Authored by:

Christopher R. Taylor 1, Joseph C.R. Thacker 2, Andrew I. Cooper 2, Simon J. Coles 1, Graeme M. Day 1

1 School of Chemistry, University of Southampton, Southampton, SO17 1BJ

2 Department of Chemistry and Centre for Materials Discovery, University of Liverpool, Liverpool, L69 7ZD

Download a .pdf copy of this summary report
Download a .pdf copy of the full detail report

Key focus and activity

This case study focused on the exploration and testing of the CCDC’s “CSD-Theory” software as a solution for storing, sharing, and analysing the results of computational crystal structure prediction on porous organic materials.  In particular, it investigated the CSD-Theory package’s ease-of-use, available tools, and extendability for the purposes of sharing computational results straightforwardly with experimental colleagues, and for expediting typical analytical calculations on a published example dataset.

Area of Physical Sciences covered

Computational and Theoretical Chemistry

Specific research topics: organic crystal structure prediction, materials discovery, solid form screening

Applicability to the Research Data Lifecycle

This Case Study arguably supports every phase of the research data lifecycle; it has particular relevance to the “Collect and Capture”, “Collaborate and Analyse”, “Manage, Store, & Preserve”, and “Discover, Reuse, & Cite” phases.

JISC Research Data Lifecycle
Figure 1 – JISC Research Data Lifecycle Model [1]

Main outputs

This study began with the installation of the CSD-Theory software and related tools on a shared virtual machine, assisted by technical support from the CCDC.  This process was relatively straightforward, demonstrating an attractive ease-of-setup for collaborative projects.

Secondly, a small (<50 lines) Python script was written to produce annotated Crystallographic Information Files (CIF) from existing CSP datasets that could in turn be imported into the CSD-Theory database format using provided tools.  This again was straightforward, though a preferable setup would allow for direct preparation of compatible databases from existing datasets using a documented schema, rather than via an intermediate file conversion and import step.  This would be an extremely useful addition that would allow more rapid adoption of these tools and integration into existing software workflows, which may differ considerably between research groups.

Next, to test the basic exploration and visualisation tools in the CSD-Theory software, the aforementioned existing CSP datasets were imported to a CSD-Theory database and searched and visualised using the WebCSD platform.  A number of technical issues arose during this testing – this was expected, as the CSD-Theory package is under active development and not yet released.  A more recent build of the software provided by the CCDC eventually resolved some of these issues, particularly the simultaneous searching of predicted and published structures.  It would be beneficial, however, for the existing documentation for the software to be expanded upon, particularly with examples and more thoroughly documented features.  It is also crucial that any such system included in the PSDI be supported comprehensively and in the long term – though we are pleased to acknowledge the CCDC’s generally high standard of support during this case study.

We additionally explored the suitability of the CSD-Theory software and related tools for performing research-relevant analysis on a published CSP dataset for the molecule trimesic acid (TMA).  In particular, we were interested in reproducing the published workflow and results demonstrating that a new crystal polymorph (delta) of this molecule discovered experimentally could be located in the CSP set based on comparison of powder X-ray diffraction (PXRD) patterns alone (see our detailed report for more information).

Importing the published CSP dataset for TMA to the CSD-Theory software (again via annotated CIF intermediates), another short script was written to use the associated CSD Python API tools to compute the PXRD pattern for all predicted structures as well as the 3 experimental structures published in the CSD.  A similar comparison was done in the previously published work, but using a combination of an alternative PXRD simulator (PLATON) and in-house comparison code; our goal was to determine if the CSD-Theory software was capable of performing the same analysis, as this would aid portability and collaboration.  Using the Python API to compute and compare patterns from the import CSD-Theory database, the delta form of TMA was correctly identified as being a very close match to a number of low-energy CSP structures, in agreement with the published work and demonstrating the possibility for CSD-Theory to contribute research-relevant functionality to the PSDI.

Outcomes and Recommendations

We recommend continued exploration of tools/capabilities of CSD-Theory software and its suitability for analysing and sharing CSP datasets in collaborative projects.  In particular, due to technical issues and time constraints, we could not explore in depth the more intuitive GUI-based tools for CSP datasets (which would be beneficial for experimentalists to rapidly gain understanding of computed results); we suggest this aspect be prioritised as an area for further consideration/testing as time and resources allow.  We also acknowledge the CCDC’s ongoing work through its CSPC consortium in defining what and how CSP-related data/metadata should be stored, and recommend that the PSDI keep such definitions in mind and contribute if appropriate – definition of this is a primary need for our research.

We additionally recommend exploring the possibility of a common, open database schema for CSP results and associated metadata that is compatible with software including the CSD-Theory package (NB: we have assumed that the existing, internal database schema used is closed, due to commercial sensitivity).  Allowing direct production of compatible databases by researchers is essential for any product employed as part of the PSDI, as there will likely be interest in converting/storing large amounts of historical data, and converting this into a closed database format through intermediate steps would dissuade adoption (cf. the FAIR principles).

A longer-term recommendation is that the eventual form of the PSDI ensures that contributed datasets (regardless of type or origin) are given individual, fixed identifiers (e.g. DOIs) to allow for unambiguous reference and FAIR attribution, independent of the storage format, location, or product/interface used.  In particular, it is important to avoid a situation in which datasets referenced in literature are solely attributed to the curator(s)/repository of the dataset, to the exclusion of the original creator(s).  This is crucial both in principle and for more pragmatic reasons, in that researchers are much more likely to deposit their data in a scheme that ensures they will be appropriately cited by any derivative work.

Overall, our recommendations are that the CSD-Theory package continue to be explored in the context of the PSDI, both as a possible integrated tool and alternatively as a prototype of functionalities and capabilities that could be independently replicated or expanded upon in the PSDI.  We encourage further collaboration with the CCDC in this regard, as well as contribution from the wider community on standards, software, and protocols for the storage, access, and attribution of CSP datasets.

Full detail report

The full detail reported produced by this case study can be found at: Full Report Link