Pathfinder 1 – Experimental Data Capture (lead Abraham Nieva de la Hidalga)

The research supported by large scale facilities generates valuable and high-quality research data. However, data publishing practices seem to undervalue the significance of these data. There is limited support for reuse beyond the original research, reproducibility is challenging, and the provenance link between published results and supporting data is not always explicit. In this scenario, the PSDI can address these issues by supporting the development/adoption of tools that facilitate experimental data capture. The tools should (a) produce provenance data for every processing step, (b) provide references for all new and existing data used, (c) unambiguously map results to data objects, and (d) generate the necessary metadata to facilitate the publication of data supporting the presented results. We believe that this initiative will make a valuable contribution to physical sciences research. The current examples of the development of Galaxy tools and workflows for XAS processing can be extended to support other experimental areas including XPS, INS, QENS or PDF.

Pathfinder 2: Process Recording and Data Collections (lead Samantha Pearman-Kanza)

Process recording is fundamental to driving validation, trust, reuse of data across the sciences and is a crucial aspect of data management and integration. It provides crucial support for the researcher in the laboratory whilst simultaneously structuring data for subsequent downstream management, publication, and reuse. It encompasses workflow support and data management systems, but more importantly tools for ingest that seamlessly fit into existing laboratory researcher practices and account for a range of processes from capturing handwritten observations to interfacing with instruments and recording data analysis. There are also socio-cultural aspects around disciplinary working practices that must be considered, and the support must seamlessly interface with other aspects of scholarly practice such as linking to published literature, database searching and writing reports/papers/theses. This pathfinder will investigate generic and domain specific tools for process recording and assess the landscape of semantic web technologies and metadata schemas for the physical sciences to aid with this process. This pathfinder will produce services to empower researchers to choose the best tools for them and provide exemplars and guidelines on best practices for publishing and sharing data.

Please note: due to a converging of interests this pathfinder now incorporates the work that was being done by pathfinder 3 in Phase 1

Pathfinder 4: Data Infrastructure and Tooling for Biomolecular Simulation (lead James Gebbie-Rayet)

This pathfinder aims to create production data infrastructure services that will enable researchers to maximise the value and quality within their data and to enable the development of future novel methods and technologies, that are currently not possible due to an absence of centralised approaches to data. It will do this by focusing on providing a technological solution to enable and simplify the capture and sharing of data from within biomolecular simulation workflows, and in a way that does not require huge cultural shifts in ways of working thus lowering the barriers to adoption. The intention is to capture the full pipeline of data from experimental input through to analytical outputs to preserve the full data provenance in how scientific studies are performed. The work here will be in partnership with and informed by the biomolecular simulation community via existing relationships under the STFC CoSeC programme in the form of CCPBioSim (scientific) and HECBioSim (HPC) consortia along with external partner institutions such as the EBI.

Pathfinder 5: Data to Knowledge (lead Alin Elena)

Pathfinder 5 focuses on the field of machine learning interatomic potentials, MLIP. Advances in this area need expensive calculations to be produced and used for training machine learning models. In addition the models resulting from these models are non trivial in terms of storage and distribution compared with previous generation interatomic potentials which tended to be analytical. This pathfinder aims are to design and deploy the hardware infrastructure to host both training data for ML and also act as a distribution centre within PSDI. There is also planned work to provide suitable benchmarks and validations for MLIP, building on work done Cambridge Crystallographic Metal Oraganics Frameworks data base to screen high volumes of compounds to extract relevant for specific work and providing training to the community through summer schools.

Pathfinder 6: Collaborative Computational Project for NMR Crystallography (lead Sathya Sai Seetharaman)

Pathfinder 6 focuses on the Collaborative Computational Project for NMR Crystallography (CCP-NC). CCP-NC combines experimental NMR and computation to provide new insight, with atomic resolution, into structure, disorder, and dynamics in the solid state. The first aim of the pathfinder is to improve the existing Magnetic Resonance (magres) database at CCP-NC, which is an open repository of computational solid state NMR results ( The second is the to redevelop this database to future-proof the frontend whilst adding new features including but not limited to more complex search option and interoperability with other material databases.

Pathfinder 7:  Reproducible Computational Workflows (lead Leandro Liborio)

The post-processing of experimental and simulation data, associated to large scale experiments performed at national facilities, requires that different software tools from various domains are connected into workflows.  These workflows can be quite complex and, in this pathfinder, we will be using the open-source, web-based Galaxy platform to manage the software tools and data associated to these workflows. Galaxy is a platform for FAIR data analysis that enable users to: run code in interactive environments; share and publish results, workflows and their associated visualizations; and ensure the reproducibility of their results by capturing and packaging data, metadata and provenance models required for repeating and understanding their data analyses. We are initially working with muon science and catalysis experiments, but plan to expand the methodology to other large scale experiments.