Phase 2 Pathfinders - Physical Sciences Data Infrastructure

Pathfinder 1: Experimental Data Capture (lead Abraham Nieva de la Hidalga)

The research supported by large scale facilities generates valuable and high-quality research data. However, data publishing practices seem to undervalue the significance of these data. There is limited support for reuse beyond the original research, reproducibility is challenging, and the provenance link between published results and supporting data is not always explicit. In this scenario, the PSDI can address these issues by supporting the development/adoption of tools that facilitate experimental data capture. The tools should (a) produce provenance data for every processing step, (b) provide references for all new and existing data used, (c) unambiguously map results to data objects, and (d) generate the necessary metadata to facilitate the publication of data supporting the presented results. We believe that this initiative will make a valuable contribution to physical sciences research. The current examples of the development of Galaxy tools and workflows for XAS processing can be extended to support other experimental areas including XPS, INS, QENS or PDF.

Pathfinder 2: Process Recording (lead Samantha Pearman-Kanza)

Process recording is fundamental to driving validation, trust, reuse of data across the sciences and is a crucial aspect of data management and integration. It provides crucial support for the researcher in the laboratory whilst simultaneously structuring data for subsequent downstream management, publication, and reuse. It encompasses workflow support and data management systems, but more importantly tools for ingest that seamlessly fit into existing laboratory researcher practices and account for a range of processes from capturing handwritten observations to interfacing with instruments and recording data analysis. There are also socio-cultural aspects around disciplinary working practices that must be considered, and the support must seamlessly interface with other aspects of scholarly practice such as linking to published literature, database searching and writing reports/papers/theses. This pathfinder will investigate generic and domain specific tools for process recording and assess the landscape of semantic web technologies and metadata schemas for the physical sciences to aid with this process. This pathfinder will produce services to empower researchers to choose the best tools for them and provide exemplars and guidelines on best practices for publishing and sharing data.

Pathfinder 3: Building Data Collections (lead Jeremy Frey)

Constructing and aggregating data collections is a key underpinning service for research, powering discovery, understanding and prediction. However, the majority of data measured in practice does not get incorporated into any form of usable collection. This pathfinder is exploring methods to build, store, manage and access collections for the different types of data, such as institutional data deposited by numerous groups into a single repository, facilities data collected at beamlines on a range of different types of samples, legacy data extracted from papers and proprietary digital sources, and orphaned data such as collections generated for a specific one-off purpose.

Pathfinder 4: Data Infrastructure and Tooling for Biomolecular Simulation (lead James Gebbie-Rayet)

This pathfinder aims to create production data infrastructure services that will enable researchers to maximise the value and quality within their data and to enable the development of future novel methods and technologies, that are currently not possible due to an absence of centralised approaches to data. It will do this by focusing on providing a technological solution to enable and simplify the capture and sharing of data from within biomolecular simulation workflows, and in a way that does not require huge cultural shifts in ways of working thus lowering the barriers to adoption. The intention is to capture the full pipeline of data from experimental input through to analytical outputs to preserve the full data provenance in how scientific studies are performed. The work here will be in partnership with and informed by the biomolecular simulation community via existing relationships under the STFC CoSeC programme in the form of CCPBioSim (scientific) and HECBioSim (HPC) consortia along with external partner institutions such as the EBI.

Pathfinder 5: Data to Knowledge (lead Alin Elena)

Pathfinder 5 focuses on the field of machine learning interatomic potentials, MLIP. Advances in this area need expensive calculations to be produced and used for training machine learning models. In addition the models resulting from these models are non trivial in terms of storage and distribution compared with previous generation interatomic potentials which tended to be analytical. This pathfinder aims are to design and deploy the hardware infrastructure to host both training data for ML and also act as a distribution centre within PSDI. There is also planned work to provide suitable benchmarks and validations for MLIP, building on work done Cambridge Crystallographic Metal Oraganics Frameworks data base to screen high volumes of compounds to extract relevant for specific work and providing training to the community through summer schools.

Pathfinder 6: Collaborative Computational Project for NMR Crystallography (lead Sathya Sai Seetharaman)

Pathfinder 6 focuses on the Collaborative Computational Project for NMR Crystallography (CCP-NC). CCP-NC combines experimental NMR and computation to provide new insight, with atomic resolution, into structure, disorder, and dynamics in the solid state. The first aim of the pathfinder is to improve the existing Magnetic Resonance (magres) database at CCP-NC, which is an open repository of computational solid state NMR results (https://www.ccpnc.ac.uk/database/). The second is the to redevelop this database to future-proof the frontend whilst adding new features including but not limited to more complex search option and interoperability with other material databases.

Pathfinder 7: Reproducible Computational Workflows (lead Leandro Liborio)

The post-processing of experimental and simulation data, associated to large scale experiments performed at national facilities, requires that different software tools from various domains are connected into workflows. These workflows can be quite complex and, in this pathfinder, we will be using the open-source, web-based Galaxy platform to manage the software tools and data associated to these workflows. Galaxy is a platform for FAIR data analysis that enable users to: run code in interactive environments; share and publish results, workflows and their associated visualizations; and ensure the reproducibility of their results by capturing and packaging data, metadata and provenance models required for repeating and understanding their data analyses. We are initially working with muon science and catalysis experiments, but plan to expand the methodology to other large scale experiments.

Pathfinder 8: Data-capture for advanced metals processing (lead Chris Race)

The Henry Royce Institute will lead a Pathfinder project on developing data infrastructure for exploring Processing-Structure-Property relationships (P-S-P) in metallurgy. The project will have three objectives: (i) develop ontology / metadata schema for describing metals processing; (ii) develop microstructural descriptors and software tools to digitally fingerprint microscopy images; (iii) develop and install infrastructure (physical and digital) for capturing and storing P-S-P information (digital thread). It will build on work in previous PSDI pathfinders and adapt some of these learnings to the particularly challenging metallurgical space, where the processing history strongly affects microstructure and performance and where considerable redundancy and inefficiency may exist, and current data capture is often unstructured.

The pathfinder will be led through the Royce Discovery Centre in Sheffield and will make use of their extensive additive manufacturing facilities, which provide a data-rich platform in which to develop the capabilities that form part of this Pathfinder with a view to generalising them for deployment in more traditional metallurgical applications, such as casting and forging.

Pathfinder 9: Predictive outcomes for materials selection (lead Anna Croft)

The pathfinder, led by Anna Croft at Loughborough University, will focus on the development of predictive outcomes for materials selection, initially for polymers and amorphous materials. It aims to design and deploy a combined database to integrate work-flows for both experimental and calculated data to assist with materials selection. It will act to encourage adoption of FAIR data in this domain through implementing standards agreed with and useful to the materials community. This project will explore the technological requirements for storing data, potential tools, method recording and working with the community to explore standards for these areas.