WP4 - First Round Pathfinders - Physical Sciences Data Infrastructure

The content on this page is currently being updated

First Round Pathfinders

WP4 was led by Simon Coles. Pathfinders are a means to focus some development activity in key strategic areas and explored and established exemplar approaches and systems that can be folded into the PSDI infrastructure and act as a template or starting point to bring in further domains, data types, techniques and user communities in future phases. The first-round pathfinders were selected and designed to address key strategic (cross)disciplinary areas (catalysis/environment, biomolecular science, artificial intelligence) at the same time as developing approaches to support key scientific techniques and components of the data pipeline. This phase of PSDI took forward 5 pathfinders to begin initial development work.

In Phase 1, the initial pathfinders interacted more deeply with specific user communities to gather detailed requirements and produce systems and tools that will interface with the Hub as PSDI becomes more integrated in future phases. They also helped develop a series of exemplars and blueprints that can be used as starting points for expanding PSDI out to more user communities in future phases and contribute to standards to be incorporated into the Hub and act as starting points for future development.

Pathfinder 1: Experimental data capture (lead Abraham Nieva de la Hidalga)

Catalysis research involves combining data from simulation, characterisation and activity/properties of a huge array of candidate compounds. It covers fundamental chemistry of materials through to industrial scale processes and addresses key issues such as the environment, sustainability and energy. The Catalysis Hub is a nexus for this community and is considering systems to support management and reuse of data. A PSDI pilot case study on this topic produced a number of recommendations that could be addressed through interacting with the PSDI Hub. PSDI and the Catalysis Hub will combine to accelerate addressing these issues by co-creating an integrated environment to support capture, analyse and reuse initially of XAS data, but then extending into other key techniques.

Pathfinder 2: Process recording (lead Samantha Pearman-Kanza)

Process recording is fundamental to driving validation, trust, reuse of data across the sciences and is a crucial aspect of data management and integration. It provides crucial support for the researcher in the laboratory whilst simultaneously structuring data for subsequent downstream management, publication and reuse. It encompasses workflow support and data management systems, but more importantly tools for ingest that seamlessly fit into existing laboratory researcher practices and account for a range of processes from capturing handwritten observations to interfacing with instruments and recording data analysis. There are also socio-cultural aspects around disciplinary working practices that must be considered and the support must seamlessly interface with other aspects of scholarly practice such as linking to published literature, database searching and writing reports/papers/theses. This pathfinder implemented several generic routes for recording process into the Hub as well as developing metadata and ontology layers to enable new and existing processing and analysis tools to interact intelligently, for example, to seamlessly generate supplementary information for papers or appendices for theses and reports.

Pathfinder 3: Building Data Collections (lead Jeremy Frey)

Constructing and aggregating data collections is a key underpinning service for research, powering discovery, understanding and prediction. However, the majority of data measured in practice does not get incorporated into any form of usable collection. This pathfinder explored and developed methods to build, store, manage and access collections for the different types of data, such as institutional data deposited by numerous groups into a single repository, facilities data collected at beamlines on a range of different types of samples, legacy data extracted from papers and proprietary digital sources, and orphaned data such as collections generated for a specific one-off purpose. Spectroscopy is a key technique in environmental, life and physical sciences providing not only characterisation of samples but also in-situ analysis of dynamic systems and will provide the testbed for much of this work. This work was conducted with the University of Cambridge and Imperial College and developed standards, based on IUPAC FAIRSpec, and infrastructure components to drive tools and processes to capture, manage, analyse and reuse spectroscopic data from across the sciences.

Pathfinder 4: Process Orchestration (lead James Gebbie)

Biomolecular science links physical science with biological and medical science. Working alongside national and international initiatives, a repository for biomolecular simulation trajectories was created including data storage, web front end, and automated capture of metadata and provenance. It looked to integrate datasets from CCPBioSim’s educational SlimMD database and trajectory data from the dinaMISMO project for simulating cryoEM images. We built on the AiiDA biomolecular demonstrator carried out in the PSDI pilot to build provenance models for the included data and will prototype linking these provenance models with input data sources, such as the PDB, EMDB, EMPIAR, so that the full flow of data from experiment to simulation is tracked. The key was close involvement of RFI, EMBL-EBI and Francis Crick Institute on requirements, e.g., software tools for in situ analysis and linking to other chemistry/biology repositories for integrative studies, enabling modelling and experimental data to be used synergistically.

Pathfinder 5: Data to Knowledge (lead Alin Elena)

Transforming data to knowledge requires constructing workflows connecting various types of data sources (e.g., from computational and experimental facilities to ELNs) for data ingestion, discovery, integration, analysis, and visualisation. Fundamental examples of this knowledge discovery will be developed on PSDI resources and infrastructure. This work was underpinned by expertise in computational science, data mining and AI at STFC and will leverage dedicated specialists and communities already assembled through partnership with, among others, the Alan Turing Institute, CoSeC, the RSEs, and the ALC. Recent research in the field of machine learning for interatomic potentials (MLIP) has produced breakthroughs that recommend them as the next paradigm change in atomistic molecular simulations, with applications ranging from battery design to catalytic chemical reaction modelling for hydrogen storage or CO2 capture. However, these advances need expensive calculations for training machine learning models, and the resulting models are non-trivial in terms of storage and distribution. This pathfinder designed and deployed the infrastructure to host training data for MLIP and also act as a distribution centre for it.