PSDI Project Shows How Physical Sciences Data Can Be Made Ready for AI

Home » PSDI Project Shows How Physical Sciences Data Can Be Made Ready for AI

May 8, 2026

Authors: Dr Matthew Partridge and Dr Aileen Day

A PSDI supported project has shown how experimental and computational data from across the physical sciences can be turned into datasets that are easier for AI researchers and tools to find, understand and use. 

The UKRI-funded project, Provision of ‘AI-ready’ data: prototyping data pipelines and repositories, tackled a practical problem. Artificial intelligence and machine learning are increasingly being used in science, but they depend on good data. In many cases, that data already exists, but it is scattered across formats, repositories and research groups and often hidden in large volumes of material of variable quality. It may be clear to the people who produced it, but much harder for someone else, or a machine learning workflow, to interpret. 

The project explored what needs to change for physical sciences data to become genuinely useful for AI. The answer was not simply to publish more data. It was to make data better described, better structured and better connected to the scientific questions it could help answer. 

Working through real research examples, PSDI and its partners created a new AI Ready Datasets Community Data Collection. The collection brings together datasets from microscopy, diffraction, crystallography, physical chemistry and computational chemistry, showing how different kinds of scientific data can be prepared for AI use through shared infrastructure and common standards.   

The work involved PSDI, the Ada Lovelace Centre, Diamond Light Source, the National Electron Diffraction Facility at the University of Southampton, UCL, Queen Mary University of London, and the University of Strathclyde. 

From stored data to usable data 

Across the physical sciences, researchers produce valuable data: images, diffraction patterns, crystal structures, computational outputs and measured properties. All of these can support machine learning, but only if they are more than simply available. They need to be usable. 

A researcher looking at a folder of files may know exactly what each one means. A future user may not. An AI workflow certainly will not. It needs to know what the files contain, how they relate to one another, where the data came from, what the labels mean, what was measured and what limitations apply. 

“Usable data needs context”

This is where the project made its main contribution. It developed and tested ways of describing scientific datasets so they can be used more easily by people and software. A key part of this was Croissant metadata, a standard for describing machine learning datasets. Croissant files can record file structure, variables, relationships, labels, provenance, annotation and intended use. 

The project did not treat metadata as a purely technical add-on. It also ensured that important information was visible and understandable to human users. AI-ready data has to work for both audiences: the researcher deciding whether a dataset is scientifically appropriate, and the software trying to load and use it without making false assumptions or misunderstanding the context. 

Examples from real science 

The project tested its approach through six examples, each chosen to explore a different part of the AI-ready data challenge. 

One, led by University College London, used synthetic transmission electron microscopy images of gold nanoparticles, designed to train and test machine learning tools that can automatically identify nanoparticles in microscope images. Because the data is synthetic, the correct answer is already known, giving researchers reliable labels for AI training and validation. 

A second used Project M crystallisation and diffraction data from Diamond Light Source beamline I11. This showed how experimental records can be brought together into a single dataset shaped around a specific machine learning question, such as identifying trends in additives or spotting unusual results. 

A third came from the National Electron Diffraction Facility at the University of Southampton. This dataset is being used to develop AI-based decision-making for sample screening, where selecting particles for electron diffraction currently relies heavily on skilled human judgement. 

Another example supported University College London and Queen Mary University of London to apply CrystaLLM-pi, a tool for generating crystal structures from powder X-ray diffraction profiles. The associated chili100k_strat dataset was prepared to support materials discovery by helping train and fine-tune the system. 

A project by Strathclyde produced BenchmarkSet1500, a reference dataset of 1,500 organic semiconductor crystal structures with high-accuracy calculations. This provides a standardised resource for building machine learning models, comparing computational methods and supporting screening studies. 

Finally, the project worked with PChProp, a physical chemistry property collection managed by PSDI. A PChProp (v.1) export of an AI-ready dataset of 3973 compounds with both melting point and boiling point data was packaged as an AI-ready research object using RO-Crate, which bundles data files with structured metadata describing context, provenance and relationships, alongside Croissant metadata. This showed how an existing scientific data collection can be repackaged with the information needed for machine learning, while preserving context, provenance and usability for future researchers. 

Together, these examples showed that there is no single shape for AI-ready data. It can be images, tables, diffraction data, crystal structures or chemical property records. What matters is that the data is organised and described in a way that makes sense beyond the original project, beyond the lifetime of the lab notebook that first explained it, for both current and future AI use. 

Building a practical pipeline

One of the key outcomes built through working with these real-world data examples was a working approach for taking datasets through an AI-readiness pipeline. 

That pipeline starts with the people who understand the data best: the researchers who produced it. They can explain what the files contain, how the data was collected, what the variables mean and what the data might be suitable for. With the right prompts and structure, much of this information can be captured during normal dataset deposition. 

More complex cases need input from machine learning users. A dataset may need to be reorganised for a particular task, split into training and test sets, or linked across several files. The Project M work showed how important that collaboration can be. The original data could be described by the people who produced it but shaping it for a specific machine learning task needed discussion with the people who would use it computationally. 

The project also explored ways to make this easier. A tool based on Ollama was developed to help populate Croissant files from an agreed schema. This kind of support could reduce manual work, while keeping the scientific judgement of domain researchers at the centre of the process. 

Tools, governance and collaboration

The datasets were not created in isolation. Several were linked to tools and workflows that show how AI-ready data can help speed up research. 

In microscopy, the project supported TEMPOS, an open-source platform for automated nanoparticle segmentation in transmission electron microscopy images. TEMPOS uses synthetic images with known labels to train models, reducing manual annotation. Related work explored automated TEM alignment, reducing reliance on skilled manual adjustment and improving reproducibility. 

In crystallography and materials research, CrystaLLM-pi and a related crystal virtualiser were developed to help generate and refine crystal structures from experimental diffraction data. At the National Electron Diffraction Facility, the work is helping move towards automated screening of individual electron diffraction samples based on learning the likely experiment outcome from a single shot diffraction pattern from each of thousands of nanoparticles in a sample. 

These examples illustrate a wider point. AI-ready data is not an abstract data management exercise. It can support tools that reduce manual work, improve reproducibility and allow researchers ask questions at a larger scale. It also gives AI tools a fighting chance of doing something useful, rather than boldly learning the hidden structure of inconsistent column headings. 

The project also looked carefully at how AI-ready data should be governed. Datasets were deposited through PSDI Community Data Collections, with requirements for review, metadata, licensing and, where needed, embargo periods. Responsible AI metadata fields were used where possible, recording provenance, annotation, intended use and limitations. 

The project showed that much of the work needed to make data AI-ready can begin with researchers themselves, as part of the normal upload process. With the right prompts, structure and supporting tools, data producers can capture key information about files, variables, provenance and intended use as part of normal dataset deposition, rather than as an extra layer of work added later. Further development is still needed, but the project has shown the beginnings of a system that could make AI readiness part of routine data sharing without significantly increasing the burden on researchers. 

At the same time, some datasets need more specialist input. Where data must be reshaped for a specific machine learning task, aggregated across records, split into training and test sets, or described in more technical detail, collaboration becomes essential. In those cases, domain scientists, data engineers, software engineers and AI researchers all have a role to play. PSDI and the Ada Lovelace Centre helped connect those groups, while the use cases brought in researchers from across national facilities and universities. 

What comes next

The project has shown that physical sciences data can be made AI-ready in practical, reusable ways. It has produced datasets, tools, workflows and examples that will continue to inform PSDI’s work. It has also shown where further effort is needed. 

The next step is to move from successful examples to routine practice. Researchers need tools that can help them create Croissant metadata, RO-Crates and README files as part of normal data deposition. Repositories need ways to aggregate separate records into task-specific datasets. Metadata standards need to work more smoothly with physical sciences conventions, including units, provenance, licences and permitted reuse. Validation tools are needed so that datasets can be checked before publication, rather than leaving problems to be discovered by users later. 

There is also a cultural challenge. AI-ready data sits between communities. Data producers know how the data was made and what it means. AI researchers know what information is needed to train, test and evaluate models. The infrastructure has to help those groups meet in the middle. 

That does not mean every researcher should become an AI specialist nor that every machine learning developer should become an expert in every branch of the physical sciences. The aim is to create systems that capture the right information at the right point, using tools that make good practice easier. 

This project has provided a strong starting point. It has shown that AI-ready data needs standards, but also judgement; automation, but also expert curation; repositories, but also communities. Most importantly, it needs to be built around real scientific use.  

“Automation – with expert judgement”

As AI becomes more deeply embedded in research, the value of data will depend not only on how much is available, but on how well it can be understood and reused. PSDI’s work has helped show what that future can look like for the physical sciences, and what still needs to be built to make it work at scale. 

To borrow from the well-known AI idiom, “garbage in, garbage out”: even the best data in the world can look like garbage to an AI if it is not properly described and identified as the valuable scientific resource it is. 

Interested in making data AI-ready?

Join us to explore what it truly means to be “AI Ready”

A full day in-person workshop sharing successes, challenges and practical approaches for improving AI readiness

Project Partners and Contributors

 

Principal Investigators  Role  Organisation 
Simon Coles  Project Lead  University of Southampton 
Ricardo Grau-Crespo  Project Co-Lead  Queen Mary University of London 
Juan Bicarregui  Project Co-Lead  STFC – Laboratories 
Paul Quinn  Project Co-Lead  STFC – Ada Lovelace Centre
Andrew Stewart  Project Co-Lead  University College London 
Keith Butler  Project Co-Lead  University College London 
Robert Palgrave  Project Co-Lead  University College London 
Julia Parker  Project Co-Lead  Diamond Light Source 
Phil Chater  Project Co-Lead  Diamond Light Source 
     
     
Project Team  Role  Organisation 
Aileen Day  Research and innovation associate  University of Southampton 
Matthew Partridge  Research and innovation associate  University of Southampton 
Daniel Rainer  Technician  University of Southampton 
Mansoor Nellikkal  Professional enabling staff  University of Southampton 
Martin Bradley  Researcher  University College London 
Jonathan Bathe  Researcher  STFC – Laboratories 
Jaehoon Cha  Researcher co-lead  STFC – Laboratories 
Maitrayee Singh  Researcher  STFC – Laboratories 
Natalia Da Silva De Sa  Researcher  University College London 
Tobias Bird  Researcher  Diamond Light Source 
Loading...