At the end of April 2024, Dr Samantha Pearman-Kanza attended the 2nd Edition of the MADICES Workshop (Machine-actionable data interoperability for the chemical sciences) in snowy Berlin, on behalf of PSDI.
This was a really interesting workshop that explored many of the research areas that PSDI is interested in, including metadata, interoperability, data sharing and processing and the use of semantic web technologies in producing FAIR data. The workshop was a mix of discussion and hackathon style activities, and a number of datasets were made available in advance for use in the hackathon.
As a result of several pre-planning calls, and a pre setup GitHub Repository, workshop members had been able to suggest discussion topics and areas of interest, and these were formalised into three main working groups on day 1:
- Semantic Annotation Standards & Tools – This looked at identifying best practices for semantically annotating and structuring data and the range of tools available to automate this process. GitHub: https://github.com/MADICES/MADICES-2024/discussions/9
- Platform Communication – This looked at how different platforms should communicate requests, what these should look like and what data formats should be used. GitHub: https://github.com/MADICES/MADICES-2024/discussions/10
- Proprietary Dataset Processing – This looked at the different methods for processing proprietary datasets. GitHub: https://github.com/MADICES/MADICES-2024/discussions/11
Despite all three areas being of interest, naturally Samantha took part in the Semantic Web Annotation Standards & Tools Group. The group started by conducting a knowledge sharing exercise, which was really interesting as it exposed the range of tools and techniques that different group members were and weren’t aware of, and these were documented on the GitHub: https://github.com/MADICES/MADICES-2024/discussions/9.
The main interests of the group centred around a few key topics, namely generating semantic metadata, how to create JSON-LD files to do this, and the different ways of representing some of the scientific information semantically, e.g. Units. These were formalised into some specific tasks for the hackathon to gather information about unit implementations and create a package on PyPi, to create an exemplar of a semantic metadata record in JSON-LD from a pre-existing metadata record, and to write a simple automation script to identify which areas were simple/complex to automate. The results of these activities can be found in this repository: https://github.com/MADICES/semantic-annotation.
Undertaking these activities raised some really important points. With respect to units, which at first glance feels like it should be a simple area to work in, there are actually a range of different ontologies and methods to represent units and measurements semantically, and they can prove quite complex to get your head around. Further, it can also be quite tricky to ensure that you have selected the appropriate unit to use with your data.
With respect to semantic annotation, this exercise highlighted just how important it was to have the combination of semantic knowledge and domain knowledge. Firstly, to ensure that people creating the semantic annotations can actually understand what the data is about, and to correctly identify the required ontology terms, as many of the same terms are used across chemistry and biology to mean slightly different things, and it’s important to ensure that the selected classes actually have the same meaning as the data represented in the datasets. It’s all about semantics, literally! An example of the JSON-LD that was created can be found in this JSON-LD Playground Link, and the test data and conversion scripts can be found on the GitHub repository.
This was a highly pertinent workshop for PSDI as it aligns with many of our research interests and goals for the future. We are actively looking into automated metadata generation, best practices for metadata, and how to usefully implement ontologies and linked data within the physical sciences to improve interoperability.
The full conference program and participants list can be found here: https://www.cecam.org/workshop-details/machine-actionable-data-interoperability-for-the-chemical-sciences-madices-2-1321