Case Study 5: Data curation and availability at instrument-based facilities

Home » Case Study 5: Data curation and availability at instrument-based facilities

Authored by:

Richard Boardman 1, Robert Bannister 2, Simon Coles 2

1School of Engineering, Faculty of Engineering and Physical Sciences, University of Southampton, SO17 1BJ

2School of Chemistry, Faculty of Engineering and Physical Sciences, University of Southampton, SO17 1BJ

Download a .pdf copy of this summary report

Download a .pdf copy of the full detail report

Overall Aim

To understand facility data management necessary to publish standalone datasets, to support e.g. formal publishing routes or for machine learning within the National Research Facility for lab-based X-ray CT (NXCT).

Key focus and activity

This case study is concerned with the Southampton node of the NXCT (the µ-VIS X-ray Imaging Centre). The extant O(PB) archive of computed tomography data housed at this node includes captured metadata. The main activity is assessing whether this is “good enough” to publish datasets, and to build prototype tools to “bridge the gap”. Additionally, the work examines if this approach translates to other instrument-based facilities.

Area of Physical Sciences covered

X-ray Computed Tomography, Image and Vision Computing, Materials Engineering, Chemical Structure, Biomaterials and Tissue Engineering, Medical Imaging

Applicability to the Research Data Lifecycle

The case study, taking the specific example and the demonstrator, is applicable to all parts of the Research Data Lifecycle: exposing data generated by instrument-based facilities according to the FAIR principles is the goal. The existing example system handles collection and capture, management, storage and preservation. The demonstrator provides complementarity in the other areas, promoting sharing, collaboration and analysis, and discovery and reuse.

JISC Research Data Lifecycle
Figure 1 – JISC Research Data Lifecycle Model [1]

Main outputs

The case study has established what exists already in terms of unique identifiers for datasets at the NXCT node, and that metadata and data are not only linked, but exposable in a manner that makes reidentifying these data and metadata possible. It is also apparent these data meet the criteria for DOI assignment.

As part of the demonstrator, a queryable interface to allow modest (demonstration) search capabilities of the metadata has been made, which can return results in either HTML or JSON format, to aid human or machine readability respectively.

Importantly there exists the facility to retrieve data or metadata, meaning that it was not necessary to establish a clear route to meet these requirements (notably, other potential facilities may need to establish this by identifying what needs to be implemented alongside establishing the logistics for handling or retrieving large datasets, offline or archived dataset management, and embargoes).

Finally, the case study has revealed routes to abstractions – the metadata at the NXCT node is either derivable (domain understanding) or extractable (instrument/data format understanding) and the latter is automated. To aid assessment of applicability of the model followed by the NXCT at Southampton to NCS and beyond, we need to be able to answer the question “is it practical to make a generalisable ‘shim’ layer without compromising metadata richness?”. The NXCT node has demonstrably provided this, and it prima facie would appear to be applicable in its current form to other facilities, though this warrants further assessment.

More information can be found in the detailed activity report

Outcomes and Recommendations

Data/metadata capture

The Southampton NXCT have been capturing all data and metadata for over ten years. Metadata are from two sources: the first is user-supplied, where they have described the sample they wish to scan and outline their expectations. These metadata are potentially subject to embargo or other restriction, and carry the risk of being “wrong”.

The second source of metadata is from the equipment itself when conducting the experiment. Ensuring that metadata is captured even if it is not currently well-understood is important: some metadata are instrument-specific and their meaning, whilst not necessarily initially clear, can later become clear as, for example, an instrument manufacturer offers an explanation for a field, or context or experience means the operators can provide additional insight.

Recommendation: Ensure data and metadata are being captured and stored by the facility, and identify the gaps between this and FAIR data.

Benefits: Capturing data and metadata immediately ensures these are not lost, and new understanding can be applied retrospectively. Identifying the gaps between what is currently captured (and not captured), and FAIR data provides a pathway to full FAIR compliance for any instrument-based facility.

Infrastructure: facility dependent. Training users in FAIR principles, and ensuring there is the means – software, databases and hardware – to enable the capture of data and metadata. Without these, there is nothing to share.

Create lightweight “shim” layers to provide a common interface

Many established instrument-based facilities should already have an established data and metadata management framework, in partial compliance with FAIR. Recognising that it might be difficult or impractical to change this (especially as some metadata may be in a fixed proprietary format supplied by the instrument manufacturer), create instead a very thin “shim” that will exist between what exists already and what is desired.

During the development of this shim layer, domain requirements may be established iteratively; it may be the case that when thinking about this and talking to potential consumers of the service, new requirements of the metadata may emerge, and encourage the facility to adapt.

Recommendation: Create lightweight “shim” layers to provide a common interface – HTML and JSON – to the data. This will satisfy both human and machine accessibility requirements.

Benefits: Light touch layers that work with existing frameworks are more likely to be implemented, and where they are implemented, the development time frame is shorter. Simple methods which give powerful, usable output will encourage reuse and accelerate times to discovery

Infrastructure: possible software support to develop these shims, particularly if an established system has no active maintainers (e.g. a postdoc moving on). Hardware may be necessary if returned data is large (scaling and bandwidth issues should be considered)

Leverage domain-appropriate standards, and expose all metadata

In many areas, there are existing standards that can be implemented, or where insufficient, used as a basis to provide compatibility and guidance. A wider consultation on a per-domain or per-facility basis would be advisable to assess these standards.

Many instruments generate metadata that does not follow a standard, but it should be kept, and presented in a useful manner (e.g. translated to an SI unit with a rich description), possibly as a superset of any existing standard to ensure that all possible metadata are available for reuse.

Recommendation: Leverage domain-appropriate standards where available, and in all cases expose potentially useful metadata

Benefits: if the standards already exist, then familiarity follows. If they do not, there is an opportunity to develop an initial offering and allow it to be guided by the community. Exposing all metadata as a superset of a standard maximises flexibility and knowledge

Infrastructure: consultations and community engagement to establish the extent and availability of existing standards, and to provide guidance where those standards do not exist