Case Study 7: Data trust, sharing & preservation

Areas of Physical Sciences Covered

Applicability to Data Lifecycle

Outcomes and Recommendations

Authored by:

Colin Bird, Samantha Kanza, Nicola Knight, Cerys Willoughby, Jeremy Frey, Simon Coles

School of Chemistry, Faculty of Engineering and Physical Sciences, University of Southampton, Southampton, SO17 1BJ

Download a .pdf copy of this summary report

Download a .pdf copy of the full detail report

Key focus and activity

The preliminary aim of Case Study 7 was “Explore data trust and sharing framework for applicability to PSDI and develop recommendations for preservation and curation approaches’ [PSDI All-Partners meeting, 11th January 2022].

During the pilot phase, the aim has broadened additionally to explore other options for data exchange and sharing frameworks, thereby to enable the recommendations more realistically to reflect the probable requirements of the Physical Sciences community.

The preliminary literature probe into what a Data Trust might involve was conceived with prior knowledge of work funded by the Food Standards Agency and the EPSRC Internet of Food Things DE Network that led to the proposal for a Food Data Trust Framework. From that starting point, the probe went on to assess publications about the Data Trust principles and requirements, leading to the impression that there is more material about the principles of data trusts and what their use involves and implies than there is describing actual trusts operating in practice.

The investigation was expanded to search for existing systems set up specifically as data exchanges or essentially to facilitate data sharing. The results of the investigation are set out in the What does a data trust mean? and Existing Frameworks & Concepts sections of the full Case Study 7 Report. The concept definitions comprise a glossary of the principal terms relating to this case study.

Of the existing infrastructures, FAIRmat (which is based on NOMAD) was explored further, as it is easily the nearest “role model” for what we currently understand PSDI might look to provide, notwithstanding NOMAD’s focus on computational materials science data.

The case study also undertook some exploratory work into the potential requirements for a data exchange infrastructure and the scope of what such a mechanism could offer.

Area of Physical Sciences covered

Given that all Physical Sciences Research Areas are to some degree digitally enabled, and in many cases strongly data-driven, this case study is highly relevant across the board. The extent to which a data exchange framework will facilitate a specific Research Area will depend not only on any existing data sharing culture but also the further opportunities that would open up owing to the availability to all Physical Sciences of an effective and efficient data exchange framework.

Related PS research areas

See previous sub-heading.

Applicability to the Research Data Lifecycle

This case study is expected to address and support every part of the research data lifecycle.

The Jisc guidance asserts that projects “should consider research data at the project start”, which this case study expects to extend – if the project intends to share its data – to assessing how to ensure that data not only complies with the FAIR principles but also how the necessary metadata and other information would be provided to a data exchange framework.

Figure 1 – JISC Research Data Lifecycle Model [1]

Main outputs

During this pilot phase, we have consulted widely through extensive engagement discussions and All-Partners meetings, coupled with literature studies of data trusts in practice, other existing frameworks for supporting data sharing, and the associated concepts. It is apparent that data is now generated at a rate that prohibits the validation carried out historically by organisations such as NIST [https://www.nist.gov/]. While there is no substitute for thorough review, the 20th century ecosystem involving publication, review, and testing for reproducibility, served the Physical Sciences tolerably well. In the 21st century landscape there is only limited exchange of data, and certainly less than the recognised need for sharing would lead us to expect. The expanding use of machine learning generates a strong demand for more training data, although satisfying that requirement is impeded by the data that might be available not being categorised and compiled. Our conclusion is that the community needs a data sharing ethos that embodies trust, albeit without the formality associated with a data trust.

Outcomes and Recommendations

Currently the community appears not to face the issues of trust and trustworthiness, although there are undoubtedly strong proxies in the form of provenance, reliable metadata, and possibly durability. In the absence of authoritative trust metrics, these proxies underpin the questions posed when researchers consider reusing data generated outside their own control. While the Open Science movement mitigates the trust issue by closer collaboration, it is implausible to extend such practices across the entire Physical Sciences community. Accordingly, we regard an exploration of the nature of trust in data to be an underpinning aim of this case study.

Reframing Case Study 7 to reflect the required characteristics as now perceived following the pilot study investigations, the aim should become: “Explore a sharing framework for applicability to PSDI and develop recommendations to provide a mechanism or process, based on trust, for connecting prospective users of data with potential providers.”

Accordingly, the principal recommendations from the pilot phase of Case Study 7 are as follows:

Reframe the case study aim to explore a sharing framework such as a data exchange infrastructure

This recommendation reflects the conclusions of the pilot phase of a study whose original aim was to “Explore data trust and sharing framework for applicability to PSDI …”

The pilot phase investigation indicated that Data Trusts involve added formal arrangements to some extent at least. Neither All-Partners meetings nor informal engagement discussions have given any affirmative indications that the PSDI community either wants or needs a formal arrangement.

Moreover, there is little evidence of Data Trusts becoming widely established, except in the Commercial Sector, despite the arguments presented in the Hall-Pesenti review.[2]

The emphasis during the pilot phase of this case study therefore shifted to explore how trust and trustworthiness should be integrated into data sharing processes.

Consult the Physical Sciences community about their data sharing expectations

The objective of this consultation would be to collect information about firm requirements, issues of concern, and features that might not be essential, but would be “nice to have”.

It is apparent from All-Partners meetings, other engagement discussions, and informal conversations that a level of commitment to data sharing exists across the Physical Sciences. Feedback from all eight case studies reveals metadata, standards, integration, and sharing to be common threads. Reliable – trustworthy – metadata is one of the keys to interoperability.

However, before establishing a data exchange mechanism for the PSDI it is vital properly to understand what these incentives mean in practice, hence the proposal for a comprehensive consultation of the Physical Sciences community, with the option subsequently to create a standing forum for reviewing matters relating to data sharing.

The consultation should address the following aspects specifically:

Understanding of, and expectations regarding, trust and trustworthiness;
Understanding of provenance and how it should be documented;
Attitudes, positive and negative, to sharing;
Metadata standards, particularly the extent to which they can be domain-independent;
Expectations regarding quality assurance;
Expectations regarding data durability.

Investigate a data exchange infrastructure

This investigation would be a long-term project, with a general ambition to facilitate a design thinking approach.

The initial phase would involve collecting views about the desirable features of a data exchange, leading into thorough research into the needs of users across the range of Physical Sciences. This phase corresponds to the first stage of the design thinking process, to be followed by generating a portfolio of use cases that would inform the process of creating a high-level architecture and a provisional design brief.

Drawing on the work carried out during this pilot phase, our preliminary – and necessarily provisional – view is that the primary objective of a data exchange would be to make connections between prospective data (re)users and providers, so would be essentially a finding mechanism. While much of that process could possibly be automated, there are potential advantages to having a ‘trusted broker’ at the core of the exchange mechanism. As well as making connections, the broker could also provide additional functions, for example feedback comprising comments, suggestions, and (maybe) ratings.

[1] https://www.jisc.ac.uk/guides/rdm-toolkit

[2] Hall, W.; Pesenti, J. Growing the Artificial Intelligence Industry in the UK. 2017.

Table of Contents

Key focus and activity

Areas of Physical Sciences Covered

Related PS research areas

Applicability to Data Lifecycle

Main Outputs

Outcomes and Recommendations

Key focus and activity

Area of Physical Sciences covered

Related PS research areas

Applicability to the Research Data Lifecycle

Main outputs

Outcomes and Recommendations

Reframe the case study aim to explore a sharing framework such as a data exchange infrastructure

Consult the Physical Sciences community about their data sharing expectations

Investigate a data exchange infrastructure

Table of Contents

Key focus and activity

Areas of Physical Sciences Covered

Related PS research areas

Applicability to Data Lifecycle

Main Outputs

Outcomes and Recommendations

Key focus and activity

Area of Physical Sciences covered

Related PS research areas

Applicability to the Research Data Lifecycle

Main outputs

Outcomes and Recommendations

Reframe the case study aim to explore a sharing framework such as a data exchange infrastructure

Consult the Physical Sciences community about their data sharing expectations

Investigate a data exchange infrastructure

Cookies Policy

Cookies Policy