To enable a series of recommendations to be produced across the whole pilot project, the findings from all the work packages in the exploratory pilot were brought together and examined. This included findings and recommendations from the case study activities, the technical investigations, the community engagement activities and expert input from the project team. From the discussion surrounding these findings we produced a series of top level recommendations for future development of data infrastructures in the Physical Sciences.

These recommendations were grouped into 4 areas:

Recommendations at a more detailed level were also generated in many of the work packages and case studies. These findings can be explored in the additional reports produced through the pilot work.

Connecting Existing Infrastructures

The primary finding confirmed the view set out in the Statement of Need, that the current data landscape is fragmented, making data analysis and reuse unduly complex, and that physical sciences research would be greatly accelerated by more integration of the systems that handle data. This would enable researchers not only to undertake their own analyses more effectively, but also to make their data products available as inputs for further research.

There is strong view in the community that the major need is for a data infrastructure that connects existing systems, widening their applicability and adding value through aggregation, rather than for the development of new functions. Such an integration would support data workflows enabling researchers to concentrate on their science rather than spend time on data management activities.

Stakeholders were understandably concerned that a data infrastructure must be trustworthy and enduring as without assurance of its longevity, researchers would be reticent to invest the time required to engage with a system which may be temporary or fail to gain traction as key infrastructure.

R1

Construct an integrated digital infrastructure for physical sciences that connects existing research data services, both with each other, with those from other domains, and internationally.

R2

The infrastructure must facilitate curation, and long-term support for data, software and services beyond the lifespan of individual projects.

R3

Encourage and implement governance and communication mechanisms whereby co-operation and co-creation between all stakeholder organisations is enabled and fostered across domains and internationally.

Best Use of Data

There is a need to open up data for reuse and aggregation into collections that add value, and to link up with data sources from other domains for cross-disciplinary, multiscale modelling and multimodal research. It should be possible to readily access provenanced data, including reference quality data, and secondary data underpinning publications. Availability of data should support reproducibility and validation of research, in addition to application in further research including machine learning and AI.

There is also a crucial need to establish better data-level connectivity across the pillars, particularly bridging between experimental and computational activities.

For an integrated, distributed physical sciences data landscape to be realised, some new connecting functionality will need to be developed. The new infrastructure should support the overarching principles of data being as open and FAIR as possible, and drive international collaborations and interdisciplinary research through the use of open standards.

R4

Develop a toolkit for publishing data collections for physical science researchers, covering data management, metadata, ontologies, curation, provenance and tools supporting research teams to implement data polices.

R5

Establish mechanisms to provide access to provenanced data from many sources, notably:

  • reference quality data from commercial and open sources
  • original data generated from experiments and simulations
  • secondary data underpinning articles, theses and reports
  • derived or analysed intermediate data
  • collections of results data representing aggregations of properties or features of analysed data

R6

Provide tooling to support reproduceable data processing workflows, including providing a common platform to run models and code; access, transfer and integrate data from different sources; and more easily access performance compute services for scaling up.

R7

Develop support for transforming data to knowledge, including visualisation, discovery, data mining, aggregating data for integrative science, and AI techniques.

Best Use of People

It was clearly recognised that an effective research ecosystem requires not only investment in technology, but also needs support professionals to make it usable, and appropriately trained people to fully exploit it. We observed a wide variation in levels of data skills in different groups. This highlighted an opportunity for sharing knowledge and best practice between projects, disciplines and research domains.

Much of a physical sciences researcher time is spent finding, cleaning, transforming and importing/exporting data. There is a need for dedicated professionals who can either fully support researchers’ data workflows enabling them to concentrate on research without being impeded by cumbersome data management, or provide streamlined tools supporting data intensive research in the physical sciences enabling researchers to more easily support themselves. The role of these professionals must be fully established, recognised and sustained.

The physical sciences research community is extensive and varied and therefore needs broad community participation in its governance, planning and development.

R8

Provide co-ordination for community engagement activities, discussions and training, supporting community adoption, and enabling community input to ensure gaps are identified and plans developed to address them in a coordinated way.

R9

Nucleate expertise for the support and service of research groups in data science for physical sciences, providing community training and support.

R10

Establish recognition and professionalisation for roles related to provision of data tools and infrastructure such as RSEs, data scientists, engineers and curators.

R11

Establish a governance structure for PSDI involving wide-ranging expertise to provide steering such as prioritising activities, advising on allocation of resources and overseeing developments, as well as to work with other projects to deliver complementary functionality across the UK’s Digital Research Infrastructure.

Best Use of Technology

Information Technology for data management and data analysis is rapidly changing and often diverging, with important new functionality emerging continuously. Physical science researchers currently have to navigate a wide diversity of provision in a highly heterogeneous technological environment. Physical Science research workflows should “Ride the wave” of technology evolution and integrate the latest technological developments.

Providing an integrated infrastructure where researchers can adopt diverse tools, yet continue to work together will require agreement on and maintenance of the vocabulary, interfaces and tools that enable interoperability. An essential feature of a data infrastructure for physical sciences should thus be to develop and maintain interoperability standards and the associated supporting tools that enable sharing and discovery of metadata and data.

R12

Early-stage deployment of the PSDI should include services enabling the connection of existing provision, extensible to under-supported areas of physical science, together with the standards to underpin the interoperability of those services.  In particular, the PSDI should deploy tools and services

R12.1

Connecting data: aggregated catalogues of data and services; creation of standards for data formats, metadata, vocabularies and ontologies.

R12.2

Connecting Services: protocols and APIs; security and access control mechanisms; data storage and transfer systems; cloud services; monitoring and reporting systems; computation workflow orchestration systems.

R13

An underlying principle of the PSDI should be to adopt existing technology where it has already been successfully deployed elsewhere, adapting it if necessary, working in partnership with those developing it for other domains.
Loading...