The overarching goal of the NIH Data Commons is to accelerate new biomedical discoveries by developing and testing a cloud-based platform where investigators can store, share, access, and interact with digital objects (data, software, etc.) generated from biomedical and behavioral research. By connecting the digital objects and making them accessible, the Data Commons is intended to allow novel scientific research that was not possible before, including hypothesis generation, discovery, and validation.
Researchers funded as part of the pilot phase of the NIH Data Commons Pilot Phase Consortium (DCPPC), are working out the best ways to build and implement the cloud-based platform described above. They are iteratively building and testing a series of key capabilities – fundamental computational units – needed for the Commons to operate and meet standards for being FAIR – findable, accessible, interoperable, and reusable. Engaging the biomedical research community to develop the Data Commons helps ensure the needs of the research community are met. Three different and high-value test case data sets help in setting policies, processes, and architecture for the Data Commons Pilot Phase with the aim of being able to use all three data sets simultaneously in analyses. The tools and best practices developed by the DCPPC will help researchers discover and interpret connections between human genes and traits and those of model organisms like fruit flies or mice.
Data Commons Pilot Phase and the NIH Strategic Plan for Data Science
The Data Commons Pilot Phase is part of the New Models of Data Stewardship program, which is a trans-NIH endeavor. The program supports the NIH Strategic Plan for Data Science goal to develop new, cutting-edge methods for storing, sharing, and analyzing NIH derived datasets in the cloud environment.
- Expand our understanding of biology;
- Enable more accurate disease risk prediction, tailored diagnostics, and prevention and treatment strategies.
The Data Commons seeks to create a digital ecosystem to benefit all generators and users of biomedical research data
Data Commons Pilot Phase Vision
The lessons learned from the Data Commons Pilot Phase will inform the development of best practices, guidelines and standards, and cohesive approaches to Data Commons architecture and principles. The adoption of such standards and guidelines by all NIH Commons-like efforts will result in an interoperable NIH Data Commons consistent with the NIH Strategic Plan for Data Science.
The goal is to create a platform that engages different research communities to develop new tools that will be integrated into the commons. Researchers will be able to access and interact with data directly in the cloud environment, eliminating the need for costly and time-consuming downloads to local servers. The DCPPC is also developing these components to work on a variety of cloud platforms and multiple cloud locations.
The NIH Data Commons Pilot Phase is being developed with an eye toward interoperability with existing data structures such as the NCI Genome Data Commons, AHA Precision Medicine platform, and the European Data Commons. Data Commons Pilot Phase staff are also coordinating with other NIH programs (e.g., All of Us Research Program), US government (e.g., National Science Foundation), and foreign activities (e.g., ELIXIR) relevant to the Data Commons.
Data Commons Pilot Phase Implementation
A multidisciplinary NIH Data Commons Pilot Phase Consortium (DCPPC) including data scientists, computer scientists, information technology engineers, cloud service providers, biomedical researchers, and the stewards of the test case data sets, are charged with setting community-endorsed processes and metrics for FAIR data management. They are developing a plan for building the Data Commons through key capabilities – fundamental computational components – to support access, use, and sharing of the test data sets.
The NIH Data Commons is implemented as a pilot phase over Fiscal Years 2017-2020.
Cloud Based Platform- Researchers and staff from the NIH Data Commons Pilot Phase and the STRIDES Initiative will work with cloud service providers (CSPs) to learn how to provide a sustainable cloud infrastructure to support NIH-derived data sets including the Data Commons Pilot Phase test case datasets.
NIH-Funded Data Sets- The first data available in the cloud through the NIH Data Commons Pilot Phase will be the three test case data sets: Genotype-Tissue Expression (GTEx), Trans-Omics for Precision Medicine (TOPMed), and several Model Organism Databases (MODs) that make up the Alliance of Genome Resources. The Data Commons Pilot Phase Consortium (DCPPC) will use the test case data sets to develop the capabilities of the Data Commons Pilot Phase. The stewards of each dataset are members of the DCPPC and are tasked with preparing and moving data sets to the cloud environment as well as developing best practices to conduct these tasks.
Key capabilities for data use- The DCPPC is also developing key capabilities – fundamental computational components – to support access, use and sharing of the test case data sets. The key capabilities include:
- Guidelines and metrics for making data Findable, Accessible, Interoperable, and Reusable (FAIR)
- An approach to Global Unique Identifiers (GUIDs)
- Application Program Interfaces (APIs) based on open standards
- Architecture independent of a specific cloud platform or provider
- Workspaces to find and interact with data and associated tools
- Research ethics, privacy, and security (including authentication and authorization)
- Indexing and search functionality
- Use cases that demonstrate how the NIH Data Commons Pilot Phase can advance biomedical research
- Coordination, training, and outreach
This page last reviewed on July 24, 2018