NIH Data Commons Pilot Phase
The overarching goal of the NIH Data Commons was to accelerate new biomedical discoveries by developing and testing a cloud-based platform where investigators could store, share, access, and interact with digital objects (data, software, etc.) generated from biomedical and behavioral research. By connecting the digital objects and making them accessible, the Data Commons was intended to foster novel scientific research that wasn’t possible before, including hypothesis generation, discovery, and validation. The program supported the NIH Strategic Plan for Data Science goal to develop new, cutting-edge methods for storing, sharing, and analyzing NIH supported datasets in the cloud.
From FY 2017-2018, researchers funded as part of the NIH Data Commons Pilot Phase Consortium (DCPPC), tested the best ways to build and implement the cloud-based platform described above. They iteratively experimented with a series of key capabilities – fundamental computational units – needed for the Commons to operate and meet standards for being FAIR – findable, accessible, interoperable, and reusable. Three different and high-value test case data sets helped in setting policies, processes, and architecture for the Data Commons Pilot Phase. The tools and best practices developed by the DCPPC will inform a broader trans-NIH data ecosystem strategy planned through the Office of Data Science Strategy (ODSS). The Common Fund will continue to test, evaluate, and refine a subset of deliverables from the Data Commons Pilot Phase, working with Common Fund programs to establish a cloud-based data ecosystem for Common Fund datasets.
What were the outcomes of the NIH Data Commons?
The NIH Data Commons Pilot Phase ended, and the NIH currently has no plans to fund a second phase of this program. The Data Commons Pilot Phase Consortium (DCPPC) generated exciting tools to help researchers work in the digital cloud environment; including advances such as:
- Computational tools and grading rubrics to assess the FAIRness of digital research objects
- Digital tools and infrastructures for searching and indexing digital objects and data sets in the cloud
- A new method for fast, open, and free big data analysis in the cloud environment
- New modes for community engagement, training, outreach, and support across multiple levels of expertise
- Partnerships with Google Cloud and Amazon Web Services via the STRIDES Initiative to develop and test new ways to implement cloud services in support of biomedical research
- A guidebook that captures current best practices in using public cloud service providers for biomedical research
- An Application Programming Interface (API) registry for the definition and discovery of APIs
- Documentation of core metadata required to register data sets in the cloud and assign them each a Globally Unique IDentifier (GUID)
- A service for registering namespaces for Compact Identifiers
- A prototype use case library compiling a collection of user narratives, stories, and science objectives that describe the cross-disciplinary workflow between scientists and engineers
- Guidelines for restricted data access and derived data management • Services to resolve GUIDs (and object contents of GUIDs) and compact identifiers
- Interoperability across multiple robust and sustainable software stacks implementing Commons standards