Test Case Data Sets
Three NIH-funded data sets will serve as test cases for the NIH Data Commons Pilot Phase. These data sets were chosen based on their value to users in the biomedical research community, the diversity of the data they contain, and their coverage of both basic and clinical research. While just three datasets will be used at the outset of the project, it is envisioned the NIH Data Commons efforts will expand to include other data resources once the pilot phase has achieved its primary objectives. The three data sets include the following:
Genotype-Tissue Expression (GTEx)
The GTEx program explores how human genes are expressed and regulated in different tissues, and the role that genomic variation plays in changing gene expression. GTEx has collected multiple human tissues from over 900 deceased donors whose DNA and RNA were sequenced to assess variation within their genomes, their effects on gene expression, and which tissues contribute to predisposition to disease. GTEx data and biospecimens are a research community resource. Visit the GTEx Portal for information on program resources. The program is funded by the NIH Common Fund.
Alliance of Genome Resources
The Model Organism Databases (MODs) provide in-depth biological data for intensively studied model organisms. Six MODs are working as a consortium with the Gene Ontology Consortium to create an integrated resource known as the Alliance of Genome Resources (AGR). The goal of the AGR is to streamline and standardize data models and interfaces, and to provide a web-based resource where data from all Alliance groups are integrated and searchable in a single place. The six participating MODs include: Saccharomyces Genome Database, WormBase, FlyBase, Zebrafish Information Network, Mouse Genome Database and Rat Genome Database. The AGR is co-funded by the NIH Common Fund’s Big Data to Knowledge (BD2K) Program and the National Human Genome Research Institute.
Trans-Omics for Precision Medicine (TOPMed)
The TOPMed program collects and pairs whole-genome sequencing (WGS) and other large-scale data (e.g., DNA methylation signatures, RNA expression profiles, metabolite profiles, proteomics) with molecular, behavioral, imaging, environmental, and clinical data from studies focused on heart, lung, blood and sleep (HLBS) disorders. TOPMed aims to collect WGS data from 120,000 individuals. The program is funded by the National Heart, Lung, and Blood Institute.