HMP Data Release and Resource Sharing Guidelines for Human Microbiome Project Data Production Grants
Rapid and unrestricted sharing of metagenomic data and associated metadata is essential for advancing research on human health and disease. The utility to the scientific community of the generated data is largely dependent on how quickly these data can be deposited into public databases. NIH is committed to rapid, pre-publication release of genomic and other data types and recognizes that metagenomic and associated data are unique research resources. For these reasons, NIH has designated the data producing projects (cooperative agreements) funded by the HMP as community resource projects and endorses rapid release of HMP-generated data and reagents.
The data release policy for HMP data production projects is based on the guiding principle that pre-publication metagenomic and associated data should be released to the scientific community as rapidly as possible via deposition into public databases. At the same time, NIH expects that users of the pre-publication data will act responsibly to recognize the scientific contribution of the HMP data producers by following normal standards of scientific etiquette and fair use of unpublished data (PDF, 6 pages).
The HMP has funded a Data Analysis and Coordination Center (DACC) as an HMP informatics resource (http://www.hmpdacc.org). The DACC provides information to the scientific community about the progress of the HMP in general, provides analytic support to the project as needed, and serves as a repository to the community for software tools. More specifically, the DACC helps to manage metagenomic data and metadata, facilitate analysis of data and utilization of tools from HMP-funded projects, and incorporates information from other relevant resources as needed. As part of the NIH HMP general data release guidelines, it is expected that all Principal Investigators on HMP data production grants will, when appropriate, collaborate to facilitate the standardization, transfer, exchange and dissemination of information.
Several factors should be taken into consideration concerning release of metadata, including clinical data, into the public domain. As part of the HMP, it is expected that all metadata will be submitted rapidly either to a public open database or a controlled access database at the same time as other HMP molecular data are submitted. Data of all types (human sequence, clinical data, transcription data, etc.) that are potentially identifying should be submitted to NCBI's dbGaP (a controlled access database). Some of these types of data should be released only after verification; data release plans should specify how those types of data will be verified. The HMP Steering Committee may establish a subcommittee to discuss and propose consistent guidelines for verified data of each type.
NIH acknowledges the variety and differences among the HMP's data production projects, but at the same time, considers it of the highest importance to develop a policy that is flexible enough to achieve rapid data release in line with the principles noted above and to be sensitive to the aims of the individual projects under the HMP.
Continued discussion of the data release policy will be an ongoing function of the HMP Steering Committee and the NIH to achieve the overall program goals of the HMP. Revisions to Data Release Policies may be required as a result of Steering Committee action.
An appropriate data release plan for each HMP-funded data production cooperative agreement is a condition of the award. It ensures that all types of data generated by the HMP, including clinical data and other metadata, will be released in accordance with the guiding principles stated above to achieve the goals of the HMP.
The HMP data release plan includes release of raw sequence data, data from next generation sequencing platforms as agreed to by the sequence producers and NCBI, genome assemblies and their annotations, other experimental data used to characterize the human microbiome (e.g. transcriptomics, metabolic reconstructions, PCR assays), clinical data, and other metadata associated with all data types, reagents and other resources.
The HMP policy for release of different data types follows:
It is expected that all raw genome and metagenome sequence (which includes 16S rDNA sequence) and next generation sequence data that are generated by HMP data cooperative agreements will be submitted as rapidly as possible (e.g., on a weekly basis) to the Trace Archive or to the Short Read Archive at NCBI/NLM/NIH. These data should also include information on templates, vectors, and quality values for each sequence. Once the set of data to be released for the next generation sequencing data is agreed to by the centers and NCBI, the Notice of Award will be revised to reflect the data that should be submitted to the repository.
A minimum metadata set associated with genome and metagenome sequence data should be submitted to NCBI along with the sequence data. All sequence data and metadata that are potentially identifying of the donor should be deposited in dbGaP (a controlled access database). The set of minimum metadata will be agreed to by the HMP Steering Committee following award of the cooperative agreements. HMP's policy for release of clinical data is that the richest possible set of data should be released to the controlled access database, consistent with the protection of donor privacy.
Genome and metagenome full and partial assemblies and their annotations should be deposited in GenBank at NCBI after verification by the center. Assemblies and annotations should be submitted to GenBank as rapidly as possible (e.g., within 45 calendar days of being generated), followed by release to other projects' web sites, if appropriate.
Deposited metadata associated with genome and metagenome assemblies and annotation should be connected to the genome, metagenome assemblies and annotation data files.
Single nucleotide polymorphisms should be submitted as rapidly as possible to NCBI dbSNP (e.g., within 30 calendar days of validation). Non-identifying metadata associated with SNPs should be submitted to dbSNP along with the SNP data.
As noted above, any data that could potentially identify an individual should be deposited in the controlled access component of dbGaP.
The HMP center and demonstration projects are expected to generate other data types that will be used to characterize the microbiome, such as expression data, immunological data, metabolomic data, NMR spectra, and PCR assays results. These data should be verified before release and should be deposited at a broadly accessible site. The standards for verification of each data type will be agreed to by the HMP Steering Committee.
Analysis performed by the awardees should be made available to the public upon acceptance of a manuscript for publication. It may be appropriate in some cases for the DACC to house these data.
Reagents, such as microbial strains to be sequenced, should be deposited at the HMP Repository at BEI (http://www.beiresources.org/About/HumanMicrobiomeProject/tabid/625/Default.aspx) before the strain is sequenced. Other resources and reagents to be shared should be released rapidly to promote the principles expressed above, consistent with achieving the program goals of the HMP. The HMP Steering Committee and the NIH intend to identify the specific resources to be shared with the scientific community through BEI.
This document serves to provide guidelines to establish consistent data release plans across all the HMP-funded data production projects funded as cooperative agreements. It is in line with the data release guidelines already developed by the International Human Microbiome Project (IHMC); all HMP data production grantees are members of IHMC. If data release plans are modified, they should remain in compliance with the IHMC policy. For example, the minimum metadata set to be released with the sequence data should be consistent with the IHMC's plans for release of a minimum data set once they are formulated.