Title of proposed idea: A synthetic cohort for the analysis of longitudinal effects of gene-environment interactions
Nominator: NIH Institutes/Centers
Major obstacle/challenge to overcome: Francis Collins has emphasized that there is a pressing need for a large-scale US prospective cohort study of genes and the environment, with a minimal sample size estimated at 500,000 (Collins, 2004; Manolio et al., 2006). However, even with a very large cohort, at least 7 to 15 years follow-up is needed to accrue enough incident cases to adequately power studies of common diseases (Burton et al, 2009).
Several NIH institutes support multiple high-quality cohort studies with rich biomedical, environmental, behavioral, and social data that, if strategically coordinated, could help elucidate how genes and environments interact to affect trajectories of health and disease (Willett et al., 2007). Recently, many of these cohorts have been extensively genotyped and some have undergone whole exome sequencing. Open sharing policies through dbGaP have created databases of extensive genotype data linked to phenotypes which are incompletely harmonized, and whose data standards vary significantly. More complex analyses could be served by a synthetic cohort with harmonized data. The longitudinal data included in these cohorts increase the reliability of measures and afford the examination of change phenotypes, which are crucial for understanding the causal pathways leading to changes in health. An integrated, harmonized, synthetic cohort will also facilitate the selection of genetically distinctive subpopulations for specific studies of genetic and environmental influences and their interaction. It would accelerate and support new approaches to discovery such as PheWAS (Denny et al., 2010; see also phenotype mining), fine-grained admixture mapping (Shriner et al., 2011), or conditioning on known risk variants to find potential gene-gene interactions via GWAS (Wijsman et al., 2011). Harnessing the potential inherent in these existing studies will allow us to analyze the roles played by life-style factors, social circumstances, and environmental exposures in modulating disease risk and progression.
We face a similar set of opportunities and challenges from the ever-growing number of patient registries, which increasingly include rich data, but where there is little consistency to date as to what data are reported or required. Investigators and patient advocacy groups alike express the need for creation of patient registries to facilitate research, but we are not yet getting the maximal benefit from the available data (Drolet & Johnson, 2008). In rare diseases especially, the ability to rapidly and accurately identify affected persons and to know enough about genotype and phenotype to be able to determine who might be informative for basic biology studies and who might be available for clinical trials is critical to the efficient conduct of research studies across the biomedical spectrum. The feasibility of greatly expanding current patient registries is reinforced by the fact that patients are increasingly willing to share their data by their participation in internet sites such as Patients Like Me (http://www.patientslikeme.com/), but these data collected through the sites have limited utility to researchers. Patients are also signing up to participate in research studies through sites such as Research Match at Vanderbilt (https://www.researchmatch.org/). The value of existing and proposed registries is severely limited by inconsistencies in data standards, ontologies, and policies for access and sharing and by the phenomenal inefficiencies of duplication of effort and investments. The need for such registries crosses multiple ICs.
Emerging scientific opportunity ripe for Common Fund investment: The initiative would systematically evaluate design issues for a synthetic cohort study of genes, environment, health, and behavior drawing from existing cohort studies which have rich longitudinal or early life data. The large collection of longitudinal NIH-funded cohorts that have valuable data on exposures throughout the life course have already had profound impact on medical, behavioral, and social science even before the genomic era. Many of these cohorts have added genetic and biomarker collections and now provide unparalleled opportunities for exploiting these epidemiological findings at a genomic level. To date, however, attempts to harmonize or synthesize data collection efforts among studies have been modest, in part due to resource constraints and IC boundaries.
Strikingly similar issues have arisen for patient registries, which have been proliferating rapidly both for rare and common diseases, with their goals now not limited to enhancing participation in clinical trials, but to enhancing our understanding of patient outcomes (e.g., AHRQ’s 2010 Registries for Evaluating Patient Outcomes: a User’s Guide). The common obstacle faced by both cohort studies and patient registries is a lack of harmonization among existing efforts. Essential analyses requiring large samples are frustrated by the lack of documented history of the creation of the registry, of common or documented consent procedures, of data standards, of common measures, and of time points for observation across the sample. Patient registries thus present an additional opportunity for trans-NIH harmonization. It should be possible to leverage the investment in the design of a virtual cohort to create a format for all NIH IC supported registries. With a common set of data standards, sharing of and access to data, all ICs could support registries within the virtual cohort system which would create efficiencies of scale and consistency of operations to greatly enhance the impact of individual IC investments.
With this initiative, we could also explore the possibility of collaborating with privately funded efforts (e.g., 23andMe) and cohorts funded by other government agencies. Although implementing the synthetic cohort design would require costs for harmonization and data collection, there are existing platforms for the former (e.g., P3G, www.p3g.org, resources developed by caBIG) and a growing number of freely available, high quality measures of phenotypes and exposures developed by NIH (e.g., instruments from the HRS, PROMIS, PhenX, and the NIH Toolbox). Furthermore, the HITECH and Affordable Care Acts have enhanced the prospects for the widespread incorporation of these measures in EHRs (e.g., the eMERGE network). This initiative is therefore well poised to create a time- and cost-effective harmonization plan.
Common Fund investment that could accelerate scientific progress in this field:
The recent special issue of Science (Feb 11 2011) on large datasets emphasized that ”large integrated data sets can potentially provide a much deeper understanding of both nature and society and open up many new avenues of research”. Better organization and access to data is imperative to realizing emerging scientific opportunities, including the development of common metadata. We propose that the Common Fund be used to invest in the three most important planning activities for a future synthetic cohort project and explore the potential for extending this effort to cover patient registries.
- Catalog and prioritize the existing NIH cohorts and form a harmonization plan for these studies.
- Develop the analytical framework and bioinformatics infrastructure that would be required for the future synthetic cohort. This will include executing calibration studies to create crosswalks between measures, especially among longitudinal studies with rich behavioral, clinical, and biological phenotypes, and adding EHR information to the data where feasible.
- Develop plans for data-sharing and consent policies and, crucially, detailed scenarios for the cost of synthetic cohorts of varying sizes and intensities.
Although we believe that in the long run the synthetic cohort would be cost effective given that ICs currently maintain the full costs of participating studies, there are uncertainties as to the actual per-participant costs of any follow-up CF activity, and it will be crucial during the pilot to determine the post-CF longer term maintenance costs of the synthetic cohort and identify a credible source of funds (e.g., costs could be distributed across ICs/agencies supporting the synthesized cohorts, buying the substantial benefits of harmonization). The target design would include a synthetic cohort of 500,000 well-phenotyped participants. We estimate the cost of this initial developmental and feasibility initiative at $2.0 million per year for two years, with funds to be divided between RMS needs and supplements to enable cohort study investigators to participate.
Should the initial process of creating a virtual cohort prove successful, efforts could be extended to include issues related to patient registries. The initial investment by the Common Fund could create a program for investment of IC-specific resources for global benefit at marginal increases in cost. The Office of Rare Diseases has recently a web-based Global Rare Diseases Patient Registry-Data Repository (GRDR) which could form the nucleus of the registry (see Forrest et al. 2010, and Rubenstein et al., 2010, for more information). To ensure maximal benefit, the NIH would require the following of all data deposited in a new, central Registry:
- Consent would require data sharing, with the level of sharing (controlled versus generally available) determined by the degree of PII in the data.
- Data standards would be established by groups with appropriate expertise. Only data meeting the pre-defined standards would be accepted.
- Access and use criteria would be established consistent with levels of PII.
- Provision would be made for withdrawal of data from the Registry in response to participant request.
Potential impact of Common Fund investment: The plan for a synthetic cohort for the analysis of longitudinal effects of gene-environment interactions would establish the feasibility and cost of an intended scalable synthetic national cohort of people for discovery research in health and disease; in essence it would design an affordable but well-powered phenome-genome project as recommended at the recent NIH Innovation Brainstorm meeting. Because the founder cohorts are longitudinal, the synthetic cohort would provide rapid access to trajectory information as well as a richer characterization of the social, environmental, and genetic factors influencing health and health disparities. Leveraging such a system to harmonize new and existing registries would ensure ongoing value, and provide multiple benefits: increased sample size for study of the pathogenesis and treatment of rare diseases, the ability to study the overlap in pathogenesis of apparently unrelated diseases sharing an etiology (e.g., coronary artery disease and invasive melanoma; see Manolio, 2010), and the chance to observe g-e interactions for phenotypically specific or genetically overlapping disease states, identifying genetic and environmental “pathogens” that can be studied in a synthetic cohort. This planning activity could also inform the design of a de novo national cohort study, or determine to what degree a synthetic cohort could provide similar information while allowing a greater degree of innovation within individual studies to address questions within their specific areas of science.
If determined to be feasible, the development of a centralized Registry, within which IC and outside organizations could deposit secure and sharable data, would allow more rapid translation of basic science information into clinically useful knowledge, greatly improved data quality, and enhanced engagement of patient advocacy groups in support of research at minimal cost. The centralized process would facilitate communication efforts to ensure broad awareness of the resource.
Burton, P. R., Hanell, A. L., Fortier, I., Manolio, T. A., Khoury, M. J., Little, J. & Elliott, P. (2009). Size matters: just how big is BIG? Quantifying realistic sample size requirements for human genome epidemiology. International Journal of Epidemiology, 38(1) 263-273.
Collins, F. S. (2004). The case for a US prospective cohort study of genes and environment. Nature, 429, 475-477.
Collins, F. S. & Manolio, T. A. (2007). Necessary but not sufficient. Nature, 445, 259.
Denny, J. C., Ritchie, M. D., Basford, M. A. Pulle, J. M. Bastarache, L., and others. (2010). PheWAS: demonstrating the feasibility of phenome-wide scan to discover gene-disease associations. Bioinformatics, 26(9), 1205-1210.
Doplet B. C., Johnson K. B. (2008). Categorizing the world of registries. Journal of Biomedical Informatics 411009–1020.
Forrest, CB, Bartek, RJ, Rubinstein Y, and Groft, SC; The Case for a Global Rare Diseases Registry The Lancet, 377, 1057-1059.
Manolio, T. A., Bailery-Wilson, J. E., & Collins, F. S. (2006). Genes, environment and the value of prospective cohort studies. Nature Reviews Genetics, 7, 812-820.
Manolio, T. A. (2010). Genomewide association studies and assessment of the risk of disease. NEJM, 363, 166-176.
Science Staff. (2011). Challenges and Opportunities. Science, 331, 692-693.
Shriner, D., Adeyemo, A., Ramos, E., Chen, G., & Rotimi, C. N. (2011). Mapping of disease-associated variants in admixed populations. Genome Biology, 12, 223-231.
Willett, W. C., Blot, W. J., Colditz, G., A., Folsom, A. R., Henderson, B. E. & Stampfer, M. J. (2007). Not worth the wait. Nature, 445, 257-258.
Wijsman, E.M. Pankratz, N.D., Choi, Y., Rothstein, J. H., Faber, K. M. and others (2011). Genome-wide association of familial late-onset Alzheimer’s disease replicates BIN1 and CLU and nominates CUGBP2 in interaction with APOE. PLoS Genetics, 7(2), e1001308.
Rubenstein, Y. R., Groft, S, C., Bartek, R., Brown, K., Christensen, and others. Creating a global rare disease patient registry linked to a rare diseases biorepository database: Rare Disease-HUB (RD-HUB), Contemporary Clinical Trials, 31(5), 394-404