The Common Fund Data Ecosystem (CFDE) program aims to enable broad use of the data generated by its many programs by creating a data ecosystem—the management infrastructure, analytics, applications, and user interfaces needed to work within and across existing Common Fund data sets. Continuing to develop and grow this ecosystem, the CFDE has now funded two more Common Fund Data Coordinating Centers (DCCs) as well as funded additional partnership projects among DCCs to help build this functional ecosystem.
Data Coordinating Centers newly engaged with CFDE
A DCC representing the Molecular Transducers of Physical Activity Consortium (MoTrPAC) Program formally joined CFDE in 2022. It will work to integrate data on the molecular changes caused by exercise collected from both human and animal models into CFDE. The DCCs from two new CF programs, Cellular Senescence Network (SenNet) and Bridge to Artificial Intelligence (Bridge2AI), will begin to engage in CFDE, too. Bridge2AI will generate flagship data sets that are ethically sourced, trustworthy, truly represent the diversity of the population, well-defined and accessible. SenNet will generate publicly accessible atlases of senescent cells, and the molecules they secrete, using data collected from multiple human and model organism tissues with a particular emphasis on single cell data.
The addition of MoTrPAC, SenNet, and Bridge2AI, will help expand CFDE by contributing a wealth of clinical information while increasing the diversity of data types included within CFDE. They join ten DCCs including 4D Nucleome (4DN), Extracellular RNA Communication (ExRNA), Gabriella Miller Kids First (Kids First), Glycoscience, Genotype-Tissue Expression (GTEx), The Human BioMolecular Atlas Program (HuBMAP), Illuminating the Druggable Genome (IDG), Library of Integrated Network-based Cellular Signatures (LINCS), Metabolomics, and Stimulating Peripheral Activity to Relieve Conditions (SPARC) programs. Together with the CFDE-Coordination Center (CFDE-CC), these awardee teams are continuing to advance development of processes for harmonizing basic metadata elements, providing data sets for the CFDE Portal, forming a culture of sharing insight and knowledge across DCCs, and contributing to CFDE-wide training and outreach efforts.
New Partnerships among Data Coordinating Centers
Five new DCC partnership projects have also been funded by the CFDE. These collaborative projects will develop approaches and tools to harmonize data and workflows from multiple Common Fund programs enabling cross-dataset analysis. These partnerships are meant to enhance DCC-DCC interactions. In addition, these partnerships aim to demonstrate the utility of their data integration tools and approaches for CF datasets to the broader scientific community. These projects and DCCs include:
- Workflow Playbook: Partnering DCCs: exRNA, Glyocscience, Kids First, LINCS, and Metabolomics
This project will develop an interactive workflow engine that will draw knowledge from across CF DCCs. Several CF DCC tools, APIs, and databases will serve the CFDE Workflow Playbook (CFDE-WP) framework. The CFDE-WP will be a network of DCC microservices served through a user interface with nodes representing semantic types, for example, gene sets, signatures, diseases, or drugs, and edges representing transformations or visualizations of these objects performed by various tools. Users will be able to upload their own data and analyze it in the context of cross-CF data and tools to dynamically construct workflows, and implementing unique use cases, to generate novel hypotheses.
- RNA Seq: Partnering DCCs: GTEx, HuBMAP, Kids First, and SPARC
This project will produce common harmonized RNAseq data resources for the CFDE, and harmonized processing pipeline(s) for further use, to increase the fairness and interoperability of the RNA datasets in the CFDE. This will involve the deployment of a standard RNAseq pipeline across the DCCs based on a common GENCODE reference and using a revised aligner to enable improved detection of rare variant effects, and reprocessing of all of the existing accessible bulk-tissue RNAseq reference resources. The harmonized data resource and processing pipeline will be made widely available. We will also harmonize all single cell RNAseq datasets using the HuBMAP pipeline with the goal to evaluate results, incorporate any improvements into that pipeline and produce harmonized scRNAseq data across the CFDE.
- Data Distillery: Partnering DCCs: 4DN, exRNA, Glycoscience, GTEx, HuBMAP, IDG, Kids First, LINCS, Metabolomics, and SPARC
This partnership will produce the largest yet research knowledge graph database of integrated NIH project data, with hundreds of millions of experimental and ontological data points and relationships mapped including from the NIH Unified Medical Language System (UMLS). This knowledge graph will provide a rich and connected data space for biomedical search, analysis, and machine learning applications.
- Making Gene Regulatory Knowledge FAIR: Partnering DCCs: exRNA, GTEx, and Kids First
The project will focus on gene regulatory element knowledge as the key “stepping stone” connecting genes and pathways and regulators in tissue-specific, developmental, and disease contexts. This approach will combine existing information from CFDE Data Coordination Centers (DCCs) DCCs into knowledge that will then be applied to generate more knowledge, thus igniting a virtuous cycle of FAIR knowledge creation. Because most genetic disease risk in humans is attributable to genetic variants impacting regulatory elements, the knowledge gained from this project will be key to interpreting whole genome sequence data from current and future projects.
- Clinical Observations and Vocabularies: Partnering DCCs: Kids First, Metabolomics, and SPARC
The goal of the CLOVoc project is to improve the ability to query and integrate across CF datasets for a given disease/phenotype or a clinical profile; allowing secondary analyses that drive insights about health and disease. In CLOVoc I, we developed interoperability across clinical resources in CFDE through a FAIR minimal clinical metadata framework and harmonized Fast Healthcare Interoperability Resources (FHIR) profiles. CLOVoc II will enhance the capabilities developed in CLOVoc I by (i) developing CLOVoc knowledge graph(s) from FHIR profiles, (ii) managing patient data values, as well as (iii) deploying learning systems using the ClOVoc knowledge graphs and algorithms to demonstrate our use cases. One such use case will be to characterize Type 1 diabetes mellitus, a spectral disorder, into its clinical variants and underlying endotypes. The outputs from this project will be synergistic and complimentary with other CFDE efforts.
Learn more about these CFDE awards by visiting the Funded Research page.