Skip to main content


Centers of Excellence for Big Data Computing

The Big Data to Knowledge (BD2K) Centers of Excellence have developed new approaches, methods, software tools and related resources including publications, data standards, and educational resources to advance Big Data Science in their relevant biomedical area of focus.


Click on the names of the Centers in the list below for more information and direct links to the tools and resources that are available from each Center.

Click on the links below to learn more about the resources and accomplishments from each BD2K Center.


BD2K Training and Education

The Big Data to Knowledge (BD2K) Training activities were designed to improve big data skills of biomedical scientists and increase the number of biomedical data scientists. BD2K-funded grants have produced a number of educational resources to strengthen the role of data science in modern biomedical research.

NIH-funded biomedical data science training programs represent a broad range of degree programs, career-development paths, in-person workshops, virtual events, and other unique activities. 

The BD2K Training Coordination Center (TCC) helps promote and support training and educational activities across the collection of NIH-funded Big Data to Knowledge (BD2K) grants. Learn more about the TCC.

Click on items in the list below for links and descriptions of each resource produced by the TCC.

BD2K training grants have produced a number of in-person courses, Massive Open Online Courses (MOOCs), workshops, summer training programs, and other activities, which can be accessed through the sunburst, an interactive display of NIH-funded biomedical data science training programs. Explore educational resources from the BD2K training grants through the sunburst .

BD2K Mentored Career Development Award in Biomedical Big Data Science for Clinicians and Doctorally Prepared Scientists (K01

  • Project Tycho : A repository for global health data in a standardized format that is compliant with FAIR guidelines. Project Tycho contains case counts for notifiable conditions for the United States and includes data for dengue-related conditions for 100 countries obtained from the World Health Organization and national health agencies.
  • HastagHealth : A resource that addresses both the dearth of neighborhood data and offers novel characterizations of neighborhoods. Neighborhood indicators include food themes, healthiness of food mentions, frequency of exercise/recreation mentions, metabolic intensity of physical activities, and happiness levels.
  • genTB : An analysis tool for translational tuberculosis genomic data that offers a means for sharing, citing and crediting tuberculosis data and metadata, the prediction of resistance on genotype using a machine learning algorithm, geographic data mapping, and a user friendly statistical analysis tool.

BD2K Open Educational Resources for Biomedical Big Data (R25

  • Oregon Health & Science University (OHSU) Educational Materials:  A repository of advanced introductory materials for individuals seeking to learn more about data science to expand their research programs, explore future career paths into data science, and understand and apply knowledge of the application of BD2K concepts in their present jobs. 

The BD2K Centers also produced training and educational resources including courses, workshops, webinars, lecture series, summer internships and training programs. Visit the Centers pages below for additional information.

BD2K- Library of Integrated Network-based Cellular Signatures Data Coordination and Integration Center (DCIC)
The BD2K-LINCS DCIC delivers high quality educational materials through the web like Massive Open Online Courses (MOOCs) as well as through mentoring, seminars and symposia: 

Center for Causal Discovery (CCD)
The CCD center offers courses, workshops, and lectures on causal relationships in big biomedical data: 

Center for Expanded Data Annotation and Retrieval (CEDAR)
The CEDAR center provides a list of educational resources for metadata training, and offers tutorials on the CEDAR software for the creation of simple template and metadata records. 

The Mobilize Center
The Mobilize Center faculty have created a number of MOOCs and run workshops for individuals interested in data science:

Center for Predictive Computational Phenotyping (CPCP)
The CPCP conducts training activities on data science, predictive models for biomedicine, and computational phenotyping for a broad set of audiences:

Mobile Censor Data-to-Knowledge (MD2K)
MD2K offers an annual training program to help investigators develop the multidisciplinary skills needed to generate high-quality mHealth research and solutions. Lectures from past training programs, training videos, and webinars on biomedical applications are available on the MD2K website:


The KnowEnG center offers an online resource that hosts prototypes of educational games for teaching sequence alignment, dynamic programming, and phylogenetic tree reconstruction algorithms. Through an R25 program partnership with the University of Illinois Chicago Urbana-Champaign and Mayo Clinic, and Fisk University, the KnowEnG center provides under-represented minority undergraduate students with curricular training and experience in Bioinformatics and Big Data.

PIC-SURE trains the next generation of biomedical big data scientists through its Summer training program in Biomedical Informatics, and by offering data science and precision medicine graduate-level courses:

Resource Indexing

The data discovery index (DataMed) prototype was developed through the BD2K biomedical and healthCAre Data Discovery Indexing Ecosystem project (bioCADDIE), and allows users to find and access biomedical datasets from multiple sources based on key attributes.

Software & Analysis

Big Data to Knowledge (BD2K) supported the development of software tools and methods to tackle data management, transformation, and analysis challenges in areas of high need to the biomedical research community.

Click below for descriptions and direct links to the tools and methods developed under each area of high need.

The data compression/reduction awards are developing solutions for compressing files from various types of data from genomics to structural data.

  • MMTF: A new compression format for large structural biology data files, the Macromolecular Transmission Format, enables 100-1000-fold speedup of interactive visualization of 3D structures over the internet.
  • GTRAC, MetaCram, smallWig, and ChipWig: A suite of compression algorithms that can dramatically reduce the size of many of the common types of files (SAM, FASTQ, Wig) used in genome sequencing, metagenomics, RNA-seq and Chip-seq. Genomic data compression is improved by 10-100-fold using these tools.
  • HaMMLET: Software that is able to improve detection of genomic copy number variants in array comparative genomic hybridization experiments. 
  • LinDen: A tool for constructing and compressing statistical epistasis networks from genome wide association studies. LinDen greatly increases the speed of a complete pairwise epistasis screen by reducing the number of statistical tests performed. 
  • Chopper: A MATLAB Toolbox used for retrieving Top-K proximities in large real world networks. Chopper yields asymptotically faster convergence in theory, and significantly reduced convergence times in practice.
  • TruenoDBA comprehensive manual for the TruenoDB distributed graph database system for biological networks

The data privacy awards are developing tools that allow multiple individuals to compute on restricted access data sets without removing the encryption.

  • PrivaSeq: A tool base for quantification and analysis of the individual characterizing information leakage, which can be used to link phenotype datasets to genotype datasets and reveal sensitive information in linking attacks.
  • PopMedNet: A scalable and extensible open-source informatics platform designed to facilitate the implementation and operation of distributed health data networks.
  • PeerSMC: A web-browser based tool allowing for two or more parties to conduct secure multiparty computation. 

The data provenance awards are generating tools to assign provenance information to biomedical datasets to improve reproducibility of these data for version tracking and citation.

  • ProvCaRe: Provenance for Clinical Research and Healthcare (ProvCaRe) is a new framework ontology for data provenance in biomedical Big Data.

The data visualization awards are making a wide range of large biomedical datasets easier to use and interpret, including brain scan imaging, geo-referenced data, health care systems dynamic data, and genomics data.

  • GGV: The Geography of Genetic Variants (GGV) browser is a web services software implementation of EEMS. EEMS is a new and innovative method for visualizing and analyzing population genetics data and other such geo-tagged biomedical data.
  • HSD ontology: A novel method for identifying and extracting healthcare systems dynamics (HSD) data, and for integrating these data with "traditional" electronic health record (EHR) data. HSD data take into account the dynamics of the healthcare system when interpreting medical records. (For example, the date when a patient developed a disease can be inferred from when they received a diagnosis, scheduled a doctor visit, tests were ordered, etc.)
  • Caleydo Web: Caleydo Web is a suite of web based methods and software tools designed to meet current needs for visualization and analysis of complex, heterogeneous biomedical data.
  • Vials: Vials is a novel visual analysis tool for analyzing splicing patterns in RNA-seq data

The data wrangling awards are developing new methods and tools to improve the utility of big datasets by making them easier to share, integrate, and transform.

  • IRRMC: The Integrated Resource for Reproducibility in Macromolecular Crystallography is a public database of x ray crystallography data, which provides a method for cleaning, collecting, and providing metadata for raw x-ray diffraction datasets.
  • Fitmunk: A new program for the automatic building of amino-acid side chains in protein crystal structures.
  • MODMatcher: A computational approach to identify and correct sample labeling errors in the multiple types of molecular data that can be used in subsequent integrative analyses.
  • ActMiR: A software tool that infers the activity of miRNAs from expression data of target genes.
  • AutoEEG and MERCuRY: New methods to process EEG cohort datasets and clinical records, align epileptic events, and identify seizure onset patterns that are of direct impact to clinicians studying epilepsy.
  • and Open source, high-performance, and continuously-updated data application programming interfaces (APIs) for accessing comprehensive, structured gene and variant annotations. The integration of multiple information streams into a community platform for annotating gene and genetic variation data significantly reduces siloing and duplication of effort across multiple databases and their user communities.
  • AsterixDB: A data management tool enabling ready access to and use of behavioral and other health-relevant data contained in social media streams developed primarily for HIV risk behavioral research.
  • geQTL: A sparse regression method that can detect both group-wise and individual associations between SNPs and expression traits.

Forums for Integrative Phenomics

BD2K supported the development of community-based data and metadata standards. The Forums for Integrative Phenomics combines data across species to illuminate challenges in genomics, human health and disease. 

  • Phenotype Ontology Reconciliation Effort:  A community effort that attempts to reconcile logical definition across a number of important phenotype ontologies. The outcome of this effort will be an integrated ecosystem of phenotype ontologies that can be leveraged in clinical diagnostics and disease mechanism discovery in humans.

Interactive Digital Media & Crowdsourcing

Big Data to Knowledge (BD2K)  supported the development of interactive media tools for analyzing biomedical data via crowdsourcing. 

  • Fold It: A revolutionary crowdsourcing computer game enabling users to contribute to important scientific research. Users solve puzzles to help researchers find out if humans' pattern-recognition and puzzle-solving abilities make them more efficient than existing computer programs at pattern-folding tasks, informing models of protein structure prediction.
  • EyeWire II: A 3D puzzle game that allows players to solve puzzles of neuron configurations to help researchers map the brain.
  • Quorum: A flexible gaming platform that will crowdsource the analysis of visual data, such as microscopic images or graphical charts, that are provided by researcher  scientists.
  • Cancer Crusade: A game in which players can help improve scientific understanding of combination therapies that fight cancer.
  • GraphSpace: An easy-to-use web-based platform that collaborating research groups can use for storing, interacting with, and sharing networks.
  • StarGEO: The Search Tag Analyze Resource for the Gene Expression Omnibus (STARGEO) project aims to crowdsource annotations of open genomics big data that allows users to discover the functional genes and biological pathways that are defective in disease.

This page last reviewed on September 13, 2023