Skip to main content

Software & Analysis

Big Data to Knowledge (BD2K) supported the development of software tools and methods to tackle data management, transformation, and analysis challenges in areas of high need to the biomedical research community.

Click below for descriptions and direct links to the tools and methods developed under each area of high need.

Data Compression/Reduction

The data compression/reduction awards are developing solutions for compressing files from various types of data from genomics to structural data.

  • MMTF: A new compression format for large structural biology data files, the Macromolecular Transmission Format, enables 100-1000-fold speedup of interactive visualization of 3D structures over the internet.
  • GTRAC, MetaCram, smallWig, and ChipWig: A suite of compression algorithms that can dramatically reduce the size of many of the common types of files (SAM, FASTQ, Wig) used in genome sequencing, metagenomics, RNA-seq and Chip-seq. Genomic data compression is improved by 10-100-fold using these tools.
  • HaMMLET: Software that is able to improve detection of genomic copy number variants in array comparative genomic hybridization experiments. 
  • LinDen: A tool for constructing and compressing statistical epistasis networks from genome wide association studies. LinDen greatly increases the speed of a complete pairwise epistasis screen by reducing the number of statistical tests performed. 
  • Chopper: A MATLAB Toolbox used for retrieving Top-K proximities in large real world networks. Chopper yields asymptotically faster convergence in theory, and significantly reduced convergence times in practice.
  • TruenoDBA comprehensive manual for the TruenoDB distributed graph database system for biological networks

Data Privacy

The data privacy awards are developing tools that allow multiple individuals to compute on restricted access data sets without removing the encryption.

  • PrivaSeq: A tool base for quantification and analysis of the individual characterizing information leakage, which can be used to link phenotype datasets to genotype datasets and reveal sensitive information in linking attacks.
  • PopMedNet: A scalable and extensible open-source informatics platform designed to facilitate the implementation and operation of distributed health data networks.
  • PeerSMC: A web-browser based tool allowing for two or more parties to conduct secure multiparty computation. 

Data Provenance

The data provenance awards are generating tools to assign provenance information to biomedical datasets to improve reproducibility of these data for version tracking and citation.

  • ProvCaRe: Provenance for Clinical Research and Healthcare (ProvCaRe) is a new framework ontology for data provenance in biomedical Big Data.

Data Visualization

The data visualization awards are making a wide range of large biomedical datasets easier to use and interpret, including brain scan imaging, geo-referenced data, health care systems dynamic data, and genomics data.

  • GGV: The Geography of Genetic Variants (GGV) browser is a web services software implementation of EEMS. EEMS is a new and innovative method for visualizing and analyzing population genetics data and other such geo-tagged biomedical data.
  • HSD ontology: A novel method for identifying and extracting healthcare systems dynamics (HSD) data, and for integrating these data with "traditional" electronic health record (EHR) data. HSD data take into account the dynamics of the healthcare system when interpreting medical records. (For example, the date when a patient developed a disease can be inferred from when they received a diagnosis, scheduled a doctor visit, tests were ordered, etc.)
  • Caleydo Web: Caleydo Web is a suite of web based methods and software tools designed to meet current needs for visualization and analysis of complex, heterogeneous biomedical data.
  • Vials: Vials is a novel visual analysis tool for analyzing splicing patterns in RNA-seq data

Data Wrangling

The data wrangling awards are developing new methods and tools to improve the utility of big datasets by making them easier to share, integrate, and transform.

  • IRRMC: The Integrated Resource for Reproducibility in Macromolecular Crystallography is a public database of x ray crystallography data, which provides a method for cleaning, collecting, and providing metadata for raw x-ray diffraction datasets.
  • Fitmunk: A new program for the automatic building of amino-acid side chains in protein crystal structures.
  • MODMatcher: A computational approach to identify and correct sample labeling errors in the multiple types of molecular data that can be used in subsequent integrative analyses.
  • ActMiR: A software tool that infers the activity of miRNAs from expression data of target genes.
  • AutoEEG and MERCuRY: New methods to process EEG cohort datasets and clinical records, align epileptic events, and identify seizure onset patterns that are of direct impact to clinicians studying epilepsy.
  • Mygene.info and myvariant.info: Open source, high-performance, and continuously-updated data application programming interfaces (APIs) for accessing comprehensive, structured gene and variant annotations. The integration of multiple information streams into a community platform for annotating gene and genetic variation data significantly reduces siloing and duplication of effort across multiple databases and their user communities.
  • AsterixDB: A data management tool enabling ready access to and use of behavioral and other health-relevant data contained in social media streams developed primarily for HIV risk behavioral research.
  • geQTL: A sparse regression method that can detect both group-wise and individual associations between SNPs and expression traits.

This page last reviewed on March 22, 2024