Project Summary

Bridging the Gap Between Data Collection and Analysis

Patients, health care providers and researchers are creating vast troves of data essential to developing precision cancer therapies. To advance care, this data must be widely shared. ISB hosts one of three National Cancer Institute Cloud Resources that allow access and provide analytical tools to mine terabytes of data that would not otherwise be so easily accessible.

background graphic

Abstracted image of a cancer regulome from ISB-CGC. Image credit: ISB.

Executive Summary

As genome sequencing and other molecular analysis technologies become faster and more affordable, the amount of data available for nearly every biological condition has skyrocketed, with cancer being no exception. Thanks to federally funded projects such as The Cancer Genome Atlas (TCGA), much of these data are openly available for researchers around the world to access and analyze. However, downloading and processing large amounts of data requires correspondingly large computer storage and power. To democratize not just access to cancer genomics data and other kinds of “omics” data, but also the ability to analyze these datasets without specialized computers, researchers at ISB and General Dynamics Information Technology (GDIT) built the ISB Cancer Gateway in the Cloud, or ISB-CGC. This openly available platform allows anyone with a computer to perform cancer data analysis right in their web browser. 

Project At-A-Glance
  • Funded by the National Cancer Institute (as subaward from Leidos Biomedical Research)
  • Led by William Longabaugh
  • Key collaborators:
    • David Pot, PhD
    • General Dynamics Information Technology (GDIT)

ISB Cancer Gateway in the Cloud: Democratizing Big Data for Cancer Research

In 2014, the National Cancer Institute awarded three organizations, including ISB, funding for pilot projects to build cloud-based platforms for cancer genomics data storage and analysis, starting with TCGA data. ISB partnered with Google Cloud to build the ISB-CGC platform and has since received ongoing funding from the NCI to maintain and expand the portal. At the time, researchers working with cancer genomics data from the NCI needed to download large datasets to their own devices. The NCI put out a call for institutions to build platforms that would host this data in commercial clouds and set up resources for analysis in the cloud as well, bypassing the need for individual data storage. 

ISB received one of three pilot awards to build ISB-CGC and the platform has been continually funded by the NCI ever since. Initially focused on hosting cancer genomics data, the platform has since expanded to support analysis of many different kinds of cancer-related datasets, including proteomics data, data from cancer cell lines, and imaging data. As it has evolved, ISB-CGC now focuses primarily on analyzing derived data that does not include a patient’s entire genome sequence. For example, researchers can extract gene expression data, the amount a given gene is switched on, or cancer-related genetic mutation data without seeing the entire genome.

Initially led by the late Ilya Shmulevich, PhD, and now helmed by William Longabaugh, ISB-CGC uses a Google Cloud Platform data warehouse system known as BigQuery that is optimized to extract specific information from very large datasets. For example, if a dataset holds data from thousands or tens of thousands of lung cancer patients, BigQuery allows researchers to easily extract and analyze data from all non-smoker female patients over 45. The system also allows researchers to access and analyze data without being experts in cloud engineering, Longabaugh said. 

The ISB-CGC portal features a graphical cohort builder, which allows researchers to search for and extract data from patients that match particular conditions, like a certain kind of cancer or a specific genetic mutation. This selection process is typically the first step in performing an analysis of cancer data. The portal also provides an extensive BigQuery search system, which allows the researcher to find the relevant ISB-CGC BigQuery data tables that they can use in their analysis.  Given the many different types of data hosted on the platform, the ISB-CGC BigQuery collection enables analysis of data from different modalities in one virtual experiment; for example, extracting and analyzing protein-based and genetic data for the same condition together. 

Citations

ISB-CGC Blogs

Technologies Developed

ISB Cancer Gateway in the Cloud (ISB-CGC)

Back to Cancer Research
William Longabaugh

Contact William Longabaugh

Senior Software Engineer, Thorsson-Shmulevich Lab

ISB