Bridging the Gap Between Data Collection and Analysis
Patients, health care providers and researchers are creating vast troves of data essential to developing precision cancer therapies. To advance care, this data must be widely shared. ISB hosts one of three National Cancer Institute Cloud Resources that allow access and provide analytical tools to mine terabytes of data that would not otherwise be so easily accessible.
Abstracted image of a cancer regulome from ISB-CGC. Image credit: ISB.
As genome sequencing and other molecular analysis technologies become faster and more affordable, the amount of data available for nearly every biological condition has skyrocketed, with cancer being no exception. Thanks to federally funded projects such as The Cancer Genome Atlas (TCGA), much of these data are openly available for researchers around the world to access and analyze. However, downloading and processing large amounts of data requires correspondingly large computer storage and power. To democratize not just access to cancer genomics data and other kinds of “omics” data, but also the ability to analyze these datasets without specialized computers, researchers at ISB and General Dynamics Information Technology (GDIT) built the ISB Cancer Gateway in the Cloud, or ISB-CGC. This openly available platform allows anyone with a computer to perform cancer data analysis right in their web browser.
- Funded by the National Cancer Institute (as subaward from Leidos Biomedical Research)
- Led by William Longabaugh
- Key collaborators:
- David Pot, PhD
- General Dynamics Information Technology (GDIT)
ISB Cancer Gateway in the Cloud: Democratizing Big Data for Cancer Research
In 2014, the National Cancer Institute awarded three organizations, including ISB, funding for pilot projects to build cloud-based platforms for cancer genomics data storage and analysis, starting with TCGA data. ISB partnered with Google Cloud to build the ISB-CGC platform and has since received ongoing funding from the NCI to maintain and expand the portal. At the time, researchers working with cancer genomics data from the NCI needed to download large datasets to their own devices. The NCI put out a call for institutions to build platforms that would host this data in commercial clouds and set up resources for analysis in the cloud as well, bypassing the need for individual data storage.
ISB received one of three pilot awards to build ISB-CGC and the platform has been continually funded by the NCI ever since. Initially focused on hosting cancer genomics data, the platform has since expanded to support analysis of many different kinds of cancer-related datasets, including proteomics data, data from cancer cell lines, and imaging data. As it has evolved, ISB-CGC now focuses primarily on analyzing derived data that does not include a patient’s entire genome sequence. For example, researchers can extract gene expression data, the amount a given gene is switched on, or cancer-related genetic mutation data without seeing the entire genome.
Initially led by the late Ilya Shmulevich, PhD, and now helmed by William Longabaugh, ISB-CGC uses a Google Cloud Platform data warehouse system known as BigQuery that is optimized to extract specific information from very large datasets. For example, if a dataset holds data from thousands or tens of thousands of lung cancer patients, BigQuery allows researchers to easily extract and analyze data from all non-smoker female patients over 45. The system also allows researchers to access and analyze data without being experts in cloud engineering, Longabaugh said.
The ISB-CGC portal features a graphical cohort builder, which allows researchers to search for and extract data from patients that match particular conditions, like a certain kind of cancer or a specific genetic mutation. This selection process is typically the first step in performing an analysis of cancer data. The portal also provides an extensive BigQuery search system, which allows the researcher to find the relevant ISB-CGC BigQuery data tables that they can use in their analysis. Given the many different types of data hosted on the platform, the ISB-CGC BigQuery collection enables analysis of data from different modalities in one virtual experiment; for example, extracting and analyzing protein-based and genetic data for the same condition together.
Citations
- Sheila M. Reynolds, Michael Miller, Phyliss Lee, et al. The ISB cancer genomics cloud: A flexible cloud-based platform for cancer genomics research. Cancer Research. 2017. doi: 10.1158/0008-5472.CAN-17-0617.
- Kawther Abdilleh, Boris Aguilar, J. Ross Thomson, et al. Multi-omics data integration in the Cloud: Analysis of Statistically Significant Associations Between Clinical and Molecular Features in Breast Cancer. 2020. doi: 10.1145/3388440.3414917.
- Dondra Bailey, Kawther Abdilleh, Boris Aguilar, et al. Multi-omics characterization of Microtubule-actin cross linking factor 1 (MACF1) using the ISB-Cancer Genomics Cloud. 2020. doi: 10.1145/3388440.3414918.
- Kawther Abdilleh, Boris Aguilar, Ronald C. Taylor, et al. Large-scale Cloud-based Inference of Differential Breast Cancer-related Network Gene Between Patient Cohorts. 2020.
- Aguilar B, Gibbs DL, Reiss DJ, et al. A generalizable data-driven multicellular model of pancreatic ductal adenocarcinoma. GigaScience. 2020. doi: 10.1093/gigascience/giaa075.
- Gibbs DL, Aguilar B, Thorsson V, Ratushny AV, Shmulevich I. Patient-Specific Cell Communication Networks Associate With Disease Progression in Cancer. Frontiers in Genetics. 2021. doi:10.3389/fgene.2021.667382.
- Boris Aguilar, Kawther Abdilleh, George Acquaah-Mensah. A tale of two cohorts: Transcriptomics and epigenomic analysis in breast cancer. 2021.
- Kawther Abdilleh, Boris Aguilar, Ronald C. Taylor, et al. Multi-omics data analysis in the cloud: inference of differential breast cancer-related network hubs between TCGA patient cohorts. 2021. doi: 10.7490/f1000research.1118296.1.
- Plaugher D, Aguilar B, Murrugarra D. Uncovering potential interventions for pancreatic cancer patients via mathematical modeling. bioRxiv. 2022. doi: 10.1101/2022.01.11.475711.
- de Andrade KC, Lee EE, Tookmanian EM, et al. The TP53 Database: transition from the International Agency for Research on Cancer to the US National Cancer Institute. Cell Death & Differentiation. 2022. doi: 10.1038/s41418-022-00976-3.
- Tercan B, Qin G, Kim TK, et al. SL-Cloud: A Cloud-based resource to support synthetic lethal interaction discovery. F1000 Research. 2022. doi:10.12688/f1000research.110903.2.
- Wang J, Zheng J, Lee E, et al. A cloud-based resource for genome coordinate-based exploration and large-scale analysis of chromosome aberrations and gene fusions in cancer. Genes Chromosomes & Cancer. 2023. doi:10.1002/gcc.23128.
- Torcivia J, Abdilleh K, Seidl F, et al. Whole Genome Variant Dataset for Enriching Studies across 18 Different Cancers. 2023. doi:10.3390/onco2020009.
- Pot D, Worman Z, Baumann A, et al. NCI Cancer Research Data Commons: Cloud-based Analytical Resources. Cancer Research. 2024. doi:10.1158/0008-5472.CAN-23-2657
- Seidl F, Hagen L, Wilson J, et al. The ISB Cancer Gateway in the Cloud (ISB-CGC): Access, explore and analyze large-scale cancer data through the Google Cloud. Cancer Research. 2024. doi:10.1158/1538-7445.AM2024-3547
ISB-CGC Blogs
- Bleich D, Wilson J. New Notebook Demonstrates Machine Learning in Google BigQuery Using Updated Mitelman Database. 2024.
- Thomson R. How to run statistics inside BigQuery. 2023.
- Bleich D. How the Mitelman Database Can Help You Explore Genomic Abnormalities. 2023.
- Bleich D. ISB-CGC Cloud Resource: Providing Researchers with Shortcuts to Data Analysis. 2022.
- Bleich D, Kuan A, Pot D, Ray M, Subramanian SL, Van der Auwera G. NCI’s Cloud Resources Help Tame Today’s Data Windfall. 2021.
Technologies Developed
ISB Cancer Gateway in the Cloud (ISB-CGC)
Contact William Longabaugh
Senior Software Engineer, Thorsson-Shmulevich Lab
ISB