pan-cancer-dataset-sources

Pan-Cancer Datasets

Collection of pan-cancer datasets consisting of various modalities, including medical and clinical records, radiology (CT, MRIs, PET), pathology (H&E and IHC), and omics data (genomics and proteomics) have been compiled below. This is a non-exhaustive collection that is being updated periodically. The purpose of this compilation is to provide the cancer research community with a unified view of the resources available for studying various cancer sites, organs, and modalities. We aim to utilize these resources in our ongoing research and fight against the cancer disease.

Primarily, we have compiled the list of datasets from data portals under the flagship of NIH National Cancer Institute (NCI) that include, The Cancer Imaging Archive (TCIA), Genomic Data Commons (GDC) portal of The Cancer Genome Atlas (TCGA), and Proteomic Data Commons (PDC) portal of Clinical Proteomic Tumor Analysis Consortium (CPTAC). Below is the summary of the datasets available at these portals.

Below we first present the National Cancer Institute (NCI) data modalities followed by the 32 cancer types and their corresponding datasets, primary publications, number of cases, and modalities. The list is organized by cancer type and then by data modality. The data modalities include clinical, copy number, DNA, imaging, and miRNA, mRNA, and protein expression. The second table below presents the non-NCI dataset resources available for public access. Lastly, we present the list of abbreviations for the cancer study name used in this compilation.

Data Modalities

NIH / NCI-hosted Datasets

Ser Cancer Site #Cases Primary Publication (#Cases Studied) Clinical Genomics Proteomics Pathology Radiology
1 Acute Myeloid Leukemia (TCGA-LAML, CPTAC-AML) 200 NEJM 2013 (200) 135 Cases (TCGA-LAML) 135 Cases (TCGA-LAML) 41 Cases 120 svs
2 Adrenocortical Carcinoma (TCGA-ACC) 92 Cancer Cell 2016 (91) 92 Cases (TCGA-ACC) 92 Cases (TCGA-ACC) 323 svs
3 Bladder Urothelial Carcinoma 412 Nature 2014, Cell 2017 (123) 408 Case (TCGA-BLCA) 408 Cases (TCGA-BLCA) 926 svs TCGA-BLCA: 111,781 imgs (CT,CR,MR,PT,DX), 58GB size
4 Breast Ductal Carcinoma 778 Nature 2012 (430) 1036 Cases (TCGA-BRCA) 1036 Cases (TCGA-BRCA) 3,111 svs TCGA-BRCA: 230,167 imgs (MR,MG,CT), 88GB size
5 Breast Lobular Carcinoma 201 Cell 2015 (127) 1036 Cases (TCGA-BRCA) 1036 Cases (TCGA-BRCA) 3,111 svs TCGA-BRCA: 230,167 imgs (MR,MG,CT), 88GB size
6 Cervical Carcinoma 307 Nature 2017 (228) 305 Cases (TCGA-CESC) 305 Cases (TCGA-CESC) 604 svs TCGA-CESC: 19,135 imgs (MR), 9.5GB size
7 Cholangiocarcinoma (TCGA-CHOL) 51 Cell Reports 2017 (38) 355 Cases (TCGA-CHOL) 355 Cases (TCGA-CHOL) 110 svs
8 Colorectal Adenocarcinoma 633 Nature 2012 (276) 458 Cases (TCGA-COAD) 458 Cases (TCGA-COAD) 1,442 svs TCGA-COAD: 8,387 imgs (CT), 4.5GB size
9 Esophageal Carcinoma 185 Nature 2017 (164) 183 Cases (TCGA-ESCA) 183 Cases(TCGA-ESCA) 396 svs TCGA-ESCA: 20,593 imgs (CT), 11GB size
10 Stomach/ Gastric Adenocarcinoma 443 Nature 2014 (295) 437 Cases (TCGA-STAD) 437 Cases (TCGA-STAD) 1,197 svs TCGA-STAD: 43,908 imgs (CT), 23.3GB size
11 Glioblastoma Multiforme 617 Nature 2008, Cell 2013 (206) 523 Cases ( TCGA-GBM) 523 Cases ( TCGA-GBM) 100 Cases 2,053 svs TCGA-GBM: 481,158 imgs (CT,MR,DX), 73.5GB size
12 Head and Neck Squamous Cell Carcinoma 528 Nature 2015 (279) 523 Cases (TCGA-HNSC) 523 Cases (TCGA-HNSC) 1,263 svs TCGA-HNSC: 270,376 imgs (CT,MR,PET,RTDOSE,RTPLAN,RTSTRUCT), 130GB size
13 Liver Hepatocellular Carcinoma 377 Cell 2017 (363) 375 Cases (TCGA-LIHC) 375 Cases (TCGA-LIHC) 870 svs TCGA-LIHC: 125,397 imgs (CT,MR,PT), 52.5GB size
14 Kidney Chromophobe Carcinoma 113 Cancer Cell 2014 (66) 66 Cases (TCGA-KICH) 66 Cases (TCGA-KICH) 326 svs TCGA-KICH: 9,221 imgs (CT,MR), 4.2GB size
15 Kidney Clear Cell Carcinoma 537 Nature 2013 (446) 523 Cases ( TCGA-KIRC) 523 Cases ( TCGA-KIRC) 2,173 svs TCGA-KIRC: 192,581 imgs (CT,MR), 91.6GB size
16 Kidney Papillary Cell Carcinoma 291 NEJM 2016 (161) 289 Cases (TCGA-KIRP) 289 Cases (TCGA-KIRP) 773 svs TCGA-KIRP: 26,667 imgs (CT,MR,PT), 9.6GB size
17 Low Grade Glioma 516 NEJM 2015 (293) 509 Cases (TCGA-LGG) 509 Cases (TCGA-LGG) 1,572 svs TCGA-LGG: 241,183 imgs (CT,MR), 42.8GB size
18 Lung Adenocarcinoma 585 Nature 2014, Nature Genetics 2016 (230) 563 Cases (TCGA-LUAD) 563 Cases (TCGA-LUAD) 111 Cases 1,608 svs TCGA-LUAD: 48,931 imgs (CT,PT,NM), 18.3GB size
19 Lung Squamous Cell Carcinoma 504 Nature 2012, Nature Genetics 2016 (178) 501 Cases (TCGA-LUSC) 501 Cases (TCGA-LUSC) 118 Cases 1,612 svs TCGA-LUSC: 36,518 imgs (CT,PET,NM), 14GB size
20 Mesothelioma (TCGA-MESO) 74 Cancer Discovery 2018 (87) 85 Cases (TCGA-MESO) 85 Cases (TCGA-MESO) 175 svs
21 Ovarian Serous Adenocarcinoma 608 Nature 2011 (489) 570 Cases (TCGA-OV) 570 Cases (TCGA-OV) 1,481 svs TCGA-OV: 53,662 imgs (CT), 28.3GB size
22 Pancreatic Ductal Adenocarcinoma (TCGA-PAAD, CPTAC-PDA) 185 Cancer Cell 2017 (150) 173 Cases (TCGA-PAAD) 173 Cases (TCGA-PAAD) 166 Cases 466 svs, 557 svs From CPTAC-PDA:: 105,546 imgs (CR,CT,MR,PT,RF,US,XA), 50.8GB size
23 Paraganglioma & Pheochromocytoma (TCGA-PCPG) 179 Cancer Cell 2017 (173) 169 Cases (TCGA-PCPG) 169 Cases(TCGA-PCPG) 385 svs
24 Prostate Adenocarcinoma 500 Cell 2015 (333) 469 Cases (TCGA-PRAD) 469 Cases (TCGA-PRAD) 1,172 svs TCGA-PRAD: 16,790 imgs (CT,PT,MR), 3.74GB size
25 Sarcoma 261 Cell 2017 (206) 255 Cases (TCGA-SARC) 255 Cases (TCGA-SARC) 890 svs TCGA-SARC: 5,653 imgs (CT,MR), 2.8GB size
26 Skin Cutaneous Melanoma (TCGA-SKCM, CPTAC-CM) 470 Cell 2015 (331) 469 Cases (TCGA-SKCM 469 Cases (TCGA-SKCM 950 svs, 404 svs From CPTAC-CM: 32,103 imgs (CT,MR,CR,PT), 14GB size
27 Testicular Germ Cell Cancer (TCGA-TGCT) 150 Cell Reports 2018 (137) 150 Cases (TCGA-TGCT) 150 Cases (TCGA-TGCT) 413 svs
28 Thymoma (TCGA-THYM) 124 Cancer Cell 2018 (117) 97 Cases(TCGA-THYM) 97 Cases (TCGA-THYM) 318 svs
29 Thyroid Papillary Carcinoma 507 Cell 2014 (496) 473 Cases(TCGA-THCA) 473 Cases (TCGA-THCA) 1,158 svs TCGA-THCA: 2,780 imgs (CT,PET), 1.16GB size
30 Uterine Carcinosarcomaa (TCGA-UCS) 57 Cancer Cell 2017 (57) 57 Cases (TCGA-UCS) 57 Cases (TCGA-UCS) 154 svs
31 Uterine Corpus Endometrioid Carcinoma 560 Nature 2013 (373) 542 Cases (TCGA-UCEC) 542 Cases (TCGA-UCEC) 104 Cases 1,371 svs TCGA-UCEC: 75,829 imgs (CT,CR,MR,PT), 36.1GB size
32 Uveal Melanoma (TCGA-UVM) 80 Cancer Cell 2017 (80) 80 Cases (TCGA-UVM) 80 Cases (TCGA-UVM) 150 svs
33 Rectum adenocarcinoma 170 Cases (TCGA-READ) 170 Cases (TCGA-READ) 530 svs TCGA-READ: 1,796 imgs (CT,MR), 279MB size
34 Lymphoid Neoplasm Diffuse Large B-cell Lymphoma ( DLBC) 103 svs, 246 svs

Other Sources of Data

Organ Disease Name Access Images Reference
Multiple Multi UKBiobank RC MRI, DXA https://www.ukbiobank.ac.uk/
Multiple Multi Grand-Challenges OA Multi-domain https://grand-challenge.org
Multiple Multi Kaggle OA Multi-domain https://www.kaggle.com
Multiple Multi VISCERAL: Visual Concept Extraction Challenge in Radiology RC Multi-domain http://www.visceral.eu/benchmarks
Multiple Multi Medical Segmentation Decathlon OA/RC CT, MRI http://medicaldecathlon.com
Brain Multi OpenNeuro OA/RC Multi-domain https://openneuro.org
Brain Multi Image and Data Archive (IDA) OA/RC s/f/dMRI, CT/PET/SPECT https://ida.loni.usc.edu
Brain Normal, dementia, Alzheimer’s OASIS Brains Dataset OA MRI https://www.oasis-brains.org
Brain Multi NITRC: NeuroImaging Tools and Resources Collaboratory OA s/fMRI https://nitrc.org
Brain TBI The Federal Interagency TBI Research (FITBIR) RC MRI, PET, Contrast https://fitbir.nih.gov
Brain TBI, Stroke CQ500 OA/RC CT http://headctstudy.qure.ai/dataset
Brain Multi NDA RC MRI https://nda.nih.gov
Brain Multi Connectome RC sMRI, fMRI https://www.humanconnectome.org
Breast Cancer screening MIAS mini-database OA MG, US http://peipa.essex.ac.uk/info/mias.html
Breast Cancer screening BCDR RC MG, US https://bcdr.eu
Breast Cancer DDSM OA MG http://www.eng.usf.edu/cvprg/Mammography/Database.html
Breast Cancer OMI-DB RC MG https://medphys.royalsurrey.nhs.uk/omidb
Breast Cancer INbreast OA/RC MG http://medicalresearch.inescporto.pt/breastresearch/index.php/Get_INbreast_Database
Cardiac Clinical routine care EchoNet-Dynamic OA/RC Echocardiogram videos https://echonet.github.io/dynamic
Cardiac Multi-abnormal CAMUS project OA/RC Echocardiogram https://www.creatis.insa-lyon.fr/Challenge/camus
Cardiac Multi EuCanShare RC MRI http://www.eucanshare.eu
Cardiac Multi Cardiac Atlas Project OA/RC MRI http://www.cardiacatlas.org
Full body Healthy, unknown Visible Human Project (VHP) OA CT, MRI https://www.nlm.nih.gov/research/visible
Lung Thorax NHS Chest X-ray NIHC OA X-ray https://nihcc.app.box.com/v/ChestXray-NIHCC
Lung Multi Cornell Engineering: Vision and Image Analysis lab OA CT http://www.via.cornell.edu/databases
Lung COVID19 MosMedData OA CT https://mosmed.ai/en
Lung COVID19 COVID-19 CT segmentation OA CT http://medicalsegmentation.com/covid19
Lung COVID19 BIMCV COVID-19 OA CT, CXR https://github.com/BIMCV-CSUSP/BIMCV-COVID-19
Lung COVID19 COVID-19 Image Data Collection OA CT, CXR https://github.com/ieee8023/covid-chestxray-dataset https://josephpcohen.com/w/public-covid19-dataset/
Lung COVID19 COVID-19 Chest X-ray Dataset Initiative OA CXR https://github.com/agchung/Figure1-COVID-chestxray-dataset
Retina Multi STARE:Structured Analysis of the Retina OA Retinal fundus http://cecas.clemson.edu/~ahoover/stare
Retina Diabetes CHASE_DB1 OA Retinal fundus https://blogs.kingston.ac.uk/retinal/chasedb1
Retina Diabetes High-Resolution Fundus (HRF) Image Database OA Retinal fundus https://www5.cs.fau.de/research/data/fundus-images
Skin Lesion International Skin Imaging Collaboration (ISIC) OA Digital images https://www.isic-archive.com

Abbreviations

Ser Abbreviation Long
1 NM Nuclear medicine
2 CT Computerized Tomography
3 CR Computed Radiography
4 PET, PT Positron Emission Tomography
5 MR Magnetic Resonance
6 MG Mammography
7 DX Digital Radiography
8 RF Radio Fluoroscopy
9 US Ultrasound
10 XA X-Ray Angiography
11 RTDOSE Radiotherapy Dose
12 RTSTRUCT Radiotherapy Structure Set
13 RTPLAN Radiotherapy Plan