Pan-Cancer Datasets
Collection of pan-cancer datasets consisting of various modalities, including medical and clinical records, radiology (CT, MRIs, PET), pathology (H&E and IHC), and omics data (genomics and proteomics) have been compiled below. This is a non-exhaustive collection that is being updated periodically. The purpose of this compilation is to provide the cancer research community with a unified view of the resources available for studying various cancer sites, organs, and modalities. We aim to utilize these resources in our ongoing research and fight against the cancer disease.
Primarily, we have compiled the list of datasets from data portals under the flagship of NIH National Cancer Institute (NCI) that include, The Cancer Imaging Archive (TCIA), Genomic Data Commons (GDC) portal of The Cancer Genome Atlas (TCGA), and Proteomic Data Commons (PDC) portal of Clinical Proteomic Tumor Analysis Consortium (CPTAC). Below is the summary of the datasets available at these portals.
- Study of molecular characterization of over 20,000 primary cancer and matched normal samples spanning 34 cancer types.
- Joint effort between NCI and the National Human Genome Research Institute.
- Over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data.
- Publicly available for research use.
- Genomics data available at the Genomics Data Commons portal, GDC, open access.
- Imaging data available at The Cancer Imaging Archive (TCIA), open access. The radiology and histopathology data of TCIA can be accessed and downloaded through the following portals:
- Proteomics data is available through the Proteomic Data Commons (PDC) under the Clinical Proteomic Tumor Analysis Consortium (CPTAC) program.
Below we first present the National Cancer Institute (NCI) data modalities followed by the 32 cancer types and their corresponding datasets, primary publications, number of cases, and modalities. The list is organized by cancer type and then by data modality. The data modalities include clinical, copy number, DNA, imaging, and miRNA, mRNA, and protein expression. The second table below presents the non-NCI dataset resources available for public access. Lastly, we present the list of abbreviations for the cancer study name used in this compilation.
Data Modalities
- Clinical
- Clinical data
- Available for all cancer types
- May include demographic information, treatment information,survival data, etc.
- XML (per patient), tab-delimited TXT.
- Additional information in the Clinical Data Elements (CDE) Browser.
- Biospecimen data
- Available for all cancer types
- Information on how samples were processed by the Biospecimen Core Resource Center
- XML (per patient), tab-delimited TXT.
- Additional information in the Clinical Data Elements (CDE) Browser.
- Pathology Reports
- Available for all cancer types
- Pathology reports (for select cases)
- PDF format
- Copy Number
- SNP microarray
- Copy number microarray
- Available for GBM, OV, LUSC
- Tab-delimited TXT (raw signals per probe), tab-delimited TSV (normalized values per aggregated region), MAT.
- Probe information contained in array design files for each platform
- DNA Sequencing
- Available for Some tumor types
- Low pass, whole genome sequencing of tumor and normal matched samples and analysis of differences in read counts between tumor and normal
- tab-delimited TSV (normal vs. tumor cells)
- DNA
- Whole exome
- Available for all cancer types
- Whole exome sequencing of tumor and normal matched samples
- VCF, MAF (mutation cells)
- Whole genome
- Available for all cancer types
- Whole genome sequencing for tumor and normal matched samples (for select cases)
- VCF, MAF (mutation cells).
- SNP microarray
- Available for all cancer types
- tab-delimited TXT (genotypes per SNP)
- Imaging
- Diagnostic image
- Available for all cancer types
- Whole slide images of tissue used to diagnose participant
- SVS
- Available at the GDC, open access
- Tissue image
- Available for all cancer types
- Whole slide images of tissue samples from each participant that were used for TCGA analyses
- SVS
- Available at the GDC, open access
- Radiological image
- Available for some cancer types
- Pre-surgical radiological imaging (e.g. MRI, CT, PET, etc) (for select cases)
- DCM or DICOM format.
- Available at The Cancer Imaging Archive, open access
- miRNA, mRNA, and Protein Expression
- miRNA Sequencing
- Available for all cancer types except GBM
- miRNA sequencing of tumor samples
- tab-delimited TXT (normalized expression values per miRNA or isoform)
- Array-based
- Available for GBM, OV cancer types
- TXT (raw signals per probe, normalized expression values per probe, gene, or exons)
- Probe information contained in array design files for each platform
- mRNA Sequencing
- Available for all cancer types
- mRNA sequencing of tumor samples using a poly(A) enrichment RNA preparation
- TXT (normalized expression values per gene, isoform, exon, or splice junction)
- labeled as RNASeqV1 and RNASeqv2
- Total RNA Sequencing
- Available for some cancer types
- mRNA sequencing of tumor samples ribosomal depletion RNA preparation
- TXT (normalized expression values per gene, isoform, exon, or splice junction)
- labeled as TotalRNASeqV2
- Microarray
- Available for BRCA, COAD, GBM, KIRC, KIRP, LAML, LGG, LUAD, LUSC, OV, READ, UCEC cancer types
- TXT (raw signals per probe, normalized expression values per probe, gene, or exons)
- Probe information contained in array design files for each platform
- Reverse-Phase Protein Array
- Available for all cancer types
- High resolution images of protein array slides (up to 1000 participant tumor samples per slide) and raw signals per slide
- TIFF, tab-delimited TXT (signal values, dilution curves, normalized expression values
NIH / NCI-hosted Datasets
Ser |
Cancer Site |
#Cases |
Primary Publication (#Cases Studied) |
Clinical |
Genomics |
Proteomics |
Pathology |
Radiology |
1 |
Acute Myeloid Leukemia (TCGA-LAML, CPTAC-AML) |
200 |
NEJM 2013 (200) |
135 Cases (TCGA-LAML) |
135 Cases (TCGA-LAML) |
41 Cases |
120 svs |
❌ |
2 |
Adrenocortical Carcinoma (TCGA-ACC) |
92 |
Cancer Cell 2016 (91) |
92 Cases (TCGA-ACC) |
92 Cases (TCGA-ACC) |
❌ |
323 svs |
❌ |
3 |
Bladder Urothelial Carcinoma |
412 |
Nature 2014, Cell 2017 (123) |
408 Case (TCGA-BLCA) |
408 Cases (TCGA-BLCA) |
❌ |
926 svs |
TCGA-BLCA: 111,781 imgs (CT,CR,MR,PT,DX), 58GB size |
4 |
Breast Ductal Carcinoma |
778 |
Nature 2012 (430) |
1036 Cases (TCGA-BRCA) |
1036 Cases (TCGA-BRCA) |
❌ |
3,111 svs |
TCGA-BRCA: 230,167 imgs (MR,MG,CT), 88GB size |
5 |
Breast Lobular Carcinoma |
201 |
Cell 2015 (127) |
1036 Cases (TCGA-BRCA) |
1036 Cases (TCGA-BRCA) |
❌ |
3,111 svs |
TCGA-BRCA: 230,167 imgs (MR,MG,CT), 88GB size |
6 |
Cervical Carcinoma |
307 |
Nature 2017 (228) |
305 Cases (TCGA-CESC) |
305 Cases (TCGA-CESC) |
❌ |
604 svs |
TCGA-CESC: 19,135 imgs (MR), 9.5GB size |
7 |
Cholangiocarcinoma (TCGA-CHOL) |
51 |
Cell Reports 2017 (38) |
355 Cases (TCGA-CHOL) |
355 Cases (TCGA-CHOL) |
❌ |
110 svs |
❌ |
8 |
Colorectal Adenocarcinoma |
633 |
Nature 2012 (276) |
458 Cases (TCGA-COAD) |
458 Cases (TCGA-COAD) |
❌ |
1,442 svs |
TCGA-COAD: 8,387 imgs (CT), 4.5GB size |
9 |
Esophageal Carcinoma |
185 |
Nature 2017 (164) |
183 Cases (TCGA-ESCA) |
183 Cases(TCGA-ESCA) |
❌ |
396 svs |
TCGA-ESCA: 20,593 imgs (CT), 11GB size |
10 |
Stomach/ Gastric Adenocarcinoma |
443 |
Nature 2014 (295) |
437 Cases (TCGA-STAD) |
437 Cases (TCGA-STAD) |
❌ |
1,197 svs |
TCGA-STAD: 43,908 imgs (CT), 23.3GB size |
11 |
Glioblastoma Multiforme |
617 |
Nature 2008, Cell 2013 (206) |
523 Cases ( TCGA-GBM) |
523 Cases ( TCGA-GBM) |
100 Cases |
2,053 svs |
TCGA-GBM: 481,158 imgs (CT,MR,DX), 73.5GB size |
12 |
Head and Neck Squamous Cell Carcinoma |
528 |
Nature 2015 (279) |
523 Cases (TCGA-HNSC) |
523 Cases (TCGA-HNSC) |
❌ |
1,263 svs |
TCGA-HNSC: 270,376 imgs (CT,MR,PET,RTDOSE,RTPLAN,RTSTRUCT), 130GB size |
13 |
Liver Hepatocellular Carcinoma |
377 |
Cell 2017 (363) |
375 Cases (TCGA-LIHC) |
375 Cases (TCGA-LIHC) |
❌ |
870 svs |
TCGA-LIHC: 125,397 imgs (CT,MR,PT), 52.5GB size |
14 |
Kidney Chromophobe Carcinoma |
113 |
Cancer Cell 2014 (66) |
66 Cases (TCGA-KICH) |
66 Cases (TCGA-KICH) |
❌ |
326 svs |
TCGA-KICH: 9,221 imgs (CT,MR), 4.2GB size |
15 |
Kidney Clear Cell Carcinoma |
537 |
Nature 2013 (446) |
523 Cases ( TCGA-KIRC) |
523 Cases ( TCGA-KIRC) |
❌ |
2,173 svs |
TCGA-KIRC: 192,581 imgs (CT,MR), 91.6GB size |
16 |
Kidney Papillary Cell Carcinoma |
291 |
NEJM 2016 (161) |
289 Cases (TCGA-KIRP) |
289 Cases (TCGA-KIRP) |
❌ |
773 svs |
TCGA-KIRP: 26,667 imgs (CT,MR,PT), 9.6GB size |
17 |
Low Grade Glioma |
516 |
NEJM 2015 (293) |
509 Cases (TCGA-LGG) |
509 Cases (TCGA-LGG) |
❌ |
1,572 svs |
TCGA-LGG: 241,183 imgs (CT,MR), 42.8GB size |
18 |
Lung Adenocarcinoma |
585 |
Nature 2014, Nature Genetics 2016 (230) |
563 Cases (TCGA-LUAD) |
563 Cases (TCGA-LUAD) |
111 Cases |
1,608 svs |
TCGA-LUAD: 48,931 imgs (CT,PT,NM), 18.3GB size |
19 |
Lung Squamous Cell Carcinoma |
504 |
Nature 2012, Nature Genetics 2016 (178) |
501 Cases (TCGA-LUSC) |
501 Cases (TCGA-LUSC) |
118 Cases |
1,612 svs |
TCGA-LUSC: 36,518 imgs (CT,PET,NM), 14GB size |
20 |
Mesothelioma (TCGA-MESO) |
74 |
Cancer Discovery 2018 (87) |
85 Cases (TCGA-MESO) |
85 Cases (TCGA-MESO) |
❌ |
175 svs |
❌ |
21 |
Ovarian Serous Adenocarcinoma |
608 |
Nature 2011 (489) |
570 Cases (TCGA-OV) |
570 Cases (TCGA-OV) |
❌ |
1,481 svs |
TCGA-OV: 53,662 imgs (CT), 28.3GB size |
22 |
Pancreatic Ductal Adenocarcinoma (TCGA-PAAD, CPTAC-PDA) |
185 |
Cancer Cell 2017 (150) |
173 Cases (TCGA-PAAD) |
173 Cases (TCGA-PAAD) |
166 Cases |
466 svs, 557 svs |
From CPTAC-PDA:: 105,546 imgs (CR,CT,MR,PT,RF,US,XA), 50.8GB size |
23 |
Paraganglioma & Pheochromocytoma (TCGA-PCPG) |
179 |
Cancer Cell 2017 (173) |
169 Cases (TCGA-PCPG) |
169 Cases(TCGA-PCPG) |
❌ |
385 svs |
❌ |
24 |
Prostate Adenocarcinoma |
500 |
Cell 2015 (333) |
469 Cases (TCGA-PRAD) |
469 Cases (TCGA-PRAD) |
❌ |
1,172 svs |
TCGA-PRAD: 16,790 imgs (CT,PT,MR), 3.74GB size |
25 |
Sarcoma |
261 |
Cell 2017 (206) |
255 Cases (TCGA-SARC) |
255 Cases (TCGA-SARC) |
❌ |
890 svs |
TCGA-SARC: 5,653 imgs (CT,MR), 2.8GB size |
26 |
Skin Cutaneous Melanoma (TCGA-SKCM, CPTAC-CM) |
470 |
Cell 2015 (331) |
469 Cases (TCGA-SKCM |
469 Cases (TCGA-SKCM |
❌ |
950 svs, 404 svs |
From CPTAC-CM: 32,103 imgs (CT,MR,CR,PT), 14GB size |
27 |
Testicular Germ Cell Cancer (TCGA-TGCT) |
150 |
Cell Reports 2018 (137) |
150 Cases (TCGA-TGCT) |
150 Cases (TCGA-TGCT) |
❌ |
413 svs |
❌ |
28 |
Thymoma (TCGA-THYM) |
124 |
Cancer Cell 2018 (117) |
97 Cases(TCGA-THYM) |
97 Cases (TCGA-THYM) |
❌ |
318 svs |
❌ |
29 |
Thyroid Papillary Carcinoma |
507 |
Cell 2014 (496) |
473 Cases(TCGA-THCA) |
473 Cases (TCGA-THCA) |
❌ |
1,158 svs |
TCGA-THCA: 2,780 imgs (CT,PET), 1.16GB size |
30 |
Uterine Carcinosarcomaa (TCGA-UCS) |
57 |
Cancer Cell 2017 (57) |
57 Cases (TCGA-UCS) |
57 Cases (TCGA-UCS) |
❌ |
154 svs |
❌ |
31 |
Uterine Corpus Endometrioid Carcinoma |
560 |
Nature 2013 (373) |
542 Cases (TCGA-UCEC) |
542 Cases (TCGA-UCEC) |
104 Cases |
1,371 svs |
TCGA-UCEC: 75,829 imgs (CT,CR,MR,PT), 36.1GB size |
32 |
Uveal Melanoma (TCGA-UVM) |
80 |
Cancer Cell 2017 (80) |
80 Cases (TCGA-UVM) |
80 Cases (TCGA-UVM) |
❌ |
150 svs |
❌ |
33 |
Rectum adenocarcinoma |
❌ |
❌ |
170 Cases (TCGA-READ) |
170 Cases (TCGA-READ) |
❌ |
530 svs |
TCGA-READ: 1,796 imgs (CT,MR), 279MB size |
34 |
Lymphoid Neoplasm Diffuse Large B-cell Lymphoma ( DLBC) |
❌ |
❌ |
❌ |
❌ |
❌ |
103 svs, 246 svs |
❌ |
Other Sources of Data
Organ |
Disease |
Name |
Access |
Images |
Reference |
Multiple |
Multi |
UKBiobank |
RC |
MRI, DXA |
https://www.ukbiobank.ac.uk/ |
Multiple |
Multi |
Grand-Challenges |
OA |
Multi-domain |
https://grand-challenge.org |
Multiple |
Multi |
Kaggle |
OA |
Multi-domain |
https://www.kaggle.com |
Multiple |
Multi |
VISCERAL: Visual Concept Extraction Challenge in Radiology |
RC |
Multi-domain |
http://www.visceral.eu/benchmarks |
Multiple |
Multi |
Medical Segmentation Decathlon |
OA/RC |
CT, MRI |
http://medicaldecathlon.com |
Brain |
Multi |
OpenNeuro |
OA/RC |
Multi-domain |
https://openneuro.org |
Brain |
Multi |
Image and Data Archive (IDA) |
OA/RC |
s/f/dMRI, CT/PET/SPECT |
https://ida.loni.usc.edu |
Brain |
Normal, dementia, Alzheimer’s |
OASIS Brains Dataset |
OA |
MRI |
https://www.oasis-brains.org |
Brain |
Multi |
NITRC: NeuroImaging Tools and Resources Collaboratory |
OA |
s/fMRI |
https://nitrc.org |
Brain |
TBI |
The Federal Interagency TBI Research (FITBIR) |
RC |
MRI, PET, Contrast |
https://fitbir.nih.gov |
Brain |
TBI, Stroke |
CQ500 |
OA/RC |
CT |
http://headctstudy.qure.ai/dataset |
Brain |
Multi |
NDA |
RC |
MRI |
https://nda.nih.gov |
Brain |
Multi |
Connectome |
RC |
sMRI, fMRI |
https://www.humanconnectome.org |
Breast |
Cancer screening |
MIAS mini-database |
OA |
MG, US |
http://peipa.essex.ac.uk/info/mias.html |
Breast |
Cancer screening |
BCDR |
RC |
MG, US |
https://bcdr.eu |
Breast |
Cancer |
DDSM |
OA |
MG |
http://www.eng.usf.edu/cvprg/Mammography/Database.html |
Breast |
Cancer |
OMI-DB |
RC |
MG |
https://medphys.royalsurrey.nhs.uk/omidb |
Breast |
Cancer |
INbreast |
OA/RC |
MG |
http://medicalresearch.inescporto.pt/breastresearch/index.php/Get_INbreast_Database |
Cardiac |
Clinical routine care |
EchoNet-Dynamic |
OA/RC |
Echocardiogram videos |
https://echonet.github.io/dynamic |
Cardiac |
Multi-abnormal |
CAMUS project |
OA/RC |
Echocardiogram |
https://www.creatis.insa-lyon.fr/Challenge/camus |
Cardiac |
Multi |
EuCanShare |
RC |
MRI |
http://www.eucanshare.eu |
Cardiac |
Multi |
Cardiac Atlas Project |
OA/RC |
MRI |
http://www.cardiacatlas.org |
Full body |
Healthy, unknown |
Visible Human Project (VHP) |
OA |
CT, MRI |
https://www.nlm.nih.gov/research/visible |
Lung |
Thorax |
NHS Chest X-ray NIHC |
OA |
X-ray |
https://nihcc.app.box.com/v/ChestXray-NIHCC |
Lung |
Multi |
Cornell Engineering: Vision and Image Analysis lab |
OA |
CT |
http://www.via.cornell.edu/databases |
Lung |
COVID19 |
MosMedData |
OA |
CT |
https://mosmed.ai/en |
Lung |
COVID19 |
COVID-19 CT segmentation |
OA |
CT |
http://medicalsegmentation.com/covid19 |
Lung |
COVID19 |
BIMCV COVID-19 |
OA |
CT, CXR |
https://github.com/BIMCV-CSUSP/BIMCV-COVID-19 |
Lung |
COVID19 |
COVID-19 Image Data Collection |
OA |
CT, CXR |
https://github.com/ieee8023/covid-chestxray-dataset https://josephpcohen.com/w/public-covid19-dataset/ |
Lung |
COVID19 |
COVID-19 Chest X-ray Dataset Initiative |
OA |
CXR |
https://github.com/agchung/Figure1-COVID-chestxray-dataset |
Retina |
Multi |
STARE:Structured Analysis of the Retina |
OA |
Retinal fundus |
http://cecas.clemson.edu/~ahoover/stare |
Retina |
Diabetes |
CHASE_DB1 |
OA |
Retinal fundus |
https://blogs.kingston.ac.uk/retinal/chasedb1 |
Retina |
Diabetes |
High-Resolution Fundus (HRF) Image Database |
OA |
Retinal fundus |
https://www5.cs.fau.de/research/data/fundus-images |
Skin |
Lesion |
International Skin Imaging Collaboration (ISIC) |
OA |
Digital images |
https://www.isic-archive.com |
Abbreviations
Ser |
Abbreviation |
Long |
1 |
NM |
Nuclear medicine |
2 |
CT |
Computerized Tomography |
3 |
CR |
Computed Radiography |
4 |
PET, PT |
Positron Emission Tomography |
5 |
MR |
Magnetic Resonance |
6 |
MG |
Mammography |
7 |
DX |
Digital Radiography |
8 |
RF |
Radio Fluoroscopy |
9 |
US |
Ultrasound |
10 |
XA |
X-Ray Angiography |
11 |
RTDOSE |
Radiotherapy Dose |
12 |
RTSTRUCT |
Radiotherapy Structure Set |
13 |
RTPLAN |
Radiotherapy Plan |