前言
TCGAbiolinks
是一个利用 GDC API
接口来查询、下载和分析 TCGA
数据库的数据的 R
包
TCGAbiolinks
包的功能主要可以分为三大块:
- 数据查询和下载
- 数据的常规分析
- 可视化
该包可以从 Bioconductor
上安装稳定版本
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")
或者从 GitHub
上安装开发版本
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
数据查询
TCGAbiolinks
提供了一些函数用于查询和下载 GDC
中的数据,包括:
-
Harmonized
:这部分数据都比较新,使用的是GRCh38
(hg38
) 基因组版本,使用的是GDC pipeline
来处理数据 -
Legacy
:这部分的数据应该是较早之前测的,使用的是GRCh37
(hg19
) 基因组版本
使用 GDCquery
函数来查询 GDC
的数据,该函数的参数为:
GDCquery(
project,
data.category,
data.type,
workflow.type,
legacy = FALSE,
access,
platform,
file.type,
barcode,
data.format,
experimental.strategy,
sample.type
)
-
project
:该参数的取值非常多,可以使用如下命令来查询所有可用的项目
> TCGAbiolinks:::getGDCprojects()$project_id
[1] "GENIE-MSK" "TCGA-UCEC" "TCGA-LGG" "TCGA-SARC"
[5] "TCGA-PAAD" "TCGA-ESCA" "TCGA-PRAD" "GENIE-VICC"
[9] "TCGA-LAML" "TCGA-KIRC" "TCGA-PCPG" "TCGA-HNSC"
[13] "GENIE-JHU" "TCGA-OV" "TCGA-GBM" "TCGA-UCS"
[17] "TCGA-MESO" "TCGA-TGCT" "TCGA-KICH" "TCGA-READ"
[21] "TCGA-UVM" "TCGA-THCA" "OHSU-CNL" "GENIE-DFCI"
[25] "GENIE-NKI" "GENIE-GRCC" "FM-AD" "GENIE-UHN"
[29] "GENIE-MDA" "TCGA-LIHC" "TCGA-THYM" "TCGA-CHOL"
[33] "TARGET-ALL-P1" "ORGANOID-PANCREATIC" "TCGA-DLBC" "TCGA-KIRP"
[37] "TCGA-BLCA" "TARGET-ALL-P3" "TARGET-CCSK" "TARGET-NBL"
[41] "TARGET-AML" "TARGET-ALL-P2" "NCICCR-DLBCL" "CTSP-DLBCL1"
[45] "TARGET-RT" "TARGET-OS" "TCGA-BRCA" "TCGA-COAD"
[49] "TCGA-CESC" "TCGA-LUSC" "TCGA-STAD" "TCGA-SKCM"
[53] "TCGA-LUAD" "TARGET-WT" "TCGA-ACC" "BEATAML1.0-CRENOLANIB"
[57] "BEATAML1.0-COHORT" "VAREPOP-APOLLO" "MMRF-COMMPASS" "WCDT-MCRPC"
[61] "CGCI-HTMCP-CC" "CMI-ASC" "CMI-MPC" "CMI-MBC"
[65] "CPTAC-2" "CGCI-BLGSP" "CPTAC-3" "HCMI-CMDC"
-
data.category
:可以使用如下方式来查询TCGA-BRCA
项目的可用的分类数据
> TCGAbiolinks:::getProjectSummary("TCGA-BRCA")
$file_count
[1] 33766
$data_categories
file_count case_count data_category
1 4679 1098 Sequencing Reads
2 1183 1098 Clinical
3 6627 1098 Copy Number Variation
4 5315 1098 Biospecimen
5 1234 1095 DNA Methylation
6 6080 1097 Transcriptome Profiling
7 8648 1044 Simple Nucleotide Variation
$case_count
[1] 1098
$file_size
[1] 5.707687e+13
对于 harmonized
数据,主要包含 7
个分类:
- Biospecimen
- Clinical
- Copy Number Variation
- DNA Methylation
- Sequencing Reads
- Simple Nucleotide Variation
- Transcriptome Profiling
而 legacy
包含的分类数据有:
- Biospecimen
- Clinical
- Copy number variation
- DNA methylation
- Gene expression
- Protein expression
- Raw microarray data
- Raw sequencing data
- Simple nucleotide variation
-
sample.type
:样本类型,取值可以是三列中的任意一个值
至于其他参数,对于 harmonized
数据,其参数取值有
参数列表的详细取值,可以从下面的网址中下载得到:https://docs.google.com/spreadsheets/d/1f98kFdj9mxVDc1dv4xTZdx8iWgUiDYO-qiFJINvmTZs/export?format=csv&gid=2046985454
对于 legacy
数据,其参数取值有
文件链接:https://docs.google.com/spreadsheets/d/1f98kFdj9mxVDc1dv4xTZdx8iWgUiDYO-qiFJINvmTZs/export?format=csv&gid=1817673686
示例
1. 查询 harmonized 数据
1.1 甲基化数据:复发肿瘤样本
在这个例子中,我们在 harmonized
数据库中搜索复发性 GBM
样本的甲基化数据
query <- GDCquery(
project = c("TCGA-GBM"),
data.category = "DNA Methylation",
legacy = FALSE,
platform = c("Illumina Human Methylation 450"),
sample.type = "Recurrent Tumor"
)
data <- getResults(query)
getResults()
用于获取查询结果,可以设置 rows
和 cols
参数来设置返回的行和列
如果是使用 RStudio
,可以使用 View
(data
) 来查看数据,也可以使用 DT
包的 datatable
来展示数据
结果中包含了 13
个样本的信息
1.2 查询甲基化和表达数据
从 harmonized
数据库中分别获取所有包含甲基化数据(450k
)和表达数据的乳腺癌患者信息
query.met <- GDCquery(
project = "TCGA-BRCA",
data.category = "DNA Methylation",
legacy = FALSE,
platform = c("Illumina Human Methylation 450")
)
query.exp <- GDCquery(
project = "TCGA-BRCA",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - FPKM"
)
# 获取既检测甲基化又检测表达的样本
common.patients <- intersect(
substr(getResults(query.met, cols = "cases"), 1, 12),
substr(getResults(query.exp, cols = "cases"), 1, 12)
)
> length(common.patients)
[1] 786
共有 786
个样本,可以设置 barcode
参数,只选择部分样本的数据
query.met <- GDCquery(
project = "TCGA-BRCA",
data.category = "DNA Methylation",
legacy = FALSE,
platform = c("Illumina Human Methylation 450"),
barcode = common.patients[1:5]
)
1.3 查询原始数据
从 harmonized
数据库中查询乳腺癌原始测序数据(bam
)
query <- GDCquery(
project = c("TCGA-BRCA"),
data.category = "Sequencing Reads",
sample.type = "Primary Tumor"
)
> getResults(query, rows = 1:10, cols = c("file_name","cases"))
file_name cases
1 00925769611d88cb0379798246946580_gdc_realn.bam TCGA-B6-A409-01A-11D-A243-09
2 6a1209cf-d93c-4ea2-ac97-1b2871ebbeb1_gdc_realn_rehead.bam TCGA-AR-A24K-01A-11R-A169-07
3 TCGA-A8-A08S-01A-11R-A035-13_mirna_gdc_realn.bam TCGA-A8-A08S-01A-11R-A035-13
4 TCGA-BH-A0W5-01A-11R-A108-13_mirna_gdc_realn.bam TCGA-BH-A0W5-01A-11R-A108-13
5 92e895d1-b891-4f21-8422-b175972b11fa_gdc_realn_rehead.bam TCGA-AR-A24Z-01A-11R-A169-07
6 ed0d4464ea1b10742473d7a1187ce3b4_gdc_realn.bam TCGA-E9-A54X-01A-11D-A25Q-09
7 1c4097b9-10d2-488e-b03e-f2ba9d78317d_gdc_realn_rehead.bam TCGA-AN-A04C-01A-21R-A034-07
8 TCGA-A7-A26J-01A-11R-A168-13_mirna_gdc_realn.bam TCGA-A7-A26J-01A-11R-A168-13
9 37cfd42c-9043-44b6-b4f7-a8ad774c3242_gdc_realn_rehead.bam TCGA-EW-A1PG-01A-11R-A144-07
10 TCGA-B6-A0IE-01A-11W-A050-09_IlluminaGA-DNASeq_exome_gdc_realn.bam TCGA-B6-A0IE-01A-11W-A050-09
2. 查询 Legacy 数据
2.1 甲基化数据
- 芯片数据
下面这个例子,用于查询多形性胶质母细胞瘤(GMB
)和低级别胶质瘤(LGG
) Illumina Human Methylation 450
和 Illumina Human Methylation 27
平台的 DNA
甲基化数据
query <- GDCquery(
project = c("TCGA-GBM","TCGA-LGG"),
legacy = TRUE,
data.category = "DNA methylation",
platform = c("Illumina Human Methylation 450", "Illumina Human Methylation 27")
)
> getResults(query, rows = 1:10, cols = 1:3)
id data_format access
1 12c1cd18-151f-4324-9bc6-a261549122d9 TXT open
2 792b642b-92e2-4307-bdf5-8cd769a30aa4 TXT open
3 5547dde5-11dd-478a-b3a6-1f5d6487bdf8 TXT open
4 b657702f-92e2-4f27-af97-781e0b3a57fe TXT open
5 275d3ee1-812a-4d08-aa7b-81fccff6fe28 TXT open
6 f4d64cd6-32e1-466c-b852-6d91f3a3c135 TXT open
7 229d11df-5587-4fc7-9f37-b1273071c43d TXT open
8 8f3bf221-d738-4850-aa00-ca1e0d10a7e7 TXT open
9 3aa79265-61ef-48bf-b521-4bf38782d830 TXT open
10 3c15f175-6e31-49da-aab7-6aa91651b1fb TXT open
- 全基因组亚硫酸氢盐测序(
WGBS
)
获取肺腺癌 WGBS
不同类型的文件
query <- GDCquery(
project = c("TCGA-LUAD"),
legacy = TRUE,
data.category = "DNA methylation",
data.type = "Methylation percentage",
experimental.strategy = "Bisulfite-Seq"
)
# VCF 文件,受控的文件
query <- GDCquery(
project = c("TCGA-LUAD"),
legacy = TRUE,
data.category = "DNA methylation",
data.type = "Bisulfite sequence alignment",
experimental.strategy = "Bisulfite-Seq"
)
# WGBS BAM 文件,受控的文件
query <- GDCquery(
project = c("TCGA-LUAD"),
legacy = TRUE,
data.type = "Aligned reads",
data.category = "Raw sequencing data",
experimental.strategy = "Bisulfite-Seq"
)
2.2 表达数据
- 获取乳腺癌标准化后的基因表达数据
query.exp.hg19 <- GDCquery(
project = "TCGA-BRCA",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
legacy = TRUE
)
> getResults(query.exp.hg19, rows = 1:10, cols = 1:4)
id data_format access cases
1 2819c29c-9233-4b9f-9adf-fc30765ee818 TXT open TCGA-GM-A2DN-01A-11R-A180-07
2 2fa430d7-7c33-405f-8570-4779d916c008 TXT open TCGA-D8-A1XR-01A-11R-A14M-07
3 51166b6c-4e3a-45fb-991f-3472fe9e64f7 TXT open TCGA-E2-A106-01A-11R-A10J-07
4 b2f68fce-3269-4c77-9ac9-01864172e7ee TXT open TCGA-D8-A1XZ-01A-11R-A14M-07
5 4307937d-70ca-4d3c-8a14-449d8434ef57 TXT open TCGA-A2-A3XZ-01A-42R-A239-07
6 8281ecbc-c5aa-410b-a707-5463c12c6ef6 TXT open TCGA-D8-A1XC-01A-11R-A14D-07
7 63b6a817-5179-4983-8226-01c8951b307e TXT open TCGA-D8-A27H-01A-11R-A16F-07
8 f9200920-0b78-492b-a764-bc30dec7c69e TXT open TCGA-A7-A13E-01A-11R-A12P-07
9 f0376900-a389-40d0-81c7-cf42da635013 TXT open TCGA-E9-A1RE-01A-11R-A157-07
10 98f6130b-2025-4c95-b650-97f67b87e453 TXT open TCGA-B6-A0IO-01A-11R-A034-07
- 使用
getManifest()
函数获取查询的manifest
文件
> getManifest(query.exp.hg19, save = FALSE)[1:6, 1]
[1] "2819c29c-9233-4b9f-9adf-fc30765ee818" "2fa430d7-7c33-405f-8570-4779d916c008"
[3] "51166b6c-4e3a-45fb-991f-3472fe9e64f7" "b2f68fce-3269-4c77-9ac9-01864172e7ee"
[5] "4307937d-70ca-4d3c-8a14-449d8434ef57" "8281ecbc-c5aa-410b-a707-5463c12c6ef6"
数据下载与处理
TCGAbiolinks
提供了一些函数,用于下载和处理来自 GDC
数据库的数据,并用于后续的分析
- 数据下载
GDCdownload
下载 GDC
数据的方法有两种:
-
client
:该方法会创建MANIFEST
文件并使用GDC Data Transfer Tool
来下载数据,但是速度上会比使用API
更慢 -
api
:该方法使用GDC API
接口来下载数据,也会创建MANIFEST
文件并将数据保存为tar.gz
文件,如果数据过大,会将数据拆分为小块分别下载,files.per.chunk
参数可以控制同时下载的文件个数,减少下载错误
参数列表
- 数据处理
GDCprepare
:
使用 GDCprepare
处理数据后,会返回 SummarizedExperiment
或 data.frame
对象,主要包含三个主要的矩阵,可以使用 SummarizedExperiment
包来获取对应的信息:
- 样本信息矩阵,使用
colData(data)
来获取样本信息 - 检测矩阵信息,使用
assay(data)
来获取分子数据 - 特征矩阵信息,使用
rowRanges(data)
来获取特征元数据信息,如基因的信息
参数列表
1. 简单示例
1.1 下载 legacy 中的数据
我们从 legacy
(参考基因组为 hg19
)数据库中下载基因表达数据
query <- GDCquery(
project = "TCGA-GBM",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
barcode = c(
"TCGA-14-0736-02A-01R-2005-01",
"TCGA-06-0211-02A-02R-2005-01"
),
legacy = TRUE
)
GDCdownload(query, method = "api", files.per.chunk = 10)
data <- GDCprepare(query)
获取样本信息
> # 先导入包
> library(SummarizedExperiment)
> colData(data)[,1:4]
DataFrame with 2 rows and 4 columns
barcode patient sample shortLetterCode
<character> <character> <character> <character>
TCGA-14-0736-02A-01R-2005-01 TCGA-14-0736-02A-01R.. TCGA-14-0736 TCGA-14-0736-02A TR
TCGA-06-0211-02A-02R-2005-01 TCGA-06-0211-02A-02R.. TCGA-06-0211 TCGA-06-0211-02A TR
获取表达数据
> assay(data)[1:6,]
TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
A1BG 798.1285 402.2892
A2M 11681.6573 33835.5229
NAT1 164.5998 70.9712
NAT2 5.6370 4.2689
RP11-986E7.7 206399.0981 45121.6649
AADAC 1.1274 0.5336
获取基因信息
> rowRanges(data)
GRanges object with 19947 ranges and 3 metadata columns:
seqnames ranges strand | gene_id entrezgene ensembl_gene_id
<Rle> <IRanges> <Rle> | <character> <integer> <character>
A1BG chr19 58856544-58864865 - | A1BG 1 ENSG00000121410
A2M chr12 9220260-9268825 - | A2M 2 ENSG00000175899
NAT1 chr8 18027986-18081198 + | NAT1 9 ENSG00000171428
NAT2 chr8 18248755-18258728 + | NAT2 10 ENSG00000156006
RP11-986E7.7 chr14 95058395-95090983 + | RP11-986E7.7 12 ENSG00000273259
... ... ... ... . ... ... ...
RASAL2-AS1 chr1 178060643-178063119 - | RASAL2-AS1 100302401 ENSG00000224687
LINC00882 chr3 106555658-106959488 - | LINC00882 100302640 ENSG00000242759
FTX chrX 73183790-73513409 - | FTX 100302692 ENSG00000230590
TICAM2 chr5 114914339-114961876 - | TICAM2 100302736 ENSG00000243414
SLC25A5-AS1 chrX 118599997-118603061 - | SLC25A5-AS1 100303728 ENSG00000224281
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
1.2 从 harmonized 中下载数据
我们从 harmonized
数据库中下载基因表达定量信息
query <- GDCquery(
project = "TCGA-GBM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - FPKM-UQ",
barcode = c(
"TCGA-14-0736-02A-01R-2005-01",
"TCGA-06-0211-02A-02R-2005-01"
)
)
GDCdownload(query)
data <- GDCprepare(query)
获取表达矩阵
> assay(data)[1:6,]
TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
ENSG00000000003 1027433.83 424382.67
ENSG00000000005 22979.34 2481.99
ENSG00000000419 810469.94 738697.92
ENSG00000000457 46320.62 39986.17
ENSG00000000460 26077.41 31642.57
ENSG00000000938 89581.99 98203.47
由于 GDCprepare()
还在开发中,并不能在所有情况下都能有效处理数据,对于数据的支持主要有
对于 Harmonized
数据
对于 Legacy
数据
- 获取
CNV
数据
query <- GDCquery(
project = "TCGA-GBM",
data.category = "Copy Number Variation",
data.type = "Copy Number Segment",
barcode = c(
"TCGA-02-0089-01A-01D-0193-01",
"TCGA-81-5911-10A-01D-1842-01"
)
)
GDCdownload(query)
data <- GDCprepare(query)
- 获取蛋白质表达数据
query <- GDCquery(
project = "TCGA-GBM",
data.category = "Protein expression",
legacy = TRUE,
barcode = c(
"TCGA-OX-A56R-01A-21-A44T-20",
"TCGA-08-0357-01A-21-1898-20"
)
)
GDCdownload(query)
data <- GDCprepare(
query,
save = TRUE,
save.filename = "gbmProteinExpression.rda",
remove.files.prepared = TRUE
)
miRNA
query.mirna <- GDCquery(
project = "TARGET-AML",
experimental.strategy = "miRNA-Seq",
data.category = "Transcriptome Profiling",
barcode = c("TARGET-20-PATDNN", "TARGET-20-PAPUNR"),
data.type = "miRNA Expression Quantification"
)
GDCdownload(query.mirna)
mirna <- GDCprepare(
query = query.mirna
)