GDAS002-Bioconductor备忘录


title: GDAS002-Bioconductor备忘录
date: 2019-09-02 12:0:00
type: "tags"
tags:

  • Bioconductor
    categories:
  • Genomics Data Analysis Series

前言

这篇笔记的主要内容是关于Bioconductor的一些备忘录(cheat sheet),作者总结的不错,我就按原文翻译了下来,由于一些知识点并没有学到,因此未翻译,后面学到后,再回头过来翻译。

安装

安装细节参考http://bioconductor.org/install/,具体代码如下所示:

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install()
BiocManager::install(c("package1","package2")
BiocManager::valid() # are packages up to date?

# what Bioc version is release right now?
http://bioconductor.org/bioc-version
# what Bioc versions are release/devel?
http://bioconductor.org/js/versions.js

R语言帮助

简单的帮忙如下所示:

?函数名
?"eSet-class" # 如果要查看类的帮助,需要后缀'-class'
help(package="foo",help_type="html") # 使用web浏览器来查看帮助文档
vignette("topic")
browseVignettes(package="package") # 某包的简单案例

高级用户的帮助功能

functionName # 直接输入函数名(不带括号),则显示函数代码
getMethod(method,"class")  # 显示某类的方法代码
selectMethod(method, "class") # 查看继承来的方法
showMethods(classes="class") # 显示某类的所有方法show all methods for class
methods(class="GRanges") # 此函数限于R >=3.2
?"functionName,class-method" # 用于查看S4对象的方法文档
?"plotMA,data.frame-method" # 需要加载geneplotter包,from library(geneplotter)
?"method.class" # 查看S3对象的方法 
?"plot.lm"
sessionInfo() # 获取R语言版本的相关帮助信息 
packageVersion("foo") # 查看某个包的版本

Bioconductor相关的论坛: https://support.bioconductor.org

如果你使用的Rstudio,则可以使用?help来查看帮忙文档,如果使用是命令行模式(例如Linux),则可以使用下面的代码通过浏览器来查看帮助文档(我使用的Rstudio,因此未尝试以下代码):

alias rhelp="Rscript -e 'args <- commandArgs(TRUE); help(args[2], package=args[3], help_type=\"html\"); Sys.sleep(5)' --args"

debugging R

traceback() # 查看哪一步导致了R错误 
# debug a function
debug(myFunction) # 逐行遍历函数中的代码 
undebug(myFunction) # 停止debugging
debugonce(myFunction) # 功能如上,但是不需要undebug()
# 在写代码的过程中,可以将函数browser()放到一个关系的节点上
# this plus devtools::load_all() can be useful for programming
# to jump in function on error:
options(error=recover)
# turn that behavior off:
options(error=NULL)
# debug, e.g. estimateSizeFactors from DESeq2...
# debugging an S4 method is more difficult; this gives you a peek inside:
trace(estimateSizeFactors, browser, exit=browser, signature="DESeqDataSet")

显示某个类的包特异方法

以下两个段代码实现的功能大致相同:查看特定的某个包中,某个特定类对象的操作方法。给定类的对象的操作方法。

intersect(sapply(strsplit(as.character(methods(class="DESeqDataSet")), ","), `[`, 1), ls("package:DESeq2"))
sub("Function: (.*) \\(package .*\\)","\\1",grep("Function",showMethods(classes="DESeqDataSet", 
    where=getNamespace("DESeq2"), printTo=FALSE), value=TRUE))

注释Annotations

这里需要先更新一下AnnotationHub,现在我们查看一下AnnotationHub的某个案例:

https://master.bioconductor.org/packages/release/bioc/html/AnnotationHub.html

以下代码是如何使用数据库包,以及biomart,先看AnnotationDbi

# 某用包个注释包
library(AnnotationDbi)
library(org.Hs.eg.db) # 这个是人类注释包,即Homo.sapiens
columns(org.Hs.eg.db)
keytypes(org.Hs.eg.db) # columns与keytypes返回的结果一样,即有什么样的注释内容
k <- head(keys(org.Hs.eg.db, keytype="ENTREZID"))

# 返回一个字符向量(1对1的关系),查看mapIds则可以查看多个选项
res <- mapIds(org.Hs.eg.db, keys=k, column="ENSEMBL", keytype="ENTREZID")

# 产生1对多的映射关系 
res <- select(org.Hs.eg.db, keys=k,
  columns=c("ENTREZID","ENSEMBL","SYMBOL"),
  keytype="ENTREZID")

biomaRt

这个包主要用于各种基因ID的转换。

# 使用biomart包,将一个注释与另外一个注释进行映射
library(biomaRt)
m <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
map <- getBM(mart = m,
  attributes = c("ensembl_gene_id", "entrezgene"),
  filters = "ensembl_gene_id", 
  values = some.ensembl.genes)

GenomicRanges包

GenomicRanges这个包用于分析高通量测序数据,它能有效地表壳与操作基因组注释信息(genomic annotations)与比对结果(alignments)。GenomicRanges包定义了一个通用容器,此容器用于储存与操作基因间隔(genomic intervals)与变量(variables)。在GenomicAlignments包和SummarizedExperiment包中定义了一个更加特殊的容器,GenomicAlignments包中的这个更加特殊的容器可以根据一个参考基因组来表示和操作一些短比对结果(short alignments),SummarizedExperiment包中容器可以构建一个实验的矩阵样汇总信息。

此包的使用如下所示:

library(GenomicRanges)
z <- GRanges("chr1",IRanges(1000001,1001000),strand="+")
start(z)
end(z)
width(z)
strand(z)
mcols(z) # the 'metadata columns', any information stored alongside each range
ranges(z) # gives the IRanges
seqnames(z) # the chromosomes for each ranges
seqlevels(z) # the possible chromosomes
seqlengths(z) # the lengths for each chromosome

Intra-range methods

不清楚功能,后面补充。

Affects ranges independently

function description
shift moves left/right
narrow narrows by relative position within range
resize resizes to width, fixing start for +, end for -
flank returns flanking ranges to the left +, or right -
promoters similar to flank
restrict restricts ranges to a start and end position
trim trims out of bound ranges
+/- expands/contracts by adding/subtracting fixed amount
* zooms in (positive) or out (negative) by multiples

Inter-range methods

Affects ranges as a group

function description
range one range, leftmost start to rightmost end
reduce cover all positions with only one range
gaps uncovered positions within range
disjoin breaks into discrete ranges based on original starts/ends

Nearest methods

Given two sets of ranges, x and subject, for each range in x, returns...

function description
nearest index of the nearest neighbor range in subject
precede index of the range in subject that is directly preceded by the range in x
follow index of the range in subject that is directly followed by the range in x
distanceToNearest distances to its nearest neighbor in subject (Hits object)
distance distances to nearest neighbor (integer vector)

A Hits object can be accessed with queryHits, subjectHits and mcols if a distance is associated.

set methods

If y is a GRangesList, then use punion, etc. All functions have default ignore.strand=FALSE, so are strand specific.

union(x,y) 
intersect(x,y)
setdiff(x,y)

Overlaps

x %over% y  # logical vector of which x overlaps any in y
fo <- findOverlaps(x,y) # returns a Hits object
queryHits(fo)   # which in x
subjectHits(fo) # which in y 

Seqnames and seqlevels

GenomicRanges and GenomeInfoDb

gr.sub <- gr[seqlevels(gr) == "chr1"]
seqlevelsStyle(x) <- "UCSC" # convert to 'chr1' style from "NCBI" style '1'

Sequences

Biostrings是一个操作序列或序列集的包,简要操作可以参考Biostrings Quick Overview PDF,关于一些物种的,缩写的信息可以参考cheat sheet for annotation

library(BSgenome.Hsapiens.UCSC.hg19)
dnastringset <- getSeq(Hsapiens, granges) # returns a DNAStringSet
# also Views() for Bioconductor >= 3.1
library(Biostrings)
dnastringset <- readDNAStringSet("transcripts.fa")
substr(dnastringset, 1, 10) # to character string
subseq(dnastringset, 1, 10) # returns DNAStringSet
Views(dnastringset, 1, 10) # lightweight views into object
complement(dnastringset)
reverseComplement(dnastringset)
matchPattern("ACGTT", dnastring) # also countPattern, also works on Hsapiens/genome
vmatchPattern("ACGTT", dnastringset) # also vcountPattern
letterFrequecy(dnastringset, "CG") # how many C's or G's
# also letterFrequencyInSlidingView
alphabetFrequency(dnastringset, as.prob=TRUE)
# also oligonucleotideFrequency, dinucleotideFrequency, trinucleotideFrequency
# transcribe/translate for imitating biological processes

测序数据

Rsamtools中的 scanBam 返回BAM文件中的原始数值:

library(Rsamtools)
which <- GRanges("chr1",IRanges(1000001,1001000))
what <- c("rname","strand","pos","qwidth","seq")
param <- ScanBamParam(which=which, what=what)
# for more BamFile functions/details see ?BamFile
# yieldSize for chunk-wise access
bamfile <- BamFile("/path/to/file.bam")
reads <- scanBam(bamfile, param=param)
res <- countBam(bamfile, param=param) 
# for more sophisticated counting modes
# see summarizeOverlaps() below

# quickly check chromosome names
seqinfo(BamFile("/path/to/file.bam"))

# DNAStringSet is defined in the Biostrings package
# see the Biostrings Quick Overview PDF
dnastringset <- scanFa(fastaFile, param=granges)

GenomicAlignments 返回一个Bioconductor对象(GRanges-based)

library(GenomicAlignments)
ga <- readGAlignments(bamfile) # single-end
ga <- readGAlignmentPairs(bamfile) # paired-end

转录本数据库

GenomicFeatures

# get a transcript database, which stores exon, trancript, and gene information
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

# or build a txdb from GTF file (e.g. downloadable from Ensembl FTP site)
txdb <- makeTranscriptDbFromGFF("file.GTF", format="gtf")

# or build a txdb from Biomart (however, not as easy to reproduce later)
txdb <- makeTranscriptDbFromBiomart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

# in Bioconductor >= 3.1, also makeTxDbFromGRanges

# saving and loading
saveDb(txdb, file="txdb.sqlite")
loadDb("txdb.sqlite")

# extracting information from txdb
g <- genes(txdb) # GRanges, just start to end, no exon/intron information
tx <- transcripts(txdb) # GRanges, similar to genes()
e <- exons(txdb) # GRanges for each exon
ebg <- exonsBy(txdb, by="gene") # exons grouped in a GRangesList by gene
ebt <- exonsBy(txdb, by="tx") # similar but by transcript

# then get the transcript sequence
txSeq <- extractTranscriptSeqs(Hsapiens, ebt)

根据范围(ranges)和实验汇总信息

SummarizedExperiment是高维数据的储存类, 它与GRangesGRangesList配合使用(使用每个基因外显子的read计数)。

library(GenomicAlignments)
fls <- list.files(pattern="*.bam$")
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
ebg <- exonsBy(txdb, by="gene")
# see yieldSize argument for restricting memory
bf <- BamFileList(fls)
library(BiocParallel)
register(MulticoreParam(4))
# lots of options in the man page
# singleEnd, ignore.strand, inter.features, fragments, etc.
se <- summarizeOverlaps(ebg, bf)

# operations on SummarizedExperiment
assay(se) # the counts from summarizeOverlaps
colData(se)
rowRanges(se)

Salmon是一个定量工具,如果数据占有GC依赖的数据,那么就添加上--gcBias参数,接着使用 tximport,以下是使用案例:

coldata <- read.table("samples.txt")
rownames(coldata) <- coldata$id
files <- coldata$files; names(files) <- coldata$id
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
dds <- DESeqDataSetFromTximport(txi, coldata, ~condition)

Rsubread包中的featureCounts函数也可以用于read计算,速度可能更快。

library(Rsubread)
res <- featureCounts(files, annot.ext="annotation.gtf",
  isGTFAnnotationFile=TRUE,
  GTF.featureType="exon",
  GTF.attrType="gene_id")
res$counts

RNA-seq分析方法

DESeq2:My preferred pipeline for DESeq2 users is to start with a lightweight transcript
abundance quantifier such as Salmon
and to use tximport, followed
by DESeqDataSetFromTximport.

Here, coldata is a data.frame with group as a column.

library(DESeq2)
# from tximport
dds <- DESeqDataSetFromTximport(txi, coldata, ~ group)
# from SummarizedExperiment
dds <- DESeqDataSet(se, ~ group)
# from count matrix
dds <- DESeqDataSetFromMatrix(counts, coldata, ~ group)
# minimal filtering helps keep things fast 
# one can set 'n' to e.g. min(5, smallest group sample size)
keep <- rowSums(counts(dds) >= 10) >= n 
dds <- dds[keep,]
dds <- DESeq(dds)
res <- results(dds) # no shrinkage of LFC, or:
res <- lfcShrink(dds, coef = 2, type="apeglm") # shrink LFCs

edgeR

# this chunk from the Quick start in the edgeR User Guide
library(edgeR) 
y <- DGEList(counts=counts,group=group)
keep <- filterByExpr(y)
y <- y[keep,]
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
fit <- glmFit(y,design)
lrt <- glmLRT(fit)
topTags(lrt)
# or use the QL methods:
qlfit <- glmQLFit(y,design)
qlft <- glmQLFTest(qlfit)
topTags(qlft)

limma-voom

library(limma)
design <- model.matrix(~ group)
y <- DGEList(counts)
keep <- filterByExpr(y)
y <- y[keep,]
y <- calcNormFactors(y)
v <- voom(y,design)
fit <- lmFit(v,design)
fit <- eBayes(fit)
topTable(fit)

更多有关RNA-seq操作的包可以参考:Many more RNA-seq packages

表达数据集Expression set

library(Biobase)
data(sample.ExpressionSet)
e <- sample.ExpressionSet
exprs(e)
pData(e)
fData(e)

下载GEO数据库

library(GEOquery)
e <- getGEO("GSE9514")

芯片分析

library(affy)
library(limma)
phenoData <- read.AnnotatedDataFrame("sample-description.csv")
eset <- justRMA("/celfile-directory", phenoData=phenoData)
design <- model.matrix(~ Disease, pData(eset))
fit <- lmFit(eset, design)
efit <- eBayes(fit)
topTable(efit, coef=2)

iCOBRA包

iCOBRA包用于计算和可视化排序数据以及二分类匹配后的数据。例如,iCOBRA包的一个典型用途就是用于评估基因表达实验中不同组之间的比较方法,我们可以视之为一个排序问题(评估正确的效应量(effect size)以及按照显著性进行排列的基因),还是一个二分类比较问题(binary assignment problem)(例如将基因分为差异表达和非差异表达)。

library(iCOBRA)
cd <- COBRAData(pval=pval.df, padj=padj.df, score=score.df, truth=truth.df)
cp <- calculate_performance(cd, binary_truth = "status", cont_truth = "logFC")
cobraplot <- prepare_data_for_plot(cp)
plot_fdrtprcurve(cobraplot)
# interactive shiny app:
COBRAapp(cd)

参考资料

  1. Genomic annotation in Bioconductor: Cheat sheet

  2. HarvardX Biomedical Data Science Open Online Training

最后编辑于
?著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,029评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,238评论 3 388
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事?!?“怎么了?”我有些...
    开封第一讲书人阅读 159,576评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,214评论 1 287
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,324评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,392评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,416评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,196评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,631评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,919评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,090评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,767评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,410评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,090评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,328评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,952评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,979评论 2 351

推荐阅读更多精彩内容