批次效应 是指在实验中由于不同的实验条件、操作人员、试剂批次、设备差异等非生物学因素引起的系统性差异。具体来说,在单细胞RNA测序(scRNA-seq)中,不同的实验批次可能会产生显著的技术变异,这些变异会掩盖实际的生物学信号,导致错误的分析结果。直观来看,批次效应表现为,在整体数据集进行UMAP 和/或 tSNE 降维分析中,不同供体的数据不能 “很好地混合”,未经校正的批次效应会导致不同批次的样本聚集在一起,而不是根据细胞类型或状态进行聚类。如果在此基础上进行更深的数据挖掘,就很可能会获得虚假差异(例如来自单一供体的细胞群)。单细胞测序数据集的 整合分析(例如跨实验批次、供体或条件) 通常是 scRNA-seq 工作流程中的关键步骤。通过综合分析,可以匹配不同数据集中的共享细胞类型和状态,这不仅提高了统计能力,更重要的是,促进了跨数据集的准确比较分析。在单细胞分析的各种工作流中,Seurat
分析套件是目前应用最广泛的方法之一。随着近年来 Seurat 团队的不断更新,该分析框架已经发展到了 v5 版本。相较于之前的版本,Seurat v5
在数据结构和分析方法上都进行了许多重要的调整和改进。本帖收集了一些资料特别对新旧版本的数据整合工作流之间的一些差异进行了汇总。
Seurat 标准整合工作流:Analysis, visualization, and integration of Visium HD spatial datasets with Seurat ? Seurat (satijalab.org)
1、Seurat v4 Integration Work-flow
在 Seurat v4 整合工作流程中,每个整合步骤中发生的情况非常清楚,如每个命令的文档页面底部所记录。
1) SelectIntegrationFeatures()
Summary:
Choose the features to use when integrating multiple datasets. This function ranks features by the number of datasets they are deemed variable in, breaking ties by the median variable feature rank across datasets. It returns the top scoring features by this ranking.
2) FindIntegrationAnchors()
Summarized as this:
Perform dimensional reduction on the dataset pair as specified via the reduction parameter. If l2.norm is set to TRUE, perform L2 normalization of the embedding vectors.Identify anchors - pairs of cells from each dataset that are contained within each other's neighborhoods (also known as mutual nearest neighbors).
Filter low confidence anchors to ensure anchors in the low dimension space are in broad agreement with the high dimensional measurements. This is done by looking at the neighbors of each query cell in the reference dataset using max.features to define this space. If the reference cell isn't found within the first k.filter neighbors, remove the anchor.
Assign each remaining anchor a score. For each anchor cell, determine the nearest k.score anchors within its own dataset and within its pair's dataset. Based on these neighborhoods, construct an overall neighbor graph and then compute the shared neighbor overlap between anchor and query cells (analogous to an SNN graph). We use the 0.01 and 0.90 quantiles on these scores to dampen outlier effects and rescale to range between 0-1
3)IntegrateData()
Summary:
For pairwise integration:Construct a weights matrix that defines the association between each query cell and each anchor. These weights are computed as 1 - the distance between the query cell and the anchor divided by the distance of the query cell to the k.weightth anchor multiplied by the anchor score computed in FindIntegrationAnchors. We then apply a Gaussian kernel width a bandwidth defined by sd.weight and normalize across all k.weight anchors.
Compute the anchor integration matrix as the difference between the two expression matrices for every pair of anchor cells
Compute the transformation matrix as the product of the integration matrix and the weights matrix.
Subtract the transformation matrix from the original expression matrix.
It is clear then that the output of this integration workflow is a corrected matrix used for downstream analysis.
1.1 Seurat v4 Log-Normalization Based Integration
{
pacman::p_load(Seurat,SeuratData,dplyr)
options(future.globals.maxSize = 1e+12)
obt.list <- SplitObject(ifnb, split.by = "stim")
features <- SelectIntegrationFeatures(object.list = obt.list)
set.anchors <- FindIntegrationAnchors(object.list = obt.list, anchor.features = features)
set.combined <- IntegrateData(anchorset = set.anchors)
DefaultAssay(set.combined) <- 'integrated'
set.combined <- set.combined %>% RunUMAP(dims = 1:30,min.dist = 1e-5)
}
1.2 Seurat v4 SCT-Normalization Based Integration
{
pacman::p_load(Seurat,SeuratData,dplyr)
options(future.globals.maxSize = 1e+12)
ifnb <- UpdateSeuratObject(LoadData("ifnb"))
seuObj.list <- SplitObject(ifnb, split.by = "stim")
lapply(seuObj.list, function(obj){
obj %>% SCTransform(verbose = FALSE) %>%
RunPCA(verbose = FALSE)
})
features <- SelectIntegrationFeatures(object.list = seuObj.list, nfeatures = 3000)
seuObj.list <- PrepSCTIntegration(object.list = seuObj.list, anchor.features = features)
### default reduction is CCA
anchors <- FindIntegrationAnchors(object.list = seuObj.list, normalization.method = "SCT",anchor.features = features)
seuObj.combined.sct <- IntegrateData(anchorset = anchors, normalization.method = "SCT")
seuObj.combined.sct <- seuObj.combined.sct %>%
RunPCA( verbose = T) %>%
RunUMAP(dims = 1:30,min.dist = 1e-3, verbose = F)
}
2、Seurat v5 Integration Work-flow
在 Seurat v5 整合工作流程中,相比起 v4,开发团队引入了
Layer
数据结构, 并且工作流程从代码/可用性角度进行了简化,但 v5总体步骤与v4工作流程相同。该流程的主要区别在于,在 Seurat v5 中,校正是在整个对象的全部细胞的low-dimensional space
上进行的(默认为PCA
空间),而不是将基因表达值本身作为校正的输入,并且不再返回integrated assay
,而是直接返回一个校正的低维空间 (默认为integrated.dr
)。值得注意的是,在生成整个对象的PCA
空间的时候,Seurat v5 更新了FindVariableFeatures
的方法,现流程在一个包含2+ Layers
的数据对象中查找 可变特征 时,会分别识别每一层的可变特征,然后找出共同的可变特征。接着,使用每个Layer
中最可变且在其他矩阵中也存在的特征,补充进来直到达到所需的 n 个可变特征(这与Seurat v4中用于整合数据时识别特征的方法SelectIntegrationFeatures
相同),从而确保整合后的数据能够准确反映细胞之间的生物学差异,而不是技术差异。此外,如果研究确实依赖
integrated assay
,Seurat v5 仍然支持旧的整合工作流程(使用IntegrateData
而不是IntegrateLayers
)。
见讨论:
- Unclear what IntegrateLayers (v5) is actually doing vs IntegrateData (v4) · Issue #8653 · satijalab/seurat (github.com)
- Seurat V5 FindVariableFeatures() and HarmonyIntegration() Question · Issue #8325 · satijalab/seurat (github.com)
2.1 Seurat v5 Log-Normalization Based Integration
{
pacman::p_load(Seurat,SeuratData,dplyr)
options(future.globals.maxSize = 1e+12)
ifnb <- UpdateSeuratObject(LoadData("ifnb"))
ifnb[["RNA"]] <- split(ifnb[["RNA"]], f = ifnb$stim)
ifnb <- ifnb %>% NormalizeData(verbose = F) %>%
FindVariableFeatures(verbose = F) %>%
ScaleData(verbose = F) %>%
RunPCA(verbose = F) %>%
IntegrateLayers(method = CCAIntegration,verbose = FALSE) %>%
RunUMAP(dims = 1:30,reduction = "integrated.dr",min.dist = 1e-3, verbose = F)
# re-join layers after integration
ifnb[["RNA"]] <- JoinLayers(ifnb[["RNA"]])
}
2.2 Seurat v5 SCT-Normalization Based Integration
pacman::p_load(Seurat,SeuratData,dplyr,ggplot2)
options(future.globals.maxSize = 1e+12)
{
ifnb[["RNA"]] <- split(ifnb[["RNA"]], f = ifnb$stim)
ifnb <- ifnb %>% SCTransform(verbose = F) %>%
RunPCA(verbose = F) %>%
IntegrateLayers(method = CCAIntegration, normalization.method = "SCT", verbose = F) %>%
RunUMAP(dims = 1:30,reduction = "integrated.dr",min.dist = 1e-3, verbose = F)
}
3、Compare Result of Seurat v4 & v5 SCT-Normalization Based Integration
通过对CCA方法校正后的UMAP表示进行目视检查,Seurat v4和v5 版本都能比较好的校正不同技术平台引入的技术偏差,并且保留了细胞类型的可区分性。
DimPlot(seuObj.combined.sct,group.by = "seurat_annotations",label = T) & labs(title = "V4 Integ") | DimPlot(ifnb,group.by = "seurat_annotations",label = T) & labs(title = "V5 Integ")
Reference
Seurat -- SCTransform_> e14 <- sctransform(object = e14) running sctrans
https://satijalab.org/seurat/archive/v4.3/integration_introduction
Unclear what IntegrateLayers (v5) is actually doing vs IntegrateData (v4) · Issue #8653 · satijalab/seurat (github.com)