单细胞下采样

最近做了一些个超大数据集的分析,其中出现了超级大量的问题….除了数据不整洁,找不到直接适用的meta还得自己生成之外,最大的问题,就是数据实在太大了没法跑….有些分析甚至直接能给200G的服务器内存给爆了,总不能什么时候都氪金吧?而且部分方法直接不支持超大量数据或者无法并行的问题。这时候,就需要用到下采样了,理论上如果我们认可自己在细胞聚类和注释这一步中,得到的cluster的纯度,那么,取部分细胞代表整体是很合理的,当然这一步可以用ROGUE工具进行检查,下采样时,除了subset里的downsample参数,和pseudocell聚合间接下采样,或者直接分层抽样,bootstrap抽样等方法之外,还发现了一个专门的工具可以进行下采样:geosketch

这是个python包,但是通过reticulate可以很方便地调用且直接兼容R中的矩阵直接下采样,因而可以在纯R环境下运行:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#在R中需要调用reticulate使用python环境,py中则可以原生运行
library(reticulate)
library(rsvd)

geosketch <- import('geosketch')

# Generate some random data.
#samples in rows, features in columns
X <- replicate(20, rnorm(1000))

# Get top PCs from randomized SVD.
s <- rsvd(X, k=10)
X.pcs <- s$u %*% diag(s$d)

# Sketch 10% of data.
sketch.size <- as.integer(100)
sketch.indices <- geosketch$gs(X.pcs, sketch.size, one_indexed = TRUE)
print(sketch.indices)

#下采样:
X_sub <- X[unlist(sketch.indices),]

#有点坑啊....试试pbmc3k:
library(Seurat)
library(SeuratData)

sce <- pbmc3k.final

DimPlot(sce)

X <- t(as.matrix(GetAssayData(sce,assay = "RNA",slot = "counts")))

s <- rsvd(X, k=10)
X.pcs <- s$u %*% diag(s$d)

sketch.size <- as.integer(500)
sketch.indices <- geosketch$gs(X.pcs, sketch.size, one_indexed = TRUE)
print(sketch.indices)

#这里技巧是,不需要获取下采样的矩阵,直接获取下采样出来的细胞名就行了
X_sub_cell <- rownames(X)[unlist(sketch.indices)]
sce$BC <- rownames(sce@meta.data)

sce.sub <- subset(sce, subset = BC %in% X_sub_cell)

DimPlot(sce) + DimPlot(sce.sub)

table(sce$seurat_annotations)
table(sce.sub$seurat_annotations)

从结果上讲,还真是有点东西!即便对于最少的细胞类型,抽样上也并没有忽略掉…而且在全程分析中并没有使用过CellType这个信息,且没有缺少某种类型,证明这个方法的确比较稳健。经过反复多次尝试,确定基本上这个方法能够照顾到极不均衡数据,不至于出现一下采样直接缺少某种细胞类型的问题。
当然,这么方便的工具不会全无问题,关键就在其中的获取转置矩阵那一步,因为这个工具需要行是细胞,列是基因,还不支持稀疏矩阵,所以消耗上在R里有时候仍然会有问题……但这个包原生是为py开发的,python中,特别是scanpy中,表达矩阵的形式直接就是符合要求的….因而不需要任何转置操作,而且内存消耗极小,看上去python环境刻不容缓啊….

那么,不依托于python,有没有其他的工具呢?

也有,那就是metacell类工具,当然,由于metacell坑爹的开发组,这个工具没法在win平台安装,linux和unix平台都能安装,但metacell的接口写的很晦涩,结果展示做的也不怎么样,因此自然就有大佬感觉不爽,去开发新的工具,那就是Supercell,当然严格来说这类工具做的是cell aggregating,但是能够实现和下采样类似的效果,极大减少细胞数量,降低计算开销,以及提升某些分析的准确度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
library(SuperCell)
#这里直接用pbmc的数据做一次单数据集尝试
rm(list = ls())

library(SeuratData)
library(Seurat)

sce <- pbmc3k.final
sce <- UpdateSeuratObject(sce)
sce$seurat_annotations
DimPlot(sce,reduction = "umap",group.by = "seurat_annotations")

gamma <- 10 # Graining level,聚合的力度
hvg <- VariableFeatures(sce,selection.method = "vst")

# Compute metacells using SuperCell package
MC <- SCimplify(
X = GetAssayData(sce), # single-cell log-normalized gene expression data
genes.use = hvg,
gamma = gamma,
n.pc = 10
)

# Compute gene expression of metacells by simply averaging gene expression within each metacell
MC.ge <- supercell_GE(
ge = GetAssayData(sce),
groups = MC$membership
)

MC$cell_line <- supercell_assign(
cluster = sce$seurat_annotations, # single-cell assignment to cell lines
supercell_membership = MC$membership, # single-cell assignment to metacells
method = "jaccard" # available methods are c("jaccard", "relative", "absolute"), function's help() for explanation
)

method_purity <- c("max_proportion", "entropy")[1]
MC$purity <- supercell_purity(
clusters = sce$seurat_annotations,
supercell_membership = MC$membership,
method = method_purity
)

# Metacell purity distribution
summary(MC$purity)

hist(MC$purity, main = paste0("Purity of metacells \nin terms of cell line composition (", method_purity,")"))

supercell_plot(
MC$graph.supercells,
group = MC$cell_line,
seed = 1,
alpha = -pi/2,
main = "Metacells colored by cell line assignment"
)

supercell_plot(
MC$graph.singlecell,
group = sce$seurat_annotations,
do.frames = FALSE,
lay.method = "components",
seed = 1,
alpha = -pi/2,
main = "Single cells colored by cell line assignment"
)

#直接使用标准的下游处理,也即不管weight的问题
MC.seurat <- supercell_2_Seurat(
SC.GE = MC.ge,
SC = MC,
fields = c("cell_line", "purity"),
var.genes = MC$genes.use,
N.comp = 10
)

Idents(MC.seurat) <- "cell_line"

MC.seurat <- RunUMAP(MC.seurat, dims = 1:10)

DimPlot(MC.seurat, reduction = "umap",pt.size = 2.0,label = T) +
DimPlot(sce,reduction = "umap",group.by = "seurat_annotations",label = T,pt.size = 1.2)

VlnPlot(MC.seurat, features = c("CD8A","CD4","GZMA","GZMB","S100A6"), pt.size = 0.0)

从聚类结果上将,效果确实不错!而且同样兼顾了稀有细胞的问题,还支持meta信息的直接映射,支持直接使用aggregating后的数据走Seurat的流程,也提到了细胞不均等的时候引入weight的方法,考虑得非常全面,甚至支持aggregating后进行velocyto的运算,综合性能非常好,以后大规模数据就用这个了,而且这个工具aggregating后甚至支持去批次的问题,也即可以从一开始,对于每个sample的数据就进行aggregating,然后再进行下游分析,就能极大程度减少运算压力:

直接用TISCH2上面的CRC的数据,来自文章,数据为7 patients,no treatment,10X,primary,2.3W cells,理论上这种大小的数据集当然能直接分析,但是对于这种每个样本单独获取数据的形式,如果数据扩大10倍呢?那显然就得引入一些降低开销的方法了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
sce <- Read10X_h5("./CRC_EMTAB8107_expression.h5")
sce.meta <- read.table("./CRC_EMTAB8107_CellMetainfo_table.tsv",
header = T,
check.names = F,row.names = 1,
sep = "\t")

sce <- CreateSeuratObject(sce,
project = "CRC_EMTAB8107",
meta.data = sce.meta,
min.features = 200,
min.cells = 10)
sce
#22236 features across 34284 samples within 1 assay

library(gtsummary)
table1 <- tbl_summary(sce@meta.data)
table1
#可以看到,orig.ident和sample不一致,但是sample和patient一致
#因此这个去批次选什么很迷惑,由于这里是癌症的数据,因此....用harmony对patient整合,然后弱去批次

#或者其实这里就可以直接split然后做supercell
sce.list <- SplitObject(sce,split.by = "Patient")

#首先单独QC

#单独对每个seurat对象执行supercell操作,最后仍然得到一个list
gamma <- 20
MC.list <- list()

for(i in seq_along(sce.list)){
sce.list[[i]] <- NormalizeData(sce.list[[i]])
sce.list[[i]] <- FindVariableFeatures(sce.list[[i]])
hvg = VariableFeatures(sce.list[[i]])
MC = SCimplify(
X = GetAssayData(sce.list[[i]]),
genes.use = hvg,
gamma = gamma,
n.pc = 10)
MC.ge = supercell_GE(
ge = GetAssayData(sce.list[[i]]),
groups = MC$membership)
MC$Cell_line <- supercell_assign(
cluster = sce.list[[i]]$Celltype..major.lineage.,
supercell_membership = MC$membership,
method = "jaccard")
MC$Sample <- supercell_assign(
cluster = sce.list[[i]]$Sample,
supercell_membership = MC$membership,
method = "absolute")
MC.list[[i]] <- supercell_2_Seurat(
SC.GE = MC.ge,
SC = MC,
fields = c("Cell_line","Sample"),
var.genes = MC$genes.use,
N.comp = 15,
is.log.normalized = F)}
rm(MC,MC.ge)

MC.harmony <- merge(MC.list[[1]], y=c(MC.list[[2]], MC.list[[3]],MC.list[[4]],
MC.list[[5]], MC.list[[6]], MC.list[[7]]))

MC.harmony <- NormalizeData(MC.harmony) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA(verbose=FALSE)

library(harmony)
MC.harmony <- RunHarmony(MC.harmony, group.by.vars = "Sample")

MC.harmony <- RunUMAP(MC.harmony, reduction = "harmony", dims = 1:10)
MC.harmony <- FindNeighbors(MC.harmony, reduction = "harmony", dims = 1:10) %>% FindClusters()

#group_by_cluster
plot1 = DimPlot(MC.harmony, reduction = "umap",group.by = "Cell_line", label=T,pt.size = 1.2)
#group_by_sample
plot2 = DimPlot(MC.harmony, reduction = "umap", group.by='Sample',pt.size = 1.2)
#combinate
plotc <- plot1+plot2
plotc

#检查一下marker的情况,看看在表达上supercell是否有优势
genes.to.check <- c('MS4A1','CD79A','CD37','CD74',
'IL7R','LTB','CD3D','CXCR4',
'GNLY','CCL5','GZMA','NKG7',
'S100A9','S100A8','TYROBP','LYZ',
'JCHAIN','IGKC','MZB1','IGLC2')

SCpubr::do_DotPlot(sample = MC.harmony,
features = genes.to.check,
group.by = "Cell_line")

SCpubr::do_CorrelationPlot(sample = MC.harmony,
group.by = "Cell_line",
column_names_rot = 90,
cell_size = 8,
cluster_cols = T)

network <- decoupleR::get_dorothea(organism = "human",
levels = c("A", "B", "C"))

table(MC.harmony$Cell_line,MC.harmony$Sample)

整个流程非常nice,几乎和原生单细胞流程一致,从表达量上,也能够提升marker基因的表现,harmony的结果也非常可靠,和TISCH网站上的整体数据相对比,降维聚类的结构完全一致,基因表达上则更加准确,显然是一种极好地公共数据大规模复用的方法!