0%

从头绘制二倍体单细胞的染色质三维结构

背景

2018 年 Dip-C文章[1]提供了一种利用单细胞Hi-C数据重构细胞染色质三维结构的方法。利用退火算法根据paternal 和maternal 染色体上snp 的不同区分每一条contact 分别来自于哪条染色体。

Dip-C fig1A

这篇文章之后又有几篇用该方法的类似文章,但是他们有一个共同的特点即事先有完成了phasing 的snp文件,比如对于人类细胞来说 illumina 提供的NA12878或者老鼠本身是杂交的等等。 但是如果我们想用这个技术绘制一个比如衰老细胞的三维基因组图谱,就会碰到不知道哪些snp位于同一个染色体的问题,从而无法完成二倍体的染色质三维结构的重构。这个时候我们需要对snp做phasing,也就是区分那些snp是在一个染色体上。

目前的方法

目前做phasing最简单的方法是测一下父母的血,很容易就能知道哪些snp来自父亲哪些来自母亲。但是吧,假如样品本身来自于一个七八十岁的人,你很难搞到人家父母的血对吧。

第二种是测个三代,利用三代或者三代+hic组装,比如FALCON-phase。但是三代太贵了。

第三种方法就是直接做hic,唉我们正好是做的单细胞hic,挺好。这个分支下有两个方法。一种是简单的hic,有个叫hapcut2的工具。另一种是结合群体遗传学,也是这个作者做的。

一个能用的流程

于是自然而然的我选择了只用hic 的hapcut2 流程。这个事情主要有下面几个步骤:

  1. 数据预处理,拿到hic bam文件。
  2. Call snp
  3. Hapcut2 phasing

至于从fastq 到hic bam 的步骤,相信大家都有流程。总之要拿到每个细胞的{cellname}.sorted.rmdup.bam

Call SNP

Call SNP 使用Deep Variant[2] 比较简单快捷。

1
2
3
4
5
6
7
8
9
10
# merge bamfile
ls bamFile | grep sorted.rmdup.bam | grep -v "bai" | awk '{print "./bamFile/"$0}' > filelist.txt
samtools merge -b filelist.txt labGM.merge.bam

# call snp with DeepVariant
# 注意这里由于使用docker 用到的文件不能是链接而应做一份拷贝。另外bam 和fasta 文件都应该有index
work_dir="/share/home/zliu/test/202103/hicPhasing/callsnp"
samtools faidx GRCh37d5.fasta
samtools index -@ 40 labGM.merge.bam
docker run -v "${work_dir}":"/input" -v "${work_dir}":"/output" google/deepvariant:"1.1.0" /opt/deepvariant/bin/run_deepvariant --model_type=WGS --ref=/input/GRCh37d5.fasta --reads=/input/labGM.merge.bam --output_vcf=/output/LabGM.vcf --num_shards=40 --logging_dir=/output/logs --call_variants_extra_args="use_openvino=true

然后对于我的实验来说,但细胞实验的错误率比较高,需要对输出的这个vcf做一点点的过滤。参考了一下LIANTI 文章[2]:

We call a heterozygous SNP from the bulk sample if (i) the total read depth is above 15; (ii) both alleles are supported by 5 reads or more; (iii) the fraction of reads supporting each allele is at least 30%; (iv) the root-mean-square mapping quality is no less than 40. F

对于任意snp,我选择了以下的条件:

  1. 每个allele至少出现在5% 的细胞里。
  2. mapping quality 的rms小于40的snp位点不要。
  3. read score小于10 的不要。

摸个filter的脚本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import pysam
import pandas as pd
import numpy as np
from pandarallel import pandarallel
import time

def filterSingleCellSNP(bamfilePath,chr,pos,ref,alt,totalCell,percentage = 0.1,rmsQualMin = 40,baseQualMin = 10):
"""
input:
output: ture or false value
"""
baseRes = []
baseQual = []
mapQual = []
library = []

bamfile = pysam.AlignmentFile(bamfilePath,"rb")
iter = bamfile.fetch(chr,pos-1,pos) # position for bam file, 0-based, -1 needed, same below
for x in iter:
try:
baseRes.append(x.query[pos - x.blocks[0][1]-1])
except:
continue
baseQual.append(ord(x.qual[pos - x.blocks[0][1]-1])-33)
mapQual.append(x.mapq)
library.append(x.get_tag("RG"))

siteInfo = pd.DataFrame({
"base" : baseRes,
"baseQual" : baseQual,
"mapQual" : mapQual,
"library" : library,
}).query('baseQual > @baseQualMin')

rms = np.sqrt(np.mean(siteInfo.mapQual.to_numpy()**2))
if rms < rmsQualMin:
return "False"

baseLib = siteInfo[["library","base"]].drop_duplicates().groupby("base").count().to_dict()["library"]
if ref in baseLib and alt in baseLib:
if (baseLib[ref] > totalCell*percentage and baseLib[alt] > totalCell*percentage):
return "True"
else: return "False"
else: return "False"


vcf = pd.read_csv("LabGM.vcf",sep="\t",comment="#",names = ["chrom","pos","n1","ref","alt","score","type","n2","featrure","value"])
vcf=vcf[vcf.ref.apply(lambda x: len(str(x))==1)]
vcf=vcf[vcf.alt.apply(lambda x: len(str(x))==1)]
vcf=vcf[vcf.value.str.match("0/1")]

pandarallel.initialize()
start = time.time()

# config below
vcf['scPass'] = vcf.parallel_apply(lambda row: filterSingleCellSNP("./labGM.merge.bam",row['chrom'],row['pos'],row['ref'],row['alt'],totalCell=249,percentage = 0.05,rmsQualMin = 40,baseQualMin = 10),axis=1)

end = time.time()
print("Time consume : ",end-start)
vcf.to_csv("./libfilter.vcf",sep="\t",header=False)

用大约2000G的GM12878 做了个测试,以NA12878 作为true set:

image-20210415212806406

可以看到右边经过filter之后的效果明显比单纯使用GQ 的结果要好一些。然后我选了GQ >15 的SNP 位点。

1
2
3
4
5
rawVCF <- read_table2("./libfilter.vcf",comment = "#",col_names=F) %>% separate(X11,sep = ":",into = c("GT","GQ","DP","AD","VAF","PL")) %>% select(-X1,-X10)
names(rawVCF) <- c("chrom","pos","n1","ref","alt","score","type","n2","GT","GQ","DP","AD","VAF","PL","singleCellPass")
rawVCF$GQ <- as.numeric(rawVCF$GQ)

rawVCF %>% filter(chrom %in% chromlist,GQ>=15) %>% select(1,2) %>% inner_join(read_table2("./LabGM.vcf",comment = "#",col_names=c("chrom","pos","n1","ref","alt","score","type","n2","GT","GQ"))) %>% write_tsv("../callsnp/GMlab.subset.vcf",col_names = F)

HAPCUT2

后面就是跑一下hapcut2[2] 的流程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ids=(`seq 22` X Y);for id in ${ids[@]};do;echo "chr${id}";done | parallel 'cat GMlab.subset.vcf | grep {} > ./vcfPerChr/labGM.{}.vcf'

#run step 1
for id in ${ids[@]};do;echo "chr${id}";done | parallel 'extractHAIRS --hic 1 --bam ./bamPerChr/labGM.{}.bam --VCF ./vcfPerChr/labGM.{}.vcf --out ./fragment/labGM.{}.fragment'

#run step 2
for id in ${ids[@]};do;echo "chr${id}";done | parallel 'HAPCUT2 --hic 1 --VCF ./vcfPerChr/labGM.{}.vcf --fragments ./fragment/labGM.{}.fragment --output ./phased_haplotypes/labGM.{}.tsv'


#cat labGM.chr10.tsv| grep BLOCK | sort -k9 -n | tail -n 1 | xargs -I {} sed -n '/^{}/,/\*\*\*\*\*\*\*\*/p' labGM.chr10.tsv | grep -v "BLOCK\|*" > labGM.chr10.largestFragment.tsv
base;snakemake -j30;conda deactivate;conda activate py3;

cat largest_contig/labGM.*.largestContig.tsv | awk 'BEGIN {OFS="\t"};{if($2==0) print $4,$5,$6,$7; else print $4,$5,$7,$6}' | awk -v x=2 '{if(length($3)<x ){print $0}}' | awk -v x=2 '{if(length($4)<x ){print $0}}'| sort -k1,1 -V -s > ../Structuretest/labGM.snp.clean.txt

cat labGM.snp.txt| awk -v x=2 '{if(length($3)<x ){print $0}}' | awk -v x=2 '{if(length($4)<x ){print $0}}' > labGM.snp.clean.txt

Dip-C 重构结构

类似于Dip-C流程我们可以用新生成的snp file 重构结构, 然后看一下结果:

1
find processed/ -name 50k.align.rms.info | xargs -I {} grep --with-filename "RMS RMSD" {} | sed -e "s/processed\///g" -e "s/\/3d_info\/50k.align.rms.info:\[M::align\] RMS RMSD: /\t/g" | sort

Result

其实目前看重构三维结构,能看的东西主要是染色质的Radial Position, 这个结构吧,也不是不能用。

Reference

[1]. Tan, L., et al., Three-dimensional genome structures of single diploid human cells. Science, 2018. 361(6405): p. 924-928.

[2]. Chen, C., Xing, D., Tan, L., Li, H., Zhou, G., Huang, L., & Xie, X. S. (2017). Single-cell whole-genome analyses by Linear Amplification via Transposon Insertion (LIANTI). Science, 356(6334), 189–194.

[3]. Poplin, R., Chang, P. C., Alexander, D., Schwartz, S., Colthurst, T., Ku, A., Newburger, D., Dijamco, J., Nguyen, N., Afshar, P. T., Gross, S. S., Dorfman, L., McLean, C. Y., & Depristo, M. A. (2018). A universal snp and small-indel variant caller using deep neural networks. Nature Biotechnology, 36(10), 983.

[4]. Edge, P., Bafna, V., & Bansal, V. (2017). HapCUT2: Robust and accurate haplotype assembly for diverse sequencing technologies. Genome Research, 27(5), 801–812.