一 :hisat2流程
hisat2比对生成sam
nohup hisat2 -p 8 -x /home/huawei/raw_data/YSQ/4AL_resequence/Chinese_Spring_v1.0/Chinese_Spring_v1.0 -1 /home/huawei/2019_9_29_data/wheat/C3_1.clean.fq.gz -2 /home/huawei/2019_9_29_data/wheat/C3_2.clean.fq.gz |samtools sort -@ 8 -o C3_sort.bam - &
bam取唯一比对
nohup samtools view -@ 8 -h C2.bam |grep -e @ -e "NH:i:1"|samtools view -@ 8 -Sb >C2.uniq.bam &
替换头文件
samtools view -H Tri_1.uniq.bam | sed 's,^@PG.*,@PG\tID:4\tSM:T1.add.bam\tLB:lib1\tPL:Illumina,g' | samtools reheader - Tri_1.uniq.bam > T1.new.bam
https://www.biostars.org/p/249376/
markduplicate
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates -I 4AL_TTD.sort.uniq.bam -O 4AL_TTD.sort.uniq_mkdup.bam -M 4AL_TTD_mkdup.metrics
用AddOrReplaceReadGroups分组
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" AddOrReplaceReadGroups -I T1.sort.uniq_mkdup.bam -O T1.add.bam --LB RNA -PL illumina -PU bwa -SM T1
建立索引
samtools index -@ 8 T1.add.bam
每个混池生成单个g.vcf
gatk HaplotypeCaller -R ../0.index/wheat_parts.fa -I T2.add.bam -ERC GVCF -O T2.g.vcf
合成vcf
合并两个g.vcf
(1) 准备一个input.list,内容是单个g.vcf的文件名,一个文件一行
cat >inputE_F.list
E.g.vcf
F.g.vcf
(2) 合并g.vcf
使用的是gatk4,用GenotypeGVCFs出现报错
gatk GenotypeGVCFs -R /home/huawei/raw_data/YSQ/4AL_resequence/Chinese_Spring_v1.0/Chinese_Spring_v1.0.fasta --variant /home/huawei/4.mergevcf/inputE_F.list -o E_Fmerge.vcf
用CombineGVCFs合并两个g.vcf
gatk CombineGVCFs -R ../0.index/wheat_parts.fa --variant T1.g.vcf --variant T2.g.vcf -O T1_T2merge.vcf
试图解决:
conda update --all
conda update gatk
conda update conda
进行call SNP,过滤流程
gatk SelectVariants -select-type SNP -V E_Fmerge.vcf -O E_F.snp.vcf
gatk VariantFiltration -V E_F.snp.vcf --filter-expression "QUAL < 30.0 || QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "Filter" -O E_F_filter.snp.vcf
gatk SelectVariants --exclude-filtered true -V E_F_filter.snp.vcf -O E_F_filtered.snp.vcf
https://www.biostars.org/p/249376/
http://08643.cn/p/a3b3c6190d35
https://www.bioinfo-scrounger.com/archives/183/
二:STAR流程
建立索引需要gtf文件,小麦下载到为gff3(网址Triticum_aestivum.IWGSC.48.gff3.gz)。
先解压:gunzip Triticum_aestivum.IWGSC.48.gff3.gz
gff3转为gtf文件
conda install gffread
gffread Triticum_aestivum.IWGSC.48.gff3 -T -o IWGSC.48.gtf
打断gtf,位置与序列fa文件相同
perl gtf_break.pl break.txt IWGSC.48.gtf IWGSC.48_break.gtf
STAR下载安装
https://github.com/alexdobin/STAR/
BSR-(RNA-seq)数据进行BSR分析
call variants from wheat RNA_seq
构建索引(大于4h)
nohup /public/software/apps/STAR/2.7.0f/bin/Linux_x86_64/STAR --runThreadN 10 --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles ./wheat_parts.fa --sjdbGTFfile ./IWGSC.48_break.gtf --sjdbOverhang 149 --limitGenomeGenerateRAM 68800833920 2>log &
生成一个dict
/public/home/zhanghq/biosoft/gatk-4.1.9.0/gatk CreateSequenceDictionary -R ../0_reference/IWGSC_parts.fa -O IWGSC_parts.dict
建立.fai,要不然报错
/public/software/apps/samtools/1.9/bin/samtools faidx IWGSC.fa
比对(15min左右)
/public/software/apps/STAR/2.7.0f/bin/Linux_x86_64/STAR --runThreadN 10 --twopassMode Basic --genomeDir ../0_reference/ --limitSjdbInsertNsj 5000000 --outSAMtype BAM SortedByCoordinate --twopass1readsN -1 --readFilesCommand zcat --outFilterMismatchNmax 10 --readFilesIn Tri_di_1_1.clean.fq.gz Tri_di_1_2.clean.fq.gz --outSAMattrRGline ID:Tri_1 SM:Tri_1 PL:ILLUMINA LB:Tri_1
/public/software/apps/STAR/2.7.0f/bin/Linux_x86_64/STAR --runThreadN 10 --twopassMode Basic --genomeDir ../0_reference/ --limitSjdbInsertNsj 5000000 --outSAMtype BAM SortedByCoordinate --twopass1readsN -1 --readFilesCommand zcat --outFilterMismatchNmax 10 --readFilesIn Tri_di_2_1.clean.fq.gz Tri_di_2_2.clean.fq.gz --outSAMattrRGline ID:Tri_2 SM:Tri_2 PL:ILLUMINA LB:Tri_2
改名
mv Aligned.sortedByCoord.out.bam Tri_1.sort.bam
mv Aligned.sortedByCoord.out.bam Tri_2.sort.bam
取唯一比对
/public/software/apps/samtools/1.9/bin/samtools view -@ 10 -h -q 255 -o sorted.bam -O BAM Tri_1.sort.bam
/public/software/apps/samtools/1.9/bin/samtools view -@ 10 -h -q 255 -o sorted.bam -O BAM Tri_2.sort.bam
mkdup
../biosoft/gatk-4.1.9.0/gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates -I Tri_1_uniq_sort.bam -O Tri_1_mkdup.bam -M Tri_1_mkdup.metrics 2>log
../biosoft/gatk-4.1.9.0/gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates -I Tri_2_uniq_sort.bam -O Tri_2_mkdup.bam -M Tri_2_mkdup.metrics 2>log
建立索引
/public/software/apps/samtools/1.9/bin/samtools index -@ 10 Tri_1_mkdup.bam
/public/software/apps/samtools/1.9/bin/samtools index -@ 10 Tri_2_mkdup.bam
../biosoft/gatk-4.1.9.0/gatk SplitNCigarReads -R ../0_reference/IWGSC_parts.fa -I Tri_1_mkdup.bam -O Tri_1_split.bam 1>Tri_1_split_log.mark 2>&1
../biosoft/gatk-4.1.9.0/gatk SplitNCigarReads -R ../0_reference/IWGSC_parts.fa -I Tri_2_mkdup.bam -O Tri_2_split.bam 1>Tri_2_split_log.mark 2>&1
生成单独的g.vcf(每个大于2h)
../biosoft/gatk-4.1.9.0/gatk HaplotypeCaller -R ../0_reference/IWGSC_parts.fa -I Tri_1_mkdup.bam -ERC GVCF -O Tri_1.g.vcf
../biosoft/gatk-4.1.9.0/gatk HaplotypeCaller -R ../0_reference/IWGSC_parts.fa -I Tri_2_mkdup.bam -ERC GVCF -O Tri_2.g.vcf
merge vcf
nohup ../biosoft/gatk-4.1.9.0/gatk CombineGVCFs -R ../0_reference/IWGSC_parts.fa --variant Tri_1.g.vcf --variant Tri_2.g.vcf -O T1_T2merge.vcf
遇到小插曲,只是所有的从头来过而已??,具体什么原因不知道,返回7A打断位置查看,发现dict的位置附近有基因,而fai的位置附近基因较少,所以从头来过,全部改为fai的位置(448855049)
要检查7A_part1打断的位置,生成的dict(与打断位置一样,450046985)与fai(448855049)文件不一样
# 先按照read name排序
nohup samtools sort -@ 8 -n -O bam -o s35S1-M.name_sorted.bam s35S1-M.bam &
nohup samtools sort -@ 8 -n -O bam -o s35S1-W.name_sorted.bam s35S1-W.bam &
# 利用fixmate来添加ms和MC tag
nohup samtools fixmate -m -@ 8 -O bam s35S1-M.name_sorted.bam s35S1-M.name_mc_sorted.bam &
nohup samtools fixmate -m -@ 8 -O bam s35S1-W.name_sorted.bam s35S1-W.name_mc_sorted.bam &
# 把添加了MC tag后的文件再按染色体位置排序
nohup samtools sort -O bam -o s35S1-M.name_mc_pos_sorted.bam s35S1-M.name_mc_sorted.bam &
nohup samtools sort -O bam -o s35S1-W.name_mc_pos_sorted.bam s35S1-W.name_mc_sorted.bam &
# 最后可以标记重复了
nohup time samtools markdup -@ 8 aligment_result/s35S1-M.name_mc_pos_sorted.bam tools_result/samtool_markdup/s35S1-M.sorted.samtools_markdup.bam &
nohup time samtools markdup -@ 8 aligment_result/s35S1-W.name_mc_pos_sorted.bam tools_result/samtool_markdup/s35S1-W.sorted.samtools_markdup.bam &
samtools中faidx索引格式
RNA-seq汇总(分析+工具+GATK流程)
NGS的变异检测高效工具Sentieon使用体验
GATK/BWA加速神器Sentieon更新列表
如何让GATK飞起来--Sentieon初步体验