2013年发布了GRCh38,每年会在不改变序列和坐标的情况下发布一些Patches
https://www.ncbi.nlm.nih.gov/grc/help/patches/
**《Biostar Handbook》建议使用最新版本的基因组,并且要知道如何在不同基因组之间映射信息(即liftover coordinates)
liftOver from UCSC (web工具和命令行工具)
https://www.ncbi.nlm.nih.gov/genome/tools/remap
remap from NCBI (web工具)
https://www.ncbi.nlm.nih.gov/genome/tools/remap
crossmap (命令行工具)
http://crossmap.sourceforge.net/
进行liftover需要一个chain data,用于描述新旧build之间的差异:
conda install crossmap -y
CrossMap.py
# Get the chain file that maps from hg19 to hg38.
# 下载chain data
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
# Get a test data file that will be remapped.
# bed文件?
wget http://data.biostarhandbook.com/data/ucsc/test.hg19.bed
# Run the remapping process.
# 进行remap
CrossMap.py bed hg19ToHg38.over.chain.gz test.hg19.bed test.hg38.bed
*.bed
文件不知道是什么,学习:
《生信分析过程中这些常见文件(fastq/bed/gtf/sam/bam/wig)的格式以及查看方式你都知道吗?》https://blog.csdn.net/qazplm12_3/article/details/85222665
bwa作者Heng Li 2017年的博客给出了一些选择参考基因组的建议:
https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use
1. 比对至GRCh37(hg19),使用hs37-1kg
:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
2. 比对至GRCh37,并且认为 decoy sequence* 有助于variant calling,使用hs37d5
:
关于decoy sequence,在博文《关于人参考基因组fasta文件的组成部分说明》中有提及,EB病毒基因组:
http://08643.cn/p/5b73773e30ef
3. 比对至GRCh38(hg38):
GRCh37(hg19)和GRCh38(hg38)还有其它小版本。
各个版本的基因组可能存在的问题:
1. Inclusion of ALT contigs.
由于基因组是用单倍体类型表现的,因此需要alt序列表示双倍体中的等位基因等。
ALT contigs are large variations with very long flanking sequences nearly identical to the primary human assembly. Most read mappers will give mapping quality zero to reads mapped in the flanking sequences. This will reduce the sensitivity of variant calling and many other analyses. You can resolve this issue with an ALT-aware mapper, but no mainstream variant callers or other tools can take the advantage of ALT-aware mapping.
2. Padding ALT contigs with long “N”s. (?)
This has the same problem with 1 and also increases the size of genome unnecessarily. It is worse.
3. Inclusion of multi-placed sequences.
伪常染色体序列(PARs)是X和Y染色体上核苷酸的同源序列,假常染色体基因(到目前为止至少发现了29个)表现出常染色体遗传而不是性别相关的遗传模式。
alpha satellites在维基百科中重定向至centromere了
https://en.wikipedia.org/wiki/Centromere#The_centromeric_sequence
In both GRCh37 and GRCh38, the pseudo-autosomal regions (PARs) of chrX are also placed on to chrY. If you use a reference genome that contains both copies, you will not be able to call any variants in PARs with a standard pipeline. In GRCh38, some alpha satellites are placed multiple times, too. The right solution is to hard mask PARs on chrY and those extra copies of alpha repeats.
4. Not using the rCRS mitochondrial sequence.
rCRS是1981年宣布的人类线粒体DNA的剑桥参考序列(CRS)的修订版(rCRS)。储存在Genebank NCBI数据库,检索号NC_012920。
同时还有非洲(Yoruba)参考序列,非洲(Uganda)参考序列,瑞典参考序列,日本参考序列,重构智人参考序列(RSRS)
rCRS is widely used in population genetics. However, the official GRCh37 comes with a mitochondrial sequence 2bp longer than rCRS. If you want to analyze mitochondrial phylogeny, this 2bp insertion will cause troubles. GRCh38 uses rCRS.
5. Converting semi-ambiguous IUB codes to “N”.
将RYKM等简并碱基都替换成N
This is a very minor issue, though. Human chromosomal sequences contain few semi-ambiguous bases.
6. Using accession numbers instead of chromosome names.
使用检索号而非染色体名
Do you know CM000663.2 corresponds to chr1 in GRCh38?
7. Not including unplaced and unlocalized contigs.
基因组中不包括来自unlocalized和unplaced序列,导致来自这些序列的读段被强制map到其它染色体上,导致错误的variant call.
This will force reads originated from these contigs to be mapped to the chromosomal assembly and lead to false variant calls.
不同版本基因组问题简要总结:
- Alt contigs的存在→variant calling和其它分析的敏感性降低→使用ALT-aware tools
- 用Ns填充Alt contigs→造成和1相似的问题
- 包括PARs→使用standard pipeline会call不到PARs上的variants→hard mask掉chrY上的PARs
- 不使用rCRS→在分析线粒体系统发育时会遇到问题
- 用N表示所有简并碱基→不是什么大问题
- 使用Accession Number而非染色体名
- 不包括unlocalized和unplaced序列--导致false variant calls
- hg19/chromFa.tar.gz from UCSC: 1, 3, 4 and 5.
- hg38/hg38.fa.gz from UCSC: 1, 3 and 5.
- GCA_000001405.15_GRCh38_genomic.fna.gz from NCBI: 1, 3, 5 and 6.
- Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz from EnsEMBL: 3.
- Homo_sapiens.GRCh38.dna.toplevel.fa.gz from EnsEMBL: 1, 2 and 3.