使用RepeatModeler从头预测基因组重复序列

简介

官网:http://www.repeatmasker.org/RepeatModeler/

Repeatmasker 基于与已知的重复序列数据库比对来寻找重复序列,Repeatmodeler是通过重续序列的结构特征来进行从头注释,因此可以寻找一些物种特有的重复序列。

RepeatMasker依赖RepBase数据库和Dfam数据库来屏蔽重复序列,对于非模式生物来讲,这两个数据库覆盖有限。通常的做法就是RepeatModeler从头预测重复序列,然后将结果作为RepeatMasker的输入。

安装依赖

RepeatModeler依赖的RepeatMasker、TRF、RMBlast请参考我的上一篇博文使用RepeatMasker屏蔽基因组重复序列,接下来我们只安装剩下的依赖。

RECON - De Novo Repeat Finder

1
2
3
4
5
6
cd ~/software/
wget http://www.repeatmasker.org/RepeatModeler/RECON-1.08.tar.gz
tar zxf RECON-1.08.tar.gz
cd RECON-1.08/src/
make && make install
rm ~/software/RECON-1.08.tar.gz

RepeatScout - De Novo Repeat Finder

1
2
3
4
5
6
cd ~/software/
wget http://www.repeatmasker.org/RepeatScout-1.0.6.tar.gz
tar zxf RepeatScout-1.0.6.tar.gz
cd RepeatScout-1.0.6/
make
rm ~/software/RepeatScout-1.0.6.tar.gz

运行LTR结构搜索,还需要安装如下软件。

LtrHarvest - The LtrHarvest program is part of the GenomeTools suite

1
2
3
4
cd ~/software/
wget http://genometools.org/pub/binary_distributions/gt-1.6.2-Linux_x86_64-64bit-complete.tar.gz
tar zxf gt-1.6.2-Linux_x86_64-64bit-complete.tar.gz
rm ~/software/gt-1.6.2-Linux_x86_64-64bit-complete.tar.gz

CD-HIT - A sequence clustering package

1
2
3
4
5
6
7
8
cd ~/software/
wget https://github.com/weizhongli/cdhit/releases/download/V4.8.1/cd-hit-v4.8.1-2019-0228.tar.gz
tar zxf cd-hit-v4.8.1-2019-0228.tar.gz
cd cd-hit-v4.8.1-2019-0228/
make
cd cd-hit-auxtools/
make
rm ~/software/cd-hit-v4.8.1-2019-0228.tar.gz

Ltr_retriever - A LTR discovery post-processing and filtering tool

1
2
3
4
cd ~/software/
wget https://github.com/oushujun/LTR_retriever/archive/v2.9.0.tar.gz
tar zxf v2.9.0.tar.gz
rm ~/software/v2.9.0.tar.gz

LTR_retriever依赖TRF、BLAST+、CD-HIT、HMMER和RepeatMasker。CD-HIT的安装见本文,TRF、HMMER、RepeatMasker的安装见上一篇博文使用RepeatMasker屏蔽基因组重复序列,BLAST+的安装参考另外一篇博文本地BLAST

编辑~/software/LTR_retriever-2.9.0/目录下的paths文件,写入如下内容,指定这些依赖的安装路径。

1
2
3
4
BLAST+= /home/chenwen/software/ncbi-blast-2.12.0+/bin # a path that contains makeblastdb, blastn, blastx
RepeatMasker= /home/chenwen/software/RepeatMasker # a path that contains RepeatMasker
HMMER= /home/chenwen/software/hmmer-3.2.1/bin # a path that contains hmmsearch
CDHIT= /home/chenwen/software/cd-hit-v4.8.1-2019-0228 # a path that contains cd-hit-est (preferred). CDHIT and BLAST are replaceable

MAFFT - A multiple sequence alignment program.

1
2
3
cd ~/software/
wget https://mafft.cbrc.jp/alignment/software/mafft-7.487-with-extensions-src.tgz
tar zxf mafft-7.487-with-extensions-src.tgz

编辑~/software/mafft-7.487-with-extensions/core/目录下的Makefile文件,将PREFIX = /usr/local修改为PREFIX = /home/chenwen/software/mafft-7.487,然后继续执行如下命令。

1
2
3
4
5
cd ~/software/mafft-7.487-with-extensions/core/
make clean
make
make install
rm -rf ~/software/mafft-7.487-with-extensions*

Ninja - A tool for large-scale neighbor-joining phylogeny inference and clustering

1
2
3
4
5
6
7
8
cd ~/software/
wget https://github.com/TravisWheelerLab/NINJA/archive/0.95-cluster_only.tar.gz
tar zxf 0.95-cluster_only.tar.gz
cd NINJA-0.95-cluster_only/NINJA/
make all
cp Ninja ~/software/
rm -rf ~/software/NINJA-0.95-cluster_only
rm ~/software/0.95-cluster_only.tar.gz

UCSC TwoBit Tools

1
2
3
4
5
6
mkdir -p ~/software/kent/
cd ~/software/kent/
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitInfo
chmod +x *

安装和配置RepeatModeler

1
2
3
4
5
6
cd ~/software/
wget http://www.repeatmasker.org/RepeatModeler/RepeatModeler-2.0.2a.tar.gz
tar zxf RepeatModeler-2.0.2a.tar.gz
rm ~/software/RepeatModeler-2.0.2a.tar.gz
cd RepeatModeler-2.0.2a/
perl ./configure

根据提示配置如下路径

1
2
3
4
5
6
7
8
9
10
11
12
REPEATMASKER_DIR /home/chenwen/software/RepeatMasker
RECON_DIR /home/chenwen/software/RECON-1.08/bin
RSCOUT_DIR /home/chenwen/software/RepeatScout-1.0.6
TRF_PRGM /home/chenwen/software/trf
CDHIT_DIR /home/chenwen/software/cd-hit-v4.8.1-2019-0228
UCSCTOOLS_DIR /home/chenwen/software/kent
RMBLAST_DIR /home/chenwen/software/rmblast-2.11.0/bin
ABBLAST_DIR /home/chenwen/software/ab-blast-20200317-linux-x64
GENOMETOOLS_DIR /home/chenwen/software/gt-1.6.2-Linux_x86_64-64bit-complete/bin
LTR_RETRIEVER_DIR /home/chenwen/software/LTR_retriever-2.9.0
MAFFT_DIR /home/chenwen/software/mafft-7.487/bin
NINJA_DIR /home/chenwen/software

如果配置成功,将会出现如下提示信息。

1
Congratulations!  RepeatModeler is now ready to use.

将RepeatModeler添加到环境变量

1
2
echo "export PATH=$HOME/software/RepeatModeler-2.0.2a:\$PATH" >> ~/.bashrc
source ~/.bashrc

运行RepeatModeler

这里我们以从Ensembl下载的鲤鱼基因组为例。

1
2
3
4
5
mkdir ~/test_RepeatModeler
cd ~/test_RepeatModeler/
wget http://ftp.ensembl.org/pub/release-104/fasta/cyprinus_carpio/dna/Cyprinus_carpio.common_carp_genome.dna.toplevel.fa.gz
gunzip Cyprinus_carpio.common_carp_genome.dna.toplevel.fa.gz Cyprinus_carpio.fa.gz
mv Cyprinus_carpio.common_carp_genome.dna.toplevel.fa Cyprinus_carpio.fa

建库

1
2
3
cd ~/test_RepeatModeler/
mkdir db
BuildDatabase -name db/Cyprinus_carpio Cyprinus_carpio.fa

参数说明

1
-name 库的名字

运行RepeatModeler

1
2
cd ~/test_RepeatModeler/
nohup RepeatModeler -database db/Cyprinus_carpio -pa 12 -LTRStruct >& run.out &

参数说明

1
2
3
-database 库的名字,与上一步一致
-pa 线程数
-LTRStruct 开启LTR结构搜索

RepeatModeler的搜索引擎已经只支持RMBlast了,如果使用-engine abblast,RepeatModeler会报出警告WARNING: "-engine abblast" is deprecated, this verison of RepeatModeler uses rmblast only.,然后继续使用RMBlast。

结果解读

我们查看一下日志文件run.out的末尾。

鲤鱼基因组大小约1.7G,RepeatModeler在我的i5 10400(6核12线程)电脑上面跑了约78小时。

RepeatModeler会生成一个名字像这样的文件夹RM_1208021.MonAug20735032021,这是工作目录,保存了中间结果,可以删除。主要结果有两个,db/Cyprinus_carpio-families.fa可以作为library用于RepeatMasker进行重复序列的屏蔽。db/Cyprinus_carpio-families.stk是Dfam兼容的Stockholm格式,可以上传到Dfam数据库。

运行RepeatMasker

我们还需要利用RepeatModeler生成的Cyprinus_carpio-families.fa,使用RepeatMasker屏蔽基因组重复序列。

1
2
cd ~/test_RepeatModeler/
RepeatMasker Cyprinus_carpio.fa -lib db/Cyprinus_carpio-families.fa -e rmblast -xsmall -s -gff -pa 12

使用了-lib参数,会使-species参数失效,不需要同时指定-species参数了。使用-lib db/Cyprinus_carpio-families.fa屏蔽了40.02%的重复序列,而使用-species 'Cyprinus carpio'也能屏蔽39.62%的重复序列。二者相差不大,可能是Dfam数据库的重复序列已经覆盖到了鲤鱼。有些非模式生物,单用RepeatMasker只能屏蔽约10%的重复序列,这时使用RepeatModeler就非常有必要了。