PacBio基因组组装之MARVEL

1. 软件介绍

PacBio三代数据基因组组装,针对低测序深度的组装;

2. 软件安装

软件主页:The MARVEL Assembler
clone 后发现目录下没有 configure 文件,而是configure.ac文件,怎么安装???

2.1. 加载所需环境

1
2
3
4
5
source /public/home/software/.bashrc
module load Autoconf/2.69
## added python3.6
export PATH="/public/home/zpxu/miniconda3/bin:$PATH"
export LD_LIBRARY_PATH="$PATH:/public/home/zpxu/miniconda3/lib"

版本检测

1
2
autoreconf --version
python --version

然后运行 autoreconf 来生成 configure 文件。

2.2. 安装

1
2
3
4
## 注意此目录与git clone 的 MARVEL 目录不同
./configure --prefix=/public/home/cotton/software/MARVEL-bin
make
make install

make install 后提示如下信息

1
2
3
4
---------------------------------------------------------------
Installation into /public/home/cotton/software/MARVEL-bin finished.
Don't forget to include /public/home/cotton/software/MARVEL-bin/lib.python in your PYTHONPATH.
---------------------------------------------------------------

2.3. 添加PYTHONPATH

参见:Permanently add a directory to PYTHONPATH

1
export PYTHONPATH="$PYTHONPATH:/public/home/cotton/software/MARVEL-bin/lib.python"

3. 软件使用

整体过程如下:

  1. overlap
  2. patch reads
  3. overlap (again)
  4. scrubbing
  5. assembly graph construction and touring
  6. optional read correction
  7. fasta file creation
    实际运行主要经过两个过程,READ PATCHING PHASEASSEMBLY PHASE,即reads的处理阶段(DBprepare.py)和组装阶段(do.py);

3.1. 初始化数据库: DBprepare.py

1
/public/home/cotton/software/MARVEL-bin/scripts/DBprepare.py ECOL p6.25x.fasta

DBprepare.py 实际上是一个集成的python脚本,里面包含大约4步MARVEL程序 (FA2db,DBsplit,DBdustHPCdaligner)。

运行完成将生成 ECOL.db,两个隐藏文件.ECOL.idx.ECOL.bps ,两个plan后缀的文件;

3.2. 组装: do.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
##!/usr/bin/env python

import multiprocessing
import marvel
import marvel.config
import marvel.queue

###### settings

DB = "ECOL" ## database name,需修改,和DBprepare.py中命名一样
COVERAGE = 25 ## coverage of the dataset,需修改
DB_FIX = DB + "_FIX" ## name of the database containing the patched reads
PARALLEL = multiprocessing.cpu_count() ## number of available processors

###### patch raw reads

q = marvel.queue.queue(DB, COVERAGE, PARALLEL)

###### run daligner to create initial overlaps
q.plan("{db}.dalign.plan")

###### run LAmerge to merge overlap blocks
q.plan("{db}.merge.plan")

## create quality and trim annotation (tracks) for each overlap block
q.block("{path}/LAq -b {block} {db} {db}.{block}.las")
## merge quality and trim tracks
q.single("{path}/TKmerge -d {db} q")
q.single("{path}/TKmerge -d {db} trim")

## run LAfix to patch reads based on overlaps
q.block("{path}/LAfix -g -1 {db} {db}.{block}.las {db}.{block}.fixed.fasta")
## join all fixed fasta files
q.single("!cat {db}.*.fixed.fasta > {db}.fixed.fasta")

## create a new Database of fixed reads (-j numOfThreads, -g genome size)
q.single("{path_scripts}/DBprepare.py -s 50 -r 2 -j 4 -g 4600000 {db_fixed} {db}.fixed.fasta", db_fixed = DB_FIX)

## run the commands
q.process()

########## assemble patched reads

q = marvel.queue.queue(DB_FIX, COVERAGE, PARALLEL)

###### run daligner to create overlaps
q.plan("{db}.dalign.plan")
###### run LAmerge to merge overlap blocks
q.plan("{db}.merge.plan")

###### start scrubbing pipeline

########## for larger genomes (> 100MB) LAstitch can be run with the -L option (preload reads)
########## with the -L option two passes over the overlap files are performed:
########## first to buffer all reads and a second time to stitch them
########## otherwise the random file access can make LAstitch pretty slow.
########## Another option would be, to keep the whole db in cache (/dev/shm/)
q.block("{path}/LAstitch -f 50 {db} {db}.{block}.las {db}.{block}.stitch.las")

########## create quality and trim annotation (tracks) for each overlap block and merge them
q.block("{path}/LAq -s 5 -T trim0 -b {block} {db} {db}.{block}.stitch.las")
q.single("{path}/TKmerge -d {db} q")
q.single("{path}/TKmerge -d {db} trim0")

########## create a repeat annotation (tracks) for each overlap block and merge them
q.block("{path}/LArepeat -c {coverage} -b {block} {db} {db}.{block}.stitch.las")
q.single("{path}/TKmerge -d {db} repeats")

########## detects "borders" in overlaps due to bad regions within reads that were not detected
########## in LAfix. Those regions can be quite big (several Kb). If gaps within a read are
########## detected LAgap chooses the longer part oft the read as valid range. The other part(s) are
########## discarded
########## option -L (see LAstitch) is also available
q.block("{path}/LAgap -t trim0 {db} {db}.{block}.stitch.las {db}.{block}.gap.las")

########## create a new trim1 track, (trim0 is kept)
q.block("{path}/LAq -s 5 -u -t trim0 -T trim1 -b {block} {db} {db}.{block}.gap.las")
q.single("{path}/TKmerge -d {db} trim1")

########## based on different filter critera filter out: local-alignments, repeat induced-alifnments
########## previously discarded alignments, ....
########## -r repeats, -t trim1 ... use repeats and trim1 track
########## -n 500 ... overlaps must be anchored by at least 500 bases (non-repeats)
########## -u 0 ... overlaps with unaligned bases according to the trim1 interval are discarded
########## -o 2000 ... overlaps shorter than 2k bases are discarded
########## -p ... purge overlaps, overlaps are not written into the output file
########## option -L (see LAstitch) is also available
q.block("{path}/LAfilter -n 300 -r repeats -t trim1 -T -o 2000 -u 0 {db} {db}.{block}.gap.las {db}.{block}.filtered.las")

########## merge all filtered overlap files into one overlap file
q.single("{path}/LAmerge -S filtered {db} {db}.filtered.las")

########## create overlap graph
q.single("{path}/OGbuild -t trim1 {db} {db}.filtered.las {db}.graphml")

########## tour the overlap graph and create contigs paths
q.single("{path_scripts}/OGtour.py -c {db} {db}.graphml")

q.single("{path}/LAcorrect -j 4 -r {db}.tour.rids {db} {db}.filtered.las {db}.corrected")
q.single("{path}/FA2db -c {db}_CORRECTED [expand:{db}.corrected.*.fasta]")

########## create contig fasta files
q.single("{path_scripts}/tour2fasta.py -c {db}_CORRECTED -t trim1 {db} {db}.tour.graphml {db}.tour.paths")

###### optional: create a layout of the overlap graph which can viewed in a Browser (svg) or Gephi (dot)
## q.single("{path}/OGlayout -R {db}.tour.graphml {db}.tour.layout.svg")
q.single("{path}/OGlayout -R {db}.tour.graphml {db}.tour.layout.dot")

q.process()

4. 软件运行实例

4.1. 分步运行

The axolotl genome and the evolution of key tissue formation regulators 基因组组装过程,来源于MARVEL的Github主页的 MARVEL/examples/axolotl/README (初始化数据库时经过Fix的修正过程)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
This file contains the steps we took to assemble the axolotl genome.


########## READ PATCHING PHASE

## create initial database
FA2db -v -x 2000 -b AXOLOTL -f axoFiles.txt

## split database into blocks
DBsplit AXOLOTL

## create dust track
DBdust AXOLOTL

## create daligner and merge plans, replace SERVER and PORT
HPCdaligner -v -t 100 -D SERVER:PORT -r1 -j16 --dal 32 --mrg 32 -o AXOLOTL AXOLOTL
################################################################## DBprepare.py 脚本可运行上诉4步 ######################################################################################################
## start dynamic repeat masking server on SERVER:PORT, replace PORT (可跳过)
DMserver -t 16 -p PORT -C AXOLOTL 40 AXOLOTL_CP
################################################################################################################################################################################################################################
## AXOLOTL.daligner.plan 和 AXOLOTL.merge.plan 是 DBprepare.py 生成的结果,内容是一些运行LAmerge和daligner程序的命令
## run daligner to create initial overlaps
AXOLOTL.daligner.plan

## after all daligner jobs are finshied the dynamic repeat masker has to be shut down (可跳过)
DMctl -h HOST -p PORT shutdown

###### run LAmerge to merge overlap blocks
AXOLOTL.merge.plan
##############################################################################################################################################################################
## create quality and trim annotation (tracks) for each overlap block for each database block
LAq -b <block> AXOLOTL AXOLOTL.<block>.las

TKmerge -d AXOLOTL q
TKmerge -d AXOLOTL trim

## run LAfix to patch reads based on overlaps for each database block
LAfix -c -x 2000 AXOLOTL AXOLOTL.<block>.las AXOLOTL.<block>.fixed.fasta


########## ASSEMBLY PHASE

## create a new database with the fixed reads
FA2db -v -x 2000 -c AXOLOTL_FIX AXOLOTL.*.fixed.fasta

## split database into blocks
DBsplit AXOLOTL_FIX

## combine repeat tracks maskr and maskc that were created during read patching phase
TKcombine AXOLOTL_FIX mask maskr maskc

## create daligner and merge plans, replace SERVER and PORT
HPCdaligner -v -t 100 -D SERVER:PORT -m mask -r2 -j16 --dal 32 --mrg 32 -o AXOLOTL_FIX AXOLOTL_FIX

## start dynamic repeat masking server on SERVER:PORT, replace PORT
DMserver -t 16 -p PORT -C AXOLOTL_FIX 40 AXOLOTL_CP

###### run daligner to create overlaps
AXOLOTL_FIX.dalign.plan

## after all daligner jobs are finshied the dynamic repeat masker has to be shut down
DMctl -h HOST -p PORT shutdown

###### run LAmerge to merge overlap blocks
AXOLOTL_FIX.merge.plan

###### SCRUBBING PHASE

## repair alignments that prematurely stopped due to left-over errors in the reads for each database block
LAstitch -f 50 AXOLOTL_FIX AXOLOTL_FIX.<block>.las AXOLOTL_FIX.<block>.stitch.las

## create quality and trim annotation for each database block
LAq -T trim0 -s 5 -b <block> AXOLOTL_FIX AXOLOTL_FIX.<block>.stitch.las

TKmerge -d AXOLOTL_FIX q
TKmerge -d AXOLOTL_FIX trim0

## create a repeat annotation for each database block
LArepeat -c <coverage> -l 1.5 -h 2.0 -b <block> AXOLOTL_FIX AXOLOTL_FIX.<block>.stitch.las

TKmerge -d AXOLOTL_FIX repeats

## merge duplicate & overlapping annotation repeat annotation and masking server output
TKcombine {db} frepeats repeats maskc maskr

## remove gaps (ie. due to chimeric reads, ...) for each database block
LAgap -s 100 -t trim0 AXOLOTL_FIX AXOLOTL_FIX.<block>.stitch.las AXOLOTL_FIX.<block>.gap.las

## recalculate the trim track based on the cleaned up gaps for each database block
LAq -u -t trim0 -T trim1 -b <block> AXOLOTL_FIX AXOLOTL_FIX.<block>.gap.las

TKmerge -d AXOLOTL_FIX trim1

## filter repeat induced alignments and try to resolve repeat modules for each database block
LAfilter -p -s 100 -n 300 -r frepeats -t trim1 -o 1000 -u 0 AXOLOTL_FIX AXOLOTL_FIX.<block>.gap.las AXOLOTL_FIX.<block>.filtered.las

## not much is left now, so we can merge everything into a single las file
LAmerge -S filtered AXOLOTL_FIX AXOLOTL_FIX.filtered.las

## create overlap graph
mkdir -p components
OGbuild -t trim1 -s -c 1 AXOLOTL_FIX AXOLOTL_FIX.filtered.las components/AXOLOTL_FIX

## tour the overlap graph and create contigs paths for each components/*.graphml
OGtour.py -c AXOLOTL_FIX <component.graphml>

## create contig fasta files for each components/*.paths
tour2fasta.py -t trim1 AXOLOTL_FIX <component.graphml> <component.paths>

4.2. 集合运行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
source /public/home/software/.bashrc
export PATH="/public/home/cotton/software/MARVEL-bin/bin:$PATH"
export PATH="/public/home/zpxu/miniconda3/bin:$PATH"
export LD_LIBRARY_PATH="$PATH:/public/home/zpxu/miniconda3/lib"

cd /public/home/zpxu/AS/results/MARVEL
fa="/public/home/zpxu/AS/PacBio/bmk_nh_filtered_subreads.fasta"

echo "1. initially the database"
DBprepare.py AS ${fa}
cd /public/home/zpxu/AS/results/MARVEL
##sed -i "s/^/\/public\/home\/cotton\/software\/MARVEL-bin\/bin\//g" /public/home/zpxu/AS/results/MARVEL/AS.dalign.plan
##sed -i "s/^/\/public\/home\/cotton\/software\/MARVEL-bin\/bin\//g" /public/home/zpxu/AS/results/MARVEL/AS.merge.plan
python /public/home/zpxu/AS/scripts/MARVEL_do.py

5. 疑问参数

5.1. 线程数目参数

-j参数表示线程数,需为2的幂次。

5.2. HPCdaligner 的 -D SERVER:PORT

解释见:HPCdaligner parameter <-D host:port> ##7,即重复序列多的基因组组装需要设置此参数,如果不是则不需要设置此参数和运行随后的 DMserver 步骤。如下即可:

1
HPCdaligner -v -t 100 -r1 -j16 --dal 32 --mrg 32 -o AXOLOTL AXOLOTL

5.3. DMserver 的 -p PORT 和 expected.coverage

Dynamic repeat Masking server主要目的在于降低大基因组组装对于计算机CPU和内存的需求,

While preliminary calculations for computational time and storage space estimated over multiple millions of CPU hours and >2 PB of storage for one daligner run, the usage of a dynamic repeat masking server (below) reduced this dramatically to 150,000 CPU hours and 120 Tb of storage space for the complete pipeline.

tiramisutes wechat
欢迎关注