PacBio基因组组装之MARVEL

1. 软件介绍

PacBio三代数据基因组组装,针对低测序深度的组装;

2. 软件安装

软件主页:The MARVEL Assembler
clone 后发现目录下没有 configure 文件,而是configure.ac文件,怎么安装???

2.1. 加载所需环境

1
2
3
4
5
source /public/home/software/.bashrc
module load Autoconf/2.69
## added python3.6
export PATH="/public/home/zpxu/miniconda3/bin:$PATH"
export LD_LIBRARY_PATH="$PATH:/public/home/zpxu/miniconda3/lib"

版本检测

1
2
autoreconf --version
python --version

然后运行 autoreconf 来生成 configure 文件。

2.2. 安装

1
2
3
4
## 注意此目录与git clone 的 MARVEL 目录不同
./configure --prefix=/public/home/cotton/software/MARVEL-bin
make
make install

make install 后提示如下信息

1
2
3
4
---------------------------------------------------------------
Installation into /public/home/cotton/software/MARVEL-bin finished.
Don't forget to include /public/home/cotton/software/MARVEL-bin/lib.python in your PYTHONPATH.
---------------------------------------------------------------

2.3. 添加PYTHONPATH

参见:Permanently add a directory to PYTHONPATH

1
export PYTHONPATH="$PYTHONPATH:/public/home/cotton/software/MARVEL-bin/lib.python"

3. 软件使用

整体过程如下:

  1. overlap
  2. patch reads
  3. overlap (again)
  4. scrubbing
  5. assembly graph construction and touring
  6. optional read correction
  7. fasta file creation
    实际运行主要经过两个过程,READ PATCHING PHASEASSEMBLY PHASE,即reads的处理阶段(DBprepare.py)和组装阶段(do.py);

3.1. 初始化数据库: DBprepare.py

1
/public/home/cotton/software/MARVEL-bin/scripts/DBprepare.py ECOL p6.25x.fasta

DBprepare.py 实际上是一个集成的python脚本,里面包含大约4步MARVEL程序 (FA2db,DBsplit,DBdustHPCdaligner)。

运行完成将生成 ECOL.db,两个隐藏文件.ECOL.idx.ECOL.bps ,两个plan后缀的文件;

3.2. 组装: do.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
##!/usr/bin/env python

import multiprocessing
import marvel
import marvel.config
import marvel.queue

###### settings

DB = "ECOL" ## database name,需修改,和DBprepare.py中命名一样
COVERAGE = 25 ## coverage of the dataset,需修改
DB_FIX = DB + "_FIX" ## name of the database containing the patched reads
PARALLEL = multiprocessing.cpu_count() ## number of available processors

###### patch raw reads

q = marvel.queue.queue(DB, COVERAGE, PARALLEL)

###### run daligner to create initial overlaps
q.plan("{db}.dalign.plan")

###### run LAmerge to merge overlap blocks
q.plan("{db}.merge.plan")

## create quality and trim annotation (tracks) for each overlap block
q.block("{path}/LAq -b {block} {db} {db}.{block}.las")
## merge quality and trim tracks
q.single("{path}/TKmerge -d {db} q")
q.single("{path}/TKmerge -d {db} trim")

## run LAfix to patch reads based on overlaps
q.block("{path}/LAfix -g -1 {db} {db}.{block}.las {db}.{block}.fixed.fasta")
## join all fixed fasta files
q.single("!cat {db}.*.fixed.fasta > {db}.fixed.fasta")

## create a new Database of fixed reads (-j numOfThreads, -g genome size)
q.single("{path_scripts}/DBprepare.py -s 50 -r 2 -j 4 -g 4600000 {db_fixed} {db}.fixed.fasta", db_fixed = DB_FIX)

## run the commands
q.process()

########## assemble patched reads

q = marvel.queue.queue(DB_FIX, COVERAGE, PARALLEL)

###### run daligner to create overlaps
q.plan("{db}.dalign.plan")
###### run LAmerge to merge overlap blocks
q.plan("{db}.merge.plan")

###### start scrubbing pipeline

########## for larger genomes (> 100MB) LAstitch can be run with the -L option (preload reads)
########## with the -L option two passes over the overlap files are performed:
########## first to buffer all reads and a second time to stitch them
########## otherwise the random file access can make LAstitch pretty slow.
########## Another option would be, to keep the whole db in cache (/dev/shm/)
q.block("{path}/LAstitch -f 50 {db} {db}.{block}.las {db}.{block}.stitch.las")

########## create quality and trim annotation (tracks) for each overlap block and merge them
q.block("{path}/LAq -s 5 -T trim0 -b {block} {db} {db}.{block}.stitch.las")
q.single("{path}/TKmerge -d {db} q")
q.single("{path}/TKmerge -d {db} trim0")

########## create a repeat annotation (tracks) for each overlap block and merge them
q.block("{path}/LArepeat -c {coverage} -b {block} {db} {db}.{block}.stitch.las")
q.single("{path}/TKmerge -d {db} repeats")

########## detects "borders" in overlaps due to bad regions within reads that were not detected
########## in LAfix. Those regions can be quite big (several Kb). If gaps within a read are
########## detected LAgap chooses the longer part oft the read as valid range. The other part(s) are
########## discarded
########## option -L (see LAstitch) is also available
q.block("{path}/LAgap -t trim0 {db} {db}.{block}.stitch.las {db}.{block}.gap.las")

########## create a new trim1 track, (trim0 is kept)
q.block("{path}/LAq -s 5 -u -t trim0 -T trim1 -b {block} {db} {db}.{block}.gap.las")
q.single("{path}/TKmerge -d {db} trim1")

########## based on different filter critera filter out: local-alignments, repeat induced-alifnments
########## previously discarded alignments, ....
########## -r repeats, -t trim1 ... use repeats and trim1 track
########## -n 500 ... overlaps must be anchored by at least 500 bases (non-repeats)
########## -u 0 ... overlaps with unaligned bases according to the trim1 interval are discarded
########## -o 2000 ... overlaps shorter than 2k bases are discarded
########## -p ... purge overlaps, overlaps are not written into the output file
########## option -L (see LAstitch) is also available
q.block("{path}/LAfilter -n 300 -r repeats -t trim1 -T -o 2000 -u 0 {db} {db}.{block}.gap.las {db}.{block}.filtered.las")

########## merge all filtered overlap files into one overlap file
q.single("{path}/LAmerge -S filtered {db} {db}.filtered.las")

########## create overlap graph
q.single("{path}/OGbuild -t trim1 {db} {db}.filtered.las {db}.graphml")

########## tour the overlap graph and create contigs paths
q.single("{path_scripts}/OGtour.py -c {db} {db}.graphml")

q.single("{path}/LAcorrect -j 4 -r {db}.tour.rids {db} {db}.filtered.las {db}.corrected")
q.single("{path}/FA2db -c {db}_CORRECTED [expand:{db}.corrected.*.fasta]")

########## create contig fasta files
q.single("{path_scripts}/tour2fasta.py -c {db}_CORRECTED -t trim1 {db} {db}.tour.graphml {db}.tour.paths")

###### optional: create a layout of the overlap graph which can viewed in a Browser (svg) or Gephi (dot)
## q.single("{path}/OGlayout -R {db}.tour.graphml {db}.tour.layout.svg")
q.single("{path}/OGlayout -R {db}.tour.graphml {db}.tour.layout.dot")

q.process()

4. 软件运行实例

4.1. 分步运行

The axolotl genome and the evolution of key tissue formation regulators 基因组组装过程,来源于MARVEL的Github主页的 MARVEL/examples/axolotl/README (初始化数据库时经过Fix的修正过程)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
This file contains the steps we took to assemble the axolotl genome.


########## READ PATCHING PHASE

## create initial database
FA2db -v -x 2000 -b AXOLOTL -f axoFiles.txt

## split database into blocks
DBsplit AXOLOTL

## create dust track
DBdust AXOLOTL

## create daligner and merge plans, replace SERVER and PORT
HPCdaligner -v -t 100 -D SERVER:PORT -r1 -j16 --dal 32 --mrg 32 -o AXOLOTL AXOLOTL
################################################################## DBprepare.py 脚本可运行上诉4步 ######################################################################################################
## start dynamic repeat masking server on SERVER:PORT, replace PORT (可跳过)
DMserver -t 16 -p PORT -C AXOLOTL 40 AXOLOTL_CP
################################################################################################################################################################################################################################
## AXOLOTL.daligner.plan 和 AXOLOTL.merge.plan 是 DBprepare.py 生成的结果,内容是一些运行LAmerge和daligner程序的命令
## run daligner to create initial overlaps
AXOLOTL.daligner.plan

## after all daligner jobs are finshied the dynamic repeat masker has to be shut down (可跳过)
DMctl -h HOST -p PORT shutdown

###### run LAmerge to merge overlap blocks
AXOLOTL.merge.plan
##############################################################################################################################################################################
## create quality and trim annotation (tracks) for each overlap block
for each database block
LAq -b <block> AXOLOTL AXOLOTL.<block>.las

TKmerge -d AXOLOTL q
TKmerge -d AXOLOTL trim

## run LAfix to patch reads based on overlaps
for each database block
LAfix -c -x 2000 AXOLOTL AXOLOTL.<block>.las AXOLOTL.<block>.fixed.fasta


########## ASSEMBLY PHASE

## create a new database with the fixed reads
FA2db -v -x 2000 -c AXOLOTL_FIX AXOLOTL.*.fixed.fasta

## split database into blocks
DBsplit AXOLOTL_FIX

## combine repeat tracks maskr and maskc that were created during read patching phase
TKcombine AXOLOTL_FIX mask maskr maskc

## create daligner and merge plans, replace SERVER and PORT
HPCdaligner -v -t 100 -D SERVER:PORT -m mask -r2 -j16 --dal 32 --mrg 32 -o AXOLOTL_FIX AXOLOTL_FIX

## start dynamic repeat masking server on SERVER:PORT, replace PORT
DMserver -t 16 -p PORT -C AXOLOTL_FIX 40 AXOLOTL_CP

###### run daligner to create overlaps
AXOLOTL_FIX.dalign.plan

## after all daligner jobs are finshied the dynamic repeat masker has to be shut down
DMctl -h HOST -p PORT shutdown

###### run LAmerge to merge overlap blocks
AXOLOTL_FIX.merge.plan

###### SCRUBBING PHASE

## repair alignments that prematurely stopped due to left-over errors in the reads
for each database block
LAstitch -f 50 AXOLOTL_FIX AXOLOTL_FIX.<block>.las AXOLOTL_FIX.<block>.stitch.las

## create quality and trim annotation
for each database block
LAq -T trim0 -s 5 -b <block> AXOLOTL_FIX AXOLOTL_FIX.<block>.stitch.las

TKmerge -d AXOLOTL_FIX q
TKmerge -d AXOLOTL_FIX trim0

## create a repeat annotation
for each database block
LArepeat -c <coverage> -l 1.5 -h 2.0 -b <block> AXOLOTL_FIX AXOLOTL_FIX.<block>.stitch.las

TKmerge -d AXOLOTL_FIX repeats

## merge duplicate & overlapping annotation repeat annotation and masking server output
TKcombine {db} frepeats repeats maskc maskr

## remove gaps (ie. due to chimeric reads, ...)
for each database block
LAgap -s 100 -t trim0 AXOLOTL_FIX AXOLOTL_FIX.<block>.stitch.las AXOLOTL_FIX.<block>.gap.las

## recalculate the trim track based on the cleaned up gaps
for each database block
LAq -u -t trim0 -T trim1 -b <block> AXOLOTL_FIX AXOLOTL_FIX.<block>.gap.las

TKmerge -d AXOLOTL_FIX trim1

## filter repeat induced alignments and try to resolve repeat modules
for each database block
LAfilter -p -s 100 -n 300 -r frepeats -t trim1 -o 1000 -u 0 AXOLOTL_FIX AXOLOTL_FIX.<block>.gap.las AXOLOTL_FIX.<block>.filtered.las

## not much is left now, so we can merge everything into a single las file
LAmerge -S filtered AXOLOTL_FIX AXOLOTL_FIX.filtered.las

## create overlap graph
mkdir -p components
OGbuild -t trim1 -s -c 1 AXOLOTL_FIX AXOLOTL_FIX.filtered.las components/AXOLOTL_FIX

## tour the overlap graph and create contigs paths
for each components/*.graphml
OGtour.py -c AXOLOTL_FIX <component.graphml>

## create contig fasta files
for each components/*.paths
tour2fasta.py -t trim1 AXOLOTL_FIX <component.graphml> <component.paths>

4.2. 集合运行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
source /public/home/software/.bashrc
export PATH="/public/home/cotton/software/MARVEL-bin/bin:$PATH"
export PATH="/public/home/zpxu/miniconda3/bin:$PATH"
export LD_LIBRARY_PATH="$PATH:/public/home/zpxu/miniconda3/lib"

cd /public/home/zpxu/AS/results/MARVEL
fa="/public/home/zpxu/AS/PacBio/bmk_nh_filtered_subreads.fasta"

echo "1. initially the database"
DBprepare.py AS ${fa}
cd /public/home/zpxu/AS/results/MARVEL
##sed -i "s/^/\/public\/home\/cotton\/software\/MARVEL-bin\/bin\//g" /public/home/zpxu/AS/results/MARVEL/AS.dalign.plan
##sed -i "s/^/\/public\/home\/cotton\/software\/MARVEL-bin\/bin\//g" /public/home/zpxu/AS/results/MARVEL/AS.merge.plan
python /public/home/zpxu/AS/scripts/MARVEL_do.py

5. 疑问参数

5.1. 线程数目参数

-j参数表示线程数,需为2的幂次。

5.2. HPCdaligner 的 -D SERVER:PORT

解释见:HPCdaligner parameter <-D host:port> ##7,即重复序列多的基因组组装需要设置此参数,如果不是则不需要设置此参数和运行随后的 DMserver 步骤。如下即可:

1
HPCdaligner -v -t 100 -r1 -j16 --dal 32 --mrg 32 -o AXOLOTL AXOLOTL

5.3. DMserver 的 -p PORT 和 expected.coverage

Dynamic repeat Masking server主要目的在于降低大基因组组装对于计算机CPU和内存的需求,

While preliminary calculations for computational time and storage space estimated over multiple millions of CPU hours and >2 PB of storage for one daligner run, the usage of a dynamic repeat masking server (below) reduced this dramatically to 150,000 CPU hours and 120 Tb of storage space for the complete pipeline.

tiramisutes wechat
欢迎关注