The spec file is an optional input to the runCA executive that launches the Celera Assembler pipeline. The spec files provides a convenient way to generate assemblies while documenting their parameters faithfully. The use of spec files is STRONGLY recommended.
Spec files参数解释
PBcR混合组装需要指定两个Spec配置文件: pacbio.spec(纠错)和asm.spec(组装)。这两个文件都包含特定的算法参数和计算机硬件参数,通常情况下算法参数可以忽略(此时将用软件默认值),但是计算机硬件参数需要根据实际情况调整。
所有参数均为option = value
形式,其中的value为布尔型(boolean),即true=1,false=0。
全局参数
错误率(Error Rates)
共有5个可配制的错误率,’error rate’在overlap中是小数,取值范围为0.0到0.4,而’error limit’是绝对值,取值范围没有限制。overlap取值只要低于’error rate’ 或 ‘error limit’阈值中的任意一个就将被取用。例如,100个碱基中overlap错误率为2%,假如 utgErrorRate=0.015并且utgErrorLimit=2.5,那么2%的overlap值将用于unitigging中。错误率必须是utg ≤ ovl ≤ cns ≤ cgw. 通常情况下, ovl = cns
Unitigger的错误率较为复杂,Overlaps中的错误率不能用在unitig构建中,每一个unitig比对使用一个不同的错误率设定值;
1>utg uses utgErrorRate.
2>bog uses utgErrorRate and utgErrorLimit.
3>bogart uses utgGraphErrorRate, utgGraphErrorLimit, utgMergeErrorRate and utgMergeErrorLimit.
最小片段长度和最小overlap长度(Minimum Fragment Length and Minimum Overlap Length)
低于最小长度发片段在gatekeeper中将丢弃,低于最小长度的overlap将不被计算。
提前停止renCA运行(Stopping runCA Early)
runCA可在某一阶段运行完后停止。
meryl: Stop before computing mer histograms.
initialTrim: Stop before the OBT initial quality trim.
deDuplication: Stop before the OBT de-duplication.
finalTrimming: Stop before the OBT trim point merge.
chimeraDetection: Stop before the OBT chimera detection.
classifyMates: Stop before de-novo classification.
unitigger: Stop before unitigger.
scaffolder: Stop before the scaffolding stage starts.
CGW: Stop before the CGW program starts.
eCR: Stop before the extend clear ranges program starts. extendClearRanges is an alias for this.
eCRPartition: Stop before partitioning for extend clear ranges. extendClearRangesPartition is an alias for this.
terminator: Stop before terminator.
initialStoreBuilding: Stop after the fragment and gatekeeper stores are created.
meryl: Stop after mer counts are generates.
overlapBasedTrimming: Stop after the Overlap Based Trimming algorithm has updated the clear ranges. OBT is an alias for this.
overlapper: Stop after the overlapper finishes, and the overlap store is created.
classifyMates: Stop after de-novo classification.
unitigger: Stop after unitigs are constructed, but before consensus starts.
utgcns: Stop after unitig consensus finishes; consensusAfterUnitigger is an alias for this.
scaffolder: Stop after all stages of scaffolding are finished.
ctgcns: Stop after contig consensus finishes; consensusAfterScaffolder is an alias for this.
网格计算(Grid Engine Options)
grid计算,useGrid = 1,需要集群的特殊支持,如果运行报错”qsub: script file ‘smp’ cannot be loaded - No such file or directory”,则说明你的当前集群环境不支持gird;详细见:并行计算、分布式计算、集群计算和云计算。
局部参数
Gatekeeper
0.1 mean < std.dev. < 1/3 mean
,如果标准差超出这一范围,则重设为0.1 * mean;Fragment Trimming
Overlapper
每一对fragments互相比对确定是否重叠;
对于较小的组装,可通过期望并行计算的数量来划分fragments数,例对于16个jobs,可划分fragments为4;
对于较大的组装,建议用较大的ovlRefBlockSize和ovlHashBlockLength控制jobs数量;
OVL Overlapper
ovlHashBits和ovlHashBlockLength如何综合选择呢?根据我们实际使用的计算机硬件情况决定,假如我们的计算机有8G内存,
1>设置
ovlHashBits=25
根据上面表格得知载入这个hash表将需要消耗接近7G的内存,如果是1/2 GB操作系统,那么我们500 MB内存载入序列数据,也就是ovlHashBlockLength最多为50,000,000,而25对应的hash表可以载入704,643,072 k-mers,但是我们仅能载入50,000,000 k-mer(one k-mer per base of sequence),这样是设定造成内存的浪费;2>设置
ovlHashBits=24
消耗305G内存,剩余3G载入序列,ovlHashBlockLength多达300,000,000,352,24对应的hash表能够载入321,536 k-mers,此时的配置较为合理。overlap job log file (0-overlaptrim-overlap/#######.out and 1-overlapper/######.out)有助于我们筛选合适的配置值,其包含如下内容;
1 | HASH LOADING STOPPED: strings 38020 out of 38020 max. |
在这里,载入15,487,424碱基序列,仅用了hash表可载入66,060,288大小的4,435,417,意味着可以增加ovlHashBlockLength (to load more sequence)或降低ovlHashBits (to use less memory)。
MER Overlapper
mer overlapper 也使用 Classic Overlapper参数 obtMerSize and ovlMerSize.
Meryl
1 | merylMemory = -segments 4 -threads 4 |
Fragment Error Correction
Unitigger
Scaffolder
Scaffold module is called CGW (chunk graph walker),It builds contigs and scaffolds from unitigs and mate pairs.
Consensus
unitigger和scaffolder后的一致性
Terminator
Unitig Repeat/Unique Toggling
Celera Assembler利用泊松分布将unitigs分类为unique和重复的,由于覆盖偏向性和截断的影响,这种分类偶尔也将unique unitigs划分为重复;为避免错误组装,重复的unitigs在组装过程中是不可信的,Unitig Repeat/Unique Toggling允许Celera Assembler纠错这些”重复“unitigs,然后重新组装;这一过程将生成一个10-toggledAsm目录,最后的组装结果在10-toggledAsm/9-terminator目录下。
Spec files实例
pacbio.spec
1 | merSize=16 |
asm.spec
1 | ovlStoreMemory = 60000 |
参考资料
SpecFiles
RunCA#Global_Options