De Nove转录组组装质量评估

无参De Nove组装通常用到Trinity软件,组装过程中最重要的两个参数就是--min_kmer_cov--min_glue为组装出高质量结果我们通常需要去尝试用不同的参数,github上也有软件开发者讨论关于这两个参数Optimizing parameters可供参考,其实问题最终也就归结为你是否关心你数据中的低丰度转录本?
此外作者也提供了一系列方法来评估组装质量Transcriptome Assembly Quality Assessment总共列出6种方法可对不同参数的组装结果进行评估,看完后综合总结出其中4种评估方法。

Assessing the Read Content of the Transcriptome Assembly

1
2
3
4
5
6
bowtie2-build  ../trinity_out_dir${i}/Trinity.fasta ../trinity_out_dir${i}/Trinity.fasta
bowtie2 --local --no-unal -p ${cpu} -x ../trinity_out_dir${i}/Trinity.fasta -q -1 ${left} -2 ${right} \
| samtools view -Sb - | samtools sort -no - - > bowtie2.nameSorted.bam
#参看proper pairs reads数量和百分比
${TRINITY_DIR}/util/SAM_nameSorted_to_uniq_count_stats.pl bowtie2.nameSorted.bam
grep "^proper_pairs" Read-Representation.out

第二步的bowtie2比对序列到组装转录本结果时可选部分数据来比对,这样可大大降低比对耗时。

Full-length transcript analysis for model and non-model organisms using BLAST+

1
2
3
4
5
blastall -p blastx -i ./trinity_out_dir${i}/Trinity.fasta  -d ${uniprot} -v 1 -b 1 -m 8 -e 1e-5 -a ${cpu} -F F -o uniprot_sprot.fasta_blastx.outfmt8
${TRINITY_DIR}/util/analyze_blastPlus_topHit_coverage.pl uniprot_sprot.fasta_blastx.outfmt8 ./trinity_out_dir${i}/Trinity.fasta /public/home/cotton/public_data/SwissProt/uniprot_sprot.fasta
${TRINITY_DIR}/util/misc/blast_outfmt6_group_segments.pl \
./uniprot_sprot.fasta_blastx.outfmt8 ./trinity_out_dir${i}/Trinity.fasta uniprot_sprot.fasta > ./uniprot_sprot.fasta_blastx.outfmt8.grouped
${TRINITY_DIR}/util/misc/blast_outfmt6_group_segments.tophit_coverage.pl ./uniprot_sprot.fasta_blastx.outfmt8.grouped

Compute DETONATE scores

RSEM-EVAL软件对于双端reads数据需要提供一个average fragment length值,可参考我的另一篇博文评估文库 Average Insert Size来计算得到此值。

1
2
3
4
5
6
rsem-eval-estimate-transcript-length-distribution ./trinity_out_dir${i}/Trinity.fasta ./RSEM-EVAL${i}/length_distribution_parameter.txt
rsem-eval/rsem-eval-calculate-score -p 1 \
--transcript-length-parameters ./RSEM-EVAL${i}/length_distribution_parameter.txt \
--paired-end --phred33 --strand-specific ../1.clean.fq ../2.clean.fq\
./trinity_out_dir${i}/Trinity.fasta \
hope-trinity_out_dir${i} 300

评估结果解释见:RSEM-EVAL: A novel reference-free transcriptome assembly evaluation measure

  • RSEM-EVAL produces the following three score related files: ‘sample_name.score’, ‘sample_name.score.isoforms.results’ and ‘sample_name.score.genes.results’.
  • sample_name.score: stores the evaluation score for the evaluated assembly. The first lines Score the RSEM-EVAL score.
  • Higher RSEM-EVAL scores are better than lower scores. This is true despite the fact that the scores are always negative. For example, a score of -80000 is better than a score of -200000, since -80000 > -200000.
  • BUSCO explore completeness according to conserved ortholog

    1
    git clone https://gitlab.com/ezlab/busco.git

    点击BUSCO官网相应图标下载所需数据库。

    1
    python BUSCO.py -i SEQUENCE_FILE -o OUTPUT_NAME -l LINEAGE -m tran

    SEQUENCE_FILE:transcript set (DNA nucleotide sequences) file in FASTA format
    OUTPUT_NAME:name to use for the run and temporary files (appended)
    LINEAGE:location of the BUSCO lineage data to use (e.g. fungi_odb9)
    察看结果: 在运行结果文件夹下short_summary_OUTPUT_NAME.txt中有如下统计信息👇

    1
    2
    3
    4
    5
    6
    7
    8
    C:80.0%[S:80.0%,D:0.0%],F:0.0%,M:20.0%,n:10

    8 Complete BUSCOs (C)
    8 Complete and single-copy BUSCOs (S)
    0 Complete and duplicated BUSCOs (D)
    0 Fragmented BUSCOs (F)
    2 Missing BUSCOs (M)
    10 Total BUSCO groups searched

    也可图像化展示结果👇:

    1
    2
    cp short_summary_OUTPUT_NAME.txt ./plot
    python2.7 BUSCO_plot.py -wd ./busco/plot/

    tiramisutes wechat
    欢迎关注