Trinity:BinPacker:Shannon – A head to head comparison

I’ve been working on a head to head comparison of the de novo transcriptome assemblers Trinity v.2.2.0 versus the new kids on the block – Shannon and BinPacker. The tl;dr version of the post is that these new assemblers are very good, and should be considered for new assembly projects, with just a couple of caveats.

Shannon is based on an information-theoretic approach to assembly that seeks to establish both necessary and sufficient conditions for optimal assembly as well as algorithms for achieving optimal assembly.

BinPacker models the transcriptome assembly problem as tracking a set of trajectories of items with their sizes representing coverage of their corresponding isoforms by solving a series of bin-packing problems.

Trinity – well, probably no need to describe this one again, as people are generally familiar with this.

Anyway, I used the standard Mouse dataset that Trinity uses as a benchmarking dataset. These datasets are available at curl -LO https://sourceforge.net/projects/trinityrnaseq/files/misc/MouseRNASEQ/mouse_SS_rnaseq.50M.fastqs.tgz

I assembled these datasets using the BinPacker and Shannon. The Trinity assembly was provided to me by Ben Fulton/Brian Haas. I evaluated each dataset with BUSCO and TransRate. BinPacker was version 1.0, downloaded on 3/4/16 from https://github.com/macmanes-lab/BinPacker (which is a fork of the most recent version on SourceForge). The version of Shannon I used was from the develop branch specifically at this commit https://github.com/sreeramkannan/Shannon/commit/428c3106289ce5b658f17f64879e23bbc59d5ad3

BinPacker run in Strand Specific mode

/share/BinPacker/BinPacker -q -d -s fq -p pair -m RF -k 25 -g 200 -o binpacker_mouse 
-l /mouse/trin_mouse.mouse.all.Left.fq 
-r /mouse/trin_mouse/mouse.all.Right.fq

Shannon no Strand-Specific assembly

python /share/Shannon/shannon.py -p 20 -o shannon_trin_noSS  --left /mouse/trin_mouse/mouse.all.Left.fq  --right /mouse/trin_mouse/mouse.all.Right.fq

and

Shannon with Strand Specific

python /share/Shannon/shannon.py -p 20 -o shannon_trin_SS  --left /mouse/trin_mouse/mouse.all.Left.fq  --right /mouse/trin_mouse/mouse.all.Right.fq --ss

I’ll post the raw data tables below, but here is the compiled version. (BinPacker and Trinity run in Strand specific mode)

Run Time

Trinity = BinPacker = Shannon SS < Shannon non-SS

For ~50M reads, we’re talking about 8 hours for the 3, and like 24 hours for Shannon non-SS. I’ve been talking a lot to Sreeran the Shannon developed about this. The SS mode is already very fast and I’m not sure why the non-SS mode is so much slower.

BUSCO Complete

Shannon NON-Strand Specific > Trinity = BinPacker = Shannon strand specific.

Here, we’re talking about 69%-73%. All of them are pretty good.

TransRate Score

BinPacker > Trinity > Shannon SS > Shannon non-SS

TransRate Optimized Score

Trinity > Shannon SS > Shannon non-SS > Binpacker

Number of reconstructed ‘transcripts’

BinPacker <<< Trinity <<< Shannon SS < Shannon non-SS

Summary

The main issues I have with the new assemblers is their scalability. Neither BinPacker nor Shannon can really handle large datasets at the moment. Any more that 50-100M reads and they seem to choke. This is an issue that both development teams are aware of and are actively working on. Aside from this – checkpoints, better parallelization (e.g., speed!). Signal:noise ratio is an issue for Shannon as are reconstructing duplicates (see BUSCO duplicate fraction), and it is too bad that the BUSCO percent complete is so much lower for Shannon SS verus non-SS assemblies.

It’s occurs to me now more than ever, that the ‘best’ assembly is likely to result from the merging of a bunch of different assemblies. The new tool – transfuse – applied to these 3 assemblies may effectively pull down the best from each, resulting in a better assembly that any of the individuals. This analysis is running – stay tuned for another blog post ASAP.

BinPacker Data

abyss-fac BinPacker.fa
n   n:500   L50 min N80 N50 N20 E-size  max sum name
39356   30183   6814    500 1530    3060    5247    3564    17514   64.75e6 BinPacker.fa

Summarized benchmarks in BUSCO notation:
        C:71%[D:21%],F:3.7%,M:24%,n:3023

Representing:
        1504    Complete Single-copy BUSCOs
        660     Complete Duplicated BUSCOs
        114     Fragmented BUSCOs
        745     Missing BUSCOs
        3023    Total BUSCO groups searched

[ INFO] 2016-03-08 11:18:17 : fragments                  52645238
[ INFO] 2016-03-08 11:18:17 : fragments mapped           43605273
[ INFO] 2016-03-08 11:18:17 : p fragments mapped             0.83
[ INFO] 2016-03-08 11:18:17 : good mappings              38104840
[ INFO] 2016-03-08 11:18:17 : p good mapping                 0.72
[ INFO] 2016-03-08 11:18:17 : bad mappings                5500433
[ INFO] 2016-03-08 11:18:17 : potential bridges             20876
[ INFO] 2016-03-08 11:18:17 : bases uncovered             1011715
[ INFO] 2016-03-08 11:18:17 : p bases uncovered              0.01
[ INFO] 2016-03-08 11:18:17 : contigs uncovbase             24873
[ INFO] 2016-03-08 11:18:17 : p contigs uncovbase            0.63
[ INFO] 2016-03-08 11:18:17 : contigs uncovered               385
[ INFO] 2016-03-08 11:18:17 : p contigs uncovered            0.01
[ INFO] 2016-03-08 11:18:17 : contigs lowcovered            22996
[ INFO] 2016-03-08 11:18:17 : p contigs lowcovered           0.58
[ INFO] 2016-03-08 11:18:17 : contigs segmented              3325
[ INFO] 2016-03-08 11:18:17 : p contigs segmented            0.08
[ INFO] 2016-03-08 11:18:17 : Read metrics done in 1580 seconds
[ INFO] 2016-03-08 11:18:17 : No reference provided, skipping comparative diagnostics
[ INFO] 2016-03-08 11:18:17 : TRANSRATE ASSEMBLY SCORE     0.2836
[ INFO] 2016-03-08 11:18:17 : -----------------------------------
[ INFO] 2016-03-08 11:18:17 : TRANSRATE OPTIMAL SCORE      0.3465
[ INFO] 2016-03-08 11:18:17 : TRANSRATE OPTIMAL CUTOFF     0.2549
[ INFO] 2016-03-08 11:18:18 : good contigs                  31583
[ INFO] 2016-03-08 11:18:18 : p good contigs                  0.8

Trinity Data

abyss-fac Trinity.fasta
n   n:500   L50 min N80 N50 N20 E-size  max sum name
80922   36311   8127    500 1246    2573    4592    3125    15366   67.12e6 Trinity.fasta

#Summarized BUSCO benchmarking for file: Trinity.fasta
#BUSCO was run in mode: trans

Summarized benchmarks in BUSCO notation:
    C:69%[D:23%],F:4.8%,M:25%,n:3023

Representing:
    1389    Complete Single-copy BUSCOs
    724 Complete Duplicated BUSCOs
    148 Fragmented BUSCOs
    762 Missing BUSCOs
    3023    Total BUSCO groups searched

[ INFO] 2016-03-09 10:04:52 : fragments                  52645238
[ INFO] 2016-03-09 10:04:52 : fragments mapped           43243512
[ INFO] 2016-03-09 10:04:52 : p fragments mapped             0.82
[ INFO] 2016-03-09 10:04:52 : good mappings              37043549
[ INFO] 2016-03-09 10:04:52 : p good mapping                  0.7
[ INFO] 2016-03-09 10:04:52 : bad mappings                6199963
[ INFO] 2016-03-09 10:04:52 : potential bridges             39139
[ INFO] 2016-03-09 10:04:52 : bases uncovered             5012605
[ INFO] 2016-03-09 10:04:52 : p bases uncovered              0.06
[ INFO] 2016-03-09 10:04:52 : contigs uncovbase             46089
[ INFO] 2016-03-09 10:04:52 : p contigs uncovbase            0.57
[ INFO] 2016-03-09 10:04:52 : contigs uncovered              4457
[ INFO] 2016-03-09 10:04:52 : p contigs uncovered            0.06
[ INFO] 2016-03-09 10:04:52 : contigs lowcovered            58918
[ INFO] 2016-03-09 10:04:52 : p contigs lowcovered           0.73
[ INFO] 2016-03-09 10:04:52 : contigs segmented              3940
[ INFO] 2016-03-09 10:04:52 : p contigs segmented            0.05
[ INFO] 2016-03-09 10:04:52 : Read metrics done in 1768 seconds
[ INFO] 2016-03-09 10:04:52 : No reference provided, skipping comparative diagnostics
[ INFO] 2016-03-09 10:04:53 : TRANSRATE ASSEMBLY SCORE     0.1241
[ INFO] 2016-03-09 10:04:53 : -----------------------------------
[ INFO] 2016-03-09 10:04:53 : TRANSRATE OPTIMAL SCORE      0.3793
[ INFO] 2016-03-09 10:04:53 : TRANSRATE OPTIMAL CUTOFF     0.4422
[ INFO] 2016-03-09 10:04:53 : good contigs                  34896
[ INFO] 2016-03-09 10:04:53 : p good contigs                 0.43

Shannon SS data

abyss-fac shannon.fasta
n       n:500   L50     min     N80     N50     N20     E-size  max     sum     name
141089  95941   23199   500     1646    3067    5203    3565    22986   218.6e6 shannon.fasta

#BUSCO was run in mode: trans

Summarized benchmarks in BUSCO notation:
        C:69%[D:51%],F:5.3%,M:25%,n:3023

Representing:
        546     Complete Single-copy BUSCOs
        1553    Complete Duplicated BUSCOs
        161     Fragmented BUSCOs
        763     Missing BUSCOs
        3023    Total BUSCO groups searched

[ INFO] 2016-03-16 10:15:02 : -----------------------------------
[ INFO] 2016-03-16 10:15:02 : fragments                  52645238
[ INFO] 2016-03-16 10:15:02 : fragments mapped           43275563
[ INFO] 2016-03-16 10:15:02 : p fragments mapped             0.82
[ INFO] 2016-03-16 10:15:02 : good mappings              37376795
[ INFO] 2016-03-16 10:15:02 : p good mapping                 0.71
[ INFO] 2016-03-16 10:15:02 : bad mappings                5898768
[ INFO] 2016-03-16 10:15:02 : potential bridges             42051
[ INFO] 2016-03-16 10:15:02 : bases uncovered            79375764
[ INFO] 2016-03-16 10:15:02 : p bases uncovered              0.34
[ INFO] 2016-03-16 10:15:02 : contigs uncovbase            109024
[ INFO] 2016-03-16 10:15:02 : p contigs uncovbase            0.77
[ INFO] 2016-03-16 10:15:02 : contigs uncovered             37726
[ INFO] 2016-03-16 10:15:02 : p contigs uncovered            0.27
[ INFO] 2016-03-16 10:15:02 : contigs lowcovered           113452
[ INFO] 2016-03-16 10:15:02 : p contigs lowcovered            0.8
[ INFO] 2016-03-16 10:15:02 : contigs segmented              8526
[ INFO] 2016-03-16 10:15:02 : p contigs segmented            0.06
[ INFO] 2016-03-16 10:15:02 : Read metrics done in 2874 seconds
[ INFO] 2016-03-16 10:15:02 : No reference provided, skipping comparative diagnostics
[ INFO] 2016-03-16 10:15:02 : TRANSRATE ASSEMBLY SCORE     0.0875
[ INFO] 2016-03-16 10:15:02 : -----------------------------------
[ INFO] 2016-03-16 10:15:02 : TRANSRATE OPTIMAL SCORE        0.36
[ INFO] 2016-03-16 10:15:02 : TRANSRATE OPTIMAL CUTOFF     0.3962
[ INFO] 2016-03-16 10:15:03 : good contigs                  58011
[ INFO] 2016-03-16 10:15:03 : p good contigs                 0.41

Shannon non-SS

abyss-fac shannon.fa
n       n:500   L50     min     N80     N50     N20     E-size  max     sum     name
178823  136990  32463   500     1875    3477    5960    4097    23409   350.6e6 shannon.fa

#BUSCO was run in mode: trans

Summarized benchmarks in BUSCO notation:
        C:73%[D:47%],F:3.8%,M:23%,n:3023

Representing:
        769     Complete Single-copy BUSCOs
        1441    Complete Duplicated BUSCOs
        116     Fragmented BUSCOs
        697     Missing BUSCOs
        3023    Total BUSCO groups searched

[ INFO] 2016-03-19 12:37:02 : fragments                  52645238
[ INFO] 2016-03-19 12:37:02 : fragments mapped           43506801
[ INFO] 2016-03-19 12:37:02 : p fragments mapped             0.83
[ INFO] 2016-03-19 12:37:02 : good mappings              37476940
[ INFO] 2016-03-19 12:37:02 : p good mapping                 0.71
[ INFO] 2016-03-19 12:37:02 : bad mappings                6029861
[ INFO] 2016-03-19 12:37:02 : potential bridges             39894
[ INFO] 2016-03-19 12:37:02 : bases uncovered           168476247
[ INFO] 2016-03-19 12:37:02 : p bases uncovered              0.46
[ INFO] 2016-03-19 12:37:02 : contigs uncovbase            156980
[ INFO] 2016-03-19 12:37:02 : p contigs uncovbase            0.88
[ INFO] 2016-03-19 12:37:02 : contigs uncovered             75138
[ INFO] 2016-03-19 12:37:02 : p contigs uncovered            0.42
[ INFO] 2016-03-19 12:37:02 : contigs lowcovered           154536
[ INFO] 2016-03-19 12:37:02 : p contigs lowcovered           0.86
[ INFO] 2016-03-19 12:37:02 : contigs segmented              8918
[ INFO] 2016-03-19 12:37:02 : p contigs segmented            0.05
[ INFO] 2016-03-19 12:37:02 : Read metrics done in 1873 seconds
[ INFO] 2016-03-19 12:37:02 : No reference provided, skipping comparative diagnostics
[ INFO] 2016-03-19 12:37:02 : TRANSRATE ASSEMBLY SCORE     0.0573
[ INFO] 2016-03-19 12:37:02 : -----------------------------------
[ INFO] 2016-03-19 12:37:02 : TRANSRATE OPTIMAL SCORE      0.3274
[ INFO] 2016-03-19 12:37:02 : TRANSRATE OPTIMAL CUTOFF     0.4245
[ INFO] 2016-03-19 12:37:02 : good contigs                  54453
[ INFO] 2016-03-19 12:37:02 : p good contigs                  0.3