Bioinformatics Leipzig - Publications

Publications - Supplemental material

Please find below supplemental material corresponding to publications of our group. Currently, we list 133 supplements. If you have problems accessing electronic information, please let us know:

©NOTICE: All documents are copyrighted by the authors; If you would like to use all or a portion of any paper, please contact the author.

This supplement is also available at http://www.bioinf.uni-leipzig.de/publications/supplements/13-008
You may use this URL to cite or link to us.

BIOINF 13-008: Lacking alignments? The next-generation sequencing mapper segemehl revisited

Christian Otto, Peter F. Stadler, Steve Hoffmann

1. General Information

Supplementary methods, figures, and tables [PDF]

2. Remapping unmapped reads with lack

In the following, using a small example data set, we outline how to perform remapping
on unmapped reads, initially mapped by segemehl, using our novel tool lack, which is
part of the segemehl distribution. Moreover, the complete list of options available for
lack is described and details about the compatibility of lack to other split-read aligners
such as Blat, TopHat 2, and STAR are explained.

2.1 Basic usage

Download the example data, custom scripts (necessary for compatibility with other
split-read aligners) and segemehl version 0.1.7 (or newer)
uncompress all files, and compile segemehl.x and lack.x

tar -xzvf lack_example_data.tar.gz

tar -xzvf lack_comp_scripts.tar.gz

tar -xzvf segemehl_0_1_7.tar.gz

cd segemehl_0_1_7/segemehl && make && cd ../..

run split-read alignment with segemehl with split-mapping option

segemehl_0_1_7/segemehl/segemehl.x -s -d reference.fa -i reference.idx -q reads.fq -o mapped.sam -u unmapped.fq -S

sort and compress SAM file, compress unmapped read file

samtools view -bS mapped.sam | samtools sort - mapped.sorted

samtools view -h mapped.sorted.bam | gzip -c > mapped.sorted.sam.gz

gzip -f unmapped.fq

run remapping with lack

segemehl_0_1_7/segemehl/lack.x -s -d reference.fa -q mapped.sorted.sam.gz -r unmapped.fq.gz -o remapped.sam

2.2 More options

Parameters:

`-d, --database <file> [<file> ...]`	list of path/filename(s) of database sequence(s)
`-q, --query [ ...]`	path/filename of alignment file in (gzip'ed) SAM format, file must be coordinate-sorted
`-o, --outfile <string>`	outputfile in SAM format (default:stdout)
`-r, --remapfilename <file>`	filename for unmapped reads to be remapped (default:none) file must contain best-seed information, that is automatically added by segemehl (see Section below if another aligner was used)
`-u, --nomatchfilename <file>`	filename for reads that remained unmapped (default:none)
`-t, --threads <n>`	start <n> threads (default:1)
`-s, --silent`	shut up!
`-A, --accuracy <n>`	min percentage of matches per read in semi-global alignment (default:90)
`-W, --minsplicecover <n>`	min coverage for spliced transcripts (default:80)
`-U, --minfragscore <n>`	min score of a spliced fragment (default:5)
`-Z, --minfraglen <n>`	min length of a spliced fragment (default:5)
`-M, --maxdist <n>`	max number of distant sites to consider, 0 to disable (default:100)

Version:
0.1.7-403 (2013-09-12 11:46:53 +0200 (Thu, 12 Sep 2013))

Bugs:
Please report bugs to christian [at] bioinf.uni-leipzig [dot] de.

References:
SEGEMEHL is free software for non-commercial use
© 2012 Bioinformatik Leipzig

2.3 Compatibility with other split-read aligners

In order to use lack with split-read aligners other than segemehl, it is necessary to get a file
comprising unmapped reads, to add best seed information, and to postprocess the alignments
reported by the split-read aligner.

run split-read aligner and convert into extended SAM format of segemehl

(1) Blat (tested with version 35x1):: blat -noHead reference.fa reads.fa mapped.psl pslReps -nohead mapped.psl mapped.filtered.psl mapped.filtered.psr cat mapped.filtered.psl | perl convBl2Se.pl reads.fq | samtools view -bT reference.fa - | samtools sort - mapped.sorted samtools calmd mapped.sorted.bam reference.fa | gzip -c > mapped.sorted.sam.gz
(2) TopHat2 (tested with version 2.0.9):: bowtie2-build reference.fa reference_bowtie2 tophat2 --fusion-search -o mapped -p 8 reference_bowtie2 reads.fq samtools view -F 0x4 mapped/accepted_hits.bam | perl convTo2Se.pl | samtools view -bT reference.fa - | samtools sort - mapped.sorted samtools calmd mapped.sorted.bam reference.fa | gzip -c > mapped.sorted.sam.gz
(3) STAR (tested with version 2.3.0e):: STAR --runMode genomeGenerate --genomeDir reference_star --genomeFastaFiles reference.fa --runThreadN 8 STAR --outFileNamePrefix mapped/ --genomeLoad NoSharedMemory --genomeDir reference_star --runThreadN 8 --chimSegmentMin 17 --chimScoreMin 17 --outSAMattributes All --readFilesIn reads.fq samtools view -F 0x4 -S mapped/Aligned.out.sam | perl convTo2Se.pl > mapped.tmp1 samtools view -F 0x4 -S mapped/Chimeric.out.sam | perl convSt2Se.pl > mapped.tmp2 perl mergeOutputStar.pl mapped.tmp1 mapped.tmp2 | samtools view -bhT reference.fa - | samtools sort - mapped.sorted samtools calmd mapped.sorted.bam reference.fa | gzip -c > mapped.sorted.sam.gz

get unmapped reads,
either

by split-read aligner itself, stored in preunmapped.fq

by use of getUnmapped.pl (part of custom script package above)

samtools view -S mapped.sorted.sam.gz | getUnmapped.pl reads.fq > preunmapped.fq

find best seeds using segemehl

segemehl_0_1_7/segemehl/segemehl.x -d genome.fa -i genome.idx -q preunmapped.fq -A 100 -u unmapped.fq -o preunmapped.sam -t 8 gzip -f unmapped.fq

run remapping with lack

segemehl_0_1_7/segemehl/lack.x -s -d reference.fa -q mapped.sorted.sam.gz -r unmapped.fq.gz -o remapped.sam

3. Read aligner comparison

Here, we provide detailed information on these performance tests. In the light of the
heated debates, we would like to stress that benchmarks only measure specific aspects
and may not be used to claim any universal superiority or inferiority of a tool.

In order to reproduce the results of the read aligner comparisons, we have assembled
a benchmark package containing all data sets (simulated and down-sampled real data),
read aligners (pre-compiled versions), optimal alignments (obtained by RazerS 3), and
all custom scripts. In addition, we subsequently explain how to re-run the benchmarks.

We would like to encourage all readers to reproduce this data and to come up with
alternative benchmarks.

3.1 Download

benchmark package 0.1 (342 MB, initial version)
benchmark package 0.2 (369 MB, paired-end mRNA-seq data set added)
benchmark package 0.3 (369 MB, bug fixes in run*.sh files)
benchmark package 0.4 (372 MB, base qualities scaled to Phred+33)

3.2 Requirements

64-bit linux machine with 50 GB RAM or more
multi-threading of each aligner with 8 threads is utilized
(adapt parameter $THR in scripts/run*.sh if necessary)
samtools installed (newest version 0.1.19 and bugfixed according to
samtools-devel, otherwise there will be an error message if no
alignments are present in SAM)

3.3 How to run the benchmarks?

Download the benchmark package, uncompress it, and change into
the new directory by typing

tar -xzvf segemehl_benchmarks.tar.gz && cd segemehl_benchmarks

Check whether all aligners are working properly using the following
commands

./tools/bowtie2-2.1.0/bowtie2 -h

./tools/bwa-0.7.4/bwa

./tools/GEM-binaries-Linux-x86_64-core_i3-20130406-045632/bin/gem-mapper

./tools/segemehl_0_1_7/segemehl/segemehl.x -h

./tools/STAR_2.3.0e.Linux_x86_64/STAR --outFileNamePrefix test && rm -rf test*

If not, please refer to the section 'Tools' below.

Run the benchmarks under default parameter values by typing

./scripts/runDefault.sh

Details on the output is given in the section 'Output' below.

Note: The initial execution will take much longer since the genome
will be downloaded and the genome index of each aligner needs be
built.

Run the benchmarks with the set of different of parameter options
(in order to define best-sensitivity and best-false positive settings)
by typing

./scripts/runVarParam.sh

Details on the output is given in the section 'Output' below.

Note: The execution of these benchmarks takes longer than under
default parameters since multiple different parameter settings are
used for every aligner and data set, some of which are much more
time-consuming than the default setting.

3.4 Tools

Bowtie 2 v.2.1.0, BWA/BWA-SW/BWA-MEM v.0.7.4, GEM pre-release 3,
segemehl v.0.1.7, and STAR v.2.3.0e are already included precompiled
in the package above. Please check whether all tools are working
properly. Note that all aligners were downloaded as source (if
available) and compiled on a linux x86 (64-bit) machine.

If any aligner is not working properly, you would need to download
the aligner manually, compile it, and possibly update the paths in the
respective runX.sh in the scripts subfolder.

3.5 Optimal alignments

The optimal alignments, obtained by RazerS 3 v.3.1, are included in the
package above. These are necessary for the evaluation of the benchmarks.

Details about the command line options used to generate these alignments
and the processing of paired-end data are given in the paper.

3.6 Genome and genome indices

To reduce the size of the package, genome sequence and genome indices
of the aligners are not included in the package above.

During the first execution of the scripts, the genome will be downloaded
and the indices will be build. More specifically, the Human genome (hg19)
will automatically be downloaded from UCSC and converted properly.
Haplotypes, random contigs, and 'non-chromosomal' sequences will
therefore be removed.

Genome indices of the aligners will be build once as well.

3.7 Output

The output of runDefault.sh or runVarParam.sh will be put into the
subfolders mappings and mapping_varparam, respectively. The
subfolders contain all final alignment files (coordinate sorted
gzip'ed SAM), time and memory files containing the running time and
peak virtual memory measurements, and evaluation files.

Finally, there will be a file, named all.eval.txt, that contains all
the evaluation measures of every dataset and aligner. It is formatted
as follows:
dataset<tab>tool<tab>params<tab>type<tab>time<tab>mem<tab>sens<tab>fp
where dataset is the name of the data set, tool is the short name of
the split-read aligner, params is the concatenated, white-spaced
reduced list of parameters used for execution, type is the evaluation
scenario (allbest or anybest), time is the user time in seconds, mem is
the peak virtual memory consumtion in MB, sens is the calculated
sensitivity, and fp is the calculated number of false positives.

Note that the all.eval.txt files are sufficient to generate all
figures and tables of the read aligner comparison, shown in paper
and supplement.