Overview and Download
Proteinortho4
NAME
Proteinortho4 - Orthology detection tool
SYNTAX
proteinortho4.pl [OPTIONS]... <FILE1> <FILE2>... >OUTPUT
proteinortho4.pl [OPTIONS]... <FILELIST> >OUTPUT
DESCRIPTION
Predicts orthologous and co-orthologous proteins within different species.
Given a list of proteome files in fasta-format, Proteinortho runs an all
against all blast search and applies a partitioning algorithm on the
adaptive best blast hits to conclude (co-)orthologous groups.
This tool is designed to deal with large data sets and behaves nicely
regarding the memory consumption even if applied to millions of
proteins.
- Important:
-
Protein ids must be globally different! You should also consider that
blast may cut the ids on a whitespace using the first part only.
- Proteome files can be given directly by command line (<FILE1> <FILE2>...) or in a <FILELIST> containing one filename each line.
-
OUTPUT FORMAT
The OUTPUT is a tab separated matrix.
First line starts with # followed by the column and file names, respectively.
Second line starts with # followed by the corresponding values and numbers of proteins for each file.
Each following line, represents a (co-)orthologous group.
Besides the number of species and proteins, the (approximated) algebraic connectivity of each group is given.
This value allows to conclude the degree of conservation among each group.
Values close to 1 are highly conserved whereas values close to 0 are poorly conserved.
The last line starts with # and tells about the version used as well as the applied parameters.
OPTIONS
- -e=<E-VALUE>
-
<E-VALUE> threshold for blasts [default: 1e-10]
- -p=blastp|blastn
-
Defines the blast program [default: blastp]
Use blastp for amino acid sequences (.faa)
Use blastn for nucleotide sequences (.fna)
- -id=(0..100)
-
Min. percent identity of best blast hits [default: 25]
Hits below this level will be ignored.
- -cov=(0..1)
-
Min. coverage of best blast hits [default: 0.5]
Hits below this level will be ignored.
- -conn=(0..1)
-
Min. algebraic connectivity for each (co-)orthologous group [default: 0.1]
Proteinortho will split groups until the given level of connectivity is reached.
Raising this level can be useful to remove less conserved paralogous from the output.
The average group-size will decrease. Thereby, 0.5 is very strict, already.
Raising the value even higher is not recommended, except you want to focus on strongly conserved sets only.
- -m=(0..1)
-
Min. similarity for additional hits [default: 0.95 (nearly equal]
All blast hits with new/bestscore < m are included, even if they are not the best hit.
This options allows to recover (co-)orthologous groups even if ambiguous paralogs exist.
Setting this value to 1 complies with running a regular (non-adaptive) reciprocal best blast hit approach.
Lowering this value will potentially include more paralogous proteins to the groups.
Values lower than 0.75 are not recommended unless you know what you do.
- -pairs
-
Do not remove simple pairs from output
(Co-)orthologous groups of size two are very likely to occur by chance, thus they are removed normally.
However, these groups might of interest for some users as well. Especially if the number of species is small.
- -selfblast
-
Apply blast for each species against itself.
Proteinortho concludes paralogous genes indirectly from comparisons to other species.
In turn, paralogs will not be detected if there is no co-orthologous gene in any other species.
Use this option to recover them as well. Will significantly increase runtime.
Using -pairs in addition is recommended.
- -unambiguous
-
Exclude connected components with paralogs from output
Beware: This option might exclude a reasonable amount of groups.
- -a=<THREADS>
-
The number of processors to use [default: auto]
- -noiolimit
-
Proteinortho automatically limits the amount of competitive I/O-threads to spare the hard disk.
In case you use a SSD or RAM-Disk, you can disable this behavior to speed up the analysis.
- -f
-
Force blastall (even if blast output is found)
- -ff
-
Force formatdb (even if databases are found), implies -f
- -dir=<DIRECTORY>
-
Defines the <DIRECTORY> for the blast outputs [default: working directory]
- -remove
-
Proteinortho allows to reuse the blast output for additional analysis.
This significantly saves time.
However, if you do not intend to run an additional analysis with at
least some of these species you can tell Proteinortho remove unnecessary
files.
- -log=<FILE>
-
Writes a detailed log of reciprocal best hits to <FILE>.
- -o=<FILE>
-
Prints the output to the given <FILE> rather than STDOUT.
- -verbose
-
Gives information about what happens, including a progress report
- -debug
-
Gives detailed information for bug tracking
Does not work in combination with -verbose
MULTIPLE MACHINE OPTIONS
The main part of Proteinortho consists of blasting each species against each other. This can take several hours up to days if hundreds of
species are involved - even on multi-core machines. For this purpose a mechanism has been implemented which allows to distribute that workload
over multiple machines.
Every option aside from -a=<THREADS> needs to be the
same. This is especially important for the directory in which the blasts
are stored. A file named sync
will be created their and used to synchronize the processes. As flock
is not capable for network file systems a temporary directory named lock is used for locking. Both may need to be removed if Proteinortho was
interrupted or crashed and a restart is intended.
Run all scripts using the option
- -blastonly
-
As the scripts synchronize themselves the order or time you start it on
different machines does not matter. You can even stop certain processes
if needed. See SIGNALS for more details to that topic. After the blasts are done, all started scripts will be terminated.
If that happened, you can grab the results and finish the calculations.
Start the script again on one machine using the same options as before.
Instead of -blastonly use the option
- -blastdone
-
This will lead to skip database creation and blasts and thus speed up the
beginning of the connected component calculation.
- -batch
-
Returns a batch-list of jobs on STDOUT
This is preferable to -blastonly if you use a
cluster-management system.
"wait;" will indicate that all jobs above have to be finished before proceeding further
Requires -o=<ILE>
SIGNALS
Sending signal INT or TERM to a Proteinortho process will lead to a clean stop which allows a later continuation at this point. If used
on MULTIPLE MACHINES
this allows to stop certain processes without interference with the on
going calculation. As going blast jobs need to be finished first, the
termination may take a while.
- However, sending the signal twice (or using KILL) will lead
to an immediate stop and may result in corrupted data. It is advisable
to remove all files from the blast out directory and not use the data
any further. This is also the case if the the blasts where distributed
over multiple machines.
-
- Furthermore, if a full stop of all processes on MULTIPLE MACHINES
is intended, a file named stop can be placed in the blast out
directory. This will lead to clean stop as described above for all
running scripts.
-
EXAMPLES
- To run this program the standard way comparing two or more species type:
-
proteinortho4.pl speciesA.faa speciesB.faa >orthologs.out
- If you want to have live progress report and store blast files in a separate folder, type:
-
mkdir blastout/
proteinortho4.pl -verbose -dir=blastout/ files.list >orthologs.out
- If you use a cluster-management system and want to handle threads yours
elf:
-
mkdir blastout/
proteinortho4.pl -batch -dir=blastout/ -o=orthologs.out files.list >
batch.sh
- Now batch.sh can be executed via the management software
-
COPYRIGHT
Copyright 2008 Free Software Foundation, Inc. License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
REPORTING BUGS
Marcus Lechner <marcus[at]bioinf.uni-leipzig.de>
AUTHORS
Written by Marcus Lechner and Lydia Steiner
Interdisciplinary Center for Bioinformatics, University of Leipzig
SEE ALSO
- po2tree.pl a tool which allows to generate pseudo phylogenetic=
trees based on the proteins contained
-
Index
- NAME
-
- SYNTAX
-
- DESCRIPTION
-
- OUTPUT FORMAT
-
- OPTIONS
-
- MULTIPLE MACHINE OPTIONS
-
- SIGNALS
-
- EXAMPLES
-
- COPYRIGHT
-
- REPORTING BUGS
-
- AUTHORS
-
- SEE ALSO
-