tblastn
(BLAST+) or blastall
(BLAST+) in the command line)
perl -v
in the command line)
python -V
in the command line)
make
in the command line)
g++
in the command line)
tar -xzvf proteinortho_v5.13.tar.gz
cd proteinortho_v5.13
./proteinortho5.pl
directly
make
followed sudo make install
(requires root privileges).
make test
to make sure Proteinortho works as expected
proteinortho5.pl [OPTIONS] FASTA1 FASTA2 [FASTAn...]
performs an orthology analysis for the given sets of proteins.
Add -p=blastn
in case your sequences are represented as nucleotides (ACTG) rather than as amino acids.
proteinortho5.pl
.
If you have it not installed but only downloaded and extracted to a folder, please use /FULL/PATH/TO/proteinortho5.pl
instead.
Proteinortho assumes, that you have all your gene sequences in FASTA format either
represented as amino acids or as nucleotides. The source code archive contains some examples, namely C.faa, E.faa, L.faa, M.faa located in the test/ directory.
By default Proteinortho assumes amino acids and thus uses blastp+ to compare sequences. If you have nucleotide sequences, you need to change this by adding the parameter
-p=blastn+. (In case you have only have NCBI BLAST legacy installed, you need to tell this too - either by adding -p=blastp or -p=blastn respectively.)
The full command for the example files would thus be proteinortho5.pl -project=test test/C.faa test/E.faa test/L.faa test/M.faa
.
Instead of naming the FASTA files one by one, you could also supply test/*.faa as argument.
Please note that the parameter -project=test is optional. With this, you can set the prefix of the output files generated by Proteinortho.
If you skip the project parameter, the default project name will be myproject.
Proteinortho will automatically determine the number of available CPU threads and use them accordingly to speed up the calculations. You can use the parameter -cpus= to manually set the number of threads.
When the analysis is done you will find a new file in your current working directory, namely test.proteinortho. To have a quick look, you can i.e. use less -S test.proteinortho
.
The tab-separated output generated looks like this:
# Species Genes Alg.-Conn. M.faa L.faa C.faa E.faa 4 4 1 M_10 L_10 C_10 E_10 4 4 1 M_11 L_11 C_11 E_11 4 4 1 M_14 L_14 C_14 E_14 ... 4 5 0.2 M_19 L_19 C_22,C_63 E_19 ...The first line starting with #is a comment line indicating the meaning of each column for each of the following lines which represent an orthologous group each. The very first column indicates the number of species covered by this group. The second column indicates the number of genes included in the group. Often, this number will equal the number of species, meaning that there is a single ortholog in each species. If the number of genes is bigger than the number of species, there are co-orthologs present. The third column gives rise to the algebraic connectivity of the respective group. Basically, this indicates how densely the genes are connected in the orthology graph that was used for clustering. A connectivity of 1 indicates a perfect dense cluster with each gene similar to each other gene. By default, Proteinortho splits each group into two more dense subgroups when the connectivity is below 0.1. In the second last line of the example above, there is a group with three paralogs in species C (C.faa). They are separated by a comma (,) indicating that they are co-orthologous the genes in the other species.
The PoFF extension allows you to use the relative order of genes (synteny) as an additional criterion to disentangle complex co-orthology relations.
To do so, add the parameter -synteny.
You can use it to either come closer to one-to-one orthology relations by preferring synthetically conserved copies in the presence of two very similar paralogs (default),
or just to reduce noise in the predictions by detecting multiple copies of genomic areas (add the parameter -dups=3).
Please note that you need additional data to include synteny, namely the gene positions in GFF3 format.
As Proteinortho is primarily made for proteins, it will only accept GFF entries of type CDS (column #3 in the GFF-file).
The attributes column (#9) must contain Name=GENE IDENTIFIER where GENE IDENTIFIER corresponds to the respective identifier in the FASTA format.
It may not contain a semicolon (;)! Alternatively, you can also set ID=GENE IDENTIFIER.
Example files are provided in the source code archive.
Hence, we can run proteinortho5.pl -project=test -synteny test/A1.faa test/B1.faa test/E1.faa test/F1.faa
to add synteny information to the calculations.
Of course, this only makes sense if species are sufficiently similar. You won't gain much when comparing e.g. bacteria with fungi.
When the analysis is done you will find an additional file in your current working directory, namely test.poff.
This file is equivalent to the .proteinortho file (above) but can be considered more accurate as synteny was involved for its construction.
In addition Proteinortho will generate graph files containing all pairwise orthology relationships including similarity scores. If they are not generated, rerun Proteinortho with the -graph parameter.
myproject.blast-graph:
filtered raw blast data based on adaptive reciprocal best blast matches (= reciprocal best match plus all reciprocal matches within a range of 95% by default)
myproject.proteinortho-graph:
clustered blast graph. Its connected components are represented in myproject.proteinortho.
myproject.ffadj-graph:
filtered blast data based on adaptive reciprocal best blast matches and synteny (only if -synteny is set)
myproject.poff-graph
clustered ffadj graph. Its connected components are represented in myproject.poff (only if -synteny is set)
The format of all graph files looks about similar:
# file_a file_b # a b evalue_ab bitscore_ab evalue_ba bitscore_ba # M.faa L.faa M_15 L_15 0.0 893 0.0 893 M_16 L_16 3e-175 481 3e-175 481 M_19 L_19 8e-93 262 8e-93 262 ... # M.faa E.faa M_10 E_10 3e-137 415 2e-148 441 M_11 E_11 2e-71 221 9e-68 209 ...The first two rows are just comments explaining the meaning of each row. Whenever a comment line (starting with #) follows, it indicates results comparing the two species is about to follow. E.g. #M.faa L.faa tells that the next lines represent results for species M and L. All matches are reciprocal matches. If e.g. a match for M_15 L_15 is shown, L_15 M_15 exists implicitly. E-Values and bit scores for both directions are given behind each match.
The synteny based graph files (myproject.ffadj-graph and myproject.poff-graph) have two additional columns: same_strand and simscore. The first one indicates if two genes from a match are located at the same strands (1) or not (-1). The second one is an internal score which can be interpreted as a normalized weight ranging from 0 to 1 based on the respective e-values. Moreover, a second comment line is followed after the species lines, e.g.
... # M.faa L.faa # Scores: 4 39 34.000000 39.000000 ...These scores are derived from the ffadj algorithm comparing the gene similarities and gene orders in the respective species. They are:
grab_proteins.pl
which fetches protein sequences of orthologous groups from Proteinortho output table
proteinortho5.pl [OPTIONS] FASTA1 FASTA2 [FASTAn...]
Option | Default value | Description |
[General options] | ||
-project= | myproject | prefix for all result file names |
-cpus= | auto | use the given number of threads |
-verbose | give a lot of information about the current progress | |
-keep | - | store temporary blast results for reuse (advisable for larger jobs) |
-temp= | working directory | path for temporary files |
-force | force recalculation of blast results in any case | |
-clean | remove all unnecessary files after processing | |
[Search options] | ||
-e= | 1e-05 | E-value for blast |
-p= | blastp+ | blast program to use:
|
-selfblast | apply selfblast to directly paralogs; normally these are inferred indirectly from orthology data to other species (experimental!) | |
-sim= | 0.95 | min. similarity for additional hits |
-identity= | 25 | min. percent identity of best blast alignments |
-cov= | 50 | min. coverage of best blast alignments in percent |
-subpara= | additional parameters for blast; set these in quotes (e.g. -subpara='-seg no') This parameter was named -blastParameters in earlier versions | |
[Synteny options] | ||
-synteny | - | activate PoFF extension to separate similar sequences using synteny (requires a GFF file for each FASTA file) |
-dups= | 0 | applied in combination with -synteny; number of reiterations for adjacencies heuristic to determine duplicated regions; if set to a higher number, co-orthologs will tend to get clustered together rather than getting separated |
-cs= | 3 | applied in combination with -synteny; size of a maximum common substring (MCS) for adjacency matches; the longer this value becomes the longer syntenic regions need to be in order to be detected |
alpha= | 0.5 | weight of adjacencies vs. sequence similarity |
[Clustering options] | ||
-singles | also report genes without orthologs in table output | |
-purity= | 1 | avoid spurious graph assignments [range: 0.01-1, default 0.75] |
-conn= | 0.1 | min. algebraic connectivity of orthologous groups during clustering |
-nograph | - | do not generate .graph files with pairwise orthology data saves some time |
[Misc options] | ||
-desc | write gene description file (XXX.descriptions); works only with NCBI-formated FASTA entries currently | |
-blastpath= | path to your local blast installation (if not in present in default paths) | |
-debug | gives detailed information for bug tracking | |
[Large compute jobs] | ||
Parameters needed to distribute the runs over several machines | ||
-step= | 0 |
perform only specific steps of the analysis
|
jobs=N/M | distribute blast step into M subsets and run job number N out of M in this very process, only works in combination with -step=2 |
proteinortho5.pl -steps=1 ...
to generate the indices.
Then you can run proteinortho5.pl -steps=2 -jobs=M/N ...
to run small chunks separately.
Instead of M and N numbers must be set representing the number of jobs you want to divide the run into (M) and the job division to be performed by the process.
E.g. to divide a Proteinortho run into 4 jobs to run on several machines, use
proteinortho5.pl -steps=1 ...
on a single PC, then
proteinortho5.pl -steps=2 -jobs=1/4 ...
proteinortho5.pl -steps=2 -jobs=2/4 ...
proteinortho5.pl -steps=2 -jobs=3/4 ...
proteinortho5.pl -steps=2 -jobs=4/4 ...
separately on different machines (can be run in parallel or iteratively within the same shared working directory).
After all step 2 runs are done, run
proteinortho5.pl -steps=3 ...