Documentation
Manual Pages
- ncDNAlign.1.cutSequences.pl
- ncDNAlign.2.getGwAln.pl
- ncDNAlign.3.mergeGwAln.pl
- ncDNAlign.4.realign.pl
- ncDNAlign.5.trimAln.pl
NAME
cutSequences.pl
- part of the NcDNAlign alignment pipeline, step (1)
Extracts (sub)sequences out of sequence files and writes them out in NcDNAlign standardized FASTA format (supported input file formats: FASTA, GenBank).
SYNOPSIS
cutSequences.pl [options, mode 1] OR [options, mode 2] OR [options, mode 3]
There are three ways of running the program:
- Mode 1:
-
EITHER you specify the configuration file and the program runs for a whole screen,
- Mode 2:
-
OR you specify a gbk input file, the organisms name (3-letter code), the disturbing gbk feature keys and a certain file that will store produced ouput to cut one single gbk file,
- Mode 3:
-
OR you specify a fasta input file, the organisms name (3-letter code), and a certain file that will store produced ouput to cut one single fasta file.
OPTIONS
Mode 1:
- -c, --conf FILE
-
Path to central NcDNAlign configuration file, inhibits -gbk, -name, -gbk_keys, -cut, -min_aln_len
Mode 2:
- --gbk FILE
-
Path to a single gbk file containing the input data.
- -n, --name xyz
-
The 3-letter name of the organism that belongs to the gbk file
- --gbk_keys ``foo;bar''
-
Specify gbk feature keys that should be cutted out of the gbk file. List them comma separated! Reference of GenBank feature keys: http://www.ncbi.nlm.nih.gov/collab/FT/#7.3
- --cut FILE
-
Path to a file storing the output in NcDNAlign customized fasta format. Cutted regions will be stored there (e.g. *.cut.fa).
- --min_aln_len INT
-
Minimal required alignment length.
- -v, --version
-
Prints version information and exits.
- -h, --help
-
Prints a short help message and exits.
- --man
-
Prints a detailed manual page and exits.
Mode 3:
- --fasta FILE
-
Path to a single fasta file containing the input data.
- --cut FILE
-
Path to a file storing the output in NcDNAlign customized fasta format. Cutted regions will be stored there (e.g. *.cut.fa).
- -n, --name xyz
-
The 3-letter name of the organism that belongs to the fasta file
DESCRIPTION
cutSequences.pl optionally cuts out specific regions from genome files resulting in FASTA formatted files that either contain subsequences or the complete genomic sequence with appropriate headers.
If it is requested to get alignments of all possible regions of applied genomes and nothing should be cutted, just deliver an empty string for the option C_GBK_KEYS in the config file or on the -gbk_keys option at the command line. Getting rid of disturbing elements as earliest as possible directly speeds up subsequent BLAST searches and further alignment procedures due to shorter sequences.
However, one must be familiar with the GenBank feature keys documented at http://www.ncbi.nlm.nih.gov/collab/FT/#7.3 to ensure that no desired loci are unintentionally cut out of the genomes (e.g. tRNAs can appear at exons, so dropping exons would cause to loose tRNAs).
EXAMPLES
- Mode 1 (whole screen):
-
$ cutSequences.pl -c config-file.cfg
- Mode 2 (cutting single file):
-
$ cutSequences.pl -gbk Eco.gbk -name Eco -gbk_keys ``CDS;repeat_region'' -cut Eco.cut.fa
AUTHORS
Dominic Rose (dominic@bioinf.uni-leipzig.de) Jana Hertel (jana@bioinf.uni-leipzig.de)