NAME

cutSequences.pl - part of the NcDNAlign alignment pipeline, step (1)

Extracts (sub)sequences out of sequence files and writes them out in NcDNAlign standardized FASTA format (supported input file formats: FASTA, GenBank).

SYNOPSIS

cutSequences.pl [options, mode 1] OR [options, mode 2] OR [options, mode 3]


There are three ways of running the program:

Mode 1:: EITHER you specify the configuration file and the program runs for a whole screen,
Mode 2:: OR you specify a gbk input file, the organisms name (3-letter code), the disturbing gbk feature keys and a certain file that will store produced ouput to cut one single gbk file,
Mode 3:: OR you specify a fasta input file, the organisms name (3-letter code), and a certain file that will store produced ouput to cut one single fasta file.

OPTIONS

Mode 1:

-c, --conf FILE: Path to central NcDNAlign configuration file, inhibits -gbk, -name, -gbk_keys, -cut, -min_aln_len

Mode 2:

--gbk FILE: Path to a single gbk file containing the input data.
-n, --name xyz: The 3-letter name of the organism that belongs to the gbk file
--gbk_keys ``foo;bar'': Specify gbk feature keys that should be cutted out of the gbk file. List them comma separated! Reference of GenBank feature keys: http://www.ncbi.nlm.nih.gov/collab/FT/#7.3
--cut FILE: Path to a file storing the output in NcDNAlign customized fasta format. Cutted regions will be stored there (e.g. *.cut.fa).
--min_aln_len INT: Minimal required alignment length.
-v, --version: Prints version information and exits.
-h, --help: Prints a short help message and exits.
--man: Prints a detailed manual page and exits.

Mode 3:

--fasta FILE: Path to a single fasta file containing the input data.
--cut FILE: Path to a file storing the output in NcDNAlign customized fasta format. Cutted regions will be stored there (e.g. *.cut.fa).
-n, --name xyz: The 3-letter name of the organism that belongs to the fasta file

DESCRIPTION

cutSequences.pl optionally cuts out specific regions from genome files resulting in FASTA formatted files that either contain subsequences or the complete genomic sequence with appropriate headers.

If it is requested to get alignments of all possible regions of applied genomes and nothing should be cutted, just deliver an empty string for the option C_GBK_KEYS in the config file or on the -gbk_keys option at the command line. Getting rid of disturbing elements as earliest as possible directly speeds up subsequent BLAST searches and further alignment procedures due to shorter sequences.

However, one must be familiar with the GenBank feature keys documented at http://www.ncbi.nlm.nih.gov/collab/FT/#7.3 to ensure that no desired loci are unintentionally cut out of the genomes (e.g. tRNAs can appear at exons, so dropping exons would cause to loose tRNAs).

EXAMPLES

Mode 1 (whole screen):: $ cutSequences.pl -c config-file.cfg
Mode 2 (cutting single file):: $ cutSequences.pl -gbk Eco.gbk -name Eco -gbk_keys ``CDS;repeat_region'' -cut Eco.cut.fa

AUTHORS

 Dominic Rose (dominic@bioinf.uni-leipzig.de)
 Jana Hertel  (jana@bioinf.uni-leipzig.de)

AVAILABILITY

http://www.bioinf.uni-leipzig.de/Software/NcDNAlign/

NcDNAlign - Plausible Multiple Alignments of Non-Protein-Coding Genomic Sequences

Dominic Rose, Jana Hertel, Kristin Reiche, Peter F Stadler, Jörg Hackermüller

Documentation

Manual Pages