NcDNAlign

NcDNAlign - Plausible Multiple Alignments of Non-Protein-Coding Genomic Sequences

Dominic Rose, Jana Hertel, Kristin Reiche, Peter F Stadler, Jörg Hackermüller


Documentation

PDF | PS

Manual Pages


NAME

cutSequences.pl - part of the NcDNAlign alignment pipeline, step (1)

Extracts (sub)sequences out of sequence files and writes them out in NcDNAlign standardized FASTA format (supported input file formats: FASTA, GenBank).


SYNOPSIS

cutSequences.pl [options, mode 1] OR [options, mode 2] OR [options, mode 3]


There are three ways of running the program:
Mode 1:

EITHER you specify the configuration file and the program runs for a whole screen,

Mode 2:

OR you specify a gbk input file, the organisms name (3-letter code), the disturbing gbk feature keys and a certain file that will store produced ouput to cut one single gbk file,

Mode 3:

OR you specify a fasta input file, the organisms name (3-letter code), and a certain file that will store produced ouput to cut one single fasta file.


OPTIONS

Mode 1:

-c, --conf FILE

Path to central NcDNAlign configuration file, inhibits -gbk, -name, -gbk_keys, -cut, -min_aln_len

Mode 2:

--gbk FILE

Path to a single gbk file containing the input data.

-n, --name xyz

The 3-letter name of the organism that belongs to the gbk file

--gbk_keys ``foo;bar''

Specify gbk feature keys that should be cutted out of the gbk file. List them comma separated! Reference of GenBank feature keys: http://www.ncbi.nlm.nih.gov/collab/FT/#7.3

--cut FILE

Path to a file storing the output in NcDNAlign customized fasta format. Cutted regions will be stored there (e.g. *.cut.fa).

--min_aln_len INT

Minimal required alignment length.

-v, --version

Prints version information and exits.

-h, --help

Prints a short help message and exits.

--man

Prints a detailed manual page and exits.

Mode 3:

--fasta FILE

Path to a single fasta file containing the input data.

--cut FILE

Path to a file storing the output in NcDNAlign customized fasta format. Cutted regions will be stored there (e.g. *.cut.fa).

-n, --name xyz

The 3-letter name of the organism that belongs to the fasta file


DESCRIPTION

cutSequences.pl optionally cuts out specific regions from genome files resulting in FASTA formatted files that either contain subsequences or the complete genomic sequence with appropriate headers.

If it is requested to get alignments of all possible regions of applied genomes and nothing should be cutted, just deliver an empty string for the option C_GBK_KEYS in the config file or on the -gbk_keys option at the command line. Getting rid of disturbing elements as earliest as possible directly speeds up subsequent BLAST searches and further alignment procedures due to shorter sequences.

However, one must be familiar with the GenBank feature keys documented at http://www.ncbi.nlm.nih.gov/collab/FT/#7.3 to ensure that no desired loci are unintentionally cut out of the genomes (e.g. tRNAs can appear at exons, so dropping exons would cause to loose tRNAs).


EXAMPLES

Mode 1 (whole screen):

$ cutSequences.pl -c config-file.cfg

Mode 2 (cutting single file):

$ cutSequences.pl -gbk Eco.gbk -name Eco -gbk_keys ``CDS;repeat_region'' -cut Eco.cut.fa


AUTHORS

 Dominic Rose (dominic@bioinf.uni-leipzig.de)
 Jana Hertel  (jana@bioinf.uni-leipzig.de)


AVAILABILITY

http://www.bioinf.uni-leipzig.de/Software/NcDNAlign/