fragrep Tutorial
Homology Search with Fragmented Nucleic Acid Sequence Patterns using fragrep
Introduction
The fragrep tool was developed to detect distant
homologous sequences that exhibit a commonly observed fragmented
conservation pattern. fragrep is equipped with a
collection of other tools, most notably aln2pattern for
extracting homology patterns from multiple sequence alignments as well
as visualization of the homology patterns.
Creating a Search Pattern
A typical starting point for creating PWM-based search patterns is a
multiple sequence alignment. Often, when dealing with the aligned
members of a non-coding RNA family, these alignments exhibit certain
conserved blocks. The tool aln2pattern relies on manual
identification and annotation of these blocks and converts them into a
search pattern that can be used by fragrep.
The following example shows an alignment of YRNA sequences from
different vertebrate species. By visual inspection, we easily identify
two conserved blocks that we annotate in a separate line
labelled fragrep in the alignment:
CLUSTALW
Ta_gut_Y1 GGCTGGTCCGAAGGTAGTGGGGTATCTCAATTGATTGTTCACAGTCAGTTACAGATTGATCTCCTTGTTCT-CTCTTTCCCC--CCTTCCCACTACTGCACTCGACTAGTCTTTT---
Ga_gal_Y1 GGCTGGTCCGAAGGTAGTGGGGTGTCTCAATTGATTGTTCACAGTCAGTTACAGATTGATCTCCTCGTTCT-CTCTTTCCCC--CCTTCCCACTACTGCACTTGACTAGTCTTTT---
Or_ana_Y1 GGCTAGTCCGAAGGTAGTGAGTTATCACAATTGATTGTTCACAGTCAGTTACAGATCGATATCCCTGTTCT-CTCTTCCTCCCACCTTCTCACTACCGTATTTGACTAGTCTTTT---
Mo_dom_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAAATGATTGTTCACAGTCAGTTACAGATCGATCTCCTTGTTCT-CTCTTTCCCTC-CCTTCTCACTACTGCACTCGACTAGTCTTTT---
Tu_tru_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATCGAACTCCTCGTTCTACTCTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTT-----
Ho_sap_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATCGAACTCCTTGTTCTACTCTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTT-----
Mu_mus_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATTGAACTCCT-GTTCTACACTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTTTT---
Sp_tri_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATTGAACTCCTTGTTCTACTCTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTTTT---
An_car_Y3 GGCTGGTCCGATTGCAGTGGT----------ACTTATAATTAATTGATCACAGTCAGTTACAGGTTTCTTTGTTCTTT---CTCCACTCCCACTGCTTCACTTGACTAGTCTTTT---
Ta_gut_Y3 GGCTGGTCCGATGGCAGTGGT----------ATTTA----TAATTGATCACAGTCAGTTACAGATTTCTTTGTTCTTT---CTCCACACCCACTGCTGCACTTGACTAGTCTTTT---
Ga_gal_Y3 GGCTGGTCCGATGGCAATGGC----------ATTTATAA-TAATTGATCACAGTCCGTTACAGATTTCTTTGCTCTTT---CTCCACTCCCACTGCTGCATTTGACTAGTCTTTT---
Or_ana_Y3 GGCTGGTCCGAGTGCAGTGGA----------ATTTATAATTAATTGATCACAGTCAGTTACAGATTTCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGTCTTTT---
Mo_dom_Y3 GGCTGGTCCGATTGCAGTGGT----------AACTCTAATTAATTGATTACAGTCAGTTACAGATTTCTTTGTTCTTT---CTCCGCTCCCACTGCTTCACTTGACTAGTCTTTT---
Tu_tru_Y3 GGCTGGTCCGCGTGCAGTGGT----------GTTTACAATTAATTGATCACAGCCAGTTACAGATTTC-TTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGCCTTTT---
Ho_sap_Y3 GGCTGGTCCGAGTGCAGTGGT----------GTTTACAACTAATTGATCACAACCAGTTACAGATTTCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGCCTT-----
Sp_tri_Y3 GGCTGGTCCGAGTGCAGTGGT----------GTTTACAACTAATTGATCACAACCAGTTACAGATTTCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGCCTTTT---
Mu_mus_Y3 GGTTGGTCCGAGAGTAGTGGT----------GTTTACAACTAATTGATCACAACCAGTTACAGATTTCTTTGTTCCTT---CTCCGCTCCCACTGCTTCACTTGACCAGCCTTTT---
Lo_afr_Y3 GGCTGGTCCGAGTGCAGTGTG--------AGGCTTACAACTAATTGATCACAACCAGTTACAGATTCCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTAGACCGGTCTTTT---
An_car_Y4 GGCTGGTCCGAAAGTAGTGGGTTACCA----------CAGAAATTATTACAGTT-AGTTTCACTAACCTTTCTAAGT-----TCCA-CCCCACTGCTAACCTTGACTGGTCTCCTTTT
Ta_gut_Y4 GGCTGGTCCGATGGTAGTGGGTAGT------------CAGAAATTATTACTGCT-ACTTAAGCTAACCTTTCTATAT-----TCCA-CCCCACTGCTAACCTCGACTGGTCTTT----
Or_ana_Y4 GGCTGGTCCGATAGTAGTGGGTGCC------------CAGAACTCTTTAATAT--AGTTTCACTAAACTTGGTATAT-----TCCA-CCTCACTGCTAAACTTGACTGGCCTTTT---
Mo_dom_Y4 GGCTGGTCCGATGGCAGTGGTTTAC------------CAGAACTTATTGATATT-AGTTTCACAACAAGTTAATATAT----TCCACCCCCACTGCTAAATTTGACTGGCCTTTT---
Tu_tru_Y4 GGCTGGTCCGATGGTAGTGGGTTAC------------CAGAACTTATTAACGTT-AGTGTCACTAAAGTTGGTATACAA----CC--CCCCACTGCTAAATTTGACTAGCTTT-----
Lo_afr_Y4 GGCTGGTCCGATGGCAGTGGGTTAT------------CAGAACTTATTAACGTT-AGTGTCACTAAAGTTGGTATACAA----CC-CCCCCACTGCTGAATTTGACTGGCTTTTT---
Ho_sap_Y4 GGCTGGTCCGATGGTAGTGGGTTAT------------CAGAACTTATTAACATT-AGTGTCACTAAAGTTGGTATACAA----CC--CCCCACTGCTAAATTTGACTGGCTTT-----
Sp_tri_Y4 GGCTGGTCCGATGGTAGTGGGTTAT------------CAGAACTTATTAACATT-AGTGTCACTAAAGTTGATATACAA----CC--CCCCACTGCTAAATTTGACTGGCTTTTT---
FRAGREP ********************-------------------------------------------------------------------*************************------
|
This annotated alignment can now be processed
with aln2pattern to obtain our search pattern that we
pipe into a file named YRNA.pattern.
aln2pattern -m YRNA.aln > YRNA.pattern
|
The pattern file YRNA.pattern contains a description of
the annotated columns as a matrix, along with some further constraints
such as match thresholds and distance constraints between the
blocks:
2 matrices
0 0 M0:GGCUGGUCCGADGGYAGUGG 0.93 0
47 66 M1:YCCCACURCUKMACUUGACURGYCU 0.89 0
M0:GGCUGGUCCGADGGYAGUGG
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 25 | 9 | 3 | 0 | 0 | 26 | 1 | 0 | 0 | 6 |
| 0 | 0 | 25 | 0 | 0 | 0 | 0 | 26 | 26 | 0 | 1 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 |
| 26 | 26 | 0 | 0 | 25 | 26 | 0 | 0 | 0 | 26 | 0 | 6 | 16 | 26 | 0 | 0 | 25 | 0 | 26 | 19 |
| 0 | 0 | 1 | 26 | 0 | 0 | 26 | 0 | 0 | 0 | 0 | 11 | 7 | 0 | 15 | 0 | 0 | 26 | 0 | 1 |
| # | G | G | C | U | G | G | U | C | C | G | A | D | G | G | Y | A | G | U | G | G |
M1:YCCCACURCUKMACUUGACURGYCU
| 1 | 0 | 0 | 0 | 26 | 0 | 0 | 8 | 0 | 0 | 7 | 8 | 24 | 0 | 0 | 1 | 0 | 26 | 0 | 0 | 18 | 0 | 0 | 0 | 0 |
| 8 | 26 | 19 | 26 | 0 | 26 | 0 | 0 | 26 | 1 | 0 | 17 | 2 | 19 | 0 | 3 | 0 | 0 | 26 | 2 | 0 | 0 | 10 | 22 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 26 | 0 | 0 | 0 | 8 | 26 | 0 | 0 | 0 |
| 17 | 0 | 7 | 0 | 0 | 0 | 26 | 0 | 0 | 25 | 8 | 1 | 0 | 7 | 26 | 22 | 0 | 0 | 0 | 24 | 0 | 0 | 16 | 4 | 26 |
| # | Y | C | C | C | A | C | U | R | C | U | K | M | A | C | U | U | G | A | C | U | R | G | Y | C | U |
|
The different parameters represent the least constraint parameters
that yield a match for all sequences in the original alignment:
The first block is required to mathc with a score of .93 and 0
insertions or deletions, where the possible match scores range between
0 and 1; the number of insertions can be any non-negative
integer. Further contraints are given through the values 47 and 66
given in front of the second pattern. These specifiy a lower bound and
an upper boudn on the number of nucleotides between the two blocks.
Note that a graphical representation of the pattern created
by aln2pattern can be found in the
file aln2pattern.eps. Postscript output can also be created
from any valid fragrep file using the
tool pattern2eps.
Searching for Matches
The pattern created above can be readily used to start a fragrep
seearch. Imagine, however, we want to search a genome that is
phylogenetically divergent from all the sequences that were given in
the alignment YRNA.aln. For this case, we can tune the
parameters for more fault tolerance, for instance by modifying the
first three lines to
2 matrices
0 0 M0:GGCUGGUCCGADGGYAGUGG 0.9 1
30 80 M1:YCCCACURCUKMACUUGACURGYCU 0.8 1
M0:GGCUGGUCCGADGGYAGUGG
|
We use this modified version of YRNA.pattern to start
a fragrep search, using option -q to avoid
reporting redundant matches:
fragrep -q YRNA.pattern X.fa
|
This query indeed yields the following match that can be processed
further to test for suitable RNA secondary structure, eventually
characteristic promoter sequences, or other evidience of a functional
ncRNA gene.
>X-fwd:pos780 weight=11.0799 p-value=1.54188e-05
TAGTGGTCCGATGGTAGTGGGTTATCAGAACTTATTAACATTAGTGTCACTAAAGAAGTCTTGATATACAACCCCCCACTGCTAAATTTGACTAGCT
>matchsequence
GG-UGGUCCGAWKGYAGUGG-----------------------------------------------------YCYCACURCUDMAYUUGACURGCU
|
References
Axel Mosig, Julian Chen, Peter F. Stadler,
Homology Search with Fragmented Nucleic Acid Sequence Patterns,
Proc. Worksh. Alg. Bioinf. (WABI), 2007.
fragrep Tutorial
Axel Mosig, PICB Shanghai