fragrep Tutorial
Homology Search with Fragmented Nucleic Acid Sequence Patterns using fragrep
Introduction
The fragrep
tool was developed to detect distant
homologous sequences that exhibit a commonly observed fragmented
conservation pattern. fragrep
is equipped with a
collection of other tools, most notably aln2pattern
for
extracting homology patterns from multiple sequence alignments as well
as visualization of the homology patterns.
Creating a Search Pattern
A typical starting point for creating PWM-based search patterns is a
multiple sequence alignment. Often, when dealing with the aligned
members of a non-coding RNA family, these alignments exhibit certain
conserved blocks. The tool aln2pattern
relies on manual
identification and annotation of these blocks and converts them into a
search pattern that can be used by fragrep
.
The following example shows an alignment of YRNA sequences from
different vertebrate species. By visual inspection, we easily identify
two conserved blocks that we annotate in a separate line
labelled fragrep
in the alignment:
CLUSTALW
Ta_gut_Y1 GGCTGGTCCGAAGGTAGTGGGGTATCTCAATTGATTGTTCACAGTCAGTTACAGATTGATCTCCTTGTTCT-CTCTTTCCCC--CCTTCCCACTACTGCACTCGACTAGTCTTTT---
Ga_gal_Y1 GGCTGGTCCGAAGGTAGTGGGGTGTCTCAATTGATTGTTCACAGTCAGTTACAGATTGATCTCCTCGTTCT-CTCTTTCCCC--CCTTCCCACTACTGCACTTGACTAGTCTTTT---
Or_ana_Y1 GGCTAGTCCGAAGGTAGTGAGTTATCACAATTGATTGTTCACAGTCAGTTACAGATCGATATCCCTGTTCT-CTCTTCCTCCCACCTTCTCACTACCGTATTTGACTAGTCTTTT---
Mo_dom_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAAATGATTGTTCACAGTCAGTTACAGATCGATCTCCTTGTTCT-CTCTTTCCCTC-CCTTCTCACTACTGCACTCGACTAGTCTTTT---
Tu_tru_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATCGAACTCCTCGTTCTACTCTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTT-----
Ho_sap_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATCGAACTCCTTGTTCTACTCTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTT-----
Mu_mus_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATTGAACTCCT-GTTCTACACTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTTTT---
Sp_tri_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATTGAACTCCTTGTTCTACTCTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTTTT---
An_car_Y3 GGCTGGTCCGATTGCAGTGGT----------ACTTATAATTAATTGATCACAGTCAGTTACAGGTTTCTTTGTTCTTT---CTCCACTCCCACTGCTTCACTTGACTAGTCTTTT---
Ta_gut_Y3 GGCTGGTCCGATGGCAGTGGT----------ATTTA----TAATTGATCACAGTCAGTTACAGATTTCTTTGTTCTTT---CTCCACACCCACTGCTGCACTTGACTAGTCTTTT---
Ga_gal_Y3 GGCTGGTCCGATGGCAATGGC----------ATTTATAA-TAATTGATCACAGTCCGTTACAGATTTCTTTGCTCTTT---CTCCACTCCCACTGCTGCATTTGACTAGTCTTTT---
Or_ana_Y3 GGCTGGTCCGAGTGCAGTGGA----------ATTTATAATTAATTGATCACAGTCAGTTACAGATTTCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGTCTTTT---
Mo_dom_Y3 GGCTGGTCCGATTGCAGTGGT----------AACTCTAATTAATTGATTACAGTCAGTTACAGATTTCTTTGTTCTTT---CTCCGCTCCCACTGCTTCACTTGACTAGTCTTTT---
Tu_tru_Y3 GGCTGGTCCGCGTGCAGTGGT----------GTTTACAATTAATTGATCACAGCCAGTTACAGATTTC-TTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGCCTTTT---
Ho_sap_Y3 GGCTGGTCCGAGTGCAGTGGT----------GTTTACAACTAATTGATCACAACCAGTTACAGATTTCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGCCTT-----
Sp_tri_Y3 GGCTGGTCCGAGTGCAGTGGT----------GTTTACAACTAATTGATCACAACCAGTTACAGATTTCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGCCTTTT---
Mu_mus_Y3 GGTTGGTCCGAGAGTAGTGGT----------GTTTACAACTAATTGATCACAACCAGTTACAGATTTCTTTGTTCCTT---CTCCGCTCCCACTGCTTCACTTGACCAGCCTTTT---
Lo_afr_Y3 GGCTGGTCCGAGTGCAGTGTG--------AGGCTTACAACTAATTGATCACAACCAGTTACAGATTCCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTAGACCGGTCTTTT---
An_car_Y4 GGCTGGTCCGAAAGTAGTGGGTTACCA----------CAGAAATTATTACAGTT-AGTTTCACTAACCTTTCTAAGT-----TCCA-CCCCACTGCTAACCTTGACTGGTCTCCTTTT
Ta_gut_Y4 GGCTGGTCCGATGGTAGTGGGTAGT------------CAGAAATTATTACTGCT-ACTTAAGCTAACCTTTCTATAT-----TCCA-CCCCACTGCTAACCTCGACTGGTCTTT----
Or_ana_Y4 GGCTGGTCCGATAGTAGTGGGTGCC------------CAGAACTCTTTAATAT--AGTTTCACTAAACTTGGTATAT-----TCCA-CCTCACTGCTAAACTTGACTGGCCTTTT---
Mo_dom_Y4 GGCTGGTCCGATGGCAGTGGTTTAC------------CAGAACTTATTGATATT-AGTTTCACAACAAGTTAATATAT----TCCACCCCCACTGCTAAATTTGACTGGCCTTTT---
Tu_tru_Y4 GGCTGGTCCGATGGTAGTGGGTTAC------------CAGAACTTATTAACGTT-AGTGTCACTAAAGTTGGTATACAA----CC--CCCCACTGCTAAATTTGACTAGCTTT-----
Lo_afr_Y4 GGCTGGTCCGATGGCAGTGGGTTAT------------CAGAACTTATTAACGTT-AGTGTCACTAAAGTTGGTATACAA----CC-CCCCCACTGCTGAATTTGACTGGCTTTTT---
Ho_sap_Y4 GGCTGGTCCGATGGTAGTGGGTTAT------------CAGAACTTATTAACATT-AGTGTCACTAAAGTTGGTATACAA----CC--CCCCACTGCTAAATTTGACTGGCTTT-----
Sp_tri_Y4 GGCTGGTCCGATGGTAGTGGGTTAT------------CAGAACTTATTAACATT-AGTGTCACTAAAGTTGATATACAA----CC--CCCCACTGCTAAATTTGACTGGCTTTTT---
FRAGREP ********************-------------------------------------------------------------------*************************------
|
This annotated alignment can now be processed
with aln2pattern
to obtain our search pattern that we
pipe into a file named YRNA.pattern
.
aln2pattern -m YRNA.aln > YRNA.pattern
|
The pattern file YRNA.pattern
contains a description of
the annotated columns as a matrix, along with some further constraints
such as match thresholds and distance constraints between the
blocks:
2 matrices
0 0 M0:GGCUGGUCCGADGGYAGUGG 0.93 0
47 66 M1:YCCCACURCUKMACUUGACURGYCU 0.89 0
M0:GGCUGGUCCGADGGYAGUGG
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 25 | 9 | 3 | 0 | 0 | 26 | 1 | 0 | 0 | 6 |
| 0 | 0 | 25 | 0 | 0 | 0 | 0 | 26 | 26 | 0 | 1 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 |
| 26 | 26 | 0 | 0 | 25 | 26 | 0 | 0 | 0 | 26 | 0 | 6 | 16 | 26 | 0 | 0 | 25 | 0 | 26 | 19 |
| 0 | 0 | 1 | 26 | 0 | 0 | 26 | 0 | 0 | 0 | 0 | 11 | 7 | 0 | 15 | 0 | 0 | 26 | 0 | 1 |
# | G | G | C | U | G | G | U | C | C | G | A | D | G | G | Y | A | G | U | G | G |
M1:YCCCACURCUKMACUUGACURGYCU
| 1 | 0 | 0 | 0 | 26 | 0 | 0 | 8 | 0 | 0 | 7 | 8 | 24 | 0 | 0 | 1 | 0 | 26 | 0 | 0 | 18 | 0 | 0 | 0 | 0 |
| 8 | 26 | 19 | 26 | 0 | 26 | 0 | 0 | 26 | 1 | 0 | 17 | 2 | 19 | 0 | 3 | 0 | 0 | 26 | 2 | 0 | 0 | 10 | 22 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 26 | 0 | 0 | 0 | 8 | 26 | 0 | 0 | 0 |
| 17 | 0 | 7 | 0 | 0 | 0 | 26 | 0 | 0 | 25 | 8 | 1 | 0 | 7 | 26 | 22 | 0 | 0 | 0 | 24 | 0 | 0 | 16 | 4 | 26 |
# | Y | C | C | C | A | C | U | R | C | U | K | M | A | C | U | U | G | A | C | U | R | G | Y | C | U |
|
The different parameters represent the least constraint parameters
that yield a match for all sequences in the original alignment:
The first block is required to mathc with a score of .93 and 0
insertions or deletions, where the possible match scores range between
0 and 1; the number of insertions can be any non-negative
integer. Further contraints are given through the values 47 and 66
given in front of the second pattern. These specifiy a lower bound and
an upper boudn on the number of nucleotides between the two blocks.
Note that a graphical representation of the pattern created
by aln2pattern
can be found in the
file aln2pattern.eps. Postscript output can also be created
from any valid fragrep
file using the
tool pattern2eps
.
Searching for Matches
The pattern created above can be readily used to start a fragrep
seearch. Imagine, however, we want to search a genome that is
phylogenetically divergent from all the sequences that were given in
the alignment YRNA.aln
. For this case, we can tune the
parameters for more fault tolerance, for instance by modifying the
first three lines to
2 matrices
0 0 M0:GGCUGGUCCGADGGYAGUGG 0.9 1
30 80 M1:YCCCACURCUKMACUUGACURGYCU 0.8 1
M0:GGCUGGUCCGADGGYAGUGG
|
We use this modified version of YRNA.pattern
to start
a fragrep
search, using option -q
to avoid
reporting redundant matches:
fragrep -q YRNA.pattern X.fa
|
This query indeed yields the following match that can be processed
further to test for suitable RNA secondary structure, eventually
characteristic promoter sequences, or other evidience of a functional
ncRNA gene.
>X-fwd:pos780 weight=11.0799 p-value=1.54188e-05
TAGTGGTCCGATGGTAGTGGGTTATCAGAACTTATTAACATTAGTGTCACTAAAGAAGTCTTGATATACAACCCCCCACTGCTAAATTTGACTAGCT
>matchsequence
GG-UGGUCCGAWKGYAGUGG-----------------------------------------------------YCYCACURCUDMAYUUGACURGCU
|
References
Axel Mosig, Julian Chen, Peter F. Stadler,
Homology Search with Fragmented Nucleic Acid Sequence Patterns,
Proc. Worksh. Alg. Bioinf. (WABI), 2007.
fragrep Tutorial
Axel Mosig, PICB Shanghai