fragrep Tutorial

Homology Search with Fragmented Nucleic Acid Sequence Patterns using fragrep

Introduction

The fragrep tool was developed to detect distant homologous sequences that exhibit a commonly observed fragmented conservation pattern. fragrep is equipped with a collection of other tools, most notably aln2pattern for extracting homology patterns from multiple sequence alignments as well as visualization of the homology patterns.

Creating a Search Pattern

A typical starting point for creating PWM-based search patterns is a multiple sequence alignment. Often, when dealing with the aligned members of a non-coding RNA family, these alignments exhibit certain conserved blocks. The tool aln2pattern relies on manual identification and annotation of these blocks and converts them into a search pattern that can be used by fragrep. The following example shows an alignment of YRNA sequences from different vertebrate species. By visual inspection, we easily identify two conserved blocks that we annotate in a separate line labelled fragrep in the alignment:

CLUSTALW

Ta_gut_Y1 GGCTGGTCCGAAGGTAGTGGGGTATCTCAATTGATTGTTCACAGTCAGTTACAGATTGATCTCCTTGTTCT-CTCTTTCCCC--CCTTCCCACTACTGCACTCGACTAGTCTTTT---
Ga_gal_Y1 GGCTGGTCCGAAGGTAGTGGGGTGTCTCAATTGATTGTTCACAGTCAGTTACAGATTGATCTCCTCGTTCT-CTCTTTCCCC--CCTTCCCACTACTGCACTTGACTAGTCTTTT---
Or_ana_Y1 GGCTAGTCCGAAGGTAGTGAGTTATCACAATTGATTGTTCACAGTCAGTTACAGATCGATATCCCTGTTCT-CTCTTCCTCCCACCTTCTCACTACCGTATTTGACTAGTCTTTT---
Mo_dom_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAAATGATTGTTCACAGTCAGTTACAGATCGATCTCCTTGTTCT-CTCTTTCCCTC-CCTTCTCACTACTGCACTCGACTAGTCTTTT---
Tu_tru_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATCGAACTCCTCGTTCTACTCTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTT-----
Ho_sap_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATCGAACTCCTTGTTCTACTCTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTT-----
Mu_mus_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATTGAACTCCT-GTTCTACACTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTTTT---
Sp_tri_Y1 GGCTGGTCCGAAGGTAGTGAGTTATCTCAATTGATTGTTCACAGTCAGTTACAGATTGAACTCCTTGTTCTACTCTTTCCCC--CCTTCTCACTACTGCACTTGACTAGTCTTTT---
An_car_Y3 GGCTGGTCCGATTGCAGTGGT----------ACTTATAATTAATTGATCACAGTCAGTTACAGGTTTCTTTGTTCTTT---CTCCACTCCCACTGCTTCACTTGACTAGTCTTTT---
Ta_gut_Y3 GGCTGGTCCGATGGCAGTGGT----------ATTTA----TAATTGATCACAGTCAGTTACAGATTTCTTTGTTCTTT---CTCCACACCCACTGCTGCACTTGACTAGTCTTTT---
Ga_gal_Y3 GGCTGGTCCGATGGCAATGGC----------ATTTATAA-TAATTGATCACAGTCCGTTACAGATTTCTTTGCTCTTT---CTCCACTCCCACTGCTGCATTTGACTAGTCTTTT---
Or_ana_Y3 GGCTGGTCCGAGTGCAGTGGA----------ATTTATAATTAATTGATCACAGTCAGTTACAGATTTCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGTCTTTT---
Mo_dom_Y3 GGCTGGTCCGATTGCAGTGGT----------AACTCTAATTAATTGATTACAGTCAGTTACAGATTTCTTTGTTCTTT---CTCCGCTCCCACTGCTTCACTTGACTAGTCTTTT---
Tu_tru_Y3 GGCTGGTCCGCGTGCAGTGGT----------GTTTACAATTAATTGATCACAGCCAGTTACAGATTTC-TTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGCCTTTT---
Ho_sap_Y3 GGCTGGTCCGAGTGCAGTGGT----------GTTTACAACTAATTGATCACAACCAGTTACAGATTTCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGCCTT-----
Sp_tri_Y3 GGCTGGTCCGAGTGCAGTGGT----------GTTTACAACTAATTGATCACAACCAGTTACAGATTTCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTTGACTAGCCTTTT---
Mu_mus_Y3 GGTTGGTCCGAGAGTAGTGGT----------GTTTACAACTAATTGATCACAACCAGTTACAGATTTCTTTGTTCCTT---CTCCGCTCCCACTGCTTCACTTGACCAGCCTTTT---
Lo_afr_Y3 GGCTGGTCCGAGTGCAGTGTG--------AGGCTTACAACTAATTGATCACAACCAGTTACAGATTCCTTTGTTCCTT---CTCCACTCCCACTGCTTCACTAGACCGGTCTTTT---
An_car_Y4 GGCTGGTCCGAAAGTAGTGGGTTACCA----------CAGAAATTATTACAGTT-AGTTTCACTAACCTTTCTAAGT-----TCCA-CCCCACTGCTAACCTTGACTGGTCTCCTTTT
Ta_gut_Y4 GGCTGGTCCGATGGTAGTGGGTAGT------------CAGAAATTATTACTGCT-ACTTAAGCTAACCTTTCTATAT-----TCCA-CCCCACTGCTAACCTCGACTGGTCTTT----
Or_ana_Y4 GGCTGGTCCGATAGTAGTGGGTGCC------------CAGAACTCTTTAATAT--AGTTTCACTAAACTTGGTATAT-----TCCA-CCTCACTGCTAAACTTGACTGGCCTTTT---
Mo_dom_Y4 GGCTGGTCCGATGGCAGTGGTTTAC------------CAGAACTTATTGATATT-AGTTTCACAACAAGTTAATATAT----TCCACCCCCACTGCTAAATTTGACTGGCCTTTT---
Tu_tru_Y4 GGCTGGTCCGATGGTAGTGGGTTAC------------CAGAACTTATTAACGTT-AGTGTCACTAAAGTTGGTATACAA----CC--CCCCACTGCTAAATTTGACTAGCTTT-----
Lo_afr_Y4 GGCTGGTCCGATGGCAGTGGGTTAT------------CAGAACTTATTAACGTT-AGTGTCACTAAAGTTGGTATACAA----CC-CCCCCACTGCTGAATTTGACTGGCTTTTT---
Ho_sap_Y4 GGCTGGTCCGATGGTAGTGGGTTAT------------CAGAACTTATTAACATT-AGTGTCACTAAAGTTGGTATACAA----CC--CCCCACTGCTAAATTTGACTGGCTTT-----
Sp_tri_Y4 GGCTGGTCCGATGGTAGTGGGTTAT------------CAGAACTTATTAACATT-AGTGTCACTAAAGTTGATATACAA----CC--CCCCACTGCTAAATTTGACTGGCTTTTT---
FRAGREP   ********************-------------------------------------------------------------------*************************------


This annotated alignment can now be processed with aln2pattern to obtain our search pattern that we pipe into a file named YRNA.pattern.

aln2pattern -m YRNA.aln > YRNA.pattern


The pattern file YRNA.pattern contains a description of the annotated columns as a matrix, along with some further constraints such as match thresholds and distance constraints between the blocks:

2 matrices
 0  0 M0:GGCUGGUCCGADGGYAGUGG 0.93 0
47 66 M1:YCCCACURCUKMACUUGACURGYCU 0.89 0
M0:GGCUGGUCCGADGGYAGUGG
0 0 0 0 1 0 0 0 0 0 25 9 3 0 0 26 1 0 0 6
0 0 25 0 0 0 0 26 26 0 1 0 0 0 11 0 0 0 0 0
26 26 0 0 25 26 0 0 0 26 0 6 16 26 0 0 25 0 26 19
0 0 1 26 0 0 26 0 0 0 0 11 7 0 15 0 0 26 0 1
# G G C U G G U C C G A D G G Y A G U G G
M1:YCCCACURCUKMACUUGACURGYCU
1 0 0 0 26 0 0 8 0 0 7 8 24 0 0 1 0 26 0 0 18 0 0 0 0
8 26 19 26 0 26 0 0 26 1 0 17 2 19 0 3 0 0 26 2 0 0 10 22 0
0 0 0 0 0 0 0 18 0 0 11 0 0 0 0 0 26 0 0 0 8 26 0 0 0
17 0 7 0 0 0 26 0 0 25 8 1 0 7 26 22 0 0 0 24 0 0 16 4 26
# Y C C C A C U R C U K M A C U U G A C U R G Y C U


The different parameters represent the least constraint parameters that yield a match for all sequences in the original alignment: The first block is required to mathc with a score of .93 and 0 insertions or deletions, where the possible match scores range between 0 and 1; the number of insertions can be any non-negative integer. Further contraints are given through the values 47 and 66 given in front of the second pattern. These specifiy a lower bound and an upper boudn on the number of nucleotides between the two blocks.

Note that a graphical representation of the pattern created by aln2pattern can be found in the file aln2pattern.eps. Postscript output can also be created from any valid fragrep file using the tool pattern2eps.

Searching for Matches

The pattern created above can be readily used to start a fragrep seearch. Imagine, however, we want to search a genome that is phylogenetically divergent from all the sequences that were given in the alignment YRNA.aln. For this case, we can tune the parameters for more fault tolerance, for instance by modifying the first three lines to

2 matrices
 0  0 M0:GGCUGGUCCGADGGYAGUGG 0.9 1
30 80 M1:YCCCACURCUKMACUUGACURGYCU 0.8 1
M0:GGCUGGUCCGADGGYAGUGG


We use this modified version of YRNA.pattern to start a fragrep search, using option -q to avoid reporting redundant matches:

fragrep -q YRNA.pattern X.fa


This query indeed yields the following match that can be processed further to test for suitable RNA secondary structure, eventually characteristic promoter sequences, or other evidience of a functional ncRNA gene.

>X-fwd:pos780 weight=11.0799 p-value=1.54188e-05 TAGTGGTCCGATGGTAGTGGGTTATCAGAACTTATTAACATTAGTGTCACTAAAGAAGTCTTGATATACAACCCCCCACTGCTAAATTTGACTAGCT >matchsequence GG-UGGUCCGAWKGYAGUGG-----------------------------------------------------YCYCACURCUDMAYUUGACURGCU


References

Axel Mosig, Julian Chen, Peter F. Stadler, Homology Search with Fragmented Nucleic Acid Sequence Patterns, Proc. Worksh. Alg. Bioinf. (WABI), 2007.
fragrep Tutorial
Axel Mosig, PICB Shanghai