Praktikum im SS11: Module 10-202-2205 und 10-202-2208
Aktuelles
Am 17.6.11 um 15 Uhr im Raum 109 findet die Einführungsveranstaltung für das Praktikum (Module 10-202-2205 und 10-202-2208) statt.Bitte melden Sie sich [HIER] zur Praktikumsteilnahme an.
Alignment graph
The Problem
Genome-wide alignments contain a wealth of information on genome evolution. In particular, they contain information on synteny, i.e., conservation of sequence order, and thus, also on rearrangements. The latter appear as breakpoints between syntenic regions. In order to get a global overview, we represent the genome-wide alignment data as a graph and analyse its properties.The Data
Genome wide alignments are organized in blocks. Each block consists of homologous sequences of a subset of species. Within each alignment block, entries look like thisa score=903779.000000 s dm3.chr2L 837206 169 + 23011544 CATCATCAGTTCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC s droSim1.chr2L 835292 169 + 22036055 TATCATCAGATCGCG------TTGCTCCTGC---------------TGGTGCACAGCCCGT i droSim1.chr2L C 0 C 0 s droSec1.super_14 807477 169 + 2068291 TATCATCAGATCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC i droSec1.super_14 C 0 C 0 s droYak2.chr2L 820953 169 + 22324452 CATCATCAGCTCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC ....
dm3 = Drosophila melanogaster assembly version 3
chr2L = chromosome, super-contig or contig (in general "sequence fragment")
The Graph(s)
In our graphs, alignment blocks will be vertices.Directed edges are defined independently for each species.
We draw a directed edge for a given species S from block A to block B if
(i) the sequences of S in blocks A and B are located on the same sequence fragment
(ii) the sequence of B follows the sequence of A accounting for the reading direction,
(iii) there is no block C whose sequence of S is entirely between that of A and B.
Since the graphs for each species S are defined on the same set of vertices, we can superimpose the edge set. We obtain a multigraph, since there is an arc for each species.
The work program
1. Construction of a graph for each species
Extract the necessary information from the maf file. (Don't forget to define IDs for the blocks!)
HINT: use UNIX command line tools for this task. Can you do this in O(#Edges) time or even better?
It will probably be most useful to store the graph as edge list.
2. Construction of a graph for each species
Keep the species information as label annotating each edge.
3. Extract information from the individual graphs and the combined graph
Question 1.
What do we expect if each piece of sequence of a given species appears
exactly once in the alignment? What do you observe?
Count the connected components.
Question 2. What happens if we switch from the strand in the data to its reverse complement?
Question 3. What is the maximal vertex degree of this graph? Plot the distribution of in and out degrees of the multigraph.
Question 4. Can we change the coordinatization (reading direction) of individual sequence fragments to match with the reference genome as much as possible? How does this affect our multigraph?
Question 5. Suppose a region of DNA was placed (a) in reverse direction at the same location, (b) on different chromosome, (c) on the same chromosome in a different location relative to the reference sequence. How do we see this in our multigraph?
Question 6. How does a deletion of a DNA region look like? What about insertions, local duplications?
Question 7. How does the phylogenetic position of a rearrangment influence the local pattern in the multigraph?
Question 8. How can we estimate the phylogenetic age of a rearrangment?