Bioinformatics Leipzig

Praktikum im SS11: Module 10-202-2205 und 10-202-2208

Aktuelles

Am 17.6.11 um 15 Uhr im Raum 109 findet die Einf�hrungsveranstaltung f�r das Praktikum (Module 10-202-2205 und 10-202-2208) statt.

Bitte melden Sie sich [HIER] zur Praktikumsteilnahme an.

Alignment graph

The Problem

Genome-wide alignments contain a wealth of information on genome evolution. In particular, they contain information on synteny, i.e., conservation of sequence order, and thus, also on rearrangements. The latter appear as breakpoints between syntenic regions. In order to get a global overview, we represent the genome-wide alignment data as a graph and analyse its properties.

The Data

Genome wide alignments are organized in blocks. Each block consists of homologous sequences of a subset of species. Within each alignment block, entries look like this a score=903779.000000 s dm3.chr2L 837206 169 + 23011544 CATCATCAGTTCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC s droSim1.chr2L 835292 169 + 22036055 TATCATCAGATCGCG------TTGCTCCTGC---------------TGGTGCACAGCCCGT i droSim1.chr2L C 0 C 0 s droSec1.super_14 807477 169 + 2068291 TATCATCAGATCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC i droSec1.super_14 C 0 C 0 s droYak2.chr2L 820953 169 + 22324452 CATCATCAGCTCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC ....

The lines starting a start a block

The lines starting s denote a sequence interval:

dm3.chr2L denotes the species and genome assembly
dm3 = Drosophila melanogaster assembly version 3
chr2L = chromosome, super-contig or contig (in general "sequence fragment")

837206 169 + 23011544 are start position, length, reading direction, and length of the sequence fragment.

the rest of the entry is actual sequence

We will not need other entries

The Graph(s)

In our graphs, alignment blocks will be vertices.
Directed edges are defined independently for each species.
We draw a directed edge for a given species S from block A to block B if
(i) the sequences of S in blocks A and B are located on the same sequence fragment
(ii) the sequence of B follows the sequence of A accounting for the reading direction,
(iii) there is no block C whose sequence of S is entirely between that of A and B.

Since the graphs for each species S are defined on the same set of vertices, we can superimpose the edge set. We obtain a multigraph, since there is an arc for each species.

The work program

1. Construction of a graph for each species

Extract the necessary information from the maf file. (Don't forget to define IDs for the blocks!)

HINT: use UNIX command line tools for this task. Can you do this in O(#Edges) time or even better?
It will probably be most useful to store the graph as edge list.

2. Construction of a graph for each species

Keep the species information as label annotating each edge.

3. Extract information from the individual graphs and the combined graph

Question 1. What do we expect if each piece of sequence of a given species appears exactly once in the alignment? What do you observe?
Count the connected components.

Question 2. What happens if we switch from the strand in the data to its reverse complement?

Question 3. What is the maximal vertex degree of this graph? Plot the distribution of in and out degrees of the multigraph.

Question 4. Can we change the coordinatization (reading direction) of individual sequence fragments to match with the reference genome as much as possible? How does this affect our multigraph?

Question 5. Suppose a region of DNA was placed (a) in reverse direction at the same location, (b) on different chromosome, (c) on the same chromosome in a different location relative to the reference sequence. How do we see this in our multigraph?

Question 6. How does a deletion of a DNA region look like? What about insertions, local duplications?

Question 7. How does the phylogenetic position of a rearrangment influence the local pattern in the multigraph?

Question 8. How can we estimate the phylogenetic age of a rearrangment?