Bioinformatics Leipzig

Graphen und biologische Netze (Praktikum, Modul 10-202-2205)

Das Praktikum findet wie folgt statt:

Raum:	109, Härtelstr. 16-18
Zeit:	ganztägig, Kernzeit: 10-15Uhr
Datum:	26.01.2015 - 06.02.2014 Endbesprechung: (wird während des Praktikums vereinbar)

Alignment graph

The Problem

Genome-wide alignments contain a wealth of information on genome evolution. In particular, they contain information on synteny, i.e., conservation of sequence order, and thus, also on rearrangements. The latter appear as breakpoints between syntenic regions. In order to get a global overview, we represent the genome-wide alignment data as a graph and analyse its properties.

The Data

Genome wide alignments are organized in blocks. Each block consists of homologous sequences of a subset of species. Within each alignment block, entries look like this a score=903779.000000 s dm3.chr2L 837206 169 + 23011544 CATCATCAGTTCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC s droSim1.chr2L 835292 169 + 22036055 TATCATCAGATCGCG------TTGCTCCTGC---------------TGGTGCACAGCCCGT i droSim1.chr2L C 0 C 0 s droSec1.super_14 807477 169 + 2068291 TATCATCAGATCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC i droSec1.super_14 C 0 C 0 s droYak2.chr2L 820953 169 + 22324452 CATCATCAGCTCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC ....

The lines starting a start a block

The lines starting s denote a sequence interval:

dm3.chr2L denotes the species and genome assembly
dm3 = Drosophila melanogaster assembly version 3
chr2L = chromosome, super-contig or contig (in general "sequence fragment")

837206 169 + 23011544 are start position, length, reading direction, and length of the sequence fragment.

the rest of the entry is actual sequence

We will not need other entries

MAF file format

We will use alignments from several different species.
Do not download them. We provide local copies at /scratch/will/Data on k53.bioinf.uni-leipzig.de.

The Graph(s)

In our graphs, alignment blocks will be vertices.
Directed edges are defined independently for each species.
We draw a directed edge for a given species S from block A to block B if
(i) the sequences of S in blocks A and B are located on the same sequence fragment
(ii) the sequence of B follows the sequence of A accounting for the reading direction,
(iii) there is no block C whose sequence of S is entirely between that of A and B.

Since the graphs for each species S are defined on the same set of vertices, we can superimpose the edge set. We obtain a multigraph, since there is an arc for each species.

The work program

0a. Definition of data structures

We require one or more intermediate, lightweight data structures. We will agree one a single, common format for everybody to use. The format shall be discussed.
Apart from the IDs mentioned below, what else is needed?
Choosing a common format has multiple advantages. (i) We can easily verify that everybody is doing the right thing. (ii) We only have to store one set of intermediate data structures.
Of course, everybody still needs to develop their version of the tools discussed below. (Just saying...)
The format definition will end up here

0a. Definition of data structures

We have to deal with a big files. All your pipelines should start with one of (gunzip, zcat, ...) in pipeline mode. Similarly, your pipelines should end with (gzip).
Exceptions occur when you want to output summary results.
In particular, don't extract the genomes; and don't keep the lightweight format uncompressed on disk.

0b. Programming languages

You are, in principle, free to choose a language of your choice. We only care that (i) you did the work, (ii) the results are ok.

1. Construction of a graph for each species

Extract the necessary information from the maf file. (Don't forget to define IDs for the blocks!)

HINT: use UNIX command line tools for this task. Can you do this in O(#Edges) time or even better?
It will probably be most useful to store the graph as edge list or adjacency list.

2. Construction of a graph for each Multiple Sequence Alignment

Keep the species information as label annotating each edge.

3. Extract information from the individual graphs and the combined graph

We will want to select some simple tasks for everyone to start with!

Extract graph statistics:
Distribution of in-degree and out-degree; graph cycles

Differences between the 4-way, 8-way, x-way alignments:
How does the number of genomes in the graph influence the graph statistics

graph artifacts:
are sequences multi-mapped?;

cleanup:
which artifacts can we clean up, thereby generating a more useful multiple alignment?

re-arrangements:
how to detect them?

Multiple in/out paths:
Given a node with multiple in- and out-edges, how to choose the correct path?

event detection:
strand breaks; inversions; can you detect other patterns? Some events are natural events; other alignment artifacts.

Question 1. What do we expect if each piece of sequence of a given species appears exactly once in the alignment? What do you observe?
Count the connected components.

Question 2. What happens if we switch from the strand in the data to its reverse complement?

Question 3. What is the maximal vertex degree of this graph? Plot the distribution of in and out degrees of the multigraph.

Question 4. Can we change the coordinatization (reading direction) of individual sequence fragments to match with the reference genome as much as possible? How does this affect our multigraph?

Question 5. Suppose a region of DNA was placed (a) in reverse direction at the same location, (b) on different chromosome, (c) on the same chromosome in a different location relative to the reference sequence. How do we see this in our multigraph?

Question 6. How does a deletion of a DNA region look like? What about insertions, local duplications?

Question 7. How does the phylogenetic position of a rearrangment influence the local pattern in the multigraph?

Question 8. How can we estimate the phylogenetic age of a rearrangment?

4. Comparison of Different Alignments for the same set of species

Question 1. Different alignment on the same species sets have different vertices. How can we neverthless compare them?

Question 2. Determine trends of graph characteristics as a function of the number of the species. What do you expect?

File handling guidelines

Files are big, use compression
Each step of a pipeline, should start and end with zcat, gzip, gunzip. Say, you want to transform the MAF file into the leightweight format: zcat alignment.maf.gz | myPipeline | gzip > myleightweight.gz .