Das Praktikum findet wie folgt statt:
Raum: | 109, Härtelstr. 16-18 |
Zeit: | ganztägig, Kernzeit: 10-15Uhr |
Datum: | 26.01.2015 - 06.02.2014 Endbesprechung: (wird während des Praktikums vereinbar) |
Alignment graph
The Problem
Genome-wide alignments contain a wealth of information on genome evolution. In particular, they contain information on synteny, i.e., conservation of sequence order, and thus, also on rearrangements. The latter appear as breakpoints between syntenic regions. In order to get a global overview, we represent the genome-wide alignment data as a graph and analyse its properties.The Data
Genome wide alignments are organized in blocks. Each block consists of homologous sequences of a subset of species. Within each alignment block, entries look like thisa score=903779.000000 s dm3.chr2L 837206 169 + 23011544 CATCATCAGTTCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC s droSim1.chr2L 835292 169 + 22036055 TATCATCAGATCGCG------TTGCTCCTGC---------------TGGTGCACAGCCCGT i droSim1.chr2L C 0 C 0 s droSec1.super_14 807477 169 + 2068291 TATCATCAGATCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC i droSec1.super_14 C 0 C 0 s droYak2.chr2L 820953 169 + 22324452 CATCATCAGCTCGCG------CTGCTCCTGC---------------TGGTGCACAGCCCGC ....
dm3 = Drosophila melanogaster assembly version 3
chr2L = chromosome, super-contig or contig (in general "sequence fragment")
Do not download them. We provide local copies at /scratch/will/Data on k53.bioinf.uni-leipzig.de.
The Graph(s)
In our graphs, alignment blocks will be vertices.Directed edges are defined independently for each species.
We draw a directed edge for a given species S from block A to block B if
(i) the sequences of S in blocks A and B are located on the same sequence fragment
(ii) the sequence of B follows the sequence of A accounting for the reading direction,
(iii) there is no block C whose sequence of S is entirely between that of A and B.
Since the graphs for each species S are defined on the same set of vertices, we can superimpose the edge set. We obtain a multigraph, since there is an arc for each species.
The work program
0a. Definition of data structures
We require one or more intermediate, lightweight data structures. We will agree one a single, common format for everybody to use. The format shall be discussed.
Apart from the IDs mentioned below, what else is needed?
Choosing a common format has multiple advantages. (i) We can easily verify that everybody is doing the right thing. (ii) We only have to store one set of intermediate data structures.
Of course, everybody still needs to develop their version of the tools discussed below. (Just saying...)
The format definition will end up here
0a. Definition of data structures
We have to deal with a big files. All your pipelines should start with one of (gunzip, zcat, ...) in pipeline mode. Similarly, your pipelines should end with (gzip).
Exceptions occur when you want to output summary results.
In particular, don't extract the genomes; and don't keep the lightweight format uncompressed on disk.
0b. Programming languages
You are, in principle, free to choose a language of your choice. We only care that (i) you did the work, (ii) the results are ok.
1. Construction of a graph for each species
Extract the necessary information from the maf file. (Don't forget to define IDs for the blocks!)
HINT: use UNIX command line tools for this task. Can you do this in O(#Edges) time or even better?
It will probably be most useful to store the graph as edge list or adjacency list.
2. Construction of a graph for each Multiple Sequence Alignment
Keep the species information as label annotating each edge.
3. Extract information from the individual graphs and the combined graph
We will want to select some simple tasks for everyone to start with!
Extract graph statistics:
Distribution of in-degree and out-degree; graph cycles
Differences between the 4-way, 8-way, x-way alignments:
How does the number of genomes in the graph influence the graph statistics
graph artifacts:
are sequences multi-mapped?;
cleanup:
which artifacts can we clean up, thereby generating a more useful multiple alignment?
re-arrangements:
how to detect them?
Multiple in/out paths:
Given a node with multiple in- and out-edges, how to choose the correct path?
event detection:
strand breaks; inversions; can you detect other patterns? Some events are natural events; other alignment artifacts.
Question 1.
What do we expect if each piece of sequence of a given species appears
exactly once in the alignment? What do you observe?
Count the connected components.
Question 2. What happens if we switch from the strand in the data to its reverse complement?
Question 3. What is the maximal vertex degree of this graph? Plot the distribution of in and out degrees of the multigraph.
Question 4. Can we change the coordinatization (reading direction) of individual sequence fragments to match with the reference genome as much as possible? How does this affect our multigraph?
Question 5. Suppose a region of DNA was placed (a) in reverse direction at the same location, (b) on different chromosome, (c) on the same chromosome in a different location relative to the reference sequence. How do we see this in our multigraph?
Question 6. How does a deletion of a DNA region look like? What about insertions, local duplications?
Question 7. How does the phylogenetic position of a rearrangment influence the local pattern in the multigraph?
Question 8. How can we estimate the phylogenetic age of a rearrangment?
4. Comparison of Different Alignments for the same set of
species
Question 1.
Different alignment on the same species sets have different vertices. How
can we neverthless compare them?
Question 2. Determine trends of graph characteristics as a function of the number of the species. What do you expect?
File handling guidelines
Files are big, use compressionEach step of a pipeline, should start and end with zcat, gzip, gunzip. Say, you want to transform the MAF file into the leightweight format: zcat alignment.maf.gz | myPipeline | gzip > myleightweight.gz .