next up previous contents
Next: Issues Up: Methods and Realization Previous: Comparison of SR protein   Contents

Homology

After identifying all domains for each SR protein a search inside the Superfam database provided by the NCBI (accessed 01/18/2009) for the SCOP ID of the annotated domain was done the following way:
1.
Download the NCBI Superfam database
2.
Extract all species which posses an annotated domain which is known to be a part of an SR protein
3.
Build a valuated species-based newick tree for each domain
4.
Build a valuated species-based newick tree for all domain combinations
The algorithmic procedure can be summarized like this:

Algorithm (exemplary):

  1. Extract RRM with superfamily SCOP-ID: 54929
    awk 'BEGIN { FS = "[\t]" } ; { if($9 == "54929") { print }};' 964.ass.tab > 54929.tab
    

  2. Count the number of species with a concrete RRM and print both species-id and number to domaincount_species.tab
    awk '{print $1"\t"$2}' 54929.tab | sort -u | awk '{print $1}' | uniq -c | awk '{print $2"\t"$1}' > domaincount_species_rrm.tab
    

  3. Join ncbi_names_id.tab and domaincount_species_rrm.tab on <short code> to get a file with tab-seperated values like <(super-)fam id> <taxon name> <count of domains in taxon>
    map_domain_counts.sh ncbi_names_id.tab domaincount_species_rrm.tab > domaincount_species_and_names_rrm.tab
    

  4. Extract the name and domain count from the domain count file (=strip ncbi id)
    awk 'BEGIN { FS = "[\t]" } ; {print $2"\t"$3}' domaincount_species_and_names_rrm.tab > genecount.tab
    

  5. Build the newick tree via Axel Wintsches tax2 and addStats scripts
    awk 'BEGIN { FS = "[\t]" } ; {print $1}' domaincount_species_and_names_rrm.tab | sort -u > ncbi_id_all.tab
    tax2 -f ncbi_id_all.tab -n > basic_tree.tree
    addStats.pl genecount.tab basic_tree.tree > annotated_tree.tree
    

  6. Plot the newick tree accordingly
    njplot.linux annotated_tree.tree
    

Files: assets/files.tar.gz


next up previous contents
Next: Issues Up: Methods and Realization Previous: Comparison of SR protein   Contents
Rene Ploetz 2009-06-05