Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

This site contains supplementary figures and tables from the paper

An overabundance of phase 0 introns immediately after the start codon in eukaryotic genes
Henrik Nielsen and Rasmus Wernersson
Submitted, 2006.


Supplementary figures

Figure S1: intron length distribution for introns up to 250 nt


Figure S2: Intron position statistics for genome data without ribosomal proteins

In order to show that the phenomenon of start codon introns is not limited to ribosomal proteins, we have calculated the distribution of intron positions for the whole-genome data sets (for proteins without signal peptides) with ribosomal proteins removed. This figure should be compared to the right half of Figure 3 in the paper.


Figure S3: Intron position statistics for human and Drosophila proteins with conserved N-terminals

In order to show that the start codon peak is also present in proteins with few indels in the N-terminal part, we selected from the list of RBHs (Reciprocal Best Hits) between human and Drosophila those global alignments where neither the human nor the fly sequence had gaps within the first 20 positions. This yielded 661 pairs, of which 582 human sequences and 599 fly sequences were predicted to be without signal peptides.


Figure S4: All vs. all alignments: Score/Identity Plots and distance trees

In order to make absolutely sure that the start codon and position 5 peaks, were not an artefact due to homologous proteins (even though the datasets have been homology reduced), we have analyzed the data the following way:

For each dataset, all protein sequences where aligned pairwise against the rest of the set using the ALIGN program [Pearson and Lipman, 1988], and alignment score and percent idendity were plotted. Furthermore, we used CLUSTALW [Thompson et. al, 1997] to construct a "phylogenetic" tree based on pairwise distances (CLUSTALW's guide-tree), and visualized the relationship by plotting the tree with "UNROOTED" [ref: http://pbil.univ-lyon1.fr/software/unrooted.html].

As seen in Section 1 & 2 below, the proteins in both the start codon peak and position 5 peak, are clearly not related. For reference we did the same analysis on 100 randomly selected proteins from the non-homology reduced Vertebrate set, and as seen in Section 3, it is clearly seen that a number of these sequences are related by homology.

Section 1: Start codon peaks

Section 2: Arthropoda pos. 5 peak

Section 3: Reference plot of non-homology reduced data


Supplementary tables

Table S1: lengths, nucleotide frequencies and dinucleotide frequencies for start codon introns compared to other first phase 0 introns.

 
Vertebrata
Arhropoda
Fungi
Magnoliophyta
 
sci no sci
sci no sci
sci no sci
sci no sci
Length statistics
Mean

3059.6
4941.3

631.0
822.4

170.0
106.6

498.6
478.8
P-val

0.01681
0.4927
0.008802
0.738
Nucleotide statistics
a
 26.20%
26.95%

29.27%
29.28%

29.16%
26.72%

28.17%
27.98%
c
21.05%
20.75%

19.90%
20.27%

19.93%
21.63%

19.94%
19.36%
g
23.48%
22.11%

19.84%
19.83%

20.61%
21.30%

19.42%
20.29%
t
29.18%
30.18%

30.99%
30.63%

30.31%
30.35%

32.46%
32.37%
Dinucleotide statistics
aa
7.85% 8.27%
11.01% 10.52%
9.05% 7.65%
9.20% 8.91%
ac
4.59% 4.66%
4.74% 5.09%
5.58% 5.87%
4.84% 4.77%
ag
7.25% 7.09%
4.91% 5.08%
5.95% 5.82%
5.36% 5.48%
at
6.50% 6.93%
8.66% 8.62%
8.76% 7.64%
8.83% 8.88%
ca
6.59% 6.67%
6.50% 6.73%
6.44% 6.45%
5.98% 5.93%
cc
5.57% 5.52%
4.65% 4.50%
4.04% 4.89%
4.38% 4.25%
cg
1.69% 1.32%
3.64% 3.69%
3.26% 3.69%
3.14% 3.08%
ct
7.20% 7.25%
5.14% 5.36%
6.30% 6.80%
6.48% 6.14%
ga
5.95% 5.85%
4.99% 5.09%
6.34% 5.90%
5.43% 5.63%
gc
5.13% 4.68%
5.29% 5.19%
4.30% 4.39%
4.11% 4.23%
gg
6.64% 5.87%
3.91% 4.11%
3.99% 3.88%
4.11% 4.39%
gt
5.72% 5.69%
5.52% 5.34%
5.51% 6.38%
5.61% 5.86%
ta
5.79% 6.16%
6.82% 6.97%
7.50% 6.98%
7.62% 7.56%
tc
5.75% 5.90%
5.25% 5.51%
6.13% 6.68%
6.64% 6.15%
tg
7.87% 7.82%
7.25% 6.84%
6.94% 7.17%
6.66% 7.17%
tt
9.76% 10.31%
11.71% 11.34%
9.92% 9.82%
11.60% 11.55%

Table S1: Length and nucleotide statistics for start codon introns ("sci") and other phase 0 first introns ("other). Where the difference for a particular dinucleotide is greater than 0.5%, the higher percentage is shown in boldface

When examining the nucleotide distrution, significant differences are found for vertebrates, fungi, and plants (p <10-4, χ2-test, df=3), but not for arthropods. However, when using nucleotide pair frequences, significant differences are found for all four groups (p < 10-3, χ2-test, df=15). The nucleotide pair frequences are shown in Table S1 (above). However, there does not seem to be any dinucleotide preferences that are the same in all four organism groups.


Table S2: Protein names or GO categories (for arthropoda) for all the start codon intron proteins

Lists of gene identifiers and annotation of their protein product, for proteins found to carry start codon introns, can be downloaded here as four plain text files: ( Arthropoda.reduc.scp.names.txt, Fungi.reduc.scp.names.txt, Magnoliophyta.reduc.scp.names.txt, Vertebrata.reduc.scp.names.txt ). The protein product annotation originating from the GenBank files is described in the following format:

Name1    Name2      Product (if known) 

AC132479 AAY24076.1 /product="unknown"
AF041427 AAB96967.1 /product="ribosomal protein s4 Y isoform"
AF305057 AAG29537.1 /product="RTS beta"
AL136181 CAH72985.1 /product="transmembrane protein 10"
AL137067 CAC08000.1 /product="Sec61 beta subunit"
AL139289 CAI23381.1 
AL157783 CAI12146.1 /product="cAMP responsive element modulator"
AL354928 CAI39640.1 /product="ribosomal protein L35"
AL355815 CAC19504.1 
AL357314 CAI22392.1 /product="Rab geranylgeranyltransferase, beta subunit"

Furthermore, a file Arthropoda.reduc.scp.go+names.txt which includes the GO categories is supplied for the arthropod subset. The format is as follows:

Name1    Name2      FlyBase name        GO-categories (here the list is shown truncated)

AE003422 AAF45673.1 Dmel_CG14813        /GO:0006886 [IEA]:intracellular protein transport;, /GO:0006887 [IEA]:exocytosis
AE003427 AAF45877.3 Dmel_CG14271        /GO:0005554 [ND]:molecular function unknown;, /GO:0005737 [NAS]:cytoplasm;, /GO:
AE003435 AAF46058.2 Dmel_CG4111_splic1  /GO:0003676 [IEA]:nucleic acid binding;, /GO:0003735 [ISS]:structural constituen
AE003456 AAF46816.3 Dmel_CG5625_splic1  /GO:0006886 [IEA]:intracellular protein transport;, /GO:0016192 [ISS]:vesicle-me



CORRESPONDENCE

Henrik Nielsen,