|
This site contains supplementary figures and tables from the paper
An overabundance of phase 0 introns immediately after the start
codon in eukaryotic genes
Henrik Nielsen and Rasmus Wernersson
Submitted, 2006.
Figure S1: intron length distribution for introns up to 250 nt
Figure S2: Intron position statistics for genome data without
ribosomal proteins
In order to show that the phenomenon of start codon introns is not
limited to ribosomal proteins, we have calculated the distribution of
intron positions for the whole-genome data sets (for proteins without
signal peptides) with ribosomal proteins removed. This figure should
be compared to the right half of Figure 3 in the paper.
Figure S3: Intron position statistics for human and
Drosophila proteins with conserved N-terminals
In order to show that the start codon peak is also present in
proteins with few indels in the N-terminal part, we selected from the
list of RBHs (Reciprocal Best Hits) between human and
Drosophila those global alignments where neither the human nor
the fly sequence had gaps within the first 20 positions. This
yielded 661 pairs, of which 582 human sequences and 599 fly sequences
were predicted to be without signal peptides.
Figure S4: All vs. all alignments: Score/Identity Plots and
distance trees
In order to make absolutely sure that the start codon and position
5 peaks, were not an artefact due to homologous proteins (even though
the datasets have been homology reduced), we have analyzed the data
the following way:
For each dataset, all protein sequences where aligned pairwise
against the rest of the set using the ALIGN program [Pearson
and Lipman, 1988], and alignment score and percent idendity were
plotted. Furthermore, we used CLUSTALW [Thompson et.
al, 1997] to construct a "phylogenetic" tree based on pairwise
distances (CLUSTALW's guide-tree), and visualized the
relationship by plotting the tree with "UNROOTED" [ref:
http://pbil.univ-lyon1.fr/software/unrooted.html].
As seen in Section 1 & 2 below, the proteins in both the start
codon peak and position 5 peak, are clearly not related. For
reference we did the same analysis on 100 randomly selected proteins
from the non-homology reduced Vertebrate set, and as seen in Section
3, it is clearly seen that a number of these sequences are related
by homology.
Section 1: Start codon peaks
Section 2: Arthropoda pos. 5 peak
Section 3: Reference plot of non-homology reduced data
Supplementary tables
Table S1: lengths, nucleotide frequencies and dinucleotide
frequencies for start codon introns compared to other first phase 0
introns.
| |
|
Vertebrata |
|
Arhropoda |
|
Fungi |
|
Magnoliophyta |
| |
|
sci |
no sci |
|
sci |
no sci |
|
sci |
no sci |
|
sci |
no sci |
Length statistics
|
Mean
|
|
3059.6
|
4941.3
|
|
631.0
|
822.4
|
|
170.0
|
106.6
|
|
498.6
|
478.8
|
P-val
|
|
0.01681 |
|
0.4927 |
|
0.008802 |
|
0.738 |
Nucleotide
statistics
|
| a
|
|
26.20%
|
26.95%
|
|
29.27%
|
29.28%
|
|
29.16%
|
26.72%
|
|
28.17%
|
27.98%
|
| c
|
|
21.05%
|
20.75%
|
|
19.90%
|
20.27%
|
|
19.93%
|
21.63%
|
|
19.94%
|
19.36%
|
| g
|
|
23.48%
|
22.11%
|
|
19.84%
|
19.83%
|
|
20.61%
|
21.30%
|
|
19.42%
|
20.29%
|
| t
|
|
29.18%
|
30.18%
|
|
30.99%
|
30.63%
|
|
30.31%
|
30.35%
|
|
32.46%
|
32.37%
|
Dinucleotide
statistics
|
| aa |
|
7.85% |
8.27% |
|
11.01% |
10.52% |
|
9.05% |
7.65% |
|
9.20% |
8.91% |
| ac |
|
4.59% |
4.66% |
|
4.74% |
5.09% |
|
5.58% |
5.87% |
|
4.84% |
4.77% |
| ag |
|
7.25% |
7.09% |
|
4.91% |
5.08% |
|
5.95% |
5.82% |
|
5.36% |
5.48% |
| at |
|
6.50% |
6.93% |
|
8.66% |
8.62% |
|
8.76% |
7.64% |
|
8.83% |
8.88% |
| ca |
|
6.59% |
6.67% |
|
6.50% |
6.73% |
|
6.44% |
6.45% |
|
5.98% |
5.93% |
| cc |
|
5.57% |
5.52% |
|
4.65% |
4.50% |
|
4.04% |
4.89% |
|
4.38% |
4.25% |
| cg |
|
1.69% |
1.32% |
|
3.64% |
3.69% |
|
3.26% |
3.69% |
|
3.14% |
3.08% |
| ct |
|
7.20% |
7.25% |
|
5.14% |
5.36% |
|
6.30% |
6.80% |
|
6.48% |
6.14% |
| ga |
|
5.95% |
5.85% |
|
4.99% |
5.09% |
|
6.34% |
5.90% |
|
5.43% |
5.63% |
| gc |
|
5.13% |
4.68% |
|
5.29% |
5.19% |
|
4.30% |
4.39% |
|
4.11% |
4.23% |
| gg |
|
6.64%
|
5.87% |
|
3.91% |
4.11% |
|
3.99% |
3.88% |
|
4.11% |
4.39% |
| gt |
|
5.72% |
5.69% |
|
5.52% |
5.34% |
|
5.51% |
6.38% |
|
5.61% |
5.86% |
| ta |
|
5.79% |
6.16% |
|
6.82% |
6.97% |
|
7.50% |
6.98% |
|
7.62% |
7.56% |
| tc |
|
5.75% |
5.90% |
|
5.25% |
5.51% |
|
6.13% |
6.68% |
|
6.64% |
6.15% |
| tg |
|
7.87% |
7.82% |
|
7.25% |
6.84% |
|
6.94% |
7.17% |
|
6.66% |
7.17% |
| tt |
|
9.76% |
10.31% |
|
11.71% |
11.34% |
|
9.92% |
9.82% |
|
11.60% |
11.55% |
Table S1:
Length and nucleotide statistics for start codon introns
("sci") and other phase 0 first introns ("other). Where the
difference for a particular dinucleotide is greater than 0.5%, the
higher percentage is shown in boldface
When examining the nucleotide distrution, significant differences are
found for vertebrates, fungi, and plants (p <10-4, χ2-test,
df=3), but not for arthropods. However, when
using nucleotide pair frequences, significant differences are found
for all four groups (p < 10-3, χ2-test, df=15). The
nucleotide pair frequences are shown in Table S1 (above). However,
there does not seem to be any dinucleotide preferences that are the
same in all four organism groups.
Table S2: Protein names or GO categories (for arthropoda) for all the
start codon intron proteins
Lists of gene identifiers and annotation of their protein product,
for proteins found to carry start codon introns,
can be downloaded here as four plain text files:
(
Arthropoda.reduc.scp.names.txt,
Fungi.reduc.scp.names.txt,
Magnoliophyta.reduc.scp.names.txt,
Vertebrata.reduc.scp.names.txt
).
The protein product annotation originating from the GenBank files
is described in the following format:
Name1 Name2 Product (if known)
AC132479 AAY24076.1 /product="unknown"
AF041427 AAB96967.1 /product="ribosomal protein s4 Y isoform"
AF305057 AAG29537.1 /product="RTS beta"
AL136181 CAH72985.1 /product="transmembrane protein 10"
AL137067 CAC08000.1 /product="Sec61 beta subunit"
AL139289 CAI23381.1
AL157783 CAI12146.1 /product="cAMP responsive element modulator"
AL354928 CAI39640.1 /product="ribosomal protein L35"
AL355815 CAC19504.1
AL357314 CAI22392.1 /product="Rab geranylgeranyltransferase, beta subunit"
Furthermore, a file
Arthropoda.reduc.scp.go+names.txt
which includes the GO categories is supplied
for the arthropod subset. The format is as follows:
Name1 Name2 FlyBase name GO-categories (here the list is shown truncated)
AE003422 AAF45673.1 Dmel_CG14813 /GO:0006886 [IEA]:intracellular protein transport;, /GO:0006887 [IEA]:exocytosis
AE003427 AAF45877.3 Dmel_CG14271 /GO:0005554 [ND]:molecular function unknown;, /GO:0005737 [NAS]:cytoplasm;, /GO:
AE003435 AAF46058.2 Dmel_CG4111_splic1 /GO:0003676 [IEA]:nucleic acid binding;, /GO:0003735 [ISS]:structural constituen
AE003456 AAF46816.3 Dmel_CG5625_splic1 /GO:0006886 [IEA]:intracellular protein transport;, /GO:0016192 [ISS]:vesicle-me
CORRESPONDENCE
Henrik Nielsen,
|