GenePublisher Exercise
GenePublisher Exercise
This exercise will illustrate analysis
of data produced by DNA arrays used for monitoring expression levels
of whole cell mRNA populations.
The GenePublisher
software used for the exercise integrates a large number of analysis methods
and automatically produces a report of all the results. You will download such an automatically generated report and try to interpret the analysis results, but first you should try to upload a dataset yourself.
Background material
- Knudsen, S. (2004) Guide to Analysis of DNA Microarray Data. 2nd Ed Wiley, New York.
Exercises
1. Running GenePublisher
On the GenePublisher webpage (in the middle of the page) there is a link to a file "sample file", download this and open it in a text editor. The file is in genetable format and contains a list of more than 4,000 Bacillus subtilis genes. Each gene is represented by its name, the annotation and the data from three experiments. The first three data colums show the expression from the wildtype and the last three represents the expression from a TnrA mutant. TnrA is a transcription factor known to regulate both activate and repress transcription.
Go to the middle of the GenePublisher webpage and upload your genetable (the sample.input file) by using the "browse" botton. Check that the parameters are correct.
2. Normalization
Download an analysis report. Open the report
using Adobe Acrobat. Go to the Normalization section under Results
and identify the MVA plots before and after normalization.
These are so-called M versus A plots; instead of plotting each probe on one chip against each probe
on another, the scales are changed so they plot, for each probe, the logarithm of
the ratio of expression between the two chips as a function of the logarithm of
the mean of the expression of the two chips. Two identical chips would yield a
straight, flat line through zero. Two comparable chips ideally have a straight, flat
line through zero and a few probes off the fitted line identicating differential expression.
Deviation of the line from zero reveals a need for normalization before the
two chips can be compared, and deviation from a straight line reveals a need for
non-linear normalization (different normalization factors for highly and weakly
expressed genes).
In these figures all chips are compared to each other (up to a limit of 10 chips).
- Look at the MVA plots before normalization and the MVA plots
after normalization. Have the chips become more comparable after normalization?
How can you tell?
3. Statistical analysis
- Identify the table of genes that show significant change in expression. What is the P-value of the top ranking gene?
- Look in the text before the table. How many false positive genes do we expect on the list? What is the expected false discovery rate?
- Look at the volcano plot (if any). How many genes from the shuffled analysis have a P-value below the chosen cutoff? How does that compare to the number
of predicted false positive genes in the question above?
- Look at the volcano plot (if any). What is the largest fold change observed?
Is that fold change significant?
4. Clustering and PCA of chips
- Look at the clustering of chips. Are the chips divided into groups that make sense based on the experiment?
- Compare the chip clustering to the chip PCA. Does the PCA add any additional
information?
5. Gene Cluster Analysis
Go to "Clustering of genes" under "Results."
- Look at the Figure Hierarchical clustering in the "Clustering" section.
What is, in your opinion, a good number of clusters to divide the
genes into?
- Look at the Figure Optimization of number of clusters K.
What number of clusters did the computer find as the optimal
number of clusters to divide the genes into?
- Look at the Figure K-means clustering. Do the genes within each cluster look
more like each other (in terms of expression profile) than they look like
members of another cluster? If they
do, the clustering method has performed as it should.
6. Promoter analysis
Go to "Promoter analysis" under "Results."
- Look at the results from the three different promoter analysis methods.
What is, in your opnion, the most significant result (if any)?
7. Further Analysis
Try to look at the function of genes that cluster closely together in the hierarchical clustering.
- Do you find any similarities in functions between members of a cluster?
This question can sometimes be easy to answer based on a quick look at the annotated function (shown in Table 2-3). Sometimes it requires a great deal of database browsing to find something in common. You can use the numbers referring to Table 2-3, the table of top ranking genes, to find out
more about individual genes.
8. Analysis of of the Bacillus subtilis data
Now use your expertise to extract interesting results from the report that you generated in the beginning of this exercise. In case the process failed you can download a backup report here: GenePub-report (3 exp).
- Do you find any genes that might belong to the TnrA regulon?
- How good is the data normalized?
- How is the data clustering?
Take a look at this report: GenePub-report (6 exp). This is the same kind of experiment, but with six experiments instead of only three. Try to compare the results from the two reports.
- Is there a big difference between these two?
- Which report would you trust the most?
- Do you find more significantly affected genes when you use more replicates?
- If you look at the top 10 most significant genes, which report gives you the highest P-values?
Last updated by Steen Knudsen, CBS, March 12, 2004
Modified by Hanner Jarmer, CBS, August 5, 2004