GenePublisher Exercise

GenePublisher Exercise


This exercise will illustrate analysis of data produced by DNA arrays used for monitoring expression levels of whole cell mRNA populations.

The GenePublisher software used for the exercise integrates a large number of analysis methods and automatically produces a report of all the results. You will download such an automatically generated report and try to interpret the analysis results, but first you should try to upload a dataset yourself.

Background material

  1. Knudsen, S. (2004) Guide to Analysis of DNA Microarray Data. 2nd Ed Wiley, New York.

Exercises

1. Running GenePublisher

On the GenePublisher webpage (in the middle of the page) there is a link to a file "sample file", download this and open it in a text editor. The file is in genetable format and contains a list of more than 4,000 Bacillus subtilis genes. Each gene is represented by its name, the annotation and the data from three experiments. The first three data colums show the expression from the wildtype and the last three represents the expression from a TnrA mutant. TnrA is a transcription factor known to regulate both activate and repress transcription.

Go to the middle of the GenePublisher webpage and upload your genetable (the sample.input file) by using the "browse" botton. Check that the parameters are correct.

2. Normalization

Download an analysis report. Open the report using Adobe Acrobat. Go to the Normalization section under Results and identify the MVA plots before and after normalization.

These are so-called M versus A plots; instead of plotting each probe on one chip against each probe on another, the scales are changed so they plot, for each probe, the logarithm of the ratio of expression between the two chips as a function of the logarithm of the mean of the expression of the two chips. Two identical chips would yield a straight, flat line through zero. Two comparable chips ideally have a straight, flat line through zero and a few probes off the fitted line identicating differential expression. Deviation of the line from zero reveals a need for normalization before the two chips can be compared, and deviation from a straight line reveals a need for non-linear normalization (different normalization factors for highly and weakly expressed genes).

In these figures all chips are compared to each other (up to a limit of 10 chips).

  1. Look at the MVA plots before normalization and the MVA plots after normalization. Have the chips become more comparable after normalization? How can you tell?

3. Statistical analysis

  1. Identify the table of genes that show significant change in expression. What is the P-value of the top ranking gene?

  2. Look in the text before the table. How many false positive genes do we expect on the list? What is the expected false discovery rate?
  3. Look at the volcano plot (if any). How many genes from the shuffled analysis have a P-value below the chosen cutoff? How does that compare to the number of predicted false positive genes in the question above?
  4. Look at the volcano plot (if any). What is the largest fold change observed? Is that fold change significant?

4. Clustering and PCA of chips

  1. Look at the clustering of chips. Are the chips divided into groups that make sense based on the experiment?

  2. Compare the chip clustering to the chip PCA. Does the PCA add any additional information?

5. Gene Cluster Analysis

Go to "Clustering of genes" under "Results."
  1. Look at the Figure Hierarchical clustering in the "Clustering" section. What is, in your opinion, a good number of clusters to divide the genes into?

  2. Look at the Figure Optimization of number of clusters K. What number of clusters did the computer find as the optimal number of clusters to divide the genes into?

  3. Look at the Figure K-means clustering. Do the genes within each cluster look more like each other (in terms of expression profile) than they look like members of another cluster? If they do, the clustering method has performed as it should.

6. Promoter analysis

Go to "Promoter analysis" under "Results."
  1. Look at the results from the three different promoter analysis methods. What is, in your opnion, the most significant result (if any)?

7. Further Analysis

Try to look at the function of genes that cluster closely together in the hierarchical clustering.
  1. Do you find any similarities in functions between members of a cluster?

This question can sometimes be easy to answer based on a quick look at the annotated function (shown in Table 2-3). Sometimes it requires a great deal of database browsing to find something in common. You can use the numbers referring to Table 2-3, the table of top ranking genes, to find out more about individual genes.

8. Analysis of of the Bacillus subtilis data

Now use your expertise to extract interesting results from the report that you generated in the beginning of this exercise. In case the process failed you can download a backup report here: GenePub-report (3 exp).
  1. Do you find any genes that might belong to the TnrA regulon?

  2. How good is the data normalized?

  3. How is the data clustering?

Take a look at this report: GenePub-report (6 exp). This is the same kind of experiment, but with six experiments instead of only three. Try to compare the results from the two reports.
  1. Is there a big difference between these two?

  2. Which report would you trust the most?

  3. Do you find more significantly affected genes when you use more replicates?

  4. If you look at the top 10 most significant genes, which report gives you the highest P-values?


Last updated by Steen Knudsen, CBS, March 12, 2004

Modified by Hanner Jarmer, CBS, August 5, 2004