Exercise in Data Integration and Systems Biology

Written by: Carsten Friis

Overview of Exercise

This exercise will focus on using the software Cytoscape, and its application to systems biology. Cytoscape is a Java-based, open source application for visualizing molecular interaction networks and for integrating these with other data such as gene expression values, etc. The exercise will start by covering the most basic features in Cytoscape, and will slowly move towards trying to use the tool to form some biological conclusions. Throughout all of it we will use data on the galactose utilization pathway in yeast.

(Note on vocabulary: In this exercise the term "node" will refer to genes and/or proteins, whereas the term "edge" will be used to describe interactions between nodes in a network regardless of their type. This is completely analogous with many publications on interaction networks and with the Cytoscape documentation where the terms are used interchangeably)

The Galactose Utilization Pathway

Figure and text from Ideker et al. (1)

As shown in the Figure above, galactose utilization consists of a biochemical pathway that converts galactose into glucose-6-phosphate and a regulatory mechanism that controls whether the pathway is on or off. This process has been reviewed extensively (2, 3) and involves at least three types of proteins. A transporter gene (GAL2) encodes a permease that transports galactose into the cell; several other hexose transporters (HXTs) may also have this ability (4). A group of enzymatic genes encodes the proteins required for conversion of intracellular galactose, including galactokinase (GAL1), uridylyltransferase (GAL7), epimerase (GAL10), and phosphoglucomutase (GAL5/PGM2). The regulatory genes GAL3, GAL4, and GAL80 exert tight transcriptional control over the transporter, the enzymes, and to a certain extent, each other. GAL4p is a DNA-binding factor that can strongly activate transcription, but in the absence of galactose, GAL80p binds GAL4p and inhibits its activity. When galactose is present in the cell, it causes GAL3p to associate with GAL80p. This association causes GAL80p to release its repression of GAL4p, so that the transporter and enzymes are expressed at a high level.

Although these genes and interactions form the core of the GAL pathway, the complete regulatory mechanism is more complex (5-8) and involves genes whose roles in galactose utilization are not entirely clear (9, 10). For instance, the gene GAL6 (LAP3) functions predominantly in a drug-resistance pathway, but can suppress transcription of the GAL transporter and enzymes under certain conditions and may itself be transcriptionally controlled by GAL4 (11).

Getting Started

Let's start off by looking at our data.

  1. For convenience just follow this link: galFiltered.sif

Now we will do some basic navigation in Cytoscape.

  1. Open Cytoscape by double-clicking on the desktop shortcut
  2. Load the data by clicking on and selecting the file "galFiltered.sif" in the "sampleData" directory. Because the data set is smaller than 500 nodes, a view of the network is automatically created

You can zoom in on the network by clicking on . You can zoom out again by clicking on . To get the view centered on the whole network click on

  1. If you zoom in far enough on the nodes, the node labels will appear and become readable. Try this
    Tip: Try holding down space while clicking and holding the mouse button down anywhere on the background in the network view. Now move the mouse around. This is a handy way of scrolling around in the network while you're zoomed in

Selection in Cytoscape

You can select specific nodes. This allows you to do operations on them. You can do this the traditional ways, by clicking on individual nodes, or by dragging a box around a group of nodes. You can also select several nodes by holding down shift. Typically, however, it makes most sense to select nodes through their names or attributes. We'll get to attributes later on, but let's try selecting the node "YPL248C" and it's closest interaction partners. This is the GAL4 gene, one of the key regulators in the galactose metabolism and its interaction partners should be the majority of the GAL-genes.

  1. Select the "YPL248C" node by clicking on "Select" in the menu, then choose "Nodes" and finally "By Name...". Use the button to zoom to GAL4/YPL248C

  2. Again, open the "Select" menu and click on "Nodes", but this time choose "First neighbors of selected nodes". Click on the button again to see all of your selected nodes

  3. By now you should have a very messy image in front of you. To clean it up and to get an overview, click on "Layout" -> "Apply Spring Embedded Layout" -> "All Nodes". Finish off by hitting , and you should have a readable network of GAL4/YPL248C and its partners

Annotation in Cytoscape

Cytoscape's greatest potential lies in its ability to integrate different genetic and proteomic features/annotations together in one figure. To illustrate how annotations can be visualized in Cytoscape, let's try to distinguish between different types of interactions. The network file you have loaded actually contains both protein-to-protein and protein-to-dna interactions. Let's add some colors to distinguish between interactions, and some arrows to visualize the direction of the interactions.

  1. Click on the button to open the visual style dialog

  2. In this window you can define different visual styles for different projects and save them for later. For the purpose of the exercise, the "default" style will do fine; select it by clicking the "Define" button

  3. In the visual properties window select "Edge attributes". The "Edge Color" tab should be active by default, otherwise you'll need to select it

  4. In the "Mapping" pull-down menu select "BasicDiscrete". By default Cytoscape should suggest yellow color for pd interactions and blue for pp, which is fine. Otherwise you can change the color by clicking on either the "pp" or "pd" buttons

  5. Finalize your changes by clicking on the "Apply to Network" button, but do not close the window just yet...

Now we can identify protein-dna interactions in the network view. We cannot, however, see which node represents the protein and which the gene because we lack the direction of the interactions. Let's add that information as well.

  1. Click on the tab "Edge Target Arrow"

  2. In the "Mapping" pull-down menu select "BasicDiscrete" and set "Map Attribute" to "Interaction" (this should be default)

  3. Now click the "pd" button and select one of the pointed arrows, then click "Apply to Network" and then "Close"

In this case the type of interaction was loaded along with the interaction network. It is also possible to load additional annotations from other independant files.

Working with the build-in annotation

Annotation comes in two classes in Cytoscape, custom and build-in. The difference can sometimes be hard to make out, but it is very significant in that build-in annotation only works for organisms supported by Cytoscape. Currently, Cytoscape version 2.1 ships with support for yeast, and little else. This build-in annotation is very powerful, however. What you have been working with sofar has all been custom annotations, loaded as simple data files. This will work for any organism, as long as you have the data. Here's some examples of what you can do with it the build-in annotation.

By now you should have noticed that the nodes in the network are labelled with ORF names and not genenames. For most people, however, it would probably be more meaningful to use the genenames. Because Cytoscape has support for yeast, we can automatically translate ORF names.

  1. Click on and then "Define"

  2. Select "Node Attributes" and then the tab "Node Label"

  3. Under "Map Attribute" change "canonicalName" to "commonName" and apply the change

As you can see, this worked almost like for the custom annotation, the only difference is that we never loaded any data on the genenames. The annotation was in Cytoscape to begin with. Other build-in annotations include KEGG and GO annotations, but these need to be activated before we can use them. Let's try this and add some functional annotation to our data.

  1. Click on the button and hit the 'plus' next to "GO, molecular function". Select "2" to add second level GO functional categories to your data

  2. Now click the visual style button and change the node labels to reflect the new GO annotation

  3. Look at the network, do proteins with similar functions cluster together?

  4. Change the labels back to reflect the genenames

Integrating gene expression levels

We happen to have some gene expression data on yeast, complete with p-values. The data are given as the logarithm of the ratio between the expression level in the wildtype yeast, and a mutant in which GAL4 is knocked out (the "gal4RGexp" column). The data file also contains p-values (the "gal4RGsig" column) describing the statistical significance of each change in expression (i.e. how reliable each measurement is). The file also contains data on two other mutants, but we'll use only the GAL4 mutant in this exercise.

  1. In the top menu, select "File" -> "Load" -> "Expression Matrix File". Select the file galExpData.pvals in the "sampleData" directory

  2. We have now loaded some microarray data. To see any difference, however, we'll need to define a new "Mapping" style. Open the "Visual Properties" window on the "Node Color" tab and Click on the "New" button, then select a "Continuous Mapper". You'll need to name the new mapper, call it e.g. "myMap1"

  1. Now we need to describe our new mapper. First select "galRG4exp" in the "Map Attribute" pull-down menu, then click the "Add point" button three times

  2. Define the mapper so that the color of the nodes reflects the direction of the expression change
    hint: Ideally, reproduce the settings from the screenshot to the right. Distinguish between "galRG4exp" > -3 and "galRG4exp" < 3

  3. By now two nodes should light up in pink. That is because these nodes represents interaction partners for which we cannot map any expression data. You can get rid of these nodes by first selecting them, and then clicking the "Hide" button . You can undo this later by clicking on the "Show All" button

  1. Change the color of the node borders to reflect which genes show significant differential expression at a 99% confidence level in the GAL4 mutant
    hint: Significant genes are those with "galRG4sig" values below 0.01. You will need to define a new continuous "Mapping"-style to do this; else you'll override the "myMap1" style you created before

  2. Zoom in on GAL4; do the genes it interacts with show differential expression? Does GAL4 appear to be a repressor or an activator?

Final Questions

By now you should have a fairly firm grip on several features in Cytoscape. Can you use Cytoscape to answer some more complex biological questions?

  1. In the network, can you identify proteins involved in any carbohydrate metabolic pathways, and isolate them according to which pathway they belong to?
    hint: You can select the appropriate genes using the right window of the KEGG/GO annotation browser

  2. The GAL-genes are known to interact with other carbohydrate metabolic pathways. Can you identify any such links?
    hint: The Galactose metabolism is in KEGG level 2, use the KEGG/GO annotation browser to select them

  3. Can you identify other pathways/mechanisms which may be affected by the Galactose utilization?

  4. To which journal would you send your network?


  1. T. Ideker, V. Thorsson, J. A. Ranish, R. Christmas, J. Buhler, J. K. Eng, R. Bumgarner, D. R. Goodlett, R. Aebersold, L. Hood, Sci. 292, 929 (2001)
  2. D. Lohr, P. Venkov, J. Zlatanova, FASEB J. 9, 777 (1995)
  3. R. J. Reece, Cell Mol. Life Sci. 57, 1161 (2000)
  4. R. Wieczorke et al., FEBS Lett. 464, 123 (1999)
  5. M. Johnston, J. S. Flick, T. Pexton, Mol. Cell. Biol. 14, 3834 (1994)
  6. I. H. Greger, N. J. Proudfoot, EMBO J. 17, 4771 (1998)
  7. G. Peng, J. E. Hopper, Mol. Cell. Biol. 20, 5140 (2000)
  8. J. R. Rohde, J. Trinh, I. Sadowski, Mol. Cell. Biol. 20, 3880 (2000)
  9. S. Rudoni, I. Mauri, M. Ceriani, P. Coccetti, E. Martegani, Int. J. Biochem. Cell Biol. 32, 215 (2000)
  10. L. Fu, A. Miseta, D. Hunton, R. B. Marchase, D. M. Bedwell, J. Biol. Chem. 275, 5431 (2000)
  11. W. Zheng, H. E. Xu, S. A. Johnston, J. Biol. Chem. 272, 30350 (1997)