Exercise on data integration

Introduction:

This exercise will focus on using the software Cytoscape, and its application to systems biology. Cytoscape is a Java-based, open source application for visualizing molecular interaction networks and for integrating these with other data such as gene expression values, etc. The exercise will start by covering the most basic features in Cytoscape, and will slowly move towards trying to use the tool to form some biological conclusions. Throughout all of it we will use data on the galactose utilization pathway in yeast.

(Note: In this exercise the term "node" will refer to genes and/or proteins, whereas the term "edge" will be used to describe interactions between nodes in a network regardless of their type. This convention is used in most publications on biological interaction networks and with the Cytoscape documentation where the terms are used interchangeably)


Background: The Galactose Utilization Pathway

Figure and text from Ideker et al. Science 2001 (1)

As shown in the Figure above, galactose utilization consists of a biochemical pathway that converts galactose into glucose-6-phosphate and a regulatory mechanism that controls whether the pathway is on or off. This process has been reviewed extensively (2, 3) and involves at least three types of proteins. A transporter gene (GAL2) encodes a permease that transports galactose into the cell; several other hexose transporters (HXTs) may also have this ability (4). A group of enzymatic genes encodes the proteins required for conversion of intracellular galactose, including galactokinase (GAL1), uridylyltransferase (GAL7), epimerase (GAL10), and phosphoglucomutase (GAL5/PGM2). The regulatory genes GAL3, GAL4, and GAL80 exert tight transcriptional control over the transporter, the enzymes, and to a certain extent, each other. GAL4p is a DNA-binding factor that can strongly activate transcription, but in the absence of galactose, GAL80p binds GAL4p and inhibits its activity. When galactose is present in the cell, it causes GAL3p to associate with GAL80p. This association causes GAL80p to release its repression of GAL4p, so that the transporter and enzymes are expressed at a high level.

Although these genes and interactions form the core of the GAL pathway, the complete regulatory mechanism is more complex (5-8) and involves genes whose roles in galactose utilization are not entirely clear (9, 10). For instance, the gene GAL6 (LAP3) functions predominantly in a drug-resistance pathway, but can suppress transcription of the GAL transporter and enzymes under certain conditions and may itself be transcriptionally controlled by GAL4 (11).

Overview:

In this exercise we will integrate gene expression data from gene deletion studies with protein-protein interaction network. In the study by Ideker et al. Science 2001, the yeast transcription factors Gal1p, Gal4p, and Gal80p were analyzed for their importance in galactose utilization pathways. In the gene deletion study, a gene is deleted and the expression value in the mutant is compared to the wild type. This is reported as log10 expression ratio (mutant/wildtype).


Part I. Loading network and expression data

STEP 0: Download the files you will need for this exercise.

STEP 1: Start Cytoscape and import the network "galFiltered.sif" If you had trouble with the file extension, your Cytoscape installation's sampleData directory will already have this file.

  • Use File -> Import -> Network (multiple file types)...

Your network will contain a combination of protein-protein (pp) and protein-DNA (pd) interactions.

STEP 2: Import expression data table: File -> Import -> Attribute from Table (Text/MS Excel)… and select the "galExpData.txt" file you downloaded in STEP 0.

In the "Import Attribute from Table" dialog, select the "Show Text File Import Options" and then select "Transfer first line as attribute names".

This file contains gene expression measurements for three knock-out perturbation experiments. In each experiment, the expression for a different transcription factor knock-out strain was measured.

STEP 3: Now we will use the 'Data Panel' to browse through the expression data (node attributes), as follows.

  1. Select some node in the Cytoscape canvas.
  2. In Data Panel, click the Select Attributes button (top left table icon of Data Panel), and select the attributes "LFC_gal1", "LFC_gal4", and "LFC_gal80". These are the log-ratios (log10) where the ratio is the deletion mutant expression realtive to wild-type expression.
Q1: What is the interpretation of negative values in the "LFC_xxx" attributes?


Part II: Coloring nodes

It is common to use expression data in Cytoscape to set the visual attributes of the nodes in a network. This visualization can be used to portray functional relation and experimental response at the same time. The steps for doing this are as follows:

STEP 4: To set visual properties: select the "VizMapper" in the Control Panel.

STEP 5: On the VizMapper manager window, click the button to create a new visual style (see figure) named something like "GalStyle" to duplicate the default style.

STEP 6: Set the "Node Color" attribute as follows:

  1. In the pull-down next to "Node Color", select "LFC_gal80" (see figure).

  2. Under the associated "Mapping Type", select a "Continuous Mapping"
  3. Click on the "Graph View" field to bring up the mapping editor (see figure below).

  4. Add 1 break points (triangles) and move the 3 otf them to -1, 0 and 1 respectively. You can do this with the help of the "Range Setting". Expression log-ratios range from about -3 to +3 in this study (log10 so 1/1000X to 1000X fold-change).
  5. By double-clicking on the range handles (small triangles), set the colors (you should only need 3 colors for all 5 range handles), then close the window.


  6. Note that the default node color of pink may fall within this spectrum. A useful trick is to choose a color outside this spectrum to distinguish nodes with no expression value defined. Under Defaults click anywhere on the image to open the default editor. Then set the “NODE_FILL_COLOR default to grey and then "Apply".

Q2: Use this visualization to identify the gene that is the most up-regulated in the "gal80" knock-out experiment.


Part III. Using p-values

Now we set the node size based on the in setting visual properties.

STEP 7: The p-value is a measure for how likely it is that a given expression change has happened by random. Hence, p-values (e.g. "pval_gal80") range from 0 to 1, as they should, and the log10(p-values) (e.g. "logp_gal80") ranges from -infinity to 0. Select some nodes and look at their relative expression and p-values in the Data Panel. You can sort the list (up or down) by clicking on the column headings.

STEP 8: Now, we will explore setting node size according to log10(p-values). Bigger nodes will then reflect more significant changes in expression for the attribute you select.

Note: Do not click on the "Add" button in this step. You will be certain to have troubles if you click on "Add" many many times as tends to happen.

  1. Double-click the "Node Size" tab in the VizMapper setting window.
  2. In the pull-down menu next to "Node Size", select "logp_gal80".
  3. In the pull-down menu under Mapping Type select Continuous Mapping
  4. Click anywhere on the "Graphic view" row to bring up the mapping editor.

    The y-axis represents the node size while the x-axis is the range of the attribute being mapped (-19.76 to 0 in this case).
  5. Double-click on the lower bound handle (solid red square) and set this size to 60. Set the upper bound size to 20. Slide the lower break point to -6 using the black triangle or by using the "Range Setting" after selecting the break opint (black triangle). Set the upper break point to -1. Set the lower break point size to 60 by double-clicking on the open red square. Set the upper break point to 20. You should see something like the following figure.

    Close the mapping editor dialog.

Q3: What color are the smallest nodes in your network?


Part IV. Biological analysis scenario

This section presents one scenario on how expression data can be combined with network data to tell a biological story. But first we need to load more relevant gene names.

STEP 9: Load the "ORF2name.na" node attribute file to get common gene names associated with our systematic ORF names used to define the network.

  • File -> Import -> Node Attributes...
  • then Open (check your Desktop if you cannot find the file you just downloaded)

STEP 10: In the VizMapper, find the Node Label attribute and set it to "GeneName" (this is the attribute you just loaded) and select passthrough mapping.

STEP 11: Now select the neighborhood of GAL4 and create a new sub-network.

  1. In the Control Panel select the Filter and create a new filter "NodeName" for example). Select the Attribute "node.GeneName" and Add the filter. Type GAL4 in the new text box and then click Apply
  2. To focus the view on the selected node, click the "Zoom Selected Region" in the menu bar. Then zoom out with the '-' magnifying glass. You should see something like,

  3. While the GAL4 node is selected: Select -> Nodes -> ‘First neighbors of selected nodes’
  4. Create a child sub-network: File -> New -> Network -> ‘From selected nodes, all edges’
  5. In the new sub-network, apply a graph layout algorithm using the yFiles Hierarchic layout.
  6. Use the VizMapper to change the Edge Color attribute with a Discrete Mapping on the "interaction" attribute. This will distinguish regulatory interactions, "pd", from protein-protein interactions, "pp".

    Now set the Edge Target Arrow Shape in the VizMapper with a Discrete Mapping on the "interaction" attribute again.

    Notice that all three dark red nodes (highly induced genes) are in the same region of the graph. With a little exploration in the node attribute browser, you should see the following:
    • The two genes that interact with all three highly induced genes are GAL11 (a general transcription cofactor with many interactions) and GAL4.
    • Both GAL4 and GAL11 show small changes in expression and neither change is statistically significant suggesting that the critical change affecting GAL1, GAL7, and GAL10 might be somewhere else in the network.
    • GAL4 interacts with GAL80, which shows a significantly lower level (GAL80 was deleted after all).

Q4: If GAL80 levels are low (or absent) but most of the other genes linked to GAL4 show significant levels of induction, what does this say about the role of Gal80p? Is Gal80 activating or inhibiting the activity of Gal4?