Exercise in Probe Design
Written by: Carsten Friis
With parts adapted from an exercise by: Rasmus Wernersson
The aim of this exercise is to illustrate some of the considerations in the design process for oligonucleotide probes for microarrays. The exercise will look both at arrays build for the single genome case as well as arrays for pangenomics/metagenomics.
It is always a good idea to begin by familarizing ourselves with the data. To keep matters simple and computational times managable we will focus on a smaller sub-set of genes rather than a full genome. We have selected a set of genes from five Proteobacters including Escherichia coli K-12 substr. MG1655, namely those involved in the flagellum. Because the flagella genes are extreemly well-conserved they are best suited for comparisons at the species level, and the other bacteria in the data set have been chosen accordingly. On an actual array one could have more than a million probes and would be able to target both well-conserved and highly variable genes, thus differentiating all the way down to the isolate level. For the sake of speed and simplicity, we will focus solely on the flagellar genes and restrict us to considerably fewer probes.
- Start by copying the data directory to your working directory
cp -r /home/projects/carsten/MVNexercises/ProbeDesign ~/
- Then enter the data directory
Designing probes for a single genome
To begin with we will focus on just one genome/organism, namely Escherichia coli K-12 substr. MG1655, and forget the other four for a while.
Do-it-yourself probe design
To illustrate the concepts, let us spend a little time doing a hands-on examination of a few candidates. If you cannot remember the selection criteria from the lecture, look up the slides on the course programme page.
- Look at the following candidates for probe targets for the fliD gene:
Note: The numbers in the identifier corresponds to the position in the gene
- Use blast to check the sequences for cross hybridization
You can use the NCBI Blast Server for this. Select the nt database and specify 'Escherichia coli K12 (taxid:83333)' in the organism field.
...Or you can check for cross hybridization by running Blast on the UNIX command line against E.coli K-12:
echo "GCAAAAAGCGACGCTAACCCCCATTTCAAATCAGCAATCGTCGTTTACCGC" | saco_convert -I raw -O fasta | blastall -p blastn -d E.coli_K-12 -F F | less
For convenience sake we use the program less to view the blast outoput. You can scroll up and down using 'j' and 'k'. You quit less by typing 'q'. You can check the other sequences by replacing the nucleotides in the asterisks.
- Which is the best probe candidate? Why?
- How many of the probes do you think would give good quality signal?
- Trick Question: What is the problem with the 'fliD_???-???' probe?
Designing probes with OligoWiz
We will use OligoWiz to create a standard probe set targeting the flagellar genes in E. coli K-12 substr. MG1655. To do this we look to the OligoWiz 2.0 server which is available here:
OligoWiz 2.0 Server
Take your time to study the instructions. To actually run OligoWiz you need to download the java-based client from the OligoWiz page and execute it on you local machine. To do this, follow these instructions:
- Open the OligoWiz 2.0 webpage in a new browser window
- Open the "instructions" help page from the main OligoWiz webpage
It is a good idea to keep the help page open throughout this exercise. The help page contains answers to many of the questions that might arise.
- Optional: Install Java 1.4.1 or better
If you run this exercise from your own computer, you may need to install a suitable version of Java. OligoWiz requires that Java is already installed on your machine. For most PCs this is usually the case, but if not Java is freely available for most systems. Please follow the instruction on the OligoWiz webpage.
- Download the OligoWiz client (the Java program) from the OligoWiz webpage
- Launch the OligoWiz 2.0 client
On Windows and Mac simply double-click on the JAR file.
- Download the dataset
You can download the data set described above in FASTA format file here. In some browsers you may have to right click and choose 'Save Target As...'.
- Inspect the data file
Use a simple text editor to inspect the file (E.g Notepad or Wordpad on Windows, TextEdit or BBEdit on Mac, NEdit on UNIX).
- Get ready to launch the query to the OligoWiz 2.0 server
IMPORTANT: The OligoWiz help page you opened ealier contains a quick walk-through of how to launch a query, near the top of the page.
You lauch the query by completing the following steps:
- Specify the input file
Notice: Some browsers may give the file a new extension (e.g. .txt). If you cannot see the file in the file-chooser dialog, select "All files" as the type of files you want to see
- Accept the default result name suggested to OligoWiz - or choose a new one
- Select Escherichia coli (subspecies "K-12") as the species database to use
You can find the correct entry through the Taxonomy tree (E. coli is a Gammaproteobacter), or through the Alphabetically sorted tree (sorted by genus name).
- Once the correct database is selected, press the "?" button positioned just above the species database tree
This will bring up the "Databases" webpage (also linked from the main OligoWiz 2.0 webpage) and jump directly to the relevant database entry. [Launching an external browser, _may_ not work on some Linux/UNIX configurations. It works on Windows and Mac for sure.]
- Adjust parameters
- Allow the length to vary between 45 and 55bp (set Aim oligo length to 50bp)
- Let OligoWiz determine the most optimal Tm (default)
- Use default cross-hybridization settings
- Select "Random primed" as the position score
Launch query by pressing "Submit query"
Wait for OligoWiz to process and download the result
The query should be complete within one minute. When everything is ready the query status changes to "completed - click to view". Double-click the query to view the result.
Have a look at the sequence/probe inspection interface
Try to select different sequences in the leftmost table, and have a look at the scoring graphs. Notice how the weighting of the different scores can be changed.
Select probes for all the transcripts
Press the "Place oligos..." button to open up the probe placement dialog. At this point we will ignore the filter-options and only experiment with the distance settings.
Initially try setting the maximum number of probes per sequence to 1, and press "Apply to all". Inspect the result in the main window.
Notice how the length of the probes may vary in order to minimize the variation in Tm.
Export the probe sequences
In the main window press the "Export oligos..." button to open up the export dialog. Select FASTA format as the file format and save all the probes to a file.
The resulting file can be inspected in a text-editor.
Experiment with the probe distance settings
Try to select more than one probe per sequence by adjusting the distance parameters and pressing "Apply to all" again. Notice that the minimum distance criteria can be use to allow/disallow overlapping probes.
- Look for the fliD gene. How does the probes suggested by OligoWiz compare to the ones you investigated above?
Designing probes for use with more than one genome
With the price of custom microarrays coming down and the requirement to order a minimum of number of chips pr. design disappearing, it may seem pointless to design a chip for more than one genome. After all if you have several organisms you want to investigate, why not design a chip for each? It is easy with OligoWiz and there is no longer any cost incentive to using just one design as opposed to using several.
There are times, however, when we may want to compare results obtained from one strain with those of another - a task not easily done if the array designs are different! We may also have an organism for which the complete genome is unavailable, but that of similar organisms is. These are some of the application of multispecies, or pan-genomic arrays. With the ever-growing popularity of metagenomics it is also tempting to target an array not at a genome, but at a specific environment, giving us a metagenomic array.
As stated above we have prepared a data set consisting of 25 genes from the flagellum taken from five organisms. You have already worked with the genes from E.coli K-12, but the full set spans these five organisms:
- Escherichia coli str. K-12 substr. MG1655
- Escherichia coli O157:H7 EDL933
- Pseudomonas aeruginosa PAO1
- Salmonella enterica subsp. enterica serovar Typhi str. Ty2
- Campylobacter jejuni subsp. jejuni NCTC 11168
Do-it-yourself design for pan-genomic arrays
Let us look at the list of probe candidates for the fliD gene again:
- Use blast again, but this time check for cross hybridization against all five genomes
This time you are better off running Blast on the UNIX command line, since this allows you to align your sequences specifically against these five organisms:
echo "GCAAAAAGCGACGCTAACCCCCATTTCAAATCAGCAATCGTCGTTTACCGC" | saco_convert -I raw -O fasta | blastall -p blastn -d Five_proteobacter -F F | less
For convenience sake we use the program less to view the blast outoput. You can scroll up and down using the 'j' and 'k' keys. You quit less by pressing 'q'. You can check the other sequences by replacing the nucleotides in the quotation marks.
- Which are now ideal probe candidates? Why?
- How many of the probes do you think would give good quality signal for a pan-genome array?
- Trick Question: Can you now see the purpose of the 'fliD_???-???' probe?
Automated design of oligos for pan-genomic arrays
We do not yet have a ready-to-use tool like OligoWiz for the design of pangenomic/metagenomic arrays, but we can take the development version for a test drive. Eventually this service will be made available to everyone online and it will be as easy to use as running OligoWiz was for the single genome above.
The biggest problem with the development version is speed. It will be impossibly slow for a selection of full genomes. Here, we overcome the speed issues by limiting ourselvers to designing probes for the fliD gene only.
We have prepared a blast database called Five_proteobacter in advance. Unlike before, we now align our probes against several genomes. This has several implications, in particular we must now consider that the probe is no longer guaranteed to be the perfect compliment of its intented target. While we might have to deal with the occational gene duplicate in a single genome case, it will now be the rule rather than the exception that we have several intented targets for each probe. And these targets will not always be perfect compliments to the probe.
To cope, we introduce the Sensitivity score which describes the affinity a given probe will have for its possible targets. The Cross-hyb score has also been replaced by the Specificity score. The score is essentially calculated the same way, but takes into account that we are now working with several genomes.
- Run the Microbial Array Designer script on the UNIX command-line
perl madwiz.pl -db Five_proteobacter fliD.in > fliD.5genomes.owz
For some unfortunate reason, it is not presently possible to run the OligoWiz client on our servers. It is thus necessary to copy the file 'fliD.5genomes.owz' which you just made, onto your own computer and run the OligoWiz client from there. In the ssh program under Windows, look for an icon resembling a yellow directory folder with blue dots on to get a GUI interface to our servers. If this causes problems, or you if you do not use the ssh program, please contact an assistant teacher.
- Transfer the data you made to your own computer
When you click on the yellow directory folder with the blue dots you should get a new window with a file interface. First, make sure you are set to transfer files as text files and not as binaries. Look for a button with the letters 'abcdef' and make sure it is pressed. Now click 'Quick connect' and specify organism.cbs.dtu.dk as host and your account name as user. You will be asked for a password, and if given correctly your home directory on our server should appear in the right frame of the window. Open the ProbeDesign directory and drag and drop the file to your computer.
- Inspect the file in OligoWiz
To load the file in OligoWiz you first need to start a new instance of the OligoWiz client just like you did above. Then open the 'File' pull-down menu and select 'open OWZ data file'. You do not need to press 'Submit query' this time, that calculation is the one we did with the script above.
- Use OligoWiz to place some oligos
Just like before, press 'Place oligos...'. For a more realistic run you might wish consider reducing the number of oligos pr. gene a little, since all the probes will target the same gene – if not exactly the same version of the gene.
- Investigate the probes OligoWiz suggested
- How do they compare to the previous probes we designed?
If time allows, feel free to experiment with OligoWiz
You can adjust the weights in the Score management section to make some scores more important than others for probe selection. Melting temperature, for example, can be a problem since our Microbial Array Designer can currently only handle probes of a fixed length, and making the Delta Tm score might thus be a good idea.
Also, although it is too CPU intensive to have all student groups run all 25 flagella genes through at the same time, the run is quite feasible on its own. We have prepared a file for you in advance called '25genes.5genomes.owz'. Copy it to your computer and open it just like you did above. You can now try design probes for all 25 genes