Classification by Machine Learning Approaches

Bioinformatics and Gene Discovery Course #27616

 

 

Lecture Notes:

The Lecture notes are available for download here.

The Exercise solution and summary slides will be available here after the exercises.

 

 

Background Information:

Read this review article by Kapetanovic et al. for a short overview and some applications of machine learning in bioinformatics.

 

 

Exercise:

During the exercise you will learn to use a data-mining software package (WEKA) and build your own classification model for splice site prediction.

 

It will be tested how feature selection impacts the classifier for splice site prediction. We will first build a model using the total feature set. Then we will build a model based on the same dataset, but with feature selection, and the resulting classifiers will be compared.

 

 

The WEKA Software Package:

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License. More information can be found on the project homepage http://www.cs.waikato.ac.nz/~ml/.

 

If you are using a notebook borrowed from CBS for the exercises (that is, you are using Windows and do not have administrative rights on your machine) click here to download a copy the WEKA package. Once you have downloaded, double-click the executable file and install it into a place you have write permission to (e.g. ‘My Documents\Weka\’) and NOT the suggested ‘C:\Program Files\...’.

 

If you are using your own Windows computer for the exercises and you already have a Java Virtual Machine installed on your computer (at least Java version 1.4, you can test this at http://www.java.com/en/download/help/testvm.xml), you can use the same link to download and install the program. If you need Java, too, use this link for a combined installation file of Weka and Java.

 

For any other operating system, there is a plain zip file containing the Weka program files here. Unzip the contents, change to the that directory and start Weka with ‘java -jar weka.jar’. You will need to have a Java VM installed for this to work.

 

 

Documentation of the WEKA package is included its installation folder; you can also download it here. Please try to make yourself a little comfortable with the program before the start of the exercise.

A detailed PowerPoint presentation created by one of the WEKA authors is available from their homepage or also here.

 

You can start experimenting with WEKA with the provided example datasets in the folder ‘data’ in the WEKA installation folder, e.g. the file weather.nominal.arff.

 

 

Datasets for building a Splice Site Predictor:

We will use datasets of human splice sites that have been used for training of the GENIE gene prediction system. Find information about these datasets at http://www.fruitfly.org/sequence/human-datasets.html.

 

The original sequence files in FASTA format have been converted to represent the four DNA bases in a binary fashion

 

A:   1 0 0 0

T:   0 1 0 0

C:   0 0 1 0

G:   0 0 0 1

 

and were subsequently converted into the .arff format used by WEKA (adapted from Yvan Saeys).

 

Acceptor Splice Site files:

Training Set:                                        acceptors_trainset.arff

Test Set:                                              acceptors_testset.arff

Donor Splice Site files:

Training Set:                                        donors_trainset.arff

Test Set:                                              donors_testset.arff

Training Set (different encoding):      donor_trainset_diffencod.arff

 

 

Tasks:       

-          Building an acceptor splice site predictor will be demonstrated during the exercises

-          Based on this experience you will build your own donor splice site predictor

 

·         How do different classifiers perform with this dataset (donors_trainset.arff) ?

(Load the dataset under ‘Preprocess’ (first tab on the top of the WEKA Explorer) → ‘Open file…’)

use e.g. the three classifiers J48, NaiveBayes and SMO.

J48 (‘Classify‘ (second WEKA tab on the top), Classifier ‘Choose’ → ‘trees’ → ‘J48’.
To start training the model set cross-validation to 5 Folds and click ‘Start’).

Naïve Bayes (Classifier ‘Choose’ → ‘bayes’ → ‘NaiveBayes’)

SMO (Support Vector Machine) (Classifier ‘Choose’ → ‘functions’ → ‘SMO’).

Which one performed best?

Compare the different reported performance measures.

 
Always use 5 fold cross-validation for consistence.

Save the text output (in particular the last section) of each classifier you want to report on for easier comparison.

 

·         Look at the file donor_trainset_diffencod.arff with a text editor.
Can you see, how the feature encoding differs to the one previously used?
How does this different encoding affects performance with the different classifiers?

 

·         Does feature selection improve classification performance in our case (take the original donors_testset.arff) ?
(use the ‘Preprocess’ tab in WEKA, Filter ‘Choose’ → ‘filters’ → ‘supervised’ – ‘attribute’ – ‘AttributeSelection’,
press ‘Apply’ (on the right of the text box) to apply the filter to your currently loaded dataset)

How does it affect the different classifiers you used?

If you have time, test different feature selection schemes.
(Click on the AttributeSelection textbox to change the default attribute selection properties)