|
Algorithms in Bioinformatics
Information for participants
GENERAL SCHEDULE
The course will in a two weeks period from Monday 28th of February - Friday 11th of March, each day from 10-17.
Lectures will be in the morning from 10.00 - 12.00, and practical exercises in the afternoon from 13.00 - 17.00.
Mornings will consist of lectures and small
practical exercises introducing the different algorithms, and afternoons will consist of programming exercises where
the algorithms will be implemented.
The main programming language will be C, and all program templates provided in the course will be written in C.
Prior knowledge of C programming is NOT required. However, basic programming skills are required to follow the course.
LITERATURE:
Compendium of review papers and selected chapters from Immunological Bioinformatics, Lund et al., MIT Press,
2005. The compendium will be handed out on the first day of the course.
Other material will be made available online during the course.
Course Programme
Please note that the programme is updated on a regular basis - click the 'refresh' button once in a while to make sure that you have the most updated information
Monday 28 of February: Course Introduction.
Introduction to course, Immune system overview, UNIX crash course 101
The over all structure of the course is described, and a short overview of the immune system is given.
Next an introduction to the UNIX operating system including an exercise
covering the most essential UNIX commands is given, as well as a short exercise introducing the
c-code template of the course.
Morten Nielsen
BACKGROUND TEXTS
- C notes by Tom Macke.
This PDF document is password protected, you get the password in the first course lecture.
- 10.00 - 10.20
- Introduction to course
- Introduction to course. [PDF] .
- 10.20 - 11.20
- Immune system overview
- Immune system overview. [PDF] .
- 11.20 - 12.30
- Unix crash course
- A UNIX/Linux crash course
- 12.30 - 13.30
- Lunch break
- 13.30 - 17.00
- C-programming crash course 101
- The development of the first c-program working using command line parsing and linked lists.
- C routine, linked listed and commandline parsing
- Details on C routines, linked lists
- Some notes on command line parsing.
[PDF].
- Doing it on your local machine
- Introduction to exercise. Sequence alignment refresher
- Some notes on Sequence alignment.
[PDF].
- The exercise
- C-programming crash course
- Answers to C-programming exercise
Tuesday 1 March: Weight matrix construction.
Weight matrix (PSSM) construction, Gibbs sampling, and Blosum matrices
The concept of sequence information and motif are introduced, and the theoretical background including sequence weighting and
pseudo counts for weight matrix construction are described.
The theoretical background for Blosum matrices is described.
Morten Nielsen
BACKGROUND TEXTS
- Immunological Bioinformatics. MIT Press. Chapter 4.
- 10.00 - 11.15
- Weight matrix construction.
[PDF].
- Handout. Estimation of pseudo counts. PLEASE PRINT AND BRING TO THE LECTURE
- NOTE. PLEASE BRING A CALCULATOR OR CELL PHONE WITH SOME SIMPLE MATH FUNCTIONS TO THE LECTURE SINCE YOU WILL BE ASKED TO
DO SOME SIMPLE MATH CALCULATIONS.
- Answer
- 11.15 - 11.35
- Gibbs sampling.
[PDF] .
- 11.35 - 12.00
- Blosum scoring matrices
[PDF] .
- 12.00 - 13.00
- Lunch break
- 13.00 - 17.00
- Implementation of PSSM construction from pre-aligned sequences including pseudo count
correction for low counts and sequence clustering
- PSSM construction and evaluation
- PSSM answers
Wednesday 2 March. Sequence alignment and data redundancy reduction algorithms..
Sequence alignment and Dynamic programming
A detailed introduction to dynamic programming and
sequence alignment is given.
In particular it is demonstrated how and why the use of sequence profiles can improve the alignment
accuracy significantly. Next, two commonly use algorithms for sequence redundancy reduction,
Hobohm 1 and Hobohm 2, are described.
Morten Nielsen
BACKGROUND TEXTS
- Immunological Bioinformatics. MIT Press. Chapter 3.
- A General method applicable to search for similarities in the amino acid sequences of two proteins. S. B. Needleman and C. D Wunsch J.Mol. Biol (1970), 48.
- Identification of common molecular subsequences. T. F. Smith and M. S. Waterman. J. Mol. Biol. (1981), 147.
- An improved algorithm for matching biological sequences. O. Gotoh. J. Mol. Biol. (1982), 162
- Selection of representative protein data sets. U. Hobohm et al. Protein Science. 1992.
- Protein distance constraints predicted by neural networks and probability density functions. O. Lund et al. Protein Engineering. (1997).
- 10.00 - 11.00
- Sequence alignment
[PDF] .
- Handout 1. PLEASE PRINT AND BRING TO CLASS
- Handout 2. PLEASE PRINT AND BRING TO CLASS
- Handout answers
- A few words on sequence profiles
[PDF] .
- 11.15- 12.00
- Hobohm algorithms
- Data redundancy reduction algorithms (Hobohm1 and Hobohm2). [ PDF].
- 12.00 - 13.00
- Lunch break
- 13.00 - 17.00
- Implementation of the Smith-Waterman Dynamic
programming and the two hobohm algortihms
- Matrix dumps from alignment programs (to be used for debugging)
- Answers to sequence alignment exercise
- Hobohm redundancy reduction algorithms
- Answers to Hobohm programming exercise
Thursday 3 March. Hidden Markov Models.
Hidden Markov Models
The concepts of hidden Markov Models are introduced. It is described how to construct a hidden Markov model from data,
and the Viterbi and posterior decoding algorithms are described in detail. Next it is demonstrated how HMM can be
used to predict features of un-characterized sequences.
Morten Nielsen
BACKGROUND TEXTS
- Immunological Bioinformatics. MIT Press. Chapter 4.
- Notes by Anders Krog in Compendium.
- 10.00 - 10.45
- Hidden Markov models introduction
- HMM slides [ PDF].
- 10.45 - 11.15
- Viterbi decoding
- Viterbi Handout PLEASE PRINT AND BRING TO CLASS
- Answers
- NOTE. PLEASE BRING A CALCULATOR OR CELL PHONE WITH SOME SIMPLE MATH FUNCTIONS TO THE LECTURE SINCE YOU WILL BE ASKED TO DO SOME SIMPLE MATH CALCULATIONS.
- 11.15 - 12.00
- Forward/Backward algorithm, Posterior decoding, Baum-Welsh learning
- Forward Handout PLEASE PRINT AND BRING TO CLASS
- Answers
- 12.00 - 12.15
- Profile Hidden Markov Models.
- 12.15 - 13.15
- Lunch break
- 13.15 - 17.00
- Develop program implementing the Viterbi and posterior decoding algorithms for HMM's.
- Hidden Markov exercises
- Answers to hidden Markov model exercise
Friday 4 March. Artificial neural networks..
Artificial neural networks
Artificial neural networks (ANN) are introduced. It is described how biological data can be encoded to optimally train ANN, and the feed-forward algorithm
is introduced demonstrating how ANN can be used to predict properties for biological data. Next, the back-propagation algorithm used for ANN training
is introduced, and the problem of method over-fitting is illustrated. It is shown how cross-validation can deal with this problem, and strategies
for neural network training are shown.
Morten Nielsen
BACKGROUND TEXTS
- Immunological Bioinformatics. MIT Press. Chapter 4.
- 10.00 - 11.45
- Artificial neural networks.
[PDF] .
- Background
- Feed forward algorithm
- Network training, back-propagation, gradient decent
- Sequence encoding
- Overfitting and cross-validation
- Handout (forward) PLEASE PRINT AND BRING TO CLASS
- Handout (forward) answers
- Handout (back-propagation) PLEASE PRINT AND BRING TO CLASS
- Handout (back-propagation) answers
- NOTE. PLEASE BRING A CALCULATOR OR CELL PHONE WITH SOME SIMPLE MATH FUNCTIONS TO THE LECTURE SINCE YOU WILL BE ASKED TO DO SOME SIMPLE MATH CALCULATIONS.
- 11.45 - 12.00
- Description of project format, project suggestions and formation of groups
- Project description.
[PDF] .
- 12.00 - 13.00
- Lunch Break
- 13.00 - 17.00
- Implementation of sequence encoding and feed forward algorithm
- Network part I answers
- Implementation of Implementation of back-propagation and neural network training
- Network part 2 answers
Monday 7 March - Thursday 10 March. Project work
Project work
In the project period, the student will apply one or more of the algorithms introduced in the course to
gain knowledge about a biological problem of interest. This can be in form of development of prediction
methods, or application of all ready developed prediction software to gain novel biological insights.
The project should be documented as a report written in form of a research paper.
The project must be submitted by email to me by Thursday 10 of March at 13.00.
Friday 11 March
Exam. 10.00 - 17.00
Remember that the exam consists first of a 15 minutes presentation of the project by both members of each group followed by 10-15 minutes individual examination in the course curriculum. In the examination I will not ask specific question to the c-code, but ask questions on the algorithms (how they work, what they do etc). I will strongly recommend that you take a careful look at all the handouts we have used in the course in order to prepare yourself for the exam.
- Program for exam in Algorithms in Bioinformatics
- 10.00 - 10.50 Group 1, Baum- Welsh,
- Pablo Gauna
- Paula Beati
- Eugenio Costa
- Ezequiel Bos
- 10.55 - 11.25 Group 2, MHC binding using PSSM
- Adrian Tamburri
- Cecilia Sanchez
- 11.30 - 12.10 Group 3, Fake versus True CV
- Paula Magarinos
- Santiago Carmona
- Ricci Alejandro
- 12.15 - 12.55 Group 4 Baum-Welsh
- Federico Pousa
- Facundo Carrillo
- Felipe Schargorosdky
- 12.55 - 13.45 Lunch break
- 13.45 - 14.35 Group 5, MHC binding using PSSM
- Esteban Lanzarotti
- Daniel Alvarez
- Omar Arganaras
- 14.40 - 15.10 Group 6, Fake versus True CV
- Christina Cossio Mercado
- Miguel Martinez Soler
- 15.15 - 16.15 Group 7, PSSM versus ANN
- Mariano Semelman
- Thomas Fischer
- Kevin Allekotte
- 16.20 - 17.00 Group 8, PSSM - DNA
- Gonzalo Parra
- Franco Simonetti
- Agustin Izquierdo
- 17.00 - 17.20 Group 9, NN
- Michael Jenik
Go to |