Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Searching for signals/motifs in sequences

Description
A sequence motif is typically a short sequence pattern of DNA or amino acid sequence that is conserved across various gene families or organisms. Sequence motifs are recognizable and could be a promoter, a binding site or a domain the folds into a specific structure. The program is given a fasta file with DNA or amino acid sequences and a file containing a description of the signal to search for. It will then display all occurrences of a match in each fasta entry.

Input and output
The program is given a fasta file, a signal description file and a deviation (a number) as input on the command line. Deviation is the deviation allowed from the original signal description.
The fasta file can contain more than one sequence/entry.
The signal description file is a tab separated file. Each line consists of either
1) one or more allowed letters at this position and a penalty for having a mismatch at that position.
2) the star character * denoting unimportant characters in the sequence and an interval where these unimportant characters are allowed.
3) the hash character # meaning this line is a comment, and should be ignored by the program.

As an example here is a prokaryotic promoter, which is a 2-parts signal that can be described as:

# Shine-Delgarno
T	7
T	8
G	6
A	5
C	5
A	5
# intervening unimportant bases
*	15-21
# Pribnow box
T	8
A	8
T	6
A	6
AT	5
T	8
The output should list all matches in each fasta entry, clearly stating the location of the match. A fasta file which could be used is here.

Details
The deviation is an important factor. If the deviation is set to 0, then search for the signal is reduced to a regular expression. If the deviation is set to 16 in the above example, then mismatches with the combined penalty of 16 or less are allowed. In the promoter example the following signals would match (ignoring intervening bases, not complete list):
TTGXXX*TATAAT
TTGACA*XXTAAT
TTGXXA*TAXAAT

Note: I do not consider an approach based heavily on regular expression as a good idea, but you are free to do as you like. Due to the flexibility in the signal caused by the gabs, several different matches can be found from a specific position in the sequence. You should find all matches, not just the first one.