Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Random sequence generator

Description
The program is given a fasta file with DNA or amino acid sequences, and produces a fasta file with random sequences that has the same statistical properties as the input. The program will be useful for testing other programs (predictors for instance) with randomly generated sequences that is similar to the organism, the predictor is trained on.

Input and output
The program is given a fasta file as input on the command line. It should understand the following options:
--seq=<number> Number of sequences to generate as output
--len=<minnumber>-<maxnumber> The length of the output sequences are in this interval

The fasta file can contain more than one sequence/entry. If the --seq option is not given then the program will produce a fasta file with a similar (random) number of sequences as the input (+/- 10%). If the --len option is not given then the length of the sequences generated will be randomly set between (shortest sequence in input - 10%) and (longest sequence in input + 10%). The fasta file produced should just be printed on STDOUT.

Examples of program execution:
generate.pl fastafile.fsa
generate.pl --seq=10 fastafile.fsa
generate.pl --len=2000-3000 --seq=10 fastafile.fsa
Options are always before the fasta file, but the order and number of the options are random.

Statistical analysis
The program has to do some statistical analysis on the input in order to produce similar output. It will be explained with DNA as an example, but the type of sequence does not matter. The program should actually work with random letters (not DNA or AA) as input.
The minimum requirement is 1) that the distribution of the letters in the output matches the distribution in the input, and 2) the probability of a certain letter is followed by a certain other letter is the same in input and output.
1) explained: If 20% of the letters in the input is A, then 20% of the output letters should also be A.
2) explained: If you have an A in the input, then it could be followed by an A in 23% of the cases, T: 34%, G: 16% and C: 27%. The same must be true for the output.
An idea could also be to make some position specific analysis - perhaps a certain letter is alwas at a certain position in the data, e.g ATG in the beginning of the input sequences, or TAG in the end. It is after all strange to let your random sequences start with f.ex. GGC if all your input starts with ATG.

You can make more complex analysis, if you want to. The shown case is somewhat similar to Hidden Markov Models. You can see what Wikipedia has to say about HMM. It will be good for inspiration.