|
Random sequence generator
Description
The program is given a fasta file with DNA or amino acid sequences,
and produces a fasta file with random sequences that has the same
statistical properties as the input. The program will be useful for
testing other programs (predictors for instance) with randomly generated
sequences that is similar to the organism, the predictor is trained on.
Input and output
The program is given a fasta file as input on the command line. It should understand
the following options:
--seq=<number> Number of sequences to generate as output
--len=<minnumber>-<maxnumber> The length of the output sequences are in this interval
The fasta file can contain more than one sequence/entry. If the --seq option is not given
then the program will produce a fasta file with a similar (random) number of sequences as
the input (+/- 10%). If the --len option is not given then the length of the sequences
generated will be randomly set between (shortest sequence in input - 10%) and (longest
sequence in input + 10%). The fasta file produced should just be printed on STDOUT.
Examples of program execution:
generate.pl fastafile.fsa
generate.pl --seq=10 fastafile.fsa
generate.pl --len=2000-3000 --seq=10 fastafile.fsa
Options are always before the fasta file, but the order and number of the options are random.
Statistical analysis
The program has to do some statistical analysis on the input in order to produce similar output.
It will be explained with DNA as an example, but the type of sequence does not matter. The program
should actually work with random letters (not DNA or AA) as input.
The minimum requirement is 1) that the distribution of the letters in the output matches the distribution
in the input, and 2) the probability of a certain letter is followed by a certain other letter is the same
in input and output.
1) explained:
If 20% of the letters in the input is A, then 20% of the output letters should also be A.
2) explained:
If you have an A in the input, then it could be followed by an A in 23% of the cases, T: 34%, G: 16% and C: 27%.
The same must be true for the output.
An idea could also be to make some position specific analysis - perhaps a certain letter is alwas at a certain
position in the data, e.g ATG in the beginning of the input sequences, or TAG in the end. It is after all strange
to let your random sequences start with f.ex. GGC if all your input starts with ATG.
You can make more complex analysis, if you want to. The shown case is somewhat similar to Hidden Markov Models.
You can see what Wikipedia has to say about HMM.
It will be good for inspiration.
|