|
Read trimmer for Next-Generation-Sequencing data
DESCRIPTION
The advent of Next Generation Sequencing (NGS) technologies have transformed
how biological research is being performed and today almost all biological
fields use the technology for cutting edge discoveries. Today, a human genome
can be sequenced in very short time for approximately $1000 giving
unprecedented possibilities for investigating human traits, evolution and
diseases. Similarly whole bacterial communities and their interplay with the
environment can be studied, unravelling novel enzymes and organisms.
These experiments produces massive amounts of data that often including a
lot of noise and the program therefore has to be able to clean up NGS data.
INPUT AND OUTPUT
The input and output data is Illumina fastq files, which is very similar
to a fasta file except that they also contain a quality associated to
each base so that one entry is exactly four lines which is also know as a 'read':
Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
This is an example:
@ILLUMINA-3BDE4F_0027:2:1:12594:2417#TGACCA/1
GTTACTTGTGTCGTTGTAGACACTNCTGATACCTCCAGCATGCCTCACAGCACACCTTCGCAGGCTTACAGAACGCTCCCCTACCCAACAACACATAGTGT
+
a\EDZKOGNDPDIJDKFFNF]Z]`BaaaZaZEZVcaaX]accccccccccccccccccccccccc[ccacccccccccccccabcca_c_O____a_aXaV
Each of the base-qualities (fourth line) encodes the probability that the
base at the same position is wrong using the ascii table for conversion
between the characters and a number (probability). The input data contains
millions (and up to billions) of these reads and they must be cleaned
according to user specified settings.
References:
http://en.wikipedia.org/wiki/FASTQ_format
http://en.wikipedia.org/wiki/Phred_quality_score
DETAILS
The program must be able to:
- Read both compressed and uncompressed fastq files and be able to write
both compressed or uncompressed files according to user input (gzip only).
- Be able to use both Phred+33 and Phred+64 encoding according to user input.
- Trim each read from the 3' based on quality, either as minimum or mean of moving window.
- Trim X nucleotides from the 5' of each read given user input.
- Trim X nucleotides from the 3' of each read given user input.
- Filter out reads with a mean quality lower than specified.
- Filter out reads that are shorter than specified.
- Filter out reads that have a maximum of occurrences of N bases (unknown bases).
- Must be efficient in RAM usage.
- Must be fast due to the millions of reads in real datasets.
- Must keep track of how many reads are trimmed, removed etc.
Optionally advanced options can be implemented into the program:
- Trim paired end reads simultaneously keeping pairs in sync between the two files.
- Autodetect the quality scale of a file (phred+33 or phred+64).
- Calculate statistics on a fastq/fasta file without trimming such as number
of entries, number of each bases, length of entries, average length of
entries, quality (average, best/worst 10%), etc.
Paired end reads:
When the a long DNA molecule is sequenced from both ends this will yield
two files, one with reads from the 5' and one with the reads from the 3'
of the DNA molecule. Programs will assume that paired reads will have the
same position in the two files. When trimming the two files they have to
be kept in sync, eg. if a read is removed from one file, then the
corresponding pair needs to be removed from the other file to maintain
synchrony between the files.
|