Exercise: Assembling NGS data - from short reads to a draft genome

Mette Voldby Larsen (metteb@cbs.dtu.dk)


Europe experienced an outbreak of STEC (Shiga toxin producing Escherichia coli) in 2011. The first cases were reported in May 2011 in Germany. A few months later, when the outbreak was over, more than 3,500 people had been infected, and more than 800 had developed the rare, life-treatening complication Haemolytic Uraemic Syndrome (HUS). Fifty-three people had died.

Along with other groups, the British Health Protection Agency sequenced the outbreak strain on a next generation sequencing platform. More specifically, they used an Illumina MiSeq sequencer. The results, the short (raw) reads, can be downloaded using this command from a terminal window:

scp studxxx@athena.cbs.dtu.dk:/usr/opt/www/pub/CBS/courses/27485.imm/exercise_NGS/Strain_H112180280.fastq Desktop/.

Alternatively, download by right-clicking on this link at choose "Save link as...": Strain_H112180280.fastq . Don't try to open the file in your browser - the file is too big.

Download the file and have a look at the content in a text editor.

  • How many lines does the information for one read cover?

  • How long is a read?

  • How many reads are there?

    The raw reads can be used directly for, e.g., species identification KmerFinder , but since we already know the reads are from an E. coli, let's skip that part.

    For our purpose, which is to identify genes, the short reads must first be assembled into a draft genome.

    Go to: Assembler - This is a service developed by Simon Rasmussen, CBS, DTU. Depending on the sequencing platform used to generate the short reads, it will use a number of different algorithms for assembling the reads.

    Via the "Browse" button, select the file with the short reads.

    Select "Illumina - single end reads" as the type of your reads.


    When the assembly job is finished (be patient), download the file containing the contigs.

    Have a look at the content in a text editor.

  • How many contigs are there?

  • Would it in theory be possible to end up with only one contig? Here are the contigs: E.coli_contigs.fsa.