Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

De novo assembly exercises

Simon Rasmussen - June 2013


Overview

In this exercise we will try to perform de novo assembly of Illumina Paired end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will try to:

  1. Run FastQC, adaptor and quality trimming reads
  2. Count k-mers and estimate genome size
  3. Correct reads using Quake
  4. Determine insert size of paired end reads
  5. Run de novo assembly using SOAPdenovo
  6. Calculate assembly statistics
  7. Plot coverage and length histograms of the assembly
  8. Re-assembly to improve assembly
  9. Visualize assembly using Circoletto

FastQC and trimming

Create dir and copy data

cd /home/local/ngs_course/studXXX
mkdir denovo
cd denovo/
cp /home/local/27626/exercises/denovo/vchol/* .

Start by running fastqc

mkdir fastqc
fastqc -o fastqc *.txt.gz
firefox fastqc/Vchol-001_6_1_sequence_fastqc/fastqc_report.html &
firefox fastqc/Vchol-001_6_2_sequence_fastqc/fastqc_report.html &

There are quite some issues with this data (but dont spend to much time on studying the report), we should clean it up first. Lets find out what type of encoding the qualities are in:

gzcat Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py
Q1. Which format are they encoded in?

Lets trim the reads using cutadapt. In the command below the most frequent adapter/primer sequence is already pasted in - dont worry about the others in our case they are just variations of that one. Cutadapt will assume a 10% error rate so they should also be clipped as well. Also we we use minimum of 40nt and trim to quality 20 (~5 mins). Note that both commands ends with "&" meaning that they will both run on the same time in the background. You must wait for both commands to finish before going to the next command!

cutadapt -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC -O 3 -m 40 -q 20 --quality-base=64 \
Vchol-001_6_1_sequence.txt.gz > Vchol-001_6_1.cut.fastq &
cutadapt -b GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG -O 3 -m 40 -q 20 --quality-base=64 \
Vchol-001_6_2_sequence.txt.gz > Vchol-001_6_2.cut.fastq &

Then we need to put the reads in the same order using cmpfastq (1min)

cmpfastq.pl Vchol-001_6_1.cut.fastq Vchol-001_6_2.cut.fastq

Ok, now we have our reads trimmed and ready to go.
Count stats of the trimmed data such as avg readlength, min length, max length, no. reads and no. bases. Note down the avg readlength and the no. of bases.

fastx_readlength.py --i Vchol-001_6_1.cut.fastq-common.out
fastx_readlength.py --i Vchol-001_6_2.cut.fastq-common.out

Genome size estimation

Lets try to count the occurence of k-mers in the data - a k-mer is simply a string of nucleotides of a certain length. Lets say we count all the 15-mers that are in the read data we have - we do this using a program called Jellyfish. Here we tell Jellyfish to count 15-mers and also add counts from the complementary strand. Afterwards we tell it to create a histogram that we will plot using R

jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C Vchol-001_6_*.cut.fastq-common.out
jellyfish histo Vchol-001_0 > Vchol-001.histo

R
dat=read.table("Vchol-001.histo")
barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8)
dev.print("Vchol-001.histo.pdf", device=pdf)

The plot shows how many times a k-mer is found (x-axis) and the number of kmers with this count (y-axis). The bars in the lower end are the k-mers that only occur very few times, these are probably sequencing errors, whereas the k-mers that occurs many times are "real" k-mers. We can use a program like Quake to use the information from the "real" k-mers to correct similar "error k-mers". In this way we can increase the performance of our assembly (this also works for SNP calling as well).

Q2. Where is the peak at?

Before we correct the reads, we will try to estimate the genome size of our genome. Of course we know that it is a V. cholerae, but if we did not know that or it was an unknown species then we could do like this. N = (M*L)/(L-K+1) and Genome_size = T/N, where N: Depth, M: Kmer peak, K: Kmer-size, L: avg readlength, T: Total bases. Try to put in the numbers (some of them you have from "fastx_readlength.py" command) and see what you get. The reference vibrio cholerae genome is around 4Mb, you should get within +/- 10% of this.

Q3. What is the estimated genome size

Quake error correction

Lets try to correct the reads using Quake. It is actually rather easy to do, we need to make a file with the read data that we have and then start quake with k=15 and tell it that the qualities are Illumina (-q 64). The corrected reads will be *.cor.out

# THIS TAKES SOME TIME - YOU CAN COPY MY FILES #
echo "Vchol-001_6_1.cut.fastq-common.out Vchol-001_6_2.cut.fastq-common.out" > file_list
/home/local/27626/bin/Quake/bin/quake.py -f file_list -k 15 -p 2 -q 64 --header

# COPY MY FILES #
ln -s /home/local/ngs_course/stud100/denovo/*.cor.out .

de novo assembly

Now the reads are ready for de novo assembly. We are going to use SOAPdenovo as the assembler, it uses the de bruijn approach. When running denovo assemblies one should try different k-mer sizes - the k-mer size is what is being used for building the de bruijn graph and is therefore very important. There is currently no way to estimate what will be the optimal k before running, the best k-size may change between different dataset and may change between different assemblers for the same dataset.

SOAPdenovo requires a configuration file with different information, a sample file is the Vchol-001.soap.conf. You need to open it and change the path to your file. To open it write "n Vchol-001.soap.conf" and then navigate to "q1" and "q2", here paste in the path and filename of your two files. Note that we have written "avg_ins=200" even though we do not know the average insert size. We will run an initial assembly to create contigs and then extract the information and then rerun the full assembly again. When you have changed the paths save the file (Ctrl-S) then start the assembler. We will try first with a k of 35.

SOAPdenovo-127mer pregraph -s Vchol-001.soap.conf -K 35 -p 2 -o initial
SOAPdenovo-127mer contig -g initial

Now you should have a file called "initial.contig", we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Lets only map the first 100.000 reads - this should be enough.

head -n 400000 Vchol-001_6_1.cut.fastq-common.out > Vchol_sample_1.fastq
head -n 400000 Vchol-001_6_2.cut.fastq-common.out > Vchol_sample_2.fastq

bwa index initial.contig
bwa mem initial.contig Vchol_sample_1.fastq Vchol_sample_2.fastq | samtools view -Sb - > initial.sample.bam

samtools view initial.sample.bam | cut -f9 > initial.insertsizes.txt

R
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])       # mean
sd(a.v[a.v >= mn & a.v <= mx])         # sd
Q4. What are the mean insert size and standard deviation of our library?

Open "Vchol-001.soap.conf" and change the avg_ins to the value you found. Then we are going to rerun the assembly where we also do scaffolding. Because the assembly takes some time to finish, you will all chose a different k-mer to run and then we will compare to find the "best" assembly. Chose one of the K-sizes on the google document and change the "XX" values below to your number. Remember to write your name and save the document.

SOAPdenovo-127mer all -s Vchol-001.soap.conf -K XX -p 2 -o asmKXX

When it is done, we need to calculate some statistics on it, there is a script from the Assemblathon that does it nicely, but first lets remove all sequences shorter than 100

fastx_filterfasta.py --i asmKXX.scafSeq --min 100
assemblathon_stats.pl asmKXX.scafSeq.filtered_100.fa > asmKXX.scafSeq.stats

Open the asmKXX.scafSeq.stats file and fill in the information in the google document for your respective K. We will then compare the results and find the best K-size for the particular dataset. If there are still k-sizes that has not been run, feel free to run them.

Copy the best assembly to your folder and use this assembly from now on (we renamed it to Vchol-001.best.fa).

cp /home/local/27626/exercises/denovo/best/Vchol-001.best.fa .
cp /home/local/27626/exercises/denovo/best/Vchol-001.best.stats .
Q5. How does the sum of the assembly compare with what we estimated using kmers?
Q6. How many contigs were joined together in scaffolds? Why do you think it is a such a low number (Hint: Repeats, PE insert size)?

Coverage of assembly

Lets calculate coverage and length for each sequence and plot it as a histogram in R. We also plot the lengths of the scaffolds.

fastx_soapcov.py --i Vchol-001.best.fa > Vchol-001.best.cov

R
library(plotrix)
dat=read.table("Vchol-001.best.cov", sep="\t")
par(mfrow=c(1,2))
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="Weighted coverage", xlab="Contig coverage")
hist(dat[,1], xlim=c(0,100), breaks=seq(0,1000,1), main="Raw coverage", xlab="Contig coverage")
dev.print("best.coverage.pdf", device=pdf)

# Lengths
par(mfrow=c(1,1))
barplot(rev(sort(dat[,2])), xlab="# Scaffold", ylab="Length", main="Scaffold Lengths")
dev.print("scaffold.lengths.pdf", device=pdf)

acroread best.coverage.pdf &
acroread scaffold.lengths.pdf &

The left plot shows the length-weighted coverage (of k-mers) of the contigs/scaffolds - this means that long sequences has a larger weight compared to short sequences. The plot on the right side shows a normal histogram of contig coverage. By comparing the two plots we see that the majority of the assembly has a k-mer coverage around 20-30X, and that remaining assembly are short sequences. Looking at the plot with the lengths you see that the majority of the assembly are in quite long scaffolds.

Q7. Can you think of why some short contigs have much higher coverage compared to main assembly?
Q8. Can you think of why some short contigs have much lower coverage compared to the main assembly?

Re-assembly

Now we created a fairly good assembly, but lets see if we can do it better. Lets try to map the reads to the assembly and then only use mapped reads for another assembly. We map using bwa.

bwa index Vchol-001.best.fa
bwa mem -M -t 4 Vchol-001.best.fa Vchol-001_6_1.cut.fastq-common.cor.out Vchol-001_6_2.cut.fastq-common.cor.out \
| samtools view -Sb - > Vchol-001.bwa.bam

When it is done we run we look at the statistics of the assembly.

samtools flagstat Vchol-001.bwa.bam
Q9. How many reads are properly paired (properly paired means that the map within expected range of each other)?

Lets extract the properly paired reads from the alignment and make a new assembly using only these. Removing reads from the assembly that may make troubles will decrease the complexity of the de bruijn graph and hopefully lead to a better assembly with less errors. Go to explain-flags and find the flag-number for a properly paired read. Insert the number below instead of Y to create a bam-file of only properly paired reads and the extract to fastq-files.

Q10. What is the flag for properly paired?
samtools view -b -f Y Vchol-001.bwa.bam > Vchol-001.bwa.pp.bam
java -jar /home/local/27626/bin/SamToFastq.jar INPUT=Vchol-001.bwa.pp.bam FASTQ=Vchol-001_1.remade.fastq SECOND_END_FASTQ=Vchol-001_2.remade.fastq

Update the "Vchol-001.soap.conf" file with the filenames "Vchol-001_1.remade.fastq" and "Vchol-001_2.remade.fastq" and then run SOAPdenovo again. Now we just use k=73 only, filter the output scaffolds and run stats on it.

SOAPdenovo-127mer all -s Vchol-001.soap.conf -K 73 -p 2 -o Vchol-001.reasm
fastx_filterfasta.py --i Vchol-001.reasm.scafSeq --min 100
assemblathon_stats.pl Vchol-001.reasm.scafSeq.filtered_100.fa > Vchol-001.reasm.scafSeq.filtered_100.stats

Open the new "Vchol-001.reasm.scafSeq.filtered_100.stats" and the original file "Vchol-001.best.stats" and compare the results.

Q11. How much improvement is there in the assembly? Is this what you expect from the number of properly paired reads in Q9?

Now that we are done with the assemblies lets clean up and remove our reads (we are using >7Gb of disk-space)

rm *.fastq
rm *.out

Visualization using circoletto

Lets try to visualize the assembly. Because we know that the data is from a Vibrio cholerae we can try to compare our assembly with the a V.cholerae reference genome. First lets filter the assembly to minimum 500bp.

fastx_filterfasta.py --i Vchol-001.reasm.scafSeq --min 500

Open a browser on your own computer and go to the Circoletto webpage. Circoletto is an easy to use web-program that builds upon Circos - a program to create amazing plots with genomic (and other) data. Open the filtered assembly ("n Vchol-001.reasm.scafSeq.filtered_500.fa") select all and copy-paste it into the upper box (query fasta). Do the same with the reference genome ("n vibrio_cholerae_O1_N16961.fa") and copy-paste it into the box just below. In the "output" section click the "ONLY show the best hit per query" and click "submit to Circoletto". When it has finished processing click "see the Circoletto output".

If it is not working you can see the file here.

You should see the two chromosomes of V. cholerae on the left hand side of the figure (named "gi|...") and then the alignment of our assembly to it. The color of the alignments are the bitscore, the red are the most confident and the black the least confident.

Q12. Do you think that our genome is similar to the reference genome?
Q13. Are there scaffolds/contigs that does not or only partially map to the reference genome?
Q14. What do you think the region on chromosome 2 (the small one) that has many small low-confident mappings could be. Hint: See the V. cholerae genome paper and search for "V. cholerae integron island"?

Bonus assembly stuff

If you are further interested in checking your assembly, it is possible to call SNPs by remapping reads to the assembly. For haploid chromosomes we do not expect any SNPs, so therefore any SNPs that are called are markers of assembly error or problems. Additionally one may also use re-mapping of paired end reads to break contigs where there are no support from read pairs.


Assembly of 454 or Ion Torrent data (Optional)

For 454 and Ion Torrent data we can use Newbler, an assembler made by Roche the company producing the 454 machines. The fact that the error-profile of Ion Torrent data is similar to 454 data, we can get quite good assemblies using Newbler for Ion data.

The commands to run Newbler assembly is this. It can take input as fasta+qual, fasta, sff and fastq. Note, that it if Ion Torrent data is in fastq-format it must be converted to fasta, because else Newbler will think it is Illumina reads and produce a very bad assembly.

setenv PATH /home/local/27626/bin/454/bin:${PATH}
newAssembly outdir
addRun -lib libname outdir single.fasta
addRun -p -lib PE1 outdir paired_1.fasta
addRun -p -lib PE1 outdir paired_2.fasta
runProject -cpu 2 outdir

Lets try with the E.coli data from the German cucumber/sprout outbreak.

mkdir ecoli
cd ecoli
cp /home/local/27626/exercises/denovo/ecoli/* .

454 data as fastq:

setenv PATH /home/local/27626/bin/454/bin:${PATH}
newAssembly asm454
addRun -lib wgs asm454 GOS1.fastq
runProject -cpu 2 asm454

Ion torrent data. First we need to convert the fastq to fasta.

setenv PATH /home/local/27626/bin/454/bin:${PATH}
fastx_fastq2fasta.py --i iontorrent.fq --o iontorrent
newAssembly asmIon
addRun -lib wgs asmIon iontorrent.fasta
runProject -cpu 2 asmIon

Calculate statstics of the assemblies and try to compare them.

assemblathon_stats.pl asm454/assembly/454LargeContigs.fna > asm454.LargeContigs.stats
assemblathon_stats.pl asmIon/assembly/454LargeContigs.fna > asmIon.LargeContigs.stats
Q15. Which of the technologies provided the best assembly of the E. coli strain?

Congratulations you finished the exercise!