Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Metagenomic assembly

Simon Rasmussen - June 2013


Overview

In this exercise we will try de novo assemble a metagenomic dataset of Illumina paired end reads from a stool sample. The data is part of the MetaHit project and was published here. These are just two samples of 124, in the next exercise (tomorrow) you will analyze data from the 124 samples. You will try to:

  1. Analyze k-mer distributions
  2. Assemble using SOAPdenovo and MetaVelvet (well, it is already done)
  3. Analyze output from metagenomic assemblies
  4. Gene prediction and clustering on metagenomic data
  5. Create a gene abundance matrix

de novo assembly of metagenomic data


Analyze k-mer distributions

Assembly of metagenomic data is much harder than assembly of single organism datasets. Here we will use SOAPdenovo as we did in the de novo assembly exercise and MetaVelvet. Let us first look at the k-mer coverage of our data, we already counted the kmers for you (else you could use jellyfish as we did for the V. cholerae exercise) and lets plot it:

mkdir meta
cd meta
cp /home/local/27626/exercises/metagenomics/kmer_counts/MH0047.histo .
cp /home/local/27626/exercises/metagenomics/kmer_counts/MH0032.histo .

R
pdf(file="kmerFreq.pdf", width=12, height=12)
par(mfrow=c(2,1))
d = read.table("MH0047.histo")
barplot(d[,2], main="MH0047 - Kmer Frequencies", xlab="K-mer coverage", xlim=c(1,400), ylim=c(0,1e6))

d = read.table("MH0032.histo")
barplot(d[,2], main="MH0032 - Kmer Frequencies", xlab="K-mer coverage", xlim=c(1,400), ylim=c(0,1e6))
dev.off()

Exit R and open the plot using acroread.

acroread kmerFreq.pdf &
Q1. Can you identify the peak(s) in the Gaussian (bell-shaped) distributions in the two plots?

You should see that compared to the V. cholera sample that we used in the de novo exercise we havent got the same nice peak. Actually the x-axis in the plots goes from 1-400 so there seem to be something with quite high coverage in the MH0032 sample, however by far most of the sequences can not be separated from each other in terms of different abundance.

Q2. Can you think of why the coverage distributions looks like this with many sequences with "lower" coverage?
Q3. Given that MetaVelvet works by dividing the de bruijn graph using coverage peaks do you think it will perform better or worse compared to SOAPdenovo?

de novo assembly

Unfortunately because the assemblies takes 30-45 mins and uses 10-25Gb of RAM each you will not run the assemblies in the exercise, but you can see the code that I used to run it here code if you need to assemble some genomes for the projects.

Instead copy the contigs and scaffolds to your folder and filter for minimum 100 bp:

cp /home/local/27626/exercises/metagenomics/MH0047_soap/MH0047_d0.scafSeq MH0047.soap.fa
cp /home/local/27626/exercises/metagenomics/MH0047_velvet/meta-velvetg.contigs.fa MH0047.velvet.fa
fastx_filterfasta.py --i MH0047.soap.fa --min 100
fastx_filterfasta.py --i MH0047.velvet.fa --min 100

Ok now lets calculate coverage of the MetaVelvet and SOAPdenovo assemblies and plot them.

fastx_soapcov.py --i MH0047.soap.fa.filtered_100.fa > MH0047.soap.cov
fastx_velvecov.py --i MH0047.velvet.fa.filtered_100.fa > MH0047.velvet.cov

R
library(plotrix)
par(mfrow=c(1,2))
dat=read.table("MH0047.soap.cov", sep="\t")
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="SOAPdenovo - Weighted coverage", xlab="Contig coverage")
dat=read.table("MH0047.velvet.cov", sep="\t")
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="MetaVelvet - Weighted coverage", xlab="Contig coverage")
dev.print("MH0047.coverage.pdf", device=pdf)
Q4. Are there any differences between the two assemblies and what can you tell about the assembly from the coverage distributions?

Finally lets look at the stats for the two assemblies. The paste command below will put the two files beside each other so that we can compare the results directly.

assemblathon_stats.pl MH0047.soap.fa.filtered_100.fa > MH0047.soap.fa.filtered_100.asm
assemblathon_stats.pl MH0047.velvet.fa.filtered_100.fa > MH0047.velvet.fa.filtered_100.asm
paste MH0047.soap.fa.filtered_100.asm MH0047.velvet.fa.filtered_100.asm | less -S
Q5. Are there any large differences in the stats of the two assemblies?

Gene predictions and clustering

Lets try to predict genes, we do this using prodigal. It is together with MetaGeneMark the best and fastest prokaryotic gene finder available (I think) - we are using it in the metagenomic setting by setting "-p meta" and outputting the predictions as dna and as gff-format. Lets use the MetaVelvet assembly as it has fewest Ns - in this way we dont get genes filled with Ns. For the sake of time lets also only use the contigs with a size >500bp (normally you could use down to 100bp). NB: The gene prediction takes 3-4 minutes each and we are running each in the background (the "&") - wait for the commands to finish.

cp /home/local/27626/exercises/metagenomics/MH0047_velvet/meta-velvetg.contigs.fa MH0047.velvet.fa
cp /home/local/27626/exercises/metagenomics/MH0032_velvet/meta-velvetg.contigs.fa MH0032.velvet.fa
fastx_filterfasta.py --i MH0032.velvet.fa --min 500
fastx_filterfasta.py --i MH0047.velvet.fa --min 500

prodigal -p meta -f gff -d MH0047.prodigal.fna -i MH0047.velvet.fa.filtered_500.fa -o MH0047.prodigal.gff &
prodigal -p meta -f gff -d MH0032.prodigal.fna -i MH0032.velvet.fa.filtered_500.fa -o MH0032.prodigal.gff &

Lets look at how many genes were predicted?

grep ">" -c MH00*.prodigal.fna
Q6. How many genes are there in the samples - consider how many genes humans have (20-30k)?

To be able to create a count matrix of the gene abundances we must create a common set of genes of the all the samples we want to compare (in our case only two). To do this we can use cd-hit which will cluster the genes based on sequence similarity and select a representative gene from each cluster that we will then use. First we combine the two gene sets and then cluster. We use a sequence identity threshold of 95% and say that the alignment must cover 90% of the sequence as shown below. However as the clustering takes 40-50 min you can copy the file I made to here:

# code for making it yourself
cat MH0047.prodigal.fna MH0032.prodigal.fna > combined.prodigal.fna
cd-hit-est -i combined.prodigal.fna -o combined.prodigal.cdhit.fna -c 0.95 -n 8 -l 100 -aS 0.9 -d 0 -B 0 -T 2 -M 10000

# code to copy
cp /home/local/27626/exercises/metagenomics/combined.prodigal.cdhit.fna .

Create gene abundance matrix

Ok now that we have our common set of genes we can determine the abundance of each gene in our sample - we do that by mapping the reads from our samples to the common gene set. This we will do using bwa, but for this exercise we will only use a subset of 0.5 mill reads from each of the libraries (NB: this is only 1% of the total amount of reads):

bwa index combined.prodigal.cdhit.fna

ln -s /home/local/27626/exercises/metagenomics/sub/* .

bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0032_081224.1.fq.gz MH0032_081224.2.fq.gz | samtools view -Sb - > MH0032_081224.bam
bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0032_091021.1.fq.gz MH0032_091021.2.fq.gz | samtools view -Sb - > MH0032_091021.bam

bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0047_081223.1.fq.gz MH0047_081223.2.fq.gz | samtools view -Sb - > MH0047_081223.bam
bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0047_090201.1.fq.gz MH0047_090201.2.fq.gz | samtools view -Sb - > MH0047_090201.bam

Now that we have mapped the reads back to the gene set we can start counting. However before we start we need to consider what to do with paired end reads where both pairs map to the same gene.

Q7. If both pairs map to the same gene we only count it as one hit - can you think of why?
Q8. What should we do if each pair map to different genes?

We filter pairs mapping to the same gene using the read_count_bam.pl script below and then only takes reads that are mapped with a mapping quality better than 30 (-q30 below). After that we sort the bam-files:

samtools view -h MH0032_081224.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort - MH0032_081224.sort
samtools view -h MH0032_091021.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort - MH0032_091021.sort
samtools view -h MH0047_081223.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort - MH0047_081223.sort
samtools view -h MH0047_090201.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort - MH0047_090201.sort

Now lets merge them to sample-bams and index them:

samtools merge MH0032.sort.bam MH0032_081224.sort.bam MH0032_091021.sort.bam
samtools merge MH0047.sort.bam MH0047_081223.sort.bam MH0047_090201.sort.bam
samtools index MH0032.sort.bam
samtools index MH0047.sort.bam

Now it is very easy to do the counting using samtools idxstats. It will output three columns: gene-name, gene-length, no. mapped_reads, no. unmapped_reads. Try it out:

samtools idxstats MH0032.sort.bam | less

Lets create a count matrix then. First we create a header line using "echo", then we get all of the gene-names, then the counts and finally we combined it one file:

echo "Gene\tMH0032\tMH0047" > header
samtools idxstats MH0032.sort.bam | grep -v "*" | cut -f1 > gene_names
samtools idxstats MH0032.sort.bam | grep -v "*" | cut -f3 > counts1
samtools idxstats MH0047.sort.bam | grep -v "*" | cut -f3 > counts2
paste gene_names counts1 counts2 | cat header - > count_matrix.tab

Take a look at the count matrix using less. Also lets try to load it into R and look at the count distribution between the samples:

less count_matrix.tab

R
d = read.table("count_matrix.tab", sep="\t", header=TRUE, as.is=TRUE)
library(ggplot2)
p = ggplot(d, aes(x=MH0032, y=MH0047)) + stat_binhex(bins=50) + scale_fill_continuous(low="grey50", high="blue")
p + geom_abline(a=1, col="white") + labs(title="Gene counts")
ggsave("count.hex.pdf")
Q9. How does the overlap between the samples look like, are there genes in common and/or genes specific for each samples?

Annotation of the count matrix

Now that we have our count matrix we can annotate the genes according to species or function. Lets try to annotate them according to species.

To do that we need to blast all of the genes towards a database of organisms that we expect to be present - in our case lets blast vs. all fully sequenced bacteria at NCBI. I downloaded the database on June 6th and it contains the sequence of 4731 bacterial chromosomes and plasmids. Again the blast of our ~100k genes against this database takes >12hrs so you will not run this - instead you can use the file I made. If you were to run the blast yourself the command is this:

# copy this file to your folder, it is the blast output #
cp /home/local/27626/exercises/metagenomics/combined.prodigal.cdhit.m8 . 

# command to blast genes vs. bacteria - do not run it in the exercise #
# the -a 1 tells it to run using 1 core, you can efficiently go up to 4-6 cores #
blastall -p blastn -d /home/local/27626/exercises/metagenomics/Bacteria.06062013.fna -a 1 -i combined.prodigal.cdhit.fna \
-m8 -e 1e-2 -o combined.prodigal.cdhit.m8 -b 5 -v 5

Take a look at the blast-report (combined.prodigal.cdhit.m8), the output is in blast-tabular format (also known as m8) - the fields are: query name, subject name, percent identity, aligned length, no. mismatches, no. gaps, query start, query end, subject start, subject end, e-value, bitscore. You should see that there often are several hits pr. gene (query). We will select only the best hit and require that it has >80% identity over 100bp before we will annotate the gene to a species. This is achieved using this perl-oneliner:

perl -ane 'BEGIN{$prev=""}; if ($F[0] eq $prev) { next } else { if ($F[2] > 80 & $F[3] > 100) { print $_}; $prev = $F[0];}' \
combined.prodigal.cdhit.m8 > combined.prodigal.cdhit.best.accepted.m8

Now lets just take the columns 1 and 2 because these has the information of which gene is annotated to which organism and copy an annotation file to here:

cut -f1,2 combined.prodigal.cdhit.best.accepted.m8 > combined.prodigal.ann
cp /home/local/27626/exercises/metagenomics/Bacteria.06062013.map .

Now we have:

  1. Count matrix with genes (count.matrix.tab)
  2. Annotation of genes to species-ids (combined.prodigal.ann)
  3. Annotation of species-ids to species names (Bacteria.06062013.map)
This we will use to create an abundance measure of the species in our sample using an R-script. For each species we determine the abundance of it as the 75% quantile of all genes annotated to that particular species:

R --vanilla count_matrix.tab combined.prodigal.ann Bacteria.06062013.map species_matrix.tab < /home/local/27626/bin/create_species_matrix.R

The script outputs the species matrix and a plot of how many genes that was annotated to each species - try to take a look at the species matrix and at the plot

less species_matrix.tab

acroread genes_pr_species.pdf &
Q10. When looking at the plot, were there many species for which we could identify most genes? Can you think of how to improve this?

Lets plot the species-abundance versus each other in our two samples.

R
library(ggplot2)
d = read.table("species_matrix.tab", header=TRUE, sep="\t")
p = ggplot(d, aes(x=MH0032, y=MH0047)) + stat_binhex(bins=50) + scale_fill_continuous(name="No. species", low="grey50", high="blue")
p + geom_abline(a=1, col="white") + labs(title="Species counts", x="Abundance in MH0032", y="Abundance in MH0047")
ggsave("species.count.hex.pdf")
Q11. Are there some species with differential abundance between the two samples?

Additional analyses that can be done

We can investigate the abundance of different functional characteristics. This can be achieved by eg. blasting all of our genes to different databases such as ncbi-nt, kegg, eggnog etc. Additionally one can cluster the genes according to gene abundance and try to use that to identify organisms in the community.


Congratulations you finished the exercise!