|
Genomic Epidemiology exercise
Simon Rasmussen, Shinny Leekitecharoenphon - June 2012
Overview
You are a bioinformatician working with epidemiology in Copenhagen. In the last couple of days more and more people have been admitted to the Hospitals in the region with severe bacteremia (bacteria in the bloodstream) and several patients, especially infants and young children, have had fatal outcomes. Luckily the probable causative infectious bacteria has been isolated and sequenced on a desktop sequencer and now it is your job to find out:
- What is it?
- Have we seen it before?
- How can we treat it (is it drug-resistant)?
Preparation
Go to your folder and make a copy of the data
cd /home/local/ngs_course/studXXX/
mkdir genomic_epi
cd genomic_epi
cp /home/local/27626/exercises/genomic_epi/ref* .
cp /home/local/27626/exercises/genomic_epi/unk* .
cp /home/local/27626/exercises/genomic_epi/data/reads* .
Preprocessing
Lets look at the quality of the number of reads that we have and the quality data. For the fastx_readlength.py command the output is "avg readlength, min length, max length, no. reads and no. bases".
fastx_readlength.py --i reads_R1.fastq.gz --gz
fastx_readlength.py --i reads_R2.fastq.gz --gz
Q1. How many reads and bases do we have in total?
mkdir fastqc
fastqc -o fastqc reads_R*.fastq.gz
firefox fastqc/reads_R1_fastqc/fastqc_report.html &
firefox fastqc/reads_R2_fastqc/fastqc_report.html &
Lets trim the data, the sequencing adapters and primers used are (even though there are no overrepresented in the R2):
Adaptor for R1: ATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCC
Adaptor for R2: GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGCATCTCGTATGC
cutadapt -b ATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCC \
-b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGCATCTCGTATGC -O 3 -m 40 -q 20 reads_R1.fastq.gz > reads_R1.cut.fastq &
cutadapt -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGCATCTCGTATGC \
-b ATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCC -O 3 -m 40 -q 20 reads_R2.fastq.gz > reads_R2.cut.fastq &
Then we need to make sure that the fastq-files are inline:
cmpfastq.pl reads_R1.cut.fastq reads_R2.cut.fastq
Error correction
Count kmers and lets plot the kmer distribution - again we count 15-mers because we are looking at a bacterial genome.
jellyfish count -t 2 -m 15 -s 1000000000 -o reads_jellyfish -C reads_R*.cut.fastq-common.out
jellyfish histo reads_jellyfish_0 > reads.histo
R
dat=read.table("reads.histo")
barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8)
dev.print("reads.histo.pdf", device=pdf)
Q2. Where is the peak at? Do we have enough coverage of the genome to correct the reads (eg. above 15X)?
Lets correct the reads using Quake, here we use the kmer distribution as above to identify "true" and "error" kmers and pass through the reads correcting true reads.
# Run Quake #
echo "reads_R1.cut.fastq-common.out reads_R2.cut.fastq-common.out" > file_list
quake.py -f file_list -k 15 -p 2 --header
What is it?
Identify species
Lets identify the species. Here we can use a program developed by the Center for Genomic Epidemiology to quickly identify species from either reads or an assembly. What it is doing is to use a database of 16S rRNA genes and then map the reads to the database. The reads, that map to the 16S rRNA database, are then used to perform a denovo assembly returning a full length 16S rRNA from the sequence. This is then compared with blast to known 16S rRNAs and output is written in ssu.out.
speciesfinder.py reads_R1.cut.fastq-common.cor.out --Mbases 20 --n 2 --tax --sample pathogenic_bacteria
When it is done open the output file (inside the pathogenic_bacteria folder called ssu.out), the first line is the best hit. The second column is whether we can trust the prediction of not. Which species do we think it is? This particular species is known to cause gastroenteritis with more than 90 million cases globally each year, with 155000 deaths. Most human infections are self-limiting, however 5% of the patients will develop bacteremia - however our strain seems to be more pathogenic.
de novo assembly
Ok now we know what species it is - but this species is found everywhere on the planet, maybe we should see if we could identify it more precisely within the species. For this we can use MLST (Multi Locus Sequence Typing) which is a method used in clinical epidemiology to type different sub-types of bacterial species. Instead of using just one Locus (=gene) as is done with 16S rRNA, it uses multiple loci to identify the type. Luckily we have server that can do that for us, but first we need to assemble the genome.
Lets start by making a denovo assembly to identify the insert size for the final assembly.
SOAPdenovo-127mer pregraph -s unknown.soap.conf -K 35 -p 2 -o initial
SOAPdenovo-127mer contig -g initial
Now you should have a file called "initial.contig", we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Lets only map the first 100.000 reads - this should be enough.
head -n 100000 reads_R1.cut.fastq-common.out > reads_sample_1.fastq
head -n 100000 reads_R2.cut.fastq-common.out > reads_sample_2.fastq
bwa index initial.contig
bwa mem initial.contig reads_sample_1.fastq reads_sample_2.fastq | samtools view -Sb - > initial.sample.bam
samtools view initial.sample.bam | cut -f9 > initial.insertsizes.txt
R
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean
sd(a.v[a.v >= mn & a.v <= mx]) # sd
Update the unknown.soap.conf with the correct insert size and run de novo assembly of the reads using K=65 (this a fairly good K for this data).
SOAPdenovo-127mer all -s unknown.soap.conf -K 65 -p 2 -o asmK65
fastx_filterfasta.py --i asmK65.contig --min 100
fastx_filterfasta.py --i asmK65.scafSeq --min 100
assemblathon_stats.pl asmK65.contig.filtered_100.fa > asmK65.contig.stats
assemblathon_stats.pl asmK65.scafSeq.filtered_100.fa > asmK65.scafSeq.stats
Look in the asmK65.contig.stats file under the "contigs" section to find the stats for the contigs, also look in the asmK65.scafSeq.stats file under the "scaffold" section to find the stats
Q3. How many contigs and scaffolds do you get and what are the N50?
MLST
Ok now that we have an assembly, lets see if we can identify a MLST type for our strain. This works by finding the particuar MLST alleles in our genome and comparing them to a table to see if there is any match with a known sequence type. Here we are going to use the command-line version but we could just as well have used the webserver. Figure out what to use as the speciesname (-d option) by running mlst.pl without any inputs and make sure to exchange the assembly.fa with the scaffold sequences from our assembly.
mlst.pl -d speciesname -i assembly.fa > mlst.res
The run takes around 5 minutes to complete. When it is complete, open the file and look at the output - there are alignment of our assembly vs. the closest match in the MLST database. You see that for the particular species we are using there are 7 loci, and it is the combination of these that make up the serotype. Also we require that there is 100% match over the entire allele to identify a serotype.
Q4. Do we have 100% matches to a serotype?
Q5. What is the serotype (ST-XXX) for our pathogen?
This particular serotype is known be present in sub-Saharan Africa and to have increased mortality in children - in some countries with higher death toll than malaria. This is a dangerous serotype.
SNP based phylogeny
We can now assume that our pathogen is coming from sub-Saharan Africa, but we do not know where in particular the strain is from. To get a higher resolution we can build a phylogeny based on SNPs and compare our strain to strains. First let us map the reads to the ST-XXX reference genome.
bwa index reference.fa
bwa mem -t 2 -R "@RG\tID:ST313\tSM:ST313\tPL:ILLUMINA" reference.fa reads_R1.cut.fastq-common.cor.out \
reads_R2.cut.fastq-common.cor.out | samtools view -Sb - > aln.bam
Then lets extract all reads with a mapping quality of 30 or better, sort and index the alignments. Thereafter lets calculate coverage and plot it using R.
samtools view -u -q 30 aln.bam | samtools sort - aln.sort
samtools index aln.sort.bam
bedtools genomecov -ibam aln.sort.bam > aln.cov
R --vanilla aln.cov aln.cov genome < /home/local/27626/bin/plotcov.R
acroread aln.cov.pdf &
Q6. How much of the genome is covered at minimum 10X?
Lets process the alignments - recall the Lectures and Exercises for alignment post-processing and variant calling. We will start with removing duplicates and indexing the de-duped bamfile:
java -Xmx2g -jar /home/local/27626/bin/MarkDuplicates.jar INPUT=aln.sort.bam OUTPUT=aln.sort.rmdup.bam \
ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
samtools index aln.sort.rmdup.bam
Then we are going to do realignment
ln -s /home/local/27626/bin/GenomeAnalysisTK-2.5-2/GenomeAnalysisTK.jar .
java -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xmx2g -jar GenomeAnalysisTK.jar \
-I aln.sort.rmdup.bam \
-R reference.fa \
-T RealignerTargetCreator \
-o IndelRealigner.intervals
java -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xmx2g -jar GenomeAnalysisTK.jar \
-I aln.sort.rmdup.bam \
-R reference.fa \
-T IndelRealigner \
-targetIntervals IndelRealigner.intervals \
-o aln.sort.rmdup.realn.bam
Q7. How many reads were removed as duplicates?
Now we are ready to call SNPs. Because we dont have a catalogue of variation for our species we are not going to do Base-Quality-Score-Recalibration (BQSR). To call variants we will use the Haplotyper in GATK and lets use a minimum of posterior probability of 50 (phred-scale) to report a SNP:
samtools index aln.sort.rmdup.realn.bam
java -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xmx2g -jar GenomeAnalysisTK.jar \
-R reference.fa \
-T HaplotypeCaller \
-I aln.sort.rmdup.realn.bam \
-o var.raw.vcf \
-stand_call_conf 50.0 \
-stand_emit_conf 10.0
Q8. How many variant calls did we get? How many of those were indels (Hint: look at the REF and ALT columns)?
Now we need to filter the SNPs, because we dont have a catalogue of variation we cant do it using the Soft filtering approach (Variant-Quality-Score-Recalibration), but instead we will use hard-filters. Let us require at a minimum depth of 10 and that calls within 5 nt of each other are removed, and let us also remove the indel calls - we are not going to use them for the phylogeny:
vcf-annotate --filter d=10/Q=50/c=2,5 var.raw.vcf > var.hard.vcf
java -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xmx2g -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
--variant var.hard.vcf \
-o var.hard.snps.vcf \
-selectType SNP
Because we are working with a haploid genome we will remove all heterozygote SNPs to get the final variant calls.
grep "#" var.hard.snps.vcf > header.vcf
grep "PASS" var.hard.snps.vcf | grep -v "0/1" | cat header.vcf - > var.hard.snps.final.vcf
grep -v "#" var.hard.snps.final.vcf | wc -l
Q9. How many SNPs do we have that pass the quality controls (last line above)?
Q10. Why do we need to filter heterozygote SNPs on haploid genomes?
Now we have our SNPs for our outbreak strain - we also have SNPs for several other strains as shown below. Lets create a phylogenetic tree using FastTree. Here we extract all SNPs from our vcf-file and from vcf-files of other ST313 from The Democratic Republic of Congo and Nigeria.
vcfwiz.pl -o all var.hard.snps.final.vcf /home/local/27626/exercises/genomic_epi/vcfs/*.gz
Lets look at the tree in FigTree. Open the program (below) and chose "File", "Open" and select "all.main_tree.newick". Try to chose a different layout and see how it looks. You can see a overview of where the strains are from by looking at "/home/local/27626/exercises/genomic_epi/strain.info"
java -Xmx512m -jar /home/local/27626/bin/FigTree_v1.3.1/lib/figtree.jar &
Q11. Which strain is the closest to our?
Lets see how many of our SNPs that overlap - replace "X" below with the closest strain:
bedtools intersect -a var.hard.snps.final.vcf -b /home/local/27626/exercises/genomic_epi/vcfs/X.var.flt.vcf.gz | wc -l
Q12. How many overlap, and how many did we have in total (Q8)?
What would you use this information for to help track down the source of the outbreak?
How can we treat it?
Now that we have the de novo assembly of our strain, lets look for resistance genes. If we find any of these then we know what not to treat patients with. Here we will use another webserver from the CGE project, the ResFinder. Download the assembly to your own computer (use SFTP on windows, scp on mac) and upload them to the server. Chose a threshold of 95% and submit the job.
Q13. Are there any resistance genes in our outbreak strain?
The next step would be to investigate the differences between this and similar but non-pathogenic strains to see if we could find the course of the increased virulence, but this will be for another time.
Congratulations you finished the exercise!
|