Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Alignment exercises

Simon Rasmussen - June 2013


Overview

You will try to run alignment of NGS data:

  1. Blast vs BWA
  2. Inspect and work with BAM-files
  3. Single end alignment of P.aeruginosa Illumina reads
  4. Visualize alignments using IGV
  5. Use bedtools to calculate coverage and work with annotations
  6. Align Illumina paired end reads
  7. Determine insert size for paired end reads
  8. Align and trim 454, Ion Torrent and Solid reads (optional)

Blast vs BWA

Lets try to align 25 Illumina reads using blast and bwa to the human genome. The "time" command infront of the commands will time how much time was actually spent on the job. Note down the time for each command, it is the third number that is written, something like m:s.ms. First go to your work dir, create a folder for the exercise and copy data to there (I wrote down all the commands, but you probably know how to do you).

cd /home/local/ngs_course/studXXX/
mkdir alignment
cd alignment
mkdir blast
cd blast
cp /home/local/27626/exercises/alignment/blast/sample.fasta .
cp /home/local/27626/exercises/alignment/blast/sample.fastq .
cp /home/local/27626/exercises/alignment/blast/sample_2.fastq .
time blastall -p blastn -d /home/local/27626/exercises/alignment/blast/human_g1k_v37.fasta -m8 -i sample.fasta -o sample.m8

Lets do the same for bwa, here we only need to time the first step because the second step will write out the time used for us

time bwa mem /home/local/27626/exercises/alignment/blast/human_g1k_v37.fasta sample.fastq > sample.sam
Q1. Which one was the fastest?

Generally alignment using BWA is done like this.

bwa index ref.fa
bwa mem ref.fa reads.fastq > aln.sam

BAM-files

Lets take a look at the BAM-file, the SAM-format consists of two sections, header and alignments. Lets look at the header and the alignment of the alignment we made before (-h tells samtools to also show the header).

samtools view -Sb sample.sam > sample.bam
samtools view -h sample.bam | less -S

Header lines starts with "@", "@SQ" are all the sequences you mapped against (in your reference fasta file). For the alignment the fields are: Read name, flag, reference it mapped to, position, mapping quality, cigar, mate reference map, mate position, template length, read sequence and read qualities. See SAM-specification for a thorough description.

You should see that the first read is mapped to chr21 at around 45Mb. Additionally the flag is "0" meaning that it mapped on the forward strand, other flag values are "4" = unmapped and "16" = mapped reverse strand. A translator for the flags can be found here explain-flags. Try typing in 16 and see what you get out of it. You also see that the first read has a mapping quality of 60 (this is phred-scale) and the "cigar" is "100M". This means 100 matches or mismatches. You can see the sequence of the read and the qualities as well.

After the qualities there are additional fields that gives information on the alignment, eg. AS:i: means the alignment score for the read, NM:i:3 means that the edit distance for the alignment is 3, eg. we need to change 3 bases in the read to have a perfect match to the reference sequence.

Q2. How many reads are unmapped?

We could also get this using the filtering option in samtools view. Eg adding -f 4 would get us all unmapped reads, getting all reads that are not unmapped (ie. mapped) can be achieved by -F 4:

samtools view -f 4 sample.bam | less -S
samtools view -F 4 sample.bam | less -S

We can also filter on the mapping quality, eg. getting all reads with a mapping quality larger than eg 30, will remove the unmapped reads (mapping quality 0) and the read mapped with mapping quality 4 and 21.

samtools view -q 30 sample.bam | less -S

Lets tell samtools to make a bam-file with only mapped reads by adding -b (output bam).

samtools view -b -F 4 sample.bam > sample.mapped.bam

Paired end mapping

Lets try to add the pairs so that we have 2x25 reads and map these. Paired end reads are mapped like this (EXAMPLE). Remember that the "index" step only needs to be done once.

bwa index ref.fa
bwa mem ref.fa reads_1.fastq reads_2.fastq > aln.sam

Lets try it out, we already have an index of the human genome:

bwa mem /home/local/27626/exercises/alignment/blast/human_g1k_v37.fasta sample.fastq sample_2.fastq | samtools view -Sb - > sample.bam

Now the sample.bam file contains the alignment of both pairs, lets have a look:

samtools view sample.bam | less -S

The pairs are grouped together line-by-line. You can also see the flags have changed, now it also contains information on pairing, whether the paired read was mapped etc. If you want you can look up af few on explain-flags. Three extra columns are now filled, where the "=" means that the pair mapped to the same chromosome, then the mapping position of the pair and the distance between the pairs (the size of the entire DNA fragment).

Lets try to get reads where both pairs map. read unmapped=4, mate unmapped=8, to get the reads that are both unmapped and that the mate is unmapped: 4+8=12

samtools view -b -F 12 sample.bam > pairs_only.bam
samtools view pairs_only.bam | less -S

Finally lets try to recreate the original read files from the bam-file. To do this we are going to use a suite of Java-programs called Picard.

java -XX:+UseParallelGC -XX:ParallelGCThreads=8 -jar /home/local/27626/bin/SamToFastq.jar INPUT=sample.bam \
 FASTQ=sample.remade.fastq SECOND_END_FASTQ=sample_2.remade.fastq

Check the difference of these vs. the original files, the diff program prints out lines that are different among the files:

diff sample.fastq sample.remade.fastq
diff sample_2.fastq sample_2.remade.fastq

P. aeruginosa single end Illumina reads

Lets try some reads from a study of Pseudomonas aeruginosa, an opportunistic pathogen that can live in the environment and may infect humans. Inside humans they may live in the lungs and form biofilms. The reads are from a strain that has infected ~40 patients over the last 30 years in Denmark and here it adapted to the human-host lung environment. We are going to compare it to the PAO1 reference genome which was isolated in 1955 in Australian from a wound (ie. probably it just came from the enviroment) and see which genes it has lost during the adaptation (assuming that the PAO1 genome is a relatively close ancestor).

Remember that these reads were the ones we trimmed yesterday. You can copy your own trimmed data but you also need to copy the files here to get everything we need for the exercise

mkdir paeruginosa
cd paeruginosa
cp -r /home/local1/27626/exercises/alignment/paeruginosa/* .

Alignment of single end Illumina reads

Create index and then run the alignment. You will have noticed that we have added a pipe ("|") and the samtools program after the bwa command. This is because the standard output from bwa is sam, but sam is text-file, and takes up a lot of space so we "pipe" it to samtools and tell it to convert it to bam (binary). Calculate the number of reads that are mapped, note down the number of mapped reads.

bwa index Paeruginosa_PAO1.fna
bwa mem Paeruginosa_PAO1.fna Paeruginosa.fastq.gz | samtools view -Sb - > Paeruginosa.bam
samtools flagstat Paeruginosa.bam

Map the processed reads and compare with the original mapping

Just adapter trimmed

bwa mem Paeruginosa_PAO1.fna Paeruginosa.fastq.cut.gz | samtools view -Sb - >  Paeruginosa.cut.bam

Adapter and quality trimmed

bwa mem Paeruginosa_PAO1.fna Paeruginosa.fastq.cut.trim.gz | samtools view -Sb - > Paeruginosa.cut.trim.bam

Count the number of reads that are mapped with each file, note down the numbers

samtools flagstat Paeruginosa.bam
samtools flagstat Paeruginosa.cut.bam
samtools flagstat Paeruginosa.cut.trim.bam
Q6. Do we get more or less reads mapped?

Lets look at the edit distances for each of the alignments, these are encoded in the BAM-file using the "NM:i: tag. Here we use a "perl-oneliner" to extract the edit distance for each read:

samtools view Paeruginosa.bam | perl -ne 'if ($_ =~ m/NM:i:(\d+)/) {print $1, "\n"}' > Paeruginosa.nm
samtools view Paeruginosa.cut.bam | perl -ne 'if ($_ =~ m/NM:i:(\d+)/) {print $1, "\n"}' > Paeruginosa.cut.nm
samtools view Paeruginosa.cut.trim.bam | perl -ne 'if ($_ =~ m/NM:i:(\d+)/) {print $1, "\n"}' > Paeruginosa.cut.trim.nm

Plot them using R. NB: to exit R, you must type q() and press n.

R
raw = read.table("Paeruginosa.nm")
cut = read.table("Paeruginosa.cut.nm")
trim = read.table("Paeruginosa.cut.trim.nm")
a = as.data.frame(table(raw[,1]))
b = as.data.frame(table(cut[,1]))
c = as.data.frame(table(trim[,1]))
m = merge(merge(a, b, by=1), c, by=1, all=TRUE)
m[is.na(m)] = 0
df = as.data.frame(m[,2:4])
rownames(df) = m[,1]
colnames(df) = c("Raw", "Cut", "Trim")
barplot(as.matrix(t(df)), beside=TRUE, col=rainbow(3), main="Effect of trimming", xlab="Edit distance")
legend("topright", legend=c("raw", "cut", "cut&trim"), col=rainbow(3), pch=16)
dev.print("trimming.effect.pdf", device=pdf)
Q7. Which reads is it that are removed using the quality trimming?
Q8. Do you think it is a good idea to trim the reads?

Alignment visualization

Sometimes it is nice to see things with you own eyes - we are therefore going to visualize the alignments using IGV. To do this we first need to sort the alignment based on mapping position and then index it using samtools.

samtools sort Paeruginosa.cut.trim.bam Paeruginosa.cut.trim.sort
samtools index Paeruginosa.cut.trim.sort.bam
java -XX:+UseParallelGC -XX:ParallelGCThreads=8 -jar -Xmx1250m /home/local/27626/bin/IGV/igv.jar &

If the P.aeruginosa_PAO1 genome is not already loaded then load it by: File, Load Genome from File, then go to "/home/local/27626/exercises/alignment/paeruginosa/" and select "P.aeruginosa_PAO1.genome". Then load your alignment by File, Load from file and go to your working directory (something like: "/home/local/ngs_course/studXXX/alignment/") select "Paeruginosa.cut.trim.sort.bam". Try zooming in (upper right corner). From the top the windows are position on the genome, read depth (grey histogram), read alignments (grey arrows) and gene annotation on the PAO1 genome. Different colored arrows are differences from the reference, some mismatches are seen in all reads, some are seen in only a few, these are Singe-Nucleotide-Polymorphisms and sequencing errors. Try to find the gene "PA0668.1" (search in the white box in the top), these reads are all white because they map to multiple positions in the genome - why do you think this could be?

Q9. There seem to be a SNP at 722,193 (paste in: "gi|110645304|ref|NC_002516.2|:722,193"), would you trust it?
Q10. Try to navigate around the genome, what do you notice?

Lets try to calculate some statistics on the coverage instead of basing it on eye-view, for this we are going to use bedtools. Try to read about bedtools at the link, it is a very useful suite of programs for working with SAM/BAM, BED, VCF and GFF files, files you will encouter many times doing NGS analysis. Lets calculate coverage using bedtools and after plot it using R

bedtools genomecov -ibam Paeruginosa.cut.trim.bam > Paeruginosa.cut.trim.bam.cov
R --vanilla Paeruginosa.cut.trim.bam.cov P.aerugonisa_PAO1 genome < /home/local/27626/bin/plotcov.R

Now open the plot. The left plot is a histogram showing the number of bases covered at a certain depth (reads), whereas the plot on the right is showing the percentage of the genome covered at X depth or more.

acroread P.aerugonisa_PAO1.pdf &
Q11. How much of the genome is not covered with any read and how much of the genome is covered with eg. 10 or more reads?

Of course we only used 1/4 of the data so on average we should have 4 times the coverage if we used all data. Lets look at which regions we are missing, eg. regions in the the PAO1 genome but not in our strain. To do this we calculate the coverage in regions by adding the "-bga" option to bedtools genomecov. Afterwards we use grep to get all regions that are not covered.Take a look at both files (Paeruginosa.cut.trim.bam.cov.bed and Paeruginosa.cut.trim.bam.cov.bed.nocov).

bedtools genomecov -bga -ibam Paeruginosa.cut.trim.bam > Paeruginosa.cut.trim.bam.cov.bed
grep -w 0$ Paeruginosa.cut.trim.bam.cov.bed > Paeruginosa.cut.trim.bam.cov.bed.nocov

Now lets use gene position (PAO1.ann.bed) to get genes with at least 50% missing sequence in our strain

bedtools intersect -f 0.5 -a PAO1.ann.bed -b Paeruginosa.cut.trim.bam.cov.bed.nocov > Paeruginosa.missing.genes

Look into the file, you see chromosome, start, end, gene name, score, strand. NB the score is just set to 1, here it has no meaning. Lets get the annotation of the genes

cut -f4 Paeruginosa.missing.genes | grep -f - Paeruginosa.cds2ann > Paeruginosa.missing.genes.ann
Q12. How many missing genes are there, and what are their function, you should see a lot of hypothetical proteins and eg. bacteriophage proteins. Could we expect this?

Try to search for some of the genes in IGV, you should be able to see the regions of the genes that have been lost during adaptation.

Q13. Do you find eg. colicin and pyocin proteins in the list? Try to google these and see what they do.

Illumina Paired end reads

Finally let us look at some paired end Illumina reads, these reads are from a ~40X wgs of an asian individual. There are two different libraries - one called "_A_" and another library called "_B_". To save time we are not going to use all reads, we are only going to map them to chr21 and additionally the reads are already filtered so that we only have reads that map chr21. Create a directory for the exercise and copy the data to there. Recall that we trimmed the "_A_" reads yesterday.

mkdir human_pe
cp /home/local/27626/exercises/alignment/human/* human_pe
cd human_pe

Lets start the alignment, start by indexing chr21 and then afterward align the paired end files. We are going to add something called a read group. This should be actually be added all the time (we forgot to do it with the P.aeruginosa reads), this writes information on unique ID for the lane, which sample it is from, which platform was used etc

bwa index chr21.fasta
bwa mem -R "@RG\tID:libA\tSM:HG00418\tPL:ILLUMINA" chr21.fasta HG00418_A_1.trim.fastq-common.out.gz \
 HG00418_A_2.trim.fastq-common.out.gz | samtools view -Sb - > HG00418_A.bam

Lets calculate the coverage that we have on chr21, then we first need to sort and then calculate coverage using bedtools (just as we did with the P.aeruginosa reads).

samtools sort HG00418_A.bam HG00418_A.sort
bedtools genomecov -ibam HG00418_A.sort.bam > HG00418_A.chr21.cov
R --vanilla HG00418_A.chr21.cov HG00418_A.chr21 21 < /home/local/27626/bin/plotcov.R
acroread HG00418_A.chr21.pdf

Ok, we dont have very high depth lets map the HG00418_B reads as well. These I have already trimmed and checked with cmpfastq.pl:

bwa mem -R "@RG\tID:libB\tSM:HG00418\tPL:ILLUMINA" chr21.fasta HG00418_B_1.trim.fastq-common.out.gz \
 HG00418_B_2.trim.fastq-common.out.gz | samtools view -Sb - > HG00418_B.bam

Estimate insert size

We will try to calculate the insert sizes of the two different libraries from the bam files, sometimes when you get data you do not know the insert sizes used, then this can be very helpful. Recall that the columns in the bam-file are: Read name, flag, reference it mapped to, position, mapping quality, cigar, mate reference map, mate position, template length, read sequence, read qualities and the different tags, among here the read group (RG) and the edit distances (NM).

Open the bam-file for library A, the insert size is the number just before the read sequences. You see some are negative other positive, this depends on the direction of the pairs and whether they map on the positive or negative strand.

samtools view HG00418_A.bam | less -S

Extract all insert_sizes - we use cut to only take column 9:

samtools view HG00418_A.bam | cut -f9 > HG00418_A.insertsizes.txt
samtools view HG00418_B.bam | cut -f9 > HG00418_B.insertsizes.txt

Lets calculate it in R (remember to quit R, write q() and press n)

R
a = read.table("HG00418_A.insertsizes.txt")
a.v = a[a[,1]>0,1]
# Lets get rid of outliers and use the 5% and 95% intervals to calculate mean and standard deviation:
mn = quantile(a.v, seq(0,1,0.05))[2]
mx = quantile(a.v, seq(0,1,0.05))[20]
mean(a.v[a.v >= mn & a.v <= mx])    # Mean
sd(a.v[a.v >= mn & a.v <= mx])      # SD

b = read.table("HG00418_B.insertsizes.txt")
b.v = b[b[,1]>0,1]
mn = quantile(b.v, seq(0,1,0.05))[2]
mx = quantile(b.v, seq(0,1,0.05))[20]
mean(b.v[b.v >= mn & b.v <= mx])    # Mean
sd(b.v[b.v >= mn & b.v <= mx])      # SD
Q14. What are the mean and standard deviation of the insert sizes for the two libraries?

We will use the data generated here in a later exercise for SNP calling


Alignment of 454, Ion Torrent and Solid data (optional)

The 454 and Ion Torrent are from the German E.coli outbreak (cucumber/sprouts) during the summer 2011 (these were the ones we trimmed yesterday). Unfortunately there are no Solid data from the outbreak, but we will try to use data from resequencing of an E.coli K12 strain (one of the most used lab-strains). We will map the data to another E. coli O104 strain (55989) which is fairly close to the German outbreak strain.

Copy data the data and take a look at the reads:

mkdir ecoli
cd ecoli
cp /home/local/27626/exercises/alignment/ecoli/* .
gzcat file | less -S

You will see that the solid data is in two files, csfasta and qual. This is a pretty bad way of representing data, because the qualities are written as numbers and not encoded. Also see that it starts with a "T" and then numbers, this is the colorspace encoding. Lets convert it to csfastq. After it is done take a look at the "solid_ecoli.fastq.gz". This is now a fastq-file in cs format. Remove the original solid_F3 files.

solid2fastq solid_F3.csfasta.gz solid_F3_QV.qual.gz -z -Z -o solid_ecoli

Since we trimmed them yesterday lets start alignment (normally we dont trim the solid reads). For 454 and iontorrent we are going to use bwa - the bwa mem can be used to align relatively short reads (>70bp), long reads, contigs or similar. The "-t 1" stands for using one thread - if we increase the number it will run in parallel and therefore faster.

# index reference
bwa index -a bwtsw Escherichea_coli_55989.fna

# perform the alignment (1 min each). 
bwa mem -t 1 Escherichea_coli_55989.fna GOS1.trim.fastq.gz | samtools view -Sb - > GOS1.bam
bwa mem -t 1 Escherichea_coli_55989.fna iontorrent.fq.gz | samtools view -Sb - > ion.bam

Take a look at the mapped reads using samtools, notice all the indels (in the cigar string, "I" and "D", for the 2nd mapped read in GOS1 it looks like this: 21M1I23M1D49M2D7M1D279M1D10M28S). Remember that indels are the main error in 454/Ion torrent reads (at homopolymer runs).

samtools view -F 4 GOS1.bam | less -S
samtools view -F 4 ion.bam | less -S

Alignment of Solid reads

For the solid data we are going to use another mapper. Both bwa and bowtie can map solid data, but it is not very good. We are going to use SHRIMP which does hashed mapping. The downside of SHRiMP is that it is slower and uses a lot more memory compared to eg bwa and bowtie, but for colorspace data we really need the precision. Eg. the index for bwa and bowtie for human genome is around 3-4Gb, but SHRiMP uses ~48Gb. This can be handled by dividing the index into sub-indices and then merging them later on. For now, the ecoli index is not that big so we will just do it in one go. Lets start by creating the index.

gmapper-cs -S Escherichea_coli_55989.fna.cs Escherichea_coli_55989.fna

Perform alignment. Here we only want the best report best hit (-o 1 --single-best-mapping), we want sam output (--sam), input is in fastq (--fastq), minimum quality to map a read is 10 (--min-av-qv 10), also write unaligned reads (--sam-unaligned) and use the index we created before (-L ecoli_O157_H7.cs)

gmapper-cs -o 1 --sam --fastq --single-best-mapping --min-avg-qv 10 --sam-unaligned \
-L Escherichea_coli_55989.fna.cs solid_ecoli.fastq.gz | samtools view -Sb - > solid.bam

Look at the solid alignments, you see they are short (35bp), we also dont see that many indels as with the 454 and Solid data. Remember that these reads are from K-12, which is much more distant to the reference genome we are using compared to the german outbreak reads (454 and Ion Torrent).

samtools view -F 4 solid.bam | less -S

Now that we have mapped the cs reads to the genome they can be translated to base space, the BAM-file will now have base-versions of the reads as well as the colorspace. Also you see both a NM:i: and a CM:i: tag, this is the edit-distances for the translated basespace and the original colorspace reads. Lets make the alignments read for IGV.

samtools sort GOS1.bam GOS1.sort
samtools sort ion.bam ion.sort
samtools sort solid.bam solid.sort
samtools index GOS1.sort.bam
samtools index ion.sort.bam
samtools index solid.sort.bam

If IGV is not already running, start it again

java -XX:+UseParallelGC -XX:ParallelGCThreads=8 -jar -Xmx1250m /home/local/27626/bin/IGV/igv.jar &

Import the E_coli_55989 genome from file ("/home/local/27626/exercises/alignment/ecoli/E_coli_55989.genome") and load the *.sort.bam files you just made ("/home/local/27626/ngs_course/studXXX/alignment/") - you can use control to click multiple at the same time. Zoom in to a region and you should be able to see the differences here as well (indels are short black lines).

Q15. What are the main difference you see between the reads?

NB. We only used 1/2 of the 454 and iontorrent data and only 1/25 of the Solid data (!) for the exercise.


Congratulations you finished the exercise!