Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Data Preprocessing

Simon Rasmussen - June 2013


Overview

We will try to pre-process several types of NGS data. The output of each of the datasets will be used in the alignment exercise tomorrow.

  1. P. aeruginosa single end Illumina reads
  2. Human Illumina Paired end reads
  3. E. coli of 454 and Ion Torrent data

P. aeruginosa single end Illumina reads

Introduction

Lets try some reads from a study of Pseudomonas aeruginosa, an opportunistic pathogen that can live in the environment and may infect humans. Inside humans they may live in the lungs and form biofilms. The reads are from a strain that has infected ~40 patients over the last 30 years in Denmark and here it adapted to the human-host lung environment. In the alignment exercise we are going to compare it to the PAO1 reference genome which was isolated in 1955 in Australian from a wound (ie. probably it just came from the enviroment) and see which genes it has lost during the adaptation (assuming that the PAO1 genome is a relatively close ancestor). But first lets check the quality of the data. Copy the data to your folder (remember to change studXXX to your actual number)

cd /home/local/ngs_course/studXXX/
mkdir preprocess
cd preprocess
cp /home/local1/27626/exercises/alignment/paeruginosa/Paeruginosa.fastq.gz .

Look at the reads, they are gzipped to store space so we need to look at them using gzcat which is a program that opens a gz-zipped file and prints it to the screen. We view the file using the program called "less":

gzcat Paeruginosa.fastq.gz | less

Q1. How long are the reads?

Q2. Count the number of reads, this is done by counting the number of lines in the file and dividing by 4 (fastq is always in 4 lines pr. read).
gzcat Paeruginosa.fastq.gz | wc -l

Calculate the maximal coverage of the P. aeruginosa genome, eg. how many times do we - on average - cover the genome? This is total bases / genome length. The genome size is 6588339 nt.

Q4. What is the maximal coverage?

We are actually only using 1/4 of the data for the exercise, what would we have if we used all data?

Q5. What are the qualities encoded in? This information is needed when we want to trim the reads and perform alignment. Instead of looking at the data lets use this little program:
gzcat Paeruginosa.fastq.gz | fastx_detect_fq.py

FastQC

Lets take a look of the reads and see if they need trimming of adaptors or qualities - we will run fastqc on the reads and view the report using firefox

mkdir fastqc
fastqc -o fastqc Paeruginosa.fastq.gz
firefox fastqc/Paeruginosa_fastqc/fastqc_report.html &

The program outputs several statistics in the report, look at each of the figures and see if it reports something that we should look at (red/white cross). The most interesting figure is the first where the quality of each base is plotted as you move from the beginning of the read to the end. Here you should see the trailing bad qualities typical for Illumina data. We need to remove this bad bases as we do not want them to influence our assembly and SNPs.

Also look for overrepresented sequences (second last plot), these are often adaptors from the sequencing that is also present in the reads. It is especially important to remove these when are doing a de novo assembly as these will overlap between the reads and make wrong assemblies. For alignment they are also troublesome as the aligners often perform global alignment (ie it want to align all of the read to the genome).

Look at the "Per base sequence quality", it shows the distribution of quality scores in the reads as you move from 5' to 3'. You should see that the quality drops towards the 3' end, this is typical for Illumina data. Also look at "Overrepresented sequences", here you see that there is a sequence often found in the reads that matches the Illumina Paired End PCR Primer 2. Copy the primer sequence

Cut adaptors/primers

Lets cut the adapter from the reads and lets only keep reads that are larger than 30 after a potential cut and that there should be at least 10 bases match before cutting the adaptor. When it is done some statistics will be output, look at the top, does the number of times that a read was cut fit with the number of times it was found in the FastQC report (~2 mins)? IMPORTANT You should paste in the primer sequence you found in the FastQC report and replace "primer" below

cutadapt -b primer -m 30 -o Paeruginosa.fastq.cut.gz -O 3 Paeruginosa.fastq.gz

Also try trimming low quality bases from the ends. After take a look at the reads, you should see the varying length now (again replace "primer" with the primer from FastQC):

cutadapt -b primer -m 30 -o Paeruginosa.fastq.cut.trim.gz -O 3 -q 20 --quality-base=33 Paeruginosa.fastq.gz
gzcat Paeruginosa.fastq.cut.trim.gz | less -S

You can also re-run fastqc on the trimmed files and view them to see the difference:

fastqc -o fastqc Paeruginosa.fastq.cut.gz
fastqc -o fastqc Paeruginosa.fastq.cut.trim.gz
firefox fastqc/Paeruginosa.fastq.cut_fastqc/fastqc_report.html &
firefox fastqc/Paeruginosa.fastq.cut.trim_fastqc/fastqc_report.html &
Q6. What was the effect on the qualities, and what was the effect on the per base sequence content? Can you suggest an additional trimming to increase the quality?

Human Illumina Paired end reads

Let us look at some paired end Illumina reads, these reads are from a ~40X wgs of an asian individual. To save time we are not going to use all reads - the reads are filtered so that we only have reads that map chr21. Lets copy the data to here.

cp /home/local/27626/exercises/alignment/human/HG*.fastq.gz .

Lets look at the data using fastqc. To begin with we are only going to use the "A" reads (A_1 and A_2). The _1_ file contains the first of all pairs and the _2_ contains the second of the pairs. Try to type in the commands needed to run fastqc:

you type in the commands to run fastqc on the HG* files here
firefox fastqc/HG00418_A_1_fastqc/fastqc_report.html &
firefox fastqc/HG00418_A_2_fastqc/fastqc_report.html &

It looks like the reads can use some trimming, we dont find any adaptor sequences so lets skip that and only trim the reads based on quality, for this we are going to use prinseq. Prinseq cant read compressed files so we need to pipe the data in and out using gzcat and gzip. We trim quality from the right (3') to min 20, want the read to have average quality 20 and after trimming the read should be at least 35bp for us to keep it. NB: the "\" just means that we are writing the commands over two lines, paste in both lines. Each command takes ~4 mins to complete.

gzcat HG00418_A_1.fastq.gz | prinseq-lite.pl -fastq stdin -out_bad null \
 -out_good HG00418_A_1.trim -trim_qual_right 20 -min_qual_mean 20 -min_len 35 -log prinseq.HG00418_A_1.log 
gzcat HG00418_A_2.fastq.gz | prinseq-lite.pl -fastq stdin -out_bad null \
 -out_good HG00418_A_2.trim -trim_qual_right 20 -min_qual_mean 20 -min_len 35 -log prinseq.HG00418_A_2.log

When we have trimmed our paired end reads we have removed reads from each file. This means that the ordering is not in sync between the two files, eg. if we removed read no. 5 in file1 but not in file2, then read6 in file1 will be paired with read5 in file2. We need to fix this using cmpfastq.pl before we can use the reads - you will see that it ouputs "common" and "unique" files. It is the common files that we are going to use.

cmpfastq.pl HG00418_A_1.trim.fastq HG00418_A_2.trim.fastq

E. coli of 454 and Ion Torrent data

The 454 and Ion Torrent data are from the German E.coli outbreak (cucumber/sprouts) during the summer 2011. Copy data the data and take a look at the reads, use gzcat and pipe it to less to see the file contents. Also you need to replace "file" with the data-files you just copied and try to type in the firefox command to view the FastQC reports:

cp /home/local/27626/exercises/alignment/ecoli/GOS1.fastq.gz .
cp /home/local/27626/exercises/alignment/ecoli/iontorrent.fq.gz .
gzcat file | less -S

Run fastqc on the data

fastqc -o fastqc GOS1.fastq.gz
fastqc -o fastqc iontorrent.fq.gz
Q7. Look at the different quality profiles, what differences can you see between the data types?

Lets trim the 454 alignments. Lets say we dont want reads shorter than 150bp and chop of reads larger than 500bp, also lets remove reads with avg quality lower than 20. Also lets output some graph data that we can visualize at the prinseq-webpage.

gzcat GOS1.fastq.gz | prinseq-lite.pl -fastq stdin -graph_data GOS1.gd -out_good stdout \
 -out_bad null -min_len 150 -min_qual_mean 20 -trim_to_len 500 | gzip -c > GOS1.trim.fastq.gz

When it is done open a browser on padawan (type "firefox &" or use the one you have open from the other FastQC reports) and go to the prinseq web-page. Click "Get report" and then "Select a graph datafile to upload". Find your directory (if not already there else: File System, home, local, ngs_course, studXXX, preprocess) and upload the GOS1.gd file. When uploaded click "Continue" and wait for it to finish, then click on the link to HTML report file. Here you will see some stats calculated a bit differently than the fastqc report.

If it doesnt work or you cant find the file you can do this:

firefox /home/local/27626//exercises/alignment/ecoli/GOS1.gd.html &

Congratulations you finished the exercise!