Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

NGS exercise

Simon Rasmussen & Jose MG (Txema) Izarzugaza - December 2012


Overview

You will analyze Illumina paired end exome data from cancer tissue. During the exercise you will:

  1. Statistics of reads (FastQC)
  2. Determine theoretical and actual coverage of experiments
  3. Alignment of Illumina reads to a reference genome
  4. Filter and pre-process the alignments
  5. Visualize the alignments
  6. Perform genotyping (SNP-calling)
  7. Filter the variants
  8. Analyze the coding/non-coding effects of SNPs
  9. Identify a probable cancer causing mutation

Introduction

The data is from a human patient who suffers from breast cancer. DNA was isolated from the tumor and the exome (DNA that is coding for protein) was captured and sequenced on Illumina machine resulting in paired end reads. You are now in charge of analyzing the patients genome for mutations that are likely to have caused the cancer. To speed things up we only have data from chromosome 13.


Get the data

Important: Everytime you log on padawan.cbs.dtu.dk in a new shell you must run the command below to have the correct programs available to you:

source /home/local/27634/bin/source.me

Lets start by making a directory to work in, remember to add your student account name to the line below (replace the XXXs), and by creating shortcuts to the data.

source /home/local/27634/bin/source.me
mkdir /home/local/phdcourse/studXXX
cd /home/local/phdcourse/studXXX
ln -s /home/local/27634/exercises/cancer/data/reads* .
ln -s /home/local/27634/exercises/cancer/data/chr13.fa .
ln -s /home/local/27634/exercises/cancer/data/chr13.exome .

Brief look at the reads

Lets take a look at the reads. Because we have paired end reads we have two files, one file with the first pair and one file with another pair. We can view the file using a program called "less":

less reads_1.fq
less reads_2.fq
Q1. How long are the reads?

Count the number of reads, this is done by counting the number of lines in the file and dividing by 4 (fastq is always in 4 lines pr. read). Hint: Use the program "wc".

Lets also count the number of bases, this can be done by this program:

fastx_sumofsequence.py reads_1.fq
fastx_sumofsequence.py reads_2.fq
Q2. Does it match with the read length (eg. divide bases/reads = avg. read length)?

Calculate the average coverage, which is total bases / genome length, of chromosome 13. Chromosome 13 is ~113 Mb, but only 1.66Mb of the chromosome are exons (the coding regions) and is what we have sequenced.

Q3. What is the average coverage of the chr13 exome? This is how many reads we on average have covering each position of the exome. What would the number be if it was the entire chromosome 13? Q4. What are the qualities encoded in? This information is needed when we want to do the alignment (run the command below), Sanger means Phred+33, Illumina means Phred+64.
fastx_detect_fq.py --i reads_1.fq

Read statistics - FastQC

We can try to calculate statistics on our data to look for adapter sequences in our reads as well as to identify whether they should be trimmed.

mkdir fastqc
fastqc -o fastqc reads*.fq
firefox fastqc/*/fastqc_report.html &

The program outputs several statistics in the report, look at each of the figures and see if it reports something that we should look at (red/white cross). The most interesting figure is the first where the quality of each base is plotted as you move from the beginning of the read to the end. Normally we for Illumina data see a drop in the base qualities as we move from 5' towards the 3'. However these data are actually very good and shows only a small drop.

Also look for overrepresented sequences (second last plot), these are often adaptors/pcr primer regions from the sequencing that is also present in the reads. It is especially important to remove these when are doing a de novo assembly as these will overlap between the reads and make wrong assemblies. For alignment they are also troublesome as the aligners often perform global alignment (ie it want to align all of the read to the genome).

Q5. Which of the statistics are failed/warning. Try to think of what could be the cause of this?

Trimming reads

Lets trim of the first bases and the bases in the 3' with poor quality (10), and lets only keep reads that are longer than 25 bases after we have trimmed them. If the qualities are encoded Phred+64 (ie. Illumina scale - Question 4) add -phred64 to the end of commands below. This trimming takes a couple of minutes.

prinseq-lite.pl -fastq reads_1.fq -trim_left 6 -trim_qual_right 10 -min_len 25 -out_good reads_1.trim
prinseq-lite.pl -fastq reads_2.fq -trim_left 6 -trim_qual_right 10 -min_len 25 -out_good reads_2.trim
Q6. Look at the output from the prinseq-lite.pl commands, how many reads passed our quality control?

When we have trimmed our paired end reads we have removed reads from each file. This means that the ordering is not in sync between the two files, eg. if we removed read no. 5 in file1 but not in file2, then read6 in file1 will be paired with read5 in file2. We need to fix this before we can use the reads using cmpfastq.pl. You will see that it ouputs "common" and "unique" files. It is the common files that we are going to use for alignment.

cmpfastq.pl reads_1.trim.fastq reads_2.trim.fastq

Alignment

Lets start the alignment, start by indexing chr13 and then afterward align the individual paired end files as if they were single end files. The indexing is creating the BWT (Burrows-Wheeler Transform) and the suffix array for us to use when aligning. If qualities are encoded in Illumina format (Phred+64) you have to add "-I" (captial i, not "one") to the command below (eg as. "bwa aln -I chr13.fa ..."), else just use it as it is.

bwa index chr13.fa
bwa aln chr13.fa reads_1.trim.fastq-common.out > reads_1.sai
bwa aln chr13.fa reads_2.trim.fastq-common.out > reads_2.sai

What we did above with the "bwa aln" command was to find matches in the suffix array, and now we convert these matches to alignments using the sampe command. Also we are going to add something called a read group. This writes information of the data and a unique ID for the lane, which sample it is from, which platform was used etc. The "/" in the line below means that the command is split in two lines. You have to paste in both lines for the command to work.

bwa sampe -r "@RG\tID:libA\tSM:cancer1\tPL:ILLUMINA" chr13.fa reads_1.sai reads_2.sai \
reads_1.trim.fastq-common.out reads_2.trim.fastq-common.out | samtools view -Sb - > alignment.bam

Lets calculate the coverage (average number of reads mapped to each base) that we have on the exome of chr13. Then we filter for badly mapped reads and then sort the alignment. Hereafter we calculate coverage using bedtools.

samtools view -u -q 30 alignment.bam | samtools sort - alignment.sort
bedtools coverage -abam alignment.sort.bam -b chr13.exome -hist | grep all > exome.coverage.txt
R --vanilla exome.coverage.txt chr13-exome all < /home/local/27626/bin/plotcov.R
acroread chr13-exome.pdf &

The left plot is a histogram showing the number of bases covered at a certain depth (reads), whereas the plot on the right is showing the percentage of the exome covered at X depth or more.

Q7. How much of the exome is not covered with any read and how much of the genome is covered with eg. 10 or more reads? Does the average coverage correspond to the theoretical coverage we calculated in Q3?

Filter and pre-process alignment


Processing using GATK

Now we start using the Genome Analysis Toolkit (GATK) developed by the Broad Institute. It is one of the leading programs for processing bam-files and SNP calling. However here on our server it will give some error messages such as "HttpMethodDirector - I/O exception ", these are just warnings and does not matter.

GATK is a java program and is called as by the main program "GenomeAnalysisTK.jar" and then each function as a walker (option -T). Lets make a shortcut (symbolic link) so that the commands are easier to write. Unfortunately some of the commands are still horribly long so we will use "\" to separate the commands on several lines.

ln -s /home/local/27626/bin/GenomeAnalysisTK-1.6-5/GenomeAnalysisTK.jar .

The following is optional and can be done after the exercise is complete if interested.


Realignment (optional)

Because we are using aligners such as BWA that are optimzed to align million to billion of reads to very large genomes (the human genome is 3Gb) they are optimized for speed. Therefore some reads may be mapped wrongly, and it is known that indels (insertions and deletions) can cause wrong mappings that can lead to false SNP calls. Here we parse through the bam-file and select regions with indels in the alignment that can be realigned using a sensitive alignment algorithm.

Lets create regions that must be aligned using the RealingerTargetCreator walker. NB: All of the GATK commands will show you how far it as progressed while it is running and how much time (approx) that is left.

samtools index alignment.sort.bam
java -Xmx2g -jar GenomeAnalysisTK.jar \
   -I alignment.sort.bam \
   -R chr13.fa \
   -T RealignerTargetCreator \
   -o IndelRealigner.intervals

Start the realignment using the IndelRealigner walker.

java -Xmx2g -jar GenomeAnalysisTK.jar \
   -I alignment.sort.bam \
   -R chr13.fa \
   -T IndelRealigner \
   -targetIntervals IndelRealigner.intervals \
   -o alignment.sort.realn.bam

Read quality re-calibration (optional)

We will use known SNPs from dbSNP (database of known human variants) to perform recalibration of the quality scores. By knowing which positions are likely to be a SNP we can use this information to calibrate the quality scores on positions that are either likely or not likely to be a true SNP (ie. where we will expect a mismatch in the read when compared to the reference genome).

We are going to analyze the variation based on Read groups, quality scores, cycle and dinucleotide content. First we need to count the covariates and output the sites that needs to be recalibrated.

ln -s /home/local/27626/exercises/aln_process/dbSNP135.snps.sites.vcf.gz .
ln -s /home/local/27626/exercises/aln_process/dbSNP135.snps.sites.vcf.gz.tbi .
java -Xmx2g -jar GenomeAnalysisTK.jar \
   -T CountCovariates \
   -R chr13.fa \
   -I alignment.sort.realn.bam \
   -recalFile recal_data.csv \
   -knownSites dbSNP135.snps.sites.vcf.gz \
   -cov ReadGroupCovariate \
   -cov QualityScoreCovariate \
   -cov CycleCovariate \
   -cov DinucCovariate

Now we are going to recalibrate our alignments using the data that we created just above "recal_data.csv".

java -Xmx2g -jar GenomeAnalysisTK.jar \
   -T TableRecalibration \
   -R chr13.fa \
   -recalFile recal_data.csv \
   -I alignment.sort.realn.bam \
   -o alignment.sort.recal.bam

Lets try to plot the reported (ie observed) qualities vs the expected qualities (empirical).

mkdir recal_dir
java -Xmx2g -jar /home/local/27626/bin/GenomeAnalysisTK-1.6-5/AnalyzeCovariates.jar \
   -recalFile recal_data.csv \
   -outputDir recal_dir \
   -ignoreQ 5
acroread recal_dir/libA.QualityScoreCovariate.dat.quality_emp_v_stated.pdf &
Q8. Are the reported quality scores close to what you would expect, note down the Root-Mean-Square-Error (RMSE)

We will count covariates for the corrected data and see if there is a difference.

java -Xmx2g -jar GenomeAnalysisTK.jar \
   -T CountCovariates \
   -R chr13.fa \
   -I alignment.sort.recal.bam \
   -recalFile recal_data_corrected.csv \
   -knownSites dbSNP135.snps.sites.vcf.gz \
   -cov ReadGroupCovariate \
   -cov QualityScoreCovariate \
   -cov CycleCovariate \
   -cov DinucCovariate

mkdir recal_cor_dir
java -Xmx2g -jar /home/local/27626/bin/GenomeAnalysisTK-1.6-5/AnalyzeCovariates.jar \
   -recalFile recal_data_corrected.csv \
   -outputDir recal_cor_dir \
   -ignoreQ 5
acroread recal_cor_dir/libA.QualityScoreCovariate.dat.quality_emp_v_stated.pdf &
Q9. Does it look better now? What is the (RMSE) of the original vs. the recalibrated?

Correct the BAM-values (optional)

We will recalculate some of the values in the BAM-file and then calculate something called Base Alignment Quality (BAQ). BAQ is useful for avoiding false SNP calls near indels because it will tell the SNP caller to downgrade the quality of mismatching bases in poor alignments.

samtools calmd -Erb alignment.sort.recal.bam /home/local/27634/exercises/cancer/data/chr13.fa > alignment.sort.recal.baq.bam

Mark Duplicates (optional)

Finally the last step we need to do is to remove duplicates. Duplicates arise from artefacts during PCR (bridge) amplification and sequencing. What can be seen in the alignment is that there will be many exact duplicates of a read. Because all the duplicated reads were sampled from the same DNA molecule it gives an uneven representation of that molecule compared to other, and will bias the SNP calling. We will therefore remove them. Here duplicates are identified as reads that map with identical 5' coordinates and orientations, and only the reads with the highest quality score are kept for further analysis.

java -Xmx2g -jar /home/local/27626/bin/MarkDuplicates.jar INPUT=alignment.sort.recal.baq.bam OUTPUT=alignment.sort.recal.baq.rmdup.bam \
   ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
Q10. How many reads are removed as duplicates from the files and how many did we have in total? (hint view the on-screen output from the command: "Read XXXXXX records." and "Marking XXXXX records as duplicates")? Try to calculate the percentage.

Now the file is ready for SNP calling.


From here no longer optional. If you skipped the optional part you need to do the following:

ln -s /home/local/27634/exercises/cancer/data/alignment.sort.recal.baq.rmdup.bam .
ln -s /home/local/27626/exercises/aln_process/dbSNP135.snps.sites.vcf.gz .
ln -s /home/local/27626/exercises/aln_process/dbSNP135.snps.sites.vcf.gz.tbi .
ln -s /home/local/27626/bin/GenomeAnalysisTK-1.6-5/GenomeAnalysisTK.jar .

Alignment visualization

Before we start SNP-calling we are going to have a look at the alignments. Sometimes it is nice to see things with you own eyes - we are therefore going to visualize the alignments using IGV. To do this we first need to index the alignment using samtools and then afterwards we start the viewer.

samtools index alignment.sort.recal.baq.rmdup.bam
java -jar -Xmx1250m /home/local/27626/bin/IGV/igv.jar &

If Human hg19 is not already loaded then load it by selecting it in the drop-down menu in the upper left corner. Then load your alignment by File, Load from file and go to your working directory (something like: "/home/local/phdcourse/studXXX//") select "alignment.sort.recal.baq.bam".
Select chr13 (box next to Human hg19) and try zooming in (upper right corner). From the top the windows are position in the genome, read depth (grey histogram), read alignments (grey arrows) and lowest is sequence, amino acido and gene/exon annotation on chromosome 13. Read arrows that have differences compared to the reference are shown as small colored bars in the reads, some mismatches are seen in all reads, some are seen in only a few, these are Singe-Nucleotide-Polymorphisms and sequencing errors. Try zooming in on a random position on the genome, you will probably only see a few reads mapping sporadicaly.

Q11. Why do you think that there are reads at positions with no genes and no exons? (Hint: Our data is exome capture data)

Try going to position 21170000 (paste in chr13:21170000) and zoom out (the IFT88 gene). You should be able to see the different exons of the gene.

Q12. Can you see something that looks different compared to the annotation of the gene (lowest box)?

Try going to position 25029083 (PARP4 gene).

Q13. Can you find a SNP? Can you find possible sequencing errors?

The Unified Genotyper (GATK)

Lets start the SNP calling using the Unified Genotyper. First we need to index the alignment using samtools and then we run the genotyper (-T UnifiedGenotyper). The extra options that we are add "--dbsnp" which is a file of all known SNPs in humans, this will annotate our SNPs with these ids, the output file "-o" which is output in VCF format, "-stand_call_conf" which is minimum Phred quality value of SNP to pass filtering, and "-stand_emit_conf" which is the minimum Phred quality value of the SNP to be reported at all. If we also want to call indels then you need to add the option "-glm BOTH", but for the exercises let us just call snps.

samtools index alignment.sort.recal.baq.rmdup.bam
java -jar GenomeAnalysisTK.jar \
   -R chr13.fa \
   -T UnifiedGenotyper \
   -I alignment.sort.recal.baq.rmdup.bam \
   --dbsnp /home/local/27626/exercises/snp_calling/dbSNP135.snps.sites.vcf.gz \
   -o cancer1.snps.raw.vcf \
   -stand_call_conf 30.0 \
   -stand_emit_conf 10.0

Open the output with the raw snp calls (given by -o). You will see that the file consists of two main sections, header denoted by "#" as the first character and the actual SNP calls. The header gives you information about all the abbreviations in the file and is quite handy. The columns in the variant part is: Chromosome, Position, dbSNP id, Reference base, Alternative base, Phred quality of call, Filter status, Lots of info, Format of the next field and then genotype information for our sample (cancer1). Lets try to see how many SNP calls we got and how many that pass the 30 quality filter:

less -S cancer1.snps.raw.vcf
grep -v "#" -c cancer1.snps.raw.vcf
grep -v "#" cancer1.snps.raw.vcf | grep -c "PASS"
Q14. How many SNP calls did we get and how many passed?

Variant Quality recalibration / Soft filtering

The raw SNP calls are likely to contain false positive calls made from biases in the sequencing data. We will try to recalibrate the variant qualities of our SNPs. First we lets create shortcuts to the datasets we are going to use for calibration.

ln -s /home/local/27626/exercises/snp_calling/hapmap_3.3.b37.sites.vcf .
ln -s /home/local/27626/exercises/snp_calling/1000G_omni2.5.b37.sites.vcf .
ln -s /home/local/27626/exercises/snp_calling/dbsnp_135.b37.vcf .

Now we are ready for recalibration. Because we have human data we are going to use information from Hapmap and the Omni-chip to recalibrate our SNPs. Additionally because we only have a small data-set (only chr13) we are going to add the command "--maxGaussians 4" - if you have full genome data this is not needed. Similar we leave out dbSNP information, but it can be added ("-resource:dbsnp,known=true,training=false,truth=false,prior=8.0 dbsnp_135.b37.vcf \ ") - the dbSNP information is only used for visualizing the results. We are going to output some plots in the var_recal directory.

mkdir var_recal
java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R chr13.fa \
   -input cancer1.snps.raw.vcf \
   -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
   -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
   -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an MQ \
   -recalFile var_recal/cancer1.recal \
   -tranchesFile var_recal/cancer1.tranches \
   -rscriptFile var_recal/cancer1.plots.R \
   --maxGaussians 4

Open the tranche-plot. The plot shows that the number of variants (x-axis, here novel=all because we did not use dbSNP information) and whether they are predicted to be True Positives (TP) or False Positives (FP). The "truth" number defines the trances, so that "90" are calculations from the 90% most probable SNPs, the "99" are calculations based on the 99% of most probable SNPs and so on. Blue are predicted to be true SNPs and orange to be false SNPs. We use this plot to decide where to put the threshold for filtering, eg. how many more false SNPs are you likely to get by increasing from 99.0 to 99.9.

Unfortunately the plot is arranged badly (bug in the program). But you can look at the amount of blue (True Positives) and Orange (False Positives) to get an idea of how we should trust the SNPs (only the PASS ones from above).

acroread var_recal/cancer1.tranches.pdf &
Q15. At which tranche would you put the threshold (judge blue vs. orange)?

Apply calibration to SNPs

Lets apply the recalibration to our SNP calls. Lets try with a threshold of 99.9

java -Xmx2g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R chr13.fa \
   -input cancer1.snps.raw.vcf \
   --ts_filter_level 99.9 \
   -tranchesFile var_recal/cancer1.tranches \
   -recalFile var_recal/cancer1.recal \
   -o cancer1.snps.recal.filtered.vcf

Open the recalibrated file and check the "Filter" column, you should now see that all SNPs that were classified at 99.9 or better are all termed "PASS" and the rest are termed "LowQual", "TruthSensitivityTranche99.90to100.00". We can now get all the trusted SNPs by taking the ones with "PASS".

less -S cancer1.snps.recal.filtered.vcf
grep "#" cancer1.snps.recal.filtered.vcf > header.vcf
grep "PASS" cancer1.snps.recal.filtered.vcf | cat header.vcf - > cancer1.snps.recal.pass.vcf
grep -c "PASS" cancer1.snps.recal.filtered.vcf
Q16. How many SNPs do we have that PASS the filtering?
Q17. Can you think of another requirement that we havent looked used which could be used to filter SNPs?

Identify cancer causing mutation

Start by loading in the programs we need for the mutation effect predictions.

source /home/local/27634/bin/source.cshrc.txt

ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, as well as mouse, worm, fly, yeast and many others). Given a list of variants with chromosome, start position, end position, reference nucleotide and observed nucleotides, ANNOVAR can perform:

  1. Gene-based annotation: identify whether SNPs or CNVs cause protein coding changes and the amino acids that are affected. Users can flexibly use RefSeq genes, UCSC genes, ENSEMBL genes, GENCODE genes, or many other gene definition systems.
  2. Region-based annotations: identify variants in specific genomic regions, for example, conserved regions among 44 species, predicted transcription factor binding sites, segmental duplication regions, GWAS hits, database of genomic variants, DNAse I hypersensitivity sites, ENCODE H3K4Me1/H3K4Me3/H3K27Ac/CTCF sites, ChIP-Seq peaks, RNA-Seq peaks, or many other annotations on genomic intervals.
  3. Filter-based annotation: identify variants that are reported in dbSNP, or identify the subset of common SNPs (MAF>1%) in the 1000 Genome Project, or identify subset of non-synonymous SNPs with SIFT score>0.05, or find intergenic variants with GERP++ score>2, or many other annotations on specific mutations.
  4. Other functionalities: Retrieve the nucleotide sequence in any user-specific genomic positions in batch, identify a candidate gene list for Mendelian diseases from exome data, and other utilities.

First thing we need to do in order to work with annovar is to convert VCF files to its own internal format. The convert2annovar.pl script does this very easily:

convert2annovar.pl -format vcf4 -outfile cancer1.annovar cancer1.snps.recal.pass.vcf

There are many options that we can play with (type convert2annovar.pl --help for a detailed description) but a basic execution would include:
-format vcf4 to indicate that the input file is in VCF4 format and
-output to indicate the output file (the default is the standard output)

If the program run correctly, we can expect something similar to this:


head -n6 cancer1.annovar
13	19042327	19042327	A	G	hom	41.11	2	60.00	20.56
13	19042919	19042919	T	C	hom	264.65	7	60.00	37.81
13	19043558	19043558	A	C	hom	220.79	7	60.00	31.54
13	19171165	19171165	C	T	hom	45.74	2	60.00	22.87
13	19172757	19172757	C	T	hom	51.64	2	60.00	25.82
13	19185939	19185939	C	T	hom	39.62	2	60.00	19.81

Annotate the variants

We can annotate the variation in this file (we specify the genome build version with the -buildver hg19 parameter and the output file):

annotate_variation.pl -geneanno cancer1.annovar /home/local/27634/exercises/cancer/data/humandb \
-outfile cancer1.unfiltered -buildver hg19

Two files will be generated, one (*.variant_function) with all the variants also and another one (exonic_variant_function) only with the coding variants. In the cancer1.variant_function file, you will see that the first column shows the location type (intergenic, intronic, exonic, ...) and the second column shows which genes it is associated to. The cancer1.exonic_variant_function has all variants within exons where column 2 shows the type (non-synonymous, synonymous) and column 4 shows the "gene:transcript:exon:base-change:aminoacid-change".

We can easily filter the exonic file in order to extract only the coding mutations (non-synonymous).

grep "nonsynonymous" cancer1.unfiltered.exonic_variant_function \
> cancer1.unfiltered.nonsynonymous_variant_function
Q18: How many nonsynonymous mutations can we find? (hint use wc)
Q19. Which gene is the most frequently mutated?
cut -f3 cancer1.unfiltered.nonsynonymous_variant_function | cut -f1 -d":" | sort | uniq -c | sort -n
Q20. How many different genes have nonsynonymous mutations on chr13 in this patient?
cut -f3 cancer1.unfiltered.nonsynonymous_variant_function | cut -f1 -d":" | sort | uniq | wc -l

Filter the variants

Sometimes we found more genes/mutations than we are able to analyze in a reasonable time. That's why it is a good idea to pre-filter the mutations. We want to analyze mutations that are NOT common (MAF<1%) SNPs in the European population:

annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 --maf_threshold 0.01 \
 -outfile cancer1.filtered_1000g_CEU cancer1.annovar /home/local/27634/exercises/cancer/data/humandb

Two files will be generated. *_dropped contains the mutations matching the criteria (namely, being in the database) and *_filtered corresponds to the clean file after filtering (namely, without the common SNPs in the European population.

Q21. How many mutations remain after this filtering? (Hint: use wc on the "_filtered" file)

Lets find nonsynonymous mutations in the filtered file by re-running the annotate_variantion.pl script:

annotate_variation.pl -geneanno cancer1.filtered_1000g_CEU.hg19_EUR.sites.2012_04_filtered \
-buildver hg19 -outfile cancer1.filtered_1000G /home/local/27634/exercises/cancer/data/humandb

grep nonsynonymous cancer1.filtered_1000G.exonic_variant_function \
 > cancer1.filtered_1000G.nonsynonymous_variant_function
Q22. How many nonsynonymous mutations can we find (Hint: wc on the cancer1.filtered_1000G.nonsynonymous_variant_function file)?
Q23. Which gene is the most frequently mutated?
cut -f3 cancer1.filtered_1000G.nonsynonymous_variant_function | cut -f1 -d":" | sort | uniq -c | sort -n
Q24. How many different genes are mutated on chr13 in this patient?
cut -f3 cancer1.filtered_1000G.nonsynonymous_variant_function | cut -f1 -d":" | sort | uniq | wc -l
Q25. Inspect some of genes in which the mutations are found (eg. at NCBI gene). One of them is highly associated with breast cancer - which one?

More information on annovar and different filters can be found here



Congratulations - you finished the exercise