|
First look at data
Overview
In this exercise you will try to look at little at "real" NGS data and also try to download and compile a couple of programs. Additionally you will try to use screen when using the shell. Often when you are working in NGS you want to try that new cool program that somebody made that solves your problem. Sometimes programs are written in perl or python and these are most of the times just download and run, but often the programs are written in C or C++ and needs to be compiled. You will try to download bwa and samtools and try to compile the programs.
- Use standard UNIX commands to work with NGS data
- Try to compile bwa and samtools
- Use screen in shell
First look at data
- Go to your student folder on /home/local/ngs_course/studXXX/ and create a folder called "first_look". Change dir to it.
- Copy the "reads.fastq" to your folder from "/home/local/27626/exercises/first_look/"
- Look at the data using less. You will see a couple of reads in here. We will try to count the number of reads that there are. A read is always four lines, it may look like there are more lines but it is because the lines are wrapped. Try opening the file with "less -S file" and you should see it:
- Header line starts with "@"
- Sequence line with the DNA sequence
- Middle header line with a "+" and sometimes also the header
- Base-quality line phred scaled probability that the base is wrong
- Try to count the number of reads in the file. You can of course do this by looking into the file, but this gets hard when there are millions of reads in a file. Remember that the header always starts with "@". Try to do this using grep and wc. How many reads do you get using both approaches? Do you get the same count using both approaches? If not why?
- The reads.fastq are actually not a real datafile, it is reads from five different NGS technologies. Can you determine which are which? Hint: copy the other fastq files from "/home/local/27626/exercises/first_look/" to your dir and look at those too.
- Now we are going to look at some Illumina paired end reads. Copy them from "/home/local/27626/exercises/first_look/paired/" to your dir. You will see that these files are zipped as .tar.gz. To unzip them use "tar -xvzf file.tar.gz".
- Now you should have two files, Ecoli_1.fastq and Ecoli_2.fastq. Have a look at them both. Compare the headers of the first reads. They should be identical except for the last character - this means that these two reads are paired together, ie. they are the DNA sequences from the two ends of the DNA fragment. It is important that they are in sync, ie. that read 5 in file 1 is paired together with read 5 in file 2.
- Lets try to see if they in sync. We will grep the header lines from each file and remove the last character (this is 1 or 2 dependent on the pair) using sed (sed 's/what_to_remove/replace_with/'), the "$" means only at the end of the line.
grep "^@ILLUMINA" Ecoli_1.fastq | sed 's/1$//' > Ecoli_1.headers
grep "^@ILLUMINA" Ecoli_2.fastq | sed 's/2$//' > Ecoli_2.headers
- Look at the two files, now they should contain headers from each pair. Try to paste them together and see the output. Are they in sync?
- Again, this only works for a few lines, try to use the program diff - it will print out lines that are not exactly the same in the two files. The command is called like: "diff file1 file2"
- What I did was to take one read in file 2 and switched it around, can you figure out which read number it was and where it should be to fix the files?
Use screen in the shell
Screen is a program that will create a virtual shell inside your shell. You may wonder why this is needed, but there are a couple of advantages in using it:
- If you log out or lose your connection you can log in again and get back to your shell as it was. This is extremely useful when you are working on a wireless network where the connection may sometimes drop. If the connection drops any programs you have running in a normal shell will be killed - but not if you are using screen.
- If you need to run a program for a long time, then it will be killed if you log out or close your computer. But by using screen you can have the program running for as long as you like.
- You can "detach" a screen at work and "attach" the screen from home or from another computer
Ok lets try it. Log in to padawan and start screen by typing "screen". You will get a welcome message, press enter. Now you have a shell inside screen - it works just like a normal shell except for a few things. You enter the control features by pressing "ctrl-a". Try pressing "ctrl-a" and afterwards "?" - this will take you to the help screen, press enter to leave it again. Try typing a command like "ls" in the shell. Then we will detach the shell by pressing "ctrl-a-d" - now you will see a message at the bottom of the screen saying "[detached]" and you are out of the screen program. But our shell is still there, type "screen -r" to reconnect to the screen - now you are back in! That is the same you would do if you lost your connection, the screen would detach itself and you would be able to reconnect to it using "screen -r". This is basically it, but there are alot more functionalities with screen, you can read more here.
Download and compile software (optional)
Compile bwa
Lets say we want to use the programs bwa and samtools to do alignment of our data (we will do that later in the course). You are lucky that I already installed all programs for the course on the server, but in case you need to do it at your own server lets try to download and install them here. These programs are written in C and needs to be compiled before they can be run by the computer.
Go and download the source code of the programs. Lets try first with bwa. Find the source code and copy the link for the code. Then download the it using wget, insert the link to the source code below.
wget link_to_source_code
Now the file is downloaded, unzip it using "tar -xvjf file.tar.bz2" and enter the directory it makes. In the directory you will see a lot of files, this is the source code (=human readable code) and needs to be compiled. This is done using the program called make. Lets compile it:
make
It should output some commands and the finish. Now a file called "bwa" has been created, this is the program. Test it by the command:
./bwa
If it works you should see some information on how to run the program. Congratulations you just compiled your first programme (unless you have compiled programs before of course!). Lets make a dir where you can put in your programs: "mkdir ~/bin" and now copy it to that directory ("cp bwa ~/bin/"). Now you can use the program by just typing "bwa" instead of having to write the entire path.
Compile samtools
Again download the source code (remember to change dir out of the bwa download), it can be done from samtools. Unzip it and change dir to the directory. Compile the software using make. Now you will see that the program "samtools" were made in the main dir. Copy it to the bin dir as before. However several other programs was also compiled, copy "bcftools/bcftools" and "bcftools/vcfutils.pl" to your bin. Alot of programs was also present/compiled in the "misc" dir, copy all the perlscripts and the seqtk program to your bin. Last type rehash to reload your path.
Now you installed samtools!
Congratulations you finished the exercise!
|