Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other


Connect to the unix machine at CBS

On Windows

Double click on the SSH Secure Shell Client and chose File>Quick Connect. Fill in Host: "", and your username at CBS. You will then be prompted for your password.

On Mac

Open X11
In the terminal window type:

ssh -X


Basic commands

Where am I? - pwd
pwd   This command returns the path to your current location (the current directory, as you can also see in your promt)
Copy the immu00 directory and its contents to here:
cp -R /usr/opt/www/pub/CBS/courses/27685.imm/exercise_unix/immu00   .
-R means recursively, i.e, include everything in the directory "." means to here, so remember the period in the end of this command.

What is in this directory? - ls
ls                      short listing of the content of the current directory (a directory is called a folder in Windows or Mac OS)
ls ..                   short listing of content of the directory above the current directory [".." means one directody up "../.." is two directories up]
ls immu00        short listing mail directory (equivalent to ls ./immu00  ["." means here])
ls -l immu00     detailed listing of projects directory
ls -ltr immu00   long listing sorted by time (t) and reversed (r): newest files last 
                        (essential for old bioinformaticians who can not remember what they just did)
Paths starting with "/" are absolute addresses starting at the root dirctory (normally called C:\ in Windows) - as oposed to relative addresses (adresses relative to where you are in the folder hirachy)

Make new directory - mkdir
mkdir testdir           Make a new directory (folder) with the name testdir in the directory (folder) where you are now.
mkdir mynewdir          Make a new directory (folder) with the name mynewdir in the directory where you are now
I want to go to? - cd
The cd command is used to move around in the file system.
cd testdir              go to the testdir directory (relative address to where you are)
cd ..                   up one level
cd                      go to my home directory
cd immu00        go to immu00 directory (verify you are there by the pwd command)
Moving or renaming files - mv
touch myfile            Makes a new empty file
mv myfile mynewfile     Rename myfile to mynewfile
mv mynewfile testdir    Moves the file mynewfile into the directory named testdir (How can you check that this has actually happened?)
Removing (deleting) files - rm and empty directories (folders) -rmdir
rm mynewfile         removes (deletes) mynewfile
rmdir mydirectory    remove an empty directory 
rmdir testdir        remove an empty directory (this directory is not empty thus this didn't succeed)
rm -rf testdir       remove a directory, including files and subdirectories - no questions asked - make sure this is what you want to do, 
                     there is no recycle bin on UNIX; once it is gone it is gone!
Copying files - cp
touch myfile          make file called "myfile"
cp myfile mynewfile   copy myfile to mynewfile
Viewing text files - cat/more/less/head/tail
cat test.dat          write contents of file to screen
head test.dat         write top of file (default 10 lines)
head -30 test.dat     write top 30 lines of file
tail test.dat         write the last 10 lines of end of file
tail -25 test.dat     write the last 30 lines of the file
more test.dat         show test.dat pagewise, pres "space to go one page down, "q" to quit.
less test.dat         show test.dat pagewise, pres "space to go one page down, "j" to go one line down, "k" to go one line up, "q" to quit.
Editing files - n/nedit
The n, or nedit (the first is a shortcut alias for the latter) command is used to launch the nedit editor. Examples:
n test.dat   edit the file test.dat with nedit
Executing Programs

Moving data around
Redirecting: |, > and <

Use | to "pipe" (or send) data from one program to another.
cat test.dat | wc              pipe the contents of test.dat into the program called wc (word count) count number of lines, words and bytes in test.dat
Use > to direct data to a file (and overwrite it).
head test.dat > tmp.dat        first ten lines of test.dat into tmp.dat
Use > to direct data to a file and append the data to the contents of the file.
head test.dat >> tmp.dat       first ten lines of test.dat into tmp.dat (now it should contain 20 lines)
Use < to get data from a file to a program.
head < test.dat 

Bioinformatics using Unix commands

Awksome programing languages (awk, nawk, gawk)
awk, nawk, and gawk are different versions of the same programming language, and are very similar. It is recommended to use gawk or nawk, rather than the original version: awk, since they are more stable and have more features!
Basically gawk will read a file and do something with each line.
Examples of using gawk:
gawk '{print $1}' epitope2protein.HLA-D_m13.out                          Print first field in file
gawk '{print $1, $3}' epitope2protein.HLA-D_m13.out                      Print first and third field in file
cat epitope2protein.HLA-D_m13.out|gawk '{print $1}'                      Print first field in file getting data from standard input
cat epitope2protein.HLA-D_m13.out|gawk '{if (/NP/) {print $1}}'          Print first field in lines containing "NP"
cat epitope2protein.HLA-D_m13.out|gawk '{if (/^NP/) {print $1}}'         Print first field in lines starting with "NP"
gawk '{print substr($7,2,5)}' epitope2protein.HLA-D_m13.out              Print five characters of the seventh column, 
                                                                         starting with the second letter (in the seventh column). 
                                                                         NB! awk numbers strings starting with 1, where many other programming and 
                                                                         scripting languages starts numbering from 0!
gawk '{print substr($7,length($7)-3,4)}' epitope2protein.HLA-D_m13.out   Print last four letters in seventh column

echo "Mary had a little lamb" |gawk '{line = $0; gsub (" ","",line);print line}'           
                                                                         Remove all spaces in all lines
gawk -v name=Mary -v animal=lamb '{print name,$1,animal}' epitope2protein.HLA-D_m13.out
                                                                         Passing variables to gawk

gawk -F "\t" '{print $1}' epitope2protein.HLA-D_m13.out                  Split only input on tabulators (rather than on any whitespace as is the default)

head epitope2protein.HLA-D_m13.out | gawk 'BEGIN{print "Here comes the data"}{print $1}END{print "No more data"}'
                                                                         statements in BEGIN{} and END{} are executed before and after 
                                                                         the data lines are read, respectively

A more complex example: You have a file called epitope2protein.HLA-D_m13.out with a protein sequence in the 7th column and the residuenumber in the sequence where an epitope starts in the third column. You want to print out the sequence surounding the start of the epitope (in this case the first five resigues of the epitope and the four residues before the epitope) in a format that can be read by the sequence motif visualization program logo. The first line in the output must be "* Aligned protein sequences.", and each sequence motif must be followed by a ".". Furthermore only motifs that are nine amino acids long must be printed. This is what the command can look like:

cat epitope2protein.HLA-D_m13.out | gawk 'BEGIN{print "* Aligned protein sequences."}{s=substr($7,$3-4,9);if (length(s)==9){print s"."}}' | /home/projects/projects/vaccine/bin/logo2 -p - | gawk 'BEGIN{pr=0}{if (/\%\!PS-Adobe-3.0 EPSF-3.0/){pr=1} if(pr==1){print $0}}' >

Sort file - sort
Example of using sort:

Getting a test file:
cp /usr/opt/www/pub/CBS/researchgroups/immunology/intro/Unix/test.out .

  sort -n test.out  				sort file numerically
  sort -n -k3 test.out				sort file numerically (big numbers last) by 3rd column
  sort -r -n -k3 test.out			sort file reverse numerically (big numbers first) by 3rd column
  sort -u pdb.mhc.spnam			Keep only one copy of each unique line
  sort pdb.mhc.spnam | uniq -c		Count the number of each unique line

Execute a string
putting `` around a command makes a unix execute the command corresponding to the string: echo pwd
      print the string pwd
`echo pwd`
      execute the command pwd
echo pwd|sh
      echo the string pwd to the shell - which will then execute it
This will be used in the next example.

Do something with many variables - foreach
Example 1: print each entry in list to screen

foreach entry (a b c)
      echo $entry

Example 2: get each swissprot entry from list and print it

foreach entry (`gawk '{print $1}' test.out`)
      echo $entry |sed 's/.*|//'| xargs getsprot

NB:echo ENV_HV1H2| xargs getsprot      is the same as getsprot ENV_HV1H2     

Warning: the string within () is limited to a few thousand charectors

Contatinate side by side - paste
paste pdb.mhc.nam pdb.mhc.spnam     

Get lines matching a patern - grep/egrep

  grep 1A68_HUMAN pdb.mhc.spnam	Get lines with "1A68_HUMAN"
  grep -v HUMAN pdb.mhc.spnam		Get lines that do not contain "HUMAN"
  grep _HUMAN pdb.mhc.spnam		Get lines with "_HUMAN" (Human swiss prot sequences)
  grep ^KA pdb.mhc.spnam			Get lines starting with "KA" (Human swiss prot sequences)
  grep "^KA.*MOUSE" pdb.mhc.spnam	Get lines matching "KA" - something ("." is a wildcard; "*" means repeated zero or more times) - "MOUSE"

What did I do

The history command returns the 100 - 500 last executed commands depending on the shell settings

Other usefull commands
which cat

Find out where a program (the cat program in this case) is installed. Often when you edit a program and nothing happens it is because you are editing another program than the one you are running

gunzip Unzip a zipped file (.gz files)

command line options
most unix programs take options in the form "program -option". for example head -5 will print out the first 5 lines of a file (5 is an option), or the -l option to ls (ls -l)

diff/gdiff compare tho files

chmod Change permissions (who can read, write to, or execute a file or script)
ownership chown Change ownership of a file
autocompletion Press TAB to let the unix system complete a file/program name
"arrow up" press arrow up to get old commands

<ctrl> a Go to the start of the commandline
<ctrl> e Go to the end of the commandline

Can I use it?

Now the training is over and you have to solve a small problem using some of the commands above. By the command
getsprot 1A68_HUMAN     
you can get the SWISSPROT entry 1A68_HUMAN, and print it out to the screen

The command
getsprot 1A68_HUMAN|gawk '{print $0}'     
takes the output from the previous command and runs it through a gawk command that print everything out, i.e., it does nothing. Your job is to rewrite the gawk command so that it writes out the SWISSPROT entry in fasta format:
Q1: Which lines contain the information you need to make the fasta file?

Q2: How can you recognise these lines.

Q3: Explain which fields in these lines do should be printed out to get the correct output, and/or alternatively how they should be edited to make correct output.

Q4: Explain what a programe that converts the SWISSPROT entry to a fasta entry should do.

Q5: Give the code of a gawk program that does this (or use any other programming language(s)).

by the command
getsprot -f pdb.mhc.spnam     
you get a lot SWISSPROT entries

Q6: Give the code of a gawk program that converts all these to fasta format (mabye the code you developed above is general enough to be used again).

Q7: Write a program that counts how many fasta entries there are in a file.

Q8: There are several copies of some of the fasta entries. Can you think of a way to only print each of the fasta entries out once?

If you are interessted in a more comprehensive introduction to UNIX you can look at the tutorial here.