EXERCISE: jEdit ---------------- Answers by: Rasmus Wernersson (v18103) Question 1: ----------- The file sizes are: 453 bytes: alpha_globin_OldMac.fsa 453 bytes: alpha_globin_Unix.fsa 461 bytes: alpha_globin_Windows.fsa The important thing to notice here is that DOS/Windows newlines actually consists of two bytes (CR + LF), whereas UNIX and the old Mac standard only use one byte. The 8 byte difference corresponds to the 8 lines of text with-in the file: 001 >pigeon_alpha-globin-D 002 ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG 003 GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT 004 GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG 005 AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC 006 CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC 007 CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA 008 TAA Question 2: ----------- Yes - inspecting the files in the associated programs (e.g. Word and FireFox) reveals the _textual_ contents to be the same. The file sizes differ dramatically: 29184 bytes: alpha_globin.doc 667 bytes: alpha_globin.html 855 bytes: alpha_globin.rtf Question 3: ----------- In all three cases a (LOT) of extra information has been added to the files. For both the HTML and RTF files, the extra information is actually text based it's it's possible to get an idea of what's going on by simply inspecting the file. Contents of the HTML file: >>>>>>>>>>>>>>>>>>>>>
>pigeon_alpha-globin-D
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
TAA
<<<<<<<<<<<<<<<<<<<<< In this case (cleanly formatted HTML) it's easy to locate the original DNA sequence. To some degree it's possible to figure out what's going on in the RTF file - the codes are basically about formatting: Snippet from the file: >>>>>>>>>>>>>>>>>>>>> \f0\b\fs24 \cf0 >pigeon_alpha-globin-D\ \f1\b0 ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG\ GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT\ GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG\ AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC\ CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC\ CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA\ <<<<<<<<<<<<<<<<<<<<< The Word file contain a HUGE amount of additional information - in BINARY form, this is why the file looks so strange when we open it in jEdit. Opening a non-text file such as a JPG image in jEdit will look a bit the same: a lot of strange symbols. Interestingly, it actully possible to get a glimpse of a few text-strings with in mess of symbols - the DNA sequence - and the name (Rasmus Wernersson) of the person who created the file. This is some of the strings we can find (generated using the "strings" command on a UNIX prompt): >>>>>>>>>>>>>>>>>>>>> jbjb >pigeon_alpha-globin-D ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA >pigeon_alpha-globin-D Rasmus Wernersson Normal Rasmus Wernersson Microsoft Word 11.5.0 PICT20 MSWD Courier New Technical University of Denmark >pigeon_alpha-globin-D Title Microsoft Word Document NB6W Word.Document.8 <<<<<<<<<<<<<<<<<<<<< Question 4: ----------- Cleaned up sequence: AACGGGCACGGGACGCATGTAGCTGGAACAGTGGCAGCCGTAAATAATAATGGTATCGGA GTTGCCGGGGTTGCAGGAGGAAACGGCTCTACCAATAGTGGAGCAAGGTTAATGTCCACA CAAATTTTTAATAGTGATGGGGATTATACAAATAGCGAAACTCTTGTGTACAGAGCCATT GTTTATGGTGCAGATAACGGAGCTGTGATCTCGCAAAATAGCTGGGGTAGTCAGTCTCTG ACTATTAAGGAGTTGCAGAAAGCTGCGATCGACTATTTCATTGATTATGCAGGAATGGAC GAAACAGGAGAAATACAGACAGGCCCTATGAGGGGAGGTATATTTATAGCTGCCGCCGGA AACGATAACGTTTCCACTCCAAATATGCCTTCAGCTTATGAACGGGTTTTAGCTGTGGCC TCAATGGGACCAGATTTTACTAAGGCAAGCTATAGCACTTTTGGAACATGGACTGATATT ACTGCTCCTGGCGGAGATATTGACAAATTTGATTTGTCAGAATACGGAGTTCTCAGCACT TATGCCGATAATTATTATGCTTATGGAGAGGGAACATCCATGGCTTGTCCACATGTCGCC GGCGCCGCC