Define the GenBank entries to be analysed, by specifying GenBank accession IDs (past in or upload)
or by pasting in (or uploading) GenBank files. A combination of ID's and
GenBank files is equally acceptable. Hitting "Submit query" at this point, will
run the server with default settings: All protein coding genes ("CDS's") will be extracted
with full intron/exon annotation.
The wanted feature types (CDS, rRNA, etc.),
preferences for naming and definition of flanking regions
can be specified using the Basic options.
Please notice that all three "Submit query" buttons perform the same action. The idea is that is not
necessary to scroll down the web page if the options are not altered.
Specifying the input data in GenBank format
1) Specify GenBank entries by accession IDs
The easiest way to specify GenBank information is by simply
supplying a list of GenBank entry ID's. The GenBank database the
FeatureExtract server using is a mirror of the GenBank flat file
distribution with the addtion of several Eukaryotic genomes
(see databases for details).
2) Supply your own GenBank format files
Use the "Upload file" option for large file(s). Smaller files can be pasted
in. Multiple files can be concatenated.
Any file complying with the GenBank format definition can be used here.
For example this could be chromosome files from the Eukaryotic genome mentioned
above. An other example could be files with custom gene/promoter ect predictions.
Select type of features to extract
Select which feature type(s) to extract. A number of predefined feature
type can be selected. Multiple features can entered in the text-field as
as comma-separated list, e.g. CDS,rRNA,tRNA,repeat.
The MOST keyword (see below)
can be useful when extracting intergenic regions.
Notice that some feature types are not always defined to mean the same. Especially
the actual meaning of gene and mRNA vary a lot.
Integenic regions: Selecting this option will include the intergenic regions
in the set of extracted sequences. The intergenic regions are simply defined as
the areas between the features defined here. Intergenic regions can be extratced
Specify the preferred naming of each extratced entry. If the desired type of
name is not avialable, fall back to the next level: 1 > 2 > 3.
- Gene name
- GenBank field: /gene="xxx"
- Systematic name
- GenBank field: /locus_tag="xxx"
- Entry ID + distance
- GenBank field: LOCUS
Define flanking regions, if any.
Notice: computations concerning flanking region elements
are only performed if flanking regions have been requested using this option.
Click on the "Submit query" button. If the processing of the query takes more
than a few seconds you'll will get the option of supplying your email address and be notified
when the job is done.
FeatureExtract has support for a number of advanced options. Typically it is not necessary to
set these manually and most users can safely skip this section and proceed to submitting the query.
This options defines the cut-off value which determines
if an intervening sequence will be annotated as a frameshift or an intron.
Intervening sequence shorter than the specified value will be
considered frameshifts - this includes negative frameshifts.
Custom defined annotation
Using this options it is possible to extend (or redefine) the
build in annotation table.
Notice: For all intron and frameshift containing sequences,
the spliced sequence and annotation is by default added to the comment field.
Splice all intron containing seqeunces
Enabeling this option will cause the server to produce spliced sequences
(and annotation) for all intron containing sequences. The full length
sequence and annotation is then moved into the comment field.
Only output intron containing sequnces
Enabeling this option will supress the output which does not
contain introns or frameshifts. This option can be use in combination
with the "Splice all..." option mentioned above, as a quick way of
producing a spliced only dataset.
Feature types to annotate in flanking regions
This option governs which feature type to annotate in the flanking regions.
The default value, the keyword MOST, is a list
built to minimize the problem with feature type synonyms
(e.g 'CDS' vs. 'gene' vs. 'mRNA') but at the same time extract as much
information as possible. The keysword are defined below:
A custom defined list can be specified as a comma separated list.
Flanking region annotation scheme
This option governs how features in flanking regions are annotated.
- Full annotation
Use the same annotation scheme as in the extracted sequences. (E.g (EEEEEEE) for exons).
- Features on the oppsite strand
relative to the individual extracted sequence is annotated in
- Presence/absence annotation
Only annotate the presence of absence of features.
- Characters used:
"+" : a feature on the same strand.
"-" : a feature on the opposite strand.
"#" : overlapping features.
Verbose mode: Output additional information about the contents of the GenBank
files and the general progress of the extraction
Example 1: Alphaglobins using GenBank accession IDs
The following list of GenBank entries contains alpha globins from
a wide range of organisms. This example illustrates the annotation
of exon and intron regions in protein coding genes.
Instruction: Paste in the list and hit "submit".
Example 2: Yeast mitochondrial genes
This is an example of how to work with an uploaded
Instructions: Download GenBank file
NC_001224 (This file contains the
Yeast mitochondrial chromosome - part of the Yeast genome
build from SGD). Upload the file to the FeatureExtract server, using the
"Upload file containing one or more GenBank files" option.
Hit "Submit query".
Notes on working with a chromosomal file
The mitochodrial GenBank file is also a good example on how
FeatureExtract works with a chromosomal file containing multiple sequence features.
For experimentation, try to enable the extraction of flanks, say 300 bp upstream and 200 bp downstream.
Also, try to widen the set of feature type to be extratced from the default (CDS) to a custom list: