gb2tab v 1.2.1 (command line program behind the FeatureExtract webserver)
gb2tab - extract sequence and annotation (intron/exon etc)
from GenBank format files.
gb2tab [-f 'CDS,mRNA,...'] [options...] [files...]
gb2tab is a tool for extracting sequence and annotation
(such as intron / exon structure) information from GenBank
This tool handles overlapping genes gracefully.
If no files are specified input is assumed to be on STDIN.
Several GenBank files can be concatenated to STDIN.
The extracted sequences are streamed to STDOUT with one
entry per line in the following format (tab separated):
name seq ann com
name: The sequence id. See the --genename, --locustag and
--entryname options below.
seq: The DNA sequence it self. UPPERCASE is used for the
main sequence, lowercase is used for flanks (if any).
ann: Single letter sequence annotation. Position for position
the annotation descripes the DNA sequence: The first
letter in the annotation, descriped the annotation for
the first position in the DNA sequence and so forth.
The annotation code is defined as follows:
FEATURE BLOCKS (AKA. "EXON BLOCKS")
( First position
T tRNA exonic region
R rRNA / generic RNA exonic region
X Unknown feature type
) Last position
? Ambiguous first or last position
[ First UTR region position
] Last UTR region position
See also the --block-chars option, for a further
explanation of feature blocks and exonic regions.
INTRONS and FRAMESHIFTS
D First intron position (donor site)
I Intron position
A Last intron position (acceptor site)
< Start of frameshift
> End of frameshift
REGIONS WITHOUT FEATURES
. NULL annotation (no annotation).
ONLY IN FLANKING REGIONS:
+ Other feature defined on the SAME STRAND
as the current entry.
- Other feature defined on the OPPOSITE STRAND
relative to the current entry.
# Multiple or overlapping features.
A..Z: Feature on the SAME STRAND as the current entry.
a..z: Feature on the OPPOSITE STRAND as the current entry.
See the -e option for a description of which features
are annotated in the flanking regions.
The options --flank_ann_full (default) and
--flank_ann_presence determine if full annotation
(+upper/lower case) or annotation of presence/absence
(+/- and #) is used.
com: Comments (free text). All text, extra information etc
defined in the GenBank files are concatenated into a single
The following extra information is added by this program:
*) GenBank accession ID.
*) Source (organism)
*) Feature type (e.g. "CDS" or "rRNA")
*) Strand ("+" or "-").
*) Spliced DNA sequence. Simply the DNA sequence defined
by the JOIN statement.
This is provied for two reasons. 1) To overcome negative
frameshifts. 2) As an easy way of extracting the sequence
of the spliced producted. See also the --splic_always and
--flank_splic options below.
*) Spliced DNA annotation.
The following options are available.
-f X, --feature_type=X
Define which feature type(s) to extract.
Default is 'CDS' which is the most general way
to annotate protein coding genes.
Multiple features can be selected by specifying a comma
separated list - for example "CDS,rRNA,tRNA".
ALL: Using the keyword "ALL", will extend the list to all
feature types listed in each GenBank file.
Please notice: This can occationally lead to problems
in files that use multiple feature types to cover the
same actual feature (e.g uses both "gene" and "CDS").
MOST: Covers the following feature types:
The keyword can be also be included in the user specified list.
For example "MOST,novel_feature" will construct a list containing
the list mention above + the new feature type "novel_feature".
-e X, --flank_features=X
Define which features to annotate in flanking regions.
The scheme for specifying features is the same as in the
-f option (see above).
The default value is "MOST".
If no flanking regions are requested (see options -b and -a
below) this option is ignored.
Extract intergenic regions. When this options is used all
regions in between the features defined with the -f options
in extracted rahter than the features themselves.
Please notice that features specified using the -e options
may be present in the intergenic regions.
Intergenic regions will always be extracted from the "+" strand.
For intron containing sequences output the spliced version as
the main result (normally this information goes into the
comments). If this options is used the full length product will
be added to the comments instead.
Using this option will force the inclusion of flanks (if any)
in the spliced product. See also option --flank_splic.
Only output intron containing sequences. Can the used in
combination with the -s option.
-b X, --flank_before=X
Extract X basepairs upstream of each sequence.
-a X, --flank_after=X
Extract X basepairs downstream of each sequence.
Print this help page and exit.
Run through all extraction steps but do not output any
data. Useful for debugging bad GenBank files in combination
with the verbose options.
Output messages about progess, details about the GenBank
file etc. to STDERR. Useful for finding errors.
Suppress all warnings, error messages and verbose info.
The exit value will still be non-zero if an error is
Annotate presence/absence and relative strandness of
features in the flanking regions.
Features - of any kind - are annotated with "+" if they are
on the SAME STRAND as the extratced feature, and "-" if they
are on the OPPOSITE STRAND. "#" marks regions covered by
This option is very useful for use with OligoWiz-2.0
Default: Include full-featured annotation in the flanking regions.
Features on the SAME STRAND as the extracted is uppercase -
features on the OPPOSITE STRAND is lowercase.
In case of regions covered by multiple features, the
feature defined FIRST by the -e option has preference.
Also include flanking regions in the spliced product.
Default is to ignore flanks.
Include spliced producted for ALL entries.
Default is to only print spliced product information for
intron/frameshift containing entries.
"Introns" shorter than X bp (default 15bp) are considered
frameshifts. This includes negative frameshifts.
Specify which characters to use for annotation of the
extracted feature types. For spliced feature (e.g CDS)
each exonic block is annotated using the specified characters.
Three characters must be supplied (for each feature type):
First position, internal positions, last position.
For example the string "(E)" will cause a 10bp feature block
(e.i a CDS exon block) to be annotated like this: (EEEEEEEE)
Introns are filled in as DII..IIA
By default the program determine the annotation chars to be
based on the type of feature being extracted:
(E) CDS, mRNA
(R) rRNA, snoRNA, snRNA, misc_RNA, scRNA
(X) Everything else.
This table can be expanded (and overwritten) by supplying a
list of relations between feature type ans block chars.
Try to extract the gene name from the /gene="xxxx"
tag (this is usually the classical gene name, e.g. HTA1)
If this is not possible fall back to 1) locustag
or 2) entryname (see below).
Try to extract the locus tag (usually the systematic
gene name) from the /locus_tag="xxxx" tag. Fall back
to using the entryname if not possible (see below).
This is the default behavior.
Use the main GenBank entry name (the "LOCUS" name) as
the base of the sequence names.
This program DOES NOT support entries which spans multiple
GenBank files. It is very unlikely this will ever be supported.
(Please notice that the webserver version supports expanding
reference GenBank entries to the listed subentries automatically).
Rasmus Wernersson, 2005.
"FeatureExtract - extraction of sequence annotation made easy".
Nucleic Acids Research, 2005, Vol. 33, Web Server issue W567-W569
The webpage contains detailed instructions and examples.
The most recent version of this program is downloadable
from this web address.
Rasmus Wernersson, firstname.lastname@example.org
Sep 2008 - bugfix + better IUPAC support