easy import has been designed to be flexible enough to fix many common problems with
.gff files so the
.ini file can contain a complete record of the process to take a provided
.gff file and import it successfully into an Ensembl database. Guidelines for some common problems are given below, there are probably more cases that can be "fixed" with creative use of the basic syntax.
Very broken gff
In some cases, there is no practical way to automate repair of a file (particularly when attributes for just a few features have been incorrectly manually edited), or repair would require the full feature set of the gff parser, which is not practical to expose through the meta-syntax used in the
.inifiles. If manual edits are required, then the commands used can be recorded as notes (prefixed by a semicolon) in the
.inifile to ensure a complete record is preserved, e.g.:
[FILES] GFF = [ gff http://example.com/example.gff3.gz ] ; perl -p -i -e 's/Parent=mrna3/Parent=mrna2/' example.gff3
IDs are automatically generated for all features that lack
In order to link features hierarchically in a
.gff file, features below the level of gene should each have a
Parent attribute containing the
ID of their parent feature. If these are missing, the coordinates of a feature's parent can be inferred from it's own coordinates using the expectation
hasParent. Then a feature of the appropriate type spanning those coordinates can either be created (keyword
make) or identified from the set of existing features (keyword
[GFF] CONDITION = [ EXPECTATION mrna hasParent gene force ]
In this example, the keyword
force first attempts to
find an existing feature to use as a parent but will
make a new feature if there is no existing feature with correct coordinates.
Exons may have valid
Parent attributes of different types within the same file. To allow testing for multiple types, it is possible to specify a pipe-separated list of types to
find. If no matching feature is found, a new parent feature can be created using
force with the type of the first item in the list.
[GFF] CONDITION = [ EXPECTATION exon hasParent transcript|mrna force ]
The Ensembl schema assumes that all transcripts are comprised of exons. Many
.gff files that lack non-coding annotations omit exons as they are essentially duplicates of CDS features. Exons can be inferred using the expectation
[GFF] CONDITION = [ EXPECTATION cds hasSister exon force ]
Alternatively, if only introns are present in the file, it is possible to use
FILL_GAPS to generate exon features between the introns (keyword
internal) and before and after the first and last introns (
[GFF] CONDITION1 = [ FILL_GAPS intron exon internal ] CONDITION2 = [ FILL_GAPS intron exon external ]
Sometimes the start and end coordinates for a feature may be reversed, this example compares the start and end coordinates for a set of feature types to check that the start is not after the end. If this expectation is violated, it is not always clear what to do without examining the file so the behaviour is set to
[GFF] CONDITION = [ EXPECTATION cds|exon|mrna|gene <=[_start,_end] SELF warn ]
Often different types are used to refer to functionally equivalent features and while the reasons for this may be legitimate, it can be inconvenient when parsing. This can be resolved by using
MAP_TYPES to cause the gff parser to treat types as equivalent.
[GFF] CONDITION1 = [ MAP_TYPES initial exon ] CONDITION2 = [ MAP_TYPES terminal exon ]
.gff assumes that each feature has a single, unique
ID attribute. In some files
IDs may be incorrectly applied to CDS features as all CDS feature lines that share a common transcript parent should share a single
ID, however they correspond approximately to a set of exons that should correctly each have unique
IDs. To fix a file with unique CDS
IDs, it is possible to override the
ID attribute and cause a new one to be generated.
[GFF] CONDITION1 = [ OVERRIDE cds ID ] CONDITION2 = [ LACKS_ID cds make ]
OVERRIDE can also be used to override any other attribute for a given feature type. When Processing exceptions, the
OVERRIDE can itself be overridden in a second
.ini file by passing a feature type with no attribute specified.
[GFF] CONDITION = [ OVERRIDE cds ]
Conflicting definitions for phase are used in different
.gff files. easy import uses the sequenceontology.org specification so for files that use the alternate definition (phase = frame - 1), it is necessary to invert the phase to convert 1 to 2 and vice versa. This is applied during Step 2.4: Import gff from prepared file (following Step 2.3: Prepare the gff file for import) so the appropriate control is located within the [MODIFY] rather than the [GFF] stanza.
[MODIFY] INVERT_PHASE = 1
Updated less than a minute ago