Repairing gff
easy import has been designed to be flexible enough to fix many common problems with .gff
files so the .ini
file can contain a complete record of the process to take a provided .gff
file and import it successfully into an Ensembl database. Guidelines for some common problems are given below, there are probably more cases that can be "fixed" with creative use of the basic syntax.
Very broken gff
In some cases, there is no practical way to automate repair of a file (particularly when attributes for just a few features have been incorrectly manually edited), or repair would require the full feature set of the gff parser, which is not practical to expose through the meta-syntax used in the
.ini
files. If manual edits are required, then the commands used can be recorded as notes (prefixed by a semicolon) in the.ini
file to ensure a complete record is preserved, e.g.:[FILES] GFF = [ gff http://example.com/example.gff3.gz ] ; perl -p -i -e 's/Parent=mrna3/Parent=mrna2/' example.gff3
Missing ID
ID
ID
s are automatically generated for all features that lack ID
attributes.
Missing Parent
Parent
In order to link features hierarchically in a .gff
file, features below the level of gene should each have a Parent
attribute containing the ID
of their parent feature. If these are missing, the coordinates of a feature's parent can be inferred from it's own coordinates using the expectation hasParent
. Then a feature of the appropriate type spanning those coordinates can either be created (keyword make
) or identified from the set of existing features (keyword find
).
[GFF]
CONDITION = [ EXPECTATION mrna hasParent gene force ]
In this example, the keyword force
first attempts to find
an existing feature to use as a parent but will make
a new feature if there is no existing feature with correct coordinates.
Exons may have valid Parent
attributes of different types within the same file. To allow testing for multiple types, it is possible to specify a pipe-separated list of types to find
. If no matching feature is found, a new parent feature can be created using make
or force
with the type of the first item in the list.
[GFF]
CONDITION = [ EXPECTATION exon hasParent transcript|mrna force ]
Missing exons
The Ensembl schema assumes that all transcripts are comprised of exons. Many .gff
files that lack non-coding annotations omit exons as they are essentially duplicates of CDS features. Exons can be inferred using the expectation hasSister
.
[GFF]
CONDITION = [ EXPECTATION cds hasSister exon force ]
Alternatively, if only introns are present in the file, it is possible to use FILL_GAPS
to generate exon features between the introns (keyword internal
) and before and after the first and last introns (external
).
[GFF]
CONDITION1 = [ FILL_GAPS intron exon internal ]
CONDITION2 = [ FILL_GAPS intron exon external ]
Bad coordinates
Sometimes the start and end coordinates for a feature may be reversed, this example compares the start and end coordinates for a set of feature types to check that the start is not after the end. If this expectation is violated, it is not always clear what to do without examining the file so the behaviour is set to warn
[GFF]
CONDITION = [ EXPECTATION cds|exon|mrna|gene <=[_start,_end] SELF warn ]
Inconsistent types
Often different types are used to refer to functionally equivalent features and while the reasons for this may be legitimate, it can be inconvenient when parsing. This can be resolved by using MAP_TYPES
to cause the gff parser to treat types as equivalent.
[GFF]
CONDITION1 = [ MAP_TYPES initial exon ]
CONDITION2 = [ MAP_TYPES terminal exon ]
Many ID
s for a single feature
ID
s for a single featurevalid .gff
assumes that each feature has a single, unique ID
attribute. In some files ID
s may be incorrectly applied to CDS features as all CDS feature lines that share a common transcript parent should share a single ID
, however they correspond approximately to a set of exons that should correctly each have unique ID
s. To fix a file with unique CDS ID
s, it is possible to override the ID
attribute and cause a new one to be generated.
[GFF]
CONDITION1 = [ OVERRIDE cds ID ]
CONDITION2 = [ LACKS_ID cds make ]
OVERRIDE
can also be used to override any other attribute for a given feature type. When Processing exceptions, the OVERRIDE
can itself be overridden in a second .ini
file by passing a feature type with no attribute specified.
[GFF]
CONDITION = [ OVERRIDE cds ]
Incorrect phase
Conflicting definitions for phase are used in different .gff
files. easy import uses the sequenceontology.org specification so for files that use the alternate definition (phase = frame - 1), it is necessary to invert the phase to convert 1 to 2 and vice versa. This is applied during Step 2.4: Import gff from prepared file (following Step 2.3: Prepare the gff file for import) so the appropriate control is located within the [MODIFY] rather than the [GFF] stanza.
[MODIFY]
INVERT_PHASE = 1
Updated less than a minute ago