Repairing gff

easy import has been designed to be flexible enough to fix many common problems with .gff files so the .ini file can contain a complete record of the process to take a provided .gff file and import it successfully into an Ensembl database. Guidelines for some common problems are given below, there are probably more cases that can be "fixed" with creative use of the basic syntax.

🚧

Very broken gff

In some cases, there is no practical way to automate repair of a file (particularly when attributes for just a few features have been incorrectly manually edited), or repair would require the full feature set of the gff parser, which is not practical to expose through the meta-syntax used in the .ini files. If manual edits are required, then the commands used can be recorded as notes (prefixed by a semicolon) in the .ini file to ensure a complete record is preserved, e.g.:

[FILES]
  GFF = [ gff http://example.com/example.gff3.gz ]
  ; perl -p -i -e 's/Parent=mrna3/Parent=mrna2/' example.gff3

Missing ID

IDs are automatically generated for all features that lack ID attributes.

Missing Parent

In order to link features hierarchically in a .gff file, features below the level of gene should each have a Parent attribute containing the ID of their parent feature. If these are missing, the coordinates of a feature's parent can be inferred from it's own coordinates using the expectation hasParent. Then a feature of the appropriate type spanning those coordinates can either be created (keyword make) or identified from the set of existing features (keyword find).

[GFF]
  CONDITION = [ EXPECTATION mrna hasParent gene force ]

In this example, the keyword force first attempts to find an existing feature to use as a parent but will make a new feature if there is no existing feature with correct coordinates.

Exons may have valid Parent attributes of different types within the same file. To allow testing for multiple types, it is possible to specify a pipe-separated list of types to find. If no matching feature is found, a new parent feature can be created using make or force with the type of the first item in the list.

[GFF]
  CONDITION = [ EXPECTATION exon hasParent transcript|mrna force ]

Missing exons

The Ensembl schema assumes that all transcripts are comprised of exons. Many .gff files that lack non-coding annotations omit exons as they are essentially duplicates of CDS features. Exons can be inferred using the expectation hasSister.

[GFF]
  CONDITION = [ EXPECTATION cds hasSister exon force ]

Alternatively, if only introns are present in the file, it is possible to use FILL_GAPS to generate exon features between the introns (keyword internal) and before and after the first and last introns (external).

[GFF]
  CONDITION1 = [ FILL_GAPS intron exon internal ]
  CONDITION2 = [ FILL_GAPS intron exon external ]

Bad coordinates

Sometimes the start and end coordinates for a feature may be reversed, this example compares the start and end coordinates for a set of feature types to check that the start is not after the end. If this expectation is violated, it is not always clear what to do without examining the file so the behaviour is set to warn

[GFF]
  CONDITION = [ EXPECTATION cds|exon|mrna|gene <=[_start,_end] SELF warn ]

Inconsistent types

Often different types are used to refer to functionally equivalent features and while the reasons for this may be legitimate, it can be inconvenient when parsing. This can be resolved by using MAP_TYPES to cause the gff parser to treat types as equivalent.

[GFF]
  CONDITION1 = [ MAP_TYPES initial exon ]
  CONDITION2 = [ MAP_TYPES terminal exon ]

Many IDs for a single feature

valid .gff assumes that each feature has a single, unique ID attribute. In some files IDs may be incorrectly applied to CDS features as all CDS feature lines that share a common transcript parent should share a single ID, however they correspond approximately to a set of exons that should correctly each have unique IDs. To fix a file with unique CDS IDs, it is possible to override the ID attribute and cause a new one to be generated.

[GFF]
  CONDITION1 = [ OVERRIDE cds ID ]
  CONDITION2 = [ LACKS_ID cds make ]

OVERRIDE can also be used to override any other attribute for a given feature type. When Processing exceptions, the OVERRIDE can itself be overridden in a second .ini file by passing a feature type with no attribute specified.

[GFF]
  CONDITION = [ OVERRIDE cds ]

Incorrect phase

Conflicting definitions for phase are used in different .gff files. easy import uses the sequenceontology.org specification so for files that use the alternate definition (phase = frame - 1), it is necessary to invert the phase to convert 1 to 2 and vice versa. This is applied during Step 2.4: Import gff from prepared file (following Step 2.3: Prepare the gff file for import) so the appropriate control is located within the [MODIFY] rather than the [GFF] stanza.

[MODIFY]
  INVERT_PHASE = 1