📘

Understanding the [GFF] syntax

The [GFF] stanza uses a meta-syntax to set options for gff parser. This method of configuration maintains a lot of flexibility in the variations of .gff file that can be processed, and is particularly useful for Repairing gff, but may appear slightly intimidating at first.

If you are unsure how to relate this to your .gff file after reading this documentation then a good place to start is by just using the same settings as the example below. Setting fewer EXPECTATIONS for "clean" .gff will save a little processing time but for most files, these settings will not cause any problems. If your .gff has characteristics that need more conditions then running the script in Step 2.3: Prepare the gff file for import should give informative error messages that can be compared to the examples (try pasting the error/warning into the search box) to show you what to do.

[GFF]
  ;  SPLIT = [ ##FASTA GFF CONTIG ]
  SORT = 1
  CHUNK = [ change region ]
  ;  CHUNK = [ separator		### ]
  CONDITION1 = [ MULTILINE   CDS ]
  CONDITION1a = [ MULTILINE  five_prime_UTR ]
  CONDITION1b = [ MULTILINE  three_prime_UTR ]
  CONDITION2 = [ EXPECTATION cds	 hasSister exon force ]
  CONDITION3 = [ EXPECTATION cds	 hasParent mrna force ];
  CONDITION4 = [ EXPECTATION exon	 hasParent mrna force ];
  CONDITION4a = [ EXPECTATION five_prime_UTR hasParent mrna force ];
  CONDITION4b = [ EXPECTATION three_prime_UTR  hasParent mrna force ];
  CONDITION5 = [ EXPECTATION mrna	 hasParent gene force ];
  CONDITION10 = [ EXPECTATION cds|exon|mrna|three_prime_UTR|five_prime_UTR|gene <=[_start,_end] SELF warn ];
  • For files with fasta sequence included at the end, SPLIT will split the gff file on the specified keyword (##FASTA) and assign the resulting subfiles to the [FILES] handles GFF and CONTIG
  • SORT is a flag to determine whether the file should be sorted prior to processing. This is a basic sort which will result in each sequence region forming a block in the sorted file, allowing the file to be processed in chunks for much faster performance.
  • CHUNK causes the file to be processed in independent chunks, which is much more efficient than reading the entire file into memory, particularly if there are a large number of validation steps.
    • for sorted files, specifying change region will split the file into a separate chunk for each sequence region.
    • alternatively, for files with additional formatting rows, the file may be split on specific separators
  • Most other keys (e.g. CONDITION1) can have any name and will be used to set validation conditions.
    • Each feature in a .gff file should have a unique ID. Specifying MULTILINE allows individual CDS features, for example to be defined across multiple lines.
    • EXPECTATIONs can be set for individual feature types (or pipe-separated sets of feature types) and may be of type hasParent <type> (feature has a parent feature of the named type) or hasSister <type> (feature shares a parent with a feature of the named type at overlapping coordinates), or one of a set of comparison operators <, <=, ==, >=, >``.
    • For each expectation, the behaviour of the validator can be set to ignore, warn, find a matching feature, make a matching feature, force (find followed by make), or die.