[_STABLE_IDS]

[GENE_STABLE_IDS]
  GFF = [ gene->Name /(.+)/ ]
[TRANSCRIPT_STABLE_IDS]
  GFF = [ SELF->Name /(.+)/ ]
[TRANSLATION_STABLE_IDS]
  GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ]

These correspond directly to the stable_id field the Ensembl database where they serve as the primary identifiers for each gene, transcript amd translation. The expectation is that these will be set to a value that will ideally remain stable across assembly versions. The location in the .gff file that should be used as a feature stable_id is controlled by the pattern [ feature->attribute /match/ /replace/ ] (see Referencing gff attributes and Match and replace for details).

🚧

missing stable_ids

If a gene or transcript stable_id cannot be found using the current patterns, the gene will not be processed further but instead will be written to a .exception.gff file. See Processing exceptions for details of how this behaviour can be used to extract information from different attributes for different transcript types.

_STABLE_IDS also provide the link between annotations in files of different types when extracting [_NAMES] and or [_DESCRIPTIONS] from locations other than .gff files. In this case the [_STABLE_IDS] stanza should include additional lines referencing the other files by their handles as defined in the [FILES] stanza. Specific patterns are available for files of type fa and tsv/csv.

[FILES]
  GFF = [ gff http://example.com/gene_models.gff3.gz ]
  PROTEIN = [ fa http://example.com/proteins.fa.gz ]
  ANNOTATION = [ tsv http://example.com/annotations.txt.gz ]
[GENE_STABLE_IDS]
  GFF = [ gene->Name /(.+)/ ]
  PROTEIN = [ DISPLAY_ID /(.+)-PA/ ]  
  ANNOTATION = [ FIELD_1 /(.+)/ ]  
  • for files of type fa, the keyword DISPLAY_ID retrieves the first part of the fasta header (before the first space) and DESCRIPTION retrieves the remainder of the fasta header (after the first space)
  • files of type tsv and csv are split into fields on tab/comma separators, FIELD_1 indicates which field should be selected (1-indexed)
  • free-text files can also be parsed by setting the type to tsv in which case each line will be placed into FIELD_1 (assuming there are no tabs in the file)