Referencing gff attributes

Gene, transcript and translation [_STABLE_IDS], [_NAMES] and [_DESCRIPTIONS] can be set based on any attributes of a feature or related feature within a .gff file by following specific syntactic conventions in the .ini file.

Given a basic .gff:

scaffold1	.	gene	1389	2804	.	+	.	ID=gene1;Name=Eg00001
scaffold1	.	mRNA	1389	2804	.	+	.	ID=mrna1;Parent=gene1;Name=Eg00001-RA
scaffold1	.	CDS	1389	1571	.	+	0	ID=cds1;Parent=mrna1;Name=Eg00001-PA
scaffold1	.	CDS	1881	2054	.	+	0	ID=cds1;Parent=mrna1;Name=Eg00001-PA
scaffold1	.	CDS	2657	2804	.	+	2	ID=cds1;Parent=mrna1;Name=Eg00001-PA
scaffold1	.	exon	1389	1571	.	+	.	ID=exon1;Parent=mrna1
scaffold1	.	exon	1881	2054	.	+	.	ID=exon1;Parent=mrna1
scaffold1	.	exon	2321	2469	.	+	.	ID=exon1;Parent=mrna1

There is no information here for [_NAMES] (i.e. synonyms) or [_DESCRIPTIONS] and the [_STABLE_IDS] in each case should use the corresponding Name attribute:

[GENE_STABLE_IDS]
  GFF = [ gene->Name /(.+)/ ]
[TRANSCRIPT_STABLE_IDS]
  GFF = [ mRNA->Name /(.+)/ ]
[TRANSLATION_STABLE_IDS]
  GFF = [ CDS->Name /(.+)/ ]

Nested feature types

When .gff is parsed, each gene is processed separately. While processing a gene, the script has access to all nested features of that gene, and similarly for transcripts, the script has access to the parent gene and nested features of the transcript, but not to alternate transcripts. Translations are processed at the level of the associated transcript.

[GENE_STABLE_IDS]
    GFF = [ gene->Name /(.+)/ ]
    GFF = [ mRNA->Name /(.+)/ /-RA// ]
    GFF = [ CDS->Name /(.+)/ /-PA// ]
[TRANSCRIPT_STABLE_IDS]
    GFF = [ mRNA->Name /(.+)/ ]
    GFF = [ gene->Name /(.+)/ /(.+)/$1-PA/ ]
    GFF = [ CDS->Name /(.+)/ /-PA/-RA/ ]
[TRANSLATION_STABLE_IDS]
    GFF = [ CDS->Name /(.+)/ ]
    GFF = [ gene->Name /(.+)/ /(.+)/$1-PA/ ]
    GFF = [ mRNA->Name /(.+)/ /-RA/-PA/ ]

are all valid (optionally using Match and replace to extract the same string each case)

The SELF keyword

  • The keyword SELF will always refer to the current gene/transcript feature.

  • [GENE_STABLE_IDS]
        GFF = [ gene->Name /(.+)/ ]
        GFF = [ SELF->Name /(.+)/ ]
    
  are equivalent ways of referring to the same ``gene`` attribute.  

- Transcript IDs may have different types so 

[TRANSCRIPT_STABLE_IDS]
GFF = [ mRNA->Name /(.+)/ ]
GFF = [ SELF->Name /(.+)/ ]

  are non-equivalent.
  - ``GFF = [ mRNA->Name /(.+)/ ]`` will only return a stable_id for transcripts of type mRNA
  - ``GFF = [ SELF->Name /(.+)/ ]`` will return a stable_id for any transcript type.
  - See [Processing exceptions](doc:processing-exceptions) for an explanation of how to use this distinction when processing ``.gff`` with multiple transcript types.

- ```
[TRANSLATION_STABLE_IDS]
    GFF = [ CDS->Name /(.+)/ ]
    GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ]

Here, SELF refers to the parent transcript so to achieve the same naming use Match and replace to substitute -PA for -RA (useful for files without CDS features lack Name attributes.

The DAUGHTER keyword

  • The keyword DAUGHTER refers to the first child of the current feature and is most useful to retrieve gene attributes from any daughter transcript type
[GENE_STABLE_IDS]
    GFF = [ mRNA->Name /(.+)/ ]
    GFF = [ DAUGHTER->Name /(.+)/ ]
  • GFF = [ mRNA->Name /(.+)/ ] will only return a stable_id for genes with daughter features of type mRNA
  • GFF = [ DAUGHTER->Name /(.+)/ ] will return a stable_id for genes with a daughter transcript of any type.