Referencing gff attributes
Gene, transcript and translation [_STABLE_IDS], [_NAMES] and [_DESCRIPTIONS] can be set based on any attributes of a feature or related feature within a .gff
file by following specific syntactic conventions in the .ini
file.
Given a basic .gff
:
scaffold1 . gene 1389 2804 . + . ID=gene1;Name=Eg00001
scaffold1 . mRNA 1389 2804 . + . ID=mrna1;Parent=gene1;Name=Eg00001-RA
scaffold1 . CDS 1389 1571 . + 0 ID=cds1;Parent=mrna1;Name=Eg00001-PA
scaffold1 . CDS 1881 2054 . + 0 ID=cds1;Parent=mrna1;Name=Eg00001-PA
scaffold1 . CDS 2657 2804 . + 2 ID=cds1;Parent=mrna1;Name=Eg00001-PA
scaffold1 . exon 1389 1571 . + . ID=exon1;Parent=mrna1
scaffold1 . exon 1881 2054 . + . ID=exon1;Parent=mrna1
scaffold1 . exon 2321 2469 . + . ID=exon1;Parent=mrna1
There is no information here for [_NAMES] (i.e. synonyms) or [_DESCRIPTIONS] and the [_STABLE_IDS] in each case should use the corresponding Name
attribute:
[GENE_STABLE_IDS]
GFF = [ gene->Name /(.+)/ ]
[TRANSCRIPT_STABLE_IDS]
GFF = [ mRNA->Name /(.+)/ ]
[TRANSLATION_STABLE_IDS]
GFF = [ CDS->Name /(.+)/ ]
Nested feature types
When .gff
is parsed, each gene is processed separately. While processing a gene, the script has access to all nested features of that gene, and similarly for transcripts, the script has access to the parent gene and nested features of the transcript, but not to alternate transcripts. Translations are processed at the level of the associated transcript.
[GENE_STABLE_IDS]
GFF = [ gene->Name /(.+)/ ]
GFF = [ mRNA->Name /(.+)/ /-RA// ]
GFF = [ CDS->Name /(.+)/ /-PA// ]
[TRANSCRIPT_STABLE_IDS]
GFF = [ mRNA->Name /(.+)/ ]
GFF = [ gene->Name /(.+)/ /(.+)/$1-PA/ ]
GFF = [ CDS->Name /(.+)/ /-PA/-RA/ ]
[TRANSLATION_STABLE_IDS]
GFF = [ CDS->Name /(.+)/ ]
GFF = [ gene->Name /(.+)/ /(.+)/$1-PA/ ]
GFF = [ mRNA->Name /(.+)/ /-RA/-PA/ ]
are all valid (optionally using Match and replace to extract the same string each case)
The SELF
keyword
SELF
keyword-
The keyword
SELF
will always refer to the current gene/transcript feature. -
[GENE_STABLE_IDS] GFF = [ gene->Name /(.+)/ ] GFF = [ SELF->Name /(.+)/ ]
are equivalent ways of referring to the same ``gene`` attribute.
- Transcript IDs may have different types so
[TRANSCRIPT_STABLE_IDS]
GFF = [ mRNA->Name /(.+)/ ]
GFF = [ SELF->Name /(.+)/ ]
are non-equivalent.
- ``GFF = [ mRNA->Name /(.+)/ ]`` will only return a stable_id for transcripts of type mRNA
- ``GFF = [ SELF->Name /(.+)/ ]`` will return a stable_id for any transcript type.
- See [Processing exceptions](doc:processing-exceptions) for an explanation of how to use this distinction when processing ``.gff`` with multiple transcript types.
- ```
[TRANSLATION_STABLE_IDS]
GFF = [ CDS->Name /(.+)/ ]
GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ]
Here, SELF
refers to the parent transcript so to achieve the same naming use Match and replace to substitute -PA
for -RA
(useful for files without CDS
features lack Name
attributes.
The DAUGHTER
keyword
DAUGHTER
keyword- The keyword
DAUGHTER
refers to the first child of the current feature and is most useful to retrieve gene attributes from any daughter transcript type
[GENE_STABLE_IDS]
GFF = [ mRNA->Name /(.+)/ ]
GFF = [ DAUGHTER->Name /(.+)/ ]
GFF = [ mRNA->Name /(.+)/ ]
will only return a stable_id for genes with daughter features of typemRNA
GFF = [ DAUGHTER->Name /(.+)/ ]
will return a stable_id for genes with a daughter transcript of any type.
Updated less than a minute ago