[_STABLE_IDS]
[GENE_STABLE_IDS]
GFF = [ gene->Name /(.+)/ ]
[TRANSCRIPT_STABLE_IDS]
GFF = [ SELF->Name /(.+)/ ]
[TRANSLATION_STABLE_IDS]
GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ]
These correspond directly to the stable_id
field the Ensembl database where they serve as the primary identifiers for each gene, transcript amd translation. The expectation is that these will be set to a value that will ideally remain stable across assembly versions. The location in the .gff
file that should be used as a feature stable_id
is controlled by the pattern [ feature->attribute /match/ /replace/ ]
(see Referencing gff attributes and Match and replace for details).
missing stable_ids
If a gene or transcript
stable_id
cannot be found using the current patterns, the gene will not be processed further but instead will be written to a.exception.gff
file. See Processing exceptions for details of how this behaviour can be used to extract information from different attributes for different transcript types.
_STABLE_IDS
also provide the link between annotations in files of different types when extracting [_NAMES] and or [_DESCRIPTIONS] from locations other than .gff
files. In this case the [_STABLE_IDS]
stanza should include additional lines referencing the other files by their handles as defined in the [FILES] stanza. Specific patterns are available for files of type fa
and tsv
/csv
.
[FILES]
GFF = [ gff http://example.com/gene_models.gff3.gz ]
PROTEIN = [ fa http://example.com/proteins.fa.gz ]
ANNOTATION = [ tsv http://example.com/annotations.txt.gz ]
[GENE_STABLE_IDS]
GFF = [ gene->Name /(.+)/ ]
PROTEIN = [ DISPLAY_ID /(.+)-PA/ ]
ANNOTATION = [ FIELD_1 /(.+)/ ]
- for files of type
fa
, the keywordDISPLAY_ID
retrieves the first part of the fasta header (before the first space) andDESCRIPTION
retrieves the remainder of the fasta header (after the first space) - files of type
tsv
andcsv
are split into fields on tab/comma separators,FIELD_1
indicates which field should be selected (1-indexed) - free-text files can also be parsed by setting the type to
tsv
in which case each line will be placed intoFIELD_1
(assuming there are no tabs in the file)
Updated less than a minute ago