{"_id":"5739b307e370590e0012e934","user":"573592b84b0ab120000b7d44","project":"5735936aafab441700723a50","version":{"_id":"5735936aafab441700723a53","__v":12,"project":"5735936aafab441700723a50","createdAt":"2016-05-13T08:42:18.615Z","releaseDate":"2016-05-13T08:42:18.615Z","categories":["5735936aafab441700723a54","5735a32931a73b1700887c94","5735b55beceb872200abbc6c","5735b56eb667601700d3bd6f","5735b9ba4b0ab120000b7dd4","5735b9c94b0ab120000b7dd5","5735cb131f16241700c8a0f7","5735e5c4e4824c3400aa1f21","5735e5d9e4824c3400aa1f23","5735e5f2ec67f6290013ac72","573ecfe0804f901700a9dfc7","573f276c7eeb8b190094ca7d"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":false,"codename":"","version_clean":"1.0.0","version":"1.0"},"__v":14,"parentDoc":null,"category":{"_id":"5735e5d9e4824c3400aa1f23","__v":0,"version":"5735936aafab441700723a53","project":"5735936aafab441700723a50","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-05-13T14:34:01.858Z","from_sync":false,"order":9,"slug":"configuration-options-core-import","title":"Configuration Options (Core Import)"},"updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-05-16T11:46:15.365Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":10,"body":"```\n[GENE_STABLE_IDS]\n  GFF = [ gene->Name /(.+)/ ]\n[TRANSCRIPT_STABLE_IDS]\n  GFF = [ SELF->Name /(.+)/ ]\n[TRANSLATION_STABLE_IDS]\n  GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ]\n```\nThese correspond directly to the ``stable_id`` field the Ensembl database where they serve as the primary identifiers for each gene, transcript amd translation.  The expectation is that these will be set to a value that will ideally remain stable across assembly versions.  The location in the ``.gff`` file that should be used as a feature ``stable_id`` is controlled by the pattern ``[ feature->attribute /match/ /replace/ ]`` (see [Referencing gff attributes](doc:referencing-gff-attributes) and [Match and replace](doc:match-and-replace) for details).\n[block:callout]\n{\n  \"type\": \"warning\",\n  \"body\": \"If a gene or transcript ``stable_id`` cannot be found using the current patterns, the gene will not be processed further but instead will be written to a ``.exception.gff`` file.  See [Processing exceptions](doc:processing-exceptions) for details of how this behaviour can be used to extract information from different attributes for different transcript types.\",\n  \"title\": \"missing stable_ids\"\n}\n[/block]\n``_STABLE_IDS`` also provide the link between annotations in files of different types when extracting [[_NAMES]](doc:names-core) and or [[_DESCRIPTIONS]](doc:descriptions-core) from locations other than ``.gff`` files.  In this case the ``[_STABLE_IDS]`` stanza should include additional lines referencing the other files by their handles as defined in the [[FILES]](doc:files-core) stanza.  Specific patterns are available for files of type ``fa`` and ``tsv``/``csv``.\n\n```\n[FILES]\n  GFF = [ gff http://example.com/gene_models.gff3.gz ]\n  PROTEIN = [ fa http://example.com/proteins.fa.gz ]\n  ANNOTATION = [ tsv http://example.com/annotations.txt.gz ]\n[GENE_STABLE_IDS]\n  GFF = [ gene->Name /(.+)/ ]\n  PROTEIN = [ DISPLAY_ID /(.+)-PA/ ]  \n  ANNOTATION = [ FIELD_1 /(.+)/ ]  \n```\n\n- for files of type ``fa``, the keyword ``DISPLAY_ID`` retrieves the first part of the fasta header (before the first space) and ``DESCRIPTION`` retrieves the remainder of the fasta header (after the first space)\n- files of type ``tsv`` and ``csv`` are split into fields on tab/comma separators, ``FIELD_1`` indicates which field should be selected (1-indexed)\n- free-text files can also be parsed by setting the type to ``tsv`` in which case each line will be placed into ``FIELD_1`` (assuming there are no tabs in the file)","excerpt":"","slug":"_stable_ids-core","type":"basic","title":"[_STABLE_IDS]"}
``` [GENE_STABLE_IDS] GFF = [ gene->Name /(.+)/ ] [TRANSCRIPT_STABLE_IDS] GFF = [ SELF->Name /(.+)/ ] [TRANSLATION_STABLE_IDS] GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ] ``` These correspond directly to the ``stable_id`` field the Ensembl database where they serve as the primary identifiers for each gene, transcript amd translation. The expectation is that these will be set to a value that will ideally remain stable across assembly versions. The location in the ``.gff`` file that should be used as a feature ``stable_id`` is controlled by the pattern ``[ feature->attribute /match/ /replace/ ]`` (see [Referencing gff attributes](doc:referencing-gff-attributes) and [Match and replace](doc:match-and-replace) for details). [block:callout] { "type": "warning", "body": "If a gene or transcript ``stable_id`` cannot be found using the current patterns, the gene will not be processed further but instead will be written to a ``.exception.gff`` file. See [Processing exceptions](doc:processing-exceptions) for details of how this behaviour can be used to extract information from different attributes for different transcript types.", "title": "missing stable_ids" } [/block] ``_STABLE_IDS`` also provide the link between annotations in files of different types when extracting [[_NAMES]](doc:names-core) and or [[_DESCRIPTIONS]](doc:descriptions-core) from locations other than ``.gff`` files. In this case the ``[_STABLE_IDS]`` stanza should include additional lines referencing the other files by their handles as defined in the [[FILES]](doc:files-core) stanza. Specific patterns are available for files of type ``fa`` and ``tsv``/``csv``. ``` [FILES] GFF = [ gff http://example.com/gene_models.gff3.gz ] PROTEIN = [ fa http://example.com/proteins.fa.gz ] ANNOTATION = [ tsv http://example.com/annotations.txt.gz ] [GENE_STABLE_IDS] GFF = [ gene->Name /(.+)/ ] PROTEIN = [ DISPLAY_ID /(.+)-PA/ ] ANNOTATION = [ FIELD_1 /(.+)/ ] ``` - for files of type ``fa``, the keyword ``DISPLAY_ID`` retrieves the first part of the fasta header (before the first space) and ``DESCRIPTION`` retrieves the remainder of the fasta header (after the first space) - files of type ``tsv`` and ``csv`` are split into fields on tab/comma separators, ``FIELD_1`` indicates which field should be selected (1-indexed) - free-text files can also be parsed by setting the type to ``tsv`` in which case each line will be placed into ``FIELD_1`` (assuming there are no tabs in the file)