{"_id":"573ed718a233380e005db162","parentDoc":null,"version":{"_id":"5735936aafab441700723a53","__v":12,"project":"5735936aafab441700723a50","createdAt":"2016-05-13T08:42:18.615Z","releaseDate":"2016-05-13T08:42:18.615Z","categories":["5735936aafab441700723a54","5735a32931a73b1700887c94","5735b55beceb872200abbc6c","5735b56eb667601700d3bd6f","5735b9ba4b0ab120000b7dd4","5735b9c94b0ab120000b7dd5","5735cb131f16241700c8a0f7","5735e5c4e4824c3400aa1f21","5735e5d9e4824c3400aa1f23","5735e5f2ec67f6290013ac72","573ecfe0804f901700a9dfc7","573f276c7eeb8b190094ca7d"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":false,"codename":"","version_clean":"1.0.0","version":"1.0"},"user":"573592b84b0ab120000b7d44","__v":26,"category":{"_id":"573f276c7eeb8b190094ca7d","__v":0,"version":"5735936aafab441700723a53","project":"5735936aafab441700723a50","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-05-20T15:04:12.866Z","from_sync":false,"order":6,"slug":"additional-documentation","title":"Additional documentation"},"project":"5735936aafab441700723a50","updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-05-20T09:21:28.866Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":3,"body":"easy import has been designed to be flexible enough to fix many common problems with ``.gff`` files so the ``.ini`` file can contain a complete record of the process to take a provided ``.gff`` file and import it successfully into an Ensembl database.  Guidelines for some common problems are given below, there are probably more cases that can be \"fixed\" with creative use of the basic syntax. \n[block:callout]\n{\n  \"type\": \"warning\",\n  \"title\": \"Very broken gff\",\n  \"body\": \"In some cases, there is no practical way to automate repair of a file (particularly when attributes for just a few features have been incorrectly manually edited), or repair would require the full feature set of the [gff parser](https://github.com/rjchallis/gff-parser), which is not practical to expose through the meta-syntax used in the ``.ini`` files.  If manual edits are required, then the commands used can be recorded as notes (prefixed by a semicolon) in the ``.ini`` file to ensure a complete record is preserved, e.g.:\\n\\n```\\n[FILES]\\n  GFF = [ gff http://example.com/example.gff3.gz ]\\n  ; perl -p -i -e 's/Parent=mrna3/Parent=mrna2/' example.gff3\\n```\"\n}\n[/block]\n## Missing ``ID``\n\n``ID``s are automatically generated for all features that lack ``ID`` attributes.\n\n## Missing ``Parent``\n\nIn order to link features hierarchically in a ``.gff`` file, features below the level of gene should each have a ``Parent`` attribute containing the ``ID`` of their parent feature.  If these are missing, the coordinates of a feature's parent can be inferred from it's own coordinates using the expectation ``hasParent``.  Then a feature of the appropriate type spanning those coordinates can either be created (keyword ``make``) or identified  from the set of existing features (keyword ``find``).\n\n```\n[GFF]\n  CONDITION = [ EXPECTATION mrna hasParent gene force ]\n```\nIn this example, the keyword ``force`` first attempts to ``find`` an existing feature to use as a parent but will ``make`` a new feature if there is no existing feature with correct coordinates.\n\nExons may have valid ``Parent`` attributes of different types within the same file.  To allow testing for multiple types, it is possible to specify a pipe-separated list of types to ``find``.  If no matching feature is found, a new parent feature can be created using ``make`` or ``force`` with the type of the first item in the list.\n\n```\n[GFF]\n  CONDITION = [ EXPECTATION exon hasParent transcript|mrna force ]\n```\n\n## Missing exons\n\nThe Ensembl schema assumes that all transcripts are comprised of exons.  Many ``.gff`` files that lack non-coding annotations omit exons as they are essentially duplicates of CDS features.  Exons can be inferred using the expectation ``hasSister``.\n\n```\n[GFF]\n  CONDITION = [ EXPECTATION cds hasSister exon force ]\n```\n\nAlternatively, if only introns are present in the file, it is possible to use ``FILL_GAPS`` to generate exon features between the introns (keyword ``internal``) and before and after the first and last introns (``external``).\n\n```\n[GFF]\n  CONDITION1 = [ FILL_GAPS intron exon internal ]\n  CONDITION2 = [ FILL_GAPS intron exon external ]\n```  \n\n## Bad coordinates\n\nSometimes the start and end coordinates for a feature may be reversed, this example compares the start and end coordinates for a set of feature types to check that the start is not after the end.  If this expectation is violated, it is not always clear what to do without examining the file so the behaviour is set to ``warn``\n\n```\n[GFF]\n  CONDITION = [ EXPECTATION cds|exon|mrna|gene <=[_start,_end] SELF warn ]\n```\n\n## Inconsistent types\n\nOften different types are used to refer to functionally equivalent features and while the reasons for this may be legitimate, it can be inconvenient when parsing.  This can be resolved by using ``MAP_TYPES`` to cause the [gff parser](https://github.com/rjchallis/gff-parser) to treat types as equivalent.\n\n```\n[GFF]\n  CONDITION1 = [ MAP_TYPES initial exon ]\n  CONDITION2 = [ MAP_TYPES terminal exon ]\n```\n\n## Many ``ID``s for a single feature\n\nvalid ``.gff`` assumes that each feature has a single, unique ``ID`` attribute.  In some files ``ID``s may be incorrectly applied to CDS features as all CDS feature lines that share a common transcript parent should share a single ``ID``, however they correspond approximately to a set of exons that should correctly each have unique ``ID``s.  To fix a file with unique CDS ``ID``s, it is possible to override the ``ID`` attribute and cause a new one to be generated.\n\n```\n[GFF]\n  CONDITION1 = [ OVERRIDE cds ID ]\n  CONDITION2 = [ LACKS_ID cds make ]\n```\n\n``OVERRIDE`` can also be used to override any other attribute for a given feature type.  When [Processing exceptions](doc:processing-exceptions), the ``OVERRIDE`` can itself be overridden in a second ``.ini`` file by passing a feature type with no attribute specified.\n\n```\n[GFF]\n  CONDITION = [ OVERRIDE cds ]\n```\n\n## Incorrect phase\n\nConflicting definitions for phase are used in different ``.gff`` files.  [easy import](https://github.com/lepbase/easy-import) uses the [sequenceontology.org specification](http://www.sequenceontology.org/gff3.shtml) so for files that use the alternate definition (phase = frame - 1), it is necessary to invert the phase to convert 1 to 2 and vice versa.  This is applied during [Step 2.4: Import gff from prepared file](doc:step-24-import-gff-from-prepared-file) (following [Step 2.3: Prepare the gff file for import](doc:step-23-prepare-the-gff-file-for-import)) so the appropriate control is located within the [[MODIFY]](doc:modify-core) rather than the [[GFF]](doc:gff-core) stanza.\n\n```\n[MODIFY]\n  INVERT_PHASE = 1\n```","excerpt":"","slug":"repairing-gff","type":"basic","title":"Repairing gff"}
easy import has been designed to be flexible enough to fix many common problems with ``.gff`` files so the ``.ini`` file can contain a complete record of the process to take a provided ``.gff`` file and import it successfully into an Ensembl database. Guidelines for some common problems are given below, there are probably more cases that can be "fixed" with creative use of the basic syntax. [block:callout] { "type": "warning", "title": "Very broken gff", "body": "In some cases, there is no practical way to automate repair of a file (particularly when attributes for just a few features have been incorrectly manually edited), or repair would require the full feature set of the [gff parser](https://github.com/rjchallis/gff-parser), which is not practical to expose through the meta-syntax used in the ``.ini`` files. If manual edits are required, then the commands used can be recorded as notes (prefixed by a semicolon) in the ``.ini`` file to ensure a complete record is preserved, e.g.:\n\n```\n[FILES]\n GFF = [ gff http://example.com/example.gff3.gz ]\n ; perl -p -i -e 's/Parent=mrna3/Parent=mrna2/' example.gff3\n```" } [/block] ## Missing ``ID`` ``ID``s are automatically generated for all features that lack ``ID`` attributes. ## Missing ``Parent`` In order to link features hierarchically in a ``.gff`` file, features below the level of gene should each have a ``Parent`` attribute containing the ``ID`` of their parent feature. If these are missing, the coordinates of a feature's parent can be inferred from it's own coordinates using the expectation ``hasParent``. Then a feature of the appropriate type spanning those coordinates can either be created (keyword ``make``) or identified from the set of existing features (keyword ``find``). ``` [GFF] CONDITION = [ EXPECTATION mrna hasParent gene force ] ``` In this example, the keyword ``force`` first attempts to ``find`` an existing feature to use as a parent but will ``make`` a new feature if there is no existing feature with correct coordinates. Exons may have valid ``Parent`` attributes of different types within the same file. To allow testing for multiple types, it is possible to specify a pipe-separated list of types to ``find``. If no matching feature is found, a new parent feature can be created using ``make`` or ``force`` with the type of the first item in the list. ``` [GFF] CONDITION = [ EXPECTATION exon hasParent transcript|mrna force ] ``` ## Missing exons The Ensembl schema assumes that all transcripts are comprised of exons. Many ``.gff`` files that lack non-coding annotations omit exons as they are essentially duplicates of CDS features. Exons can be inferred using the expectation ``hasSister``. ``` [GFF] CONDITION = [ EXPECTATION cds hasSister exon force ] ``` Alternatively, if only introns are present in the file, it is possible to use ``FILL_GAPS`` to generate exon features between the introns (keyword ``internal``) and before and after the first and last introns (``external``). ``` [GFF] CONDITION1 = [ FILL_GAPS intron exon internal ] CONDITION2 = [ FILL_GAPS intron exon external ] ``` ## Bad coordinates Sometimes the start and end coordinates for a feature may be reversed, this example compares the start and end coordinates for a set of feature types to check that the start is not after the end. If this expectation is violated, it is not always clear what to do without examining the file so the behaviour is set to ``warn`` ``` [GFF] CONDITION = [ EXPECTATION cds|exon|mrna|gene <=[_start,_end] SELF warn ] ``` ## Inconsistent types Often different types are used to refer to functionally equivalent features and while the reasons for this may be legitimate, it can be inconvenient when parsing. This can be resolved by using ``MAP_TYPES`` to cause the [gff parser](https://github.com/rjchallis/gff-parser) to treat types as equivalent. ``` [GFF] CONDITION1 = [ MAP_TYPES initial exon ] CONDITION2 = [ MAP_TYPES terminal exon ] ``` ## Many ``ID``s for a single feature valid ``.gff`` assumes that each feature has a single, unique ``ID`` attribute. In some files ``ID``s may be incorrectly applied to CDS features as all CDS feature lines that share a common transcript parent should share a single ``ID``, however they correspond approximately to a set of exons that should correctly each have unique ``ID``s. To fix a file with unique CDS ``ID``s, it is possible to override the ``ID`` attribute and cause a new one to be generated. ``` [GFF] CONDITION1 = [ OVERRIDE cds ID ] CONDITION2 = [ LACKS_ID cds make ] ``` ``OVERRIDE`` can also be used to override any other attribute for a given feature type. When [Processing exceptions](doc:processing-exceptions), the ``OVERRIDE`` can itself be overridden in a second ``.ini`` file by passing a feature type with no attribute specified. ``` [GFF] CONDITION = [ OVERRIDE cds ] ``` ## Incorrect phase Conflicting definitions for phase are used in different ``.gff`` files. [easy import](https://github.com/lepbase/easy-import) uses the [sequenceontology.org specification](http://www.sequenceontology.org/gff3.shtml) so for files that use the alternate definition (phase = frame - 1), it is necessary to invert the phase to convert 1 to 2 and vice versa. This is applied during [Step 2.4: Import gff from prepared file](doc:step-24-import-gff-from-prepared-file) (following [Step 2.3: Prepare the gff file for import](doc:step-23-prepare-the-gff-file-for-import)) so the appropriate control is located within the [[MODIFY]](doc:modify-core) rather than the [[GFF]](doc:gff-core) stanza. ``` [MODIFY] INVERT_PHASE = 1 ```