{"_id":"5735af6631a73b1700887cd3","category":{"_id":"5735a32931a73b1700887c94","version":"5735936aafab441700723a53","__v":0,"project":"5735936aafab441700723a50","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-05-13T09:49:29.176Z","from_sync":false,"order":2,"slug":"quick-start","title":"Stage 2 - Core Import"},"user":"573592b84b0ab120000b7d44","version":{"_id":"5735936aafab441700723a53","__v":12,"project":"5735936aafab441700723a50","createdAt":"2016-05-13T08:42:18.615Z","releaseDate":"2016-05-13T08:42:18.615Z","categories":["5735936aafab441700723a54","5735a32931a73b1700887c94","5735b55beceb872200abbc6c","5735b56eb667601700d3bd6f","5735b9ba4b0ab120000b7dd4","5735b9c94b0ab120000b7dd5","5735cb131f16241700c8a0f7","5735e5c4e4824c3400aa1f21","5735e5d9e4824c3400aa1f23","5735e5f2ec67f6290013ac72","573ecfe0804f901700a9dfc7","573f276c7eeb8b190094ca7d"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":false,"codename":"","version_clean":"1.0.0","version":"1.0"},"__v":20,"parentDoc":null,"project":"5735936aafab441700723a50","updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-05-13T10:41:42.661Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":3,"body":"Many ``.gff`` files deviate from the official specification and even for those that are correctly formatted, it is often useful to extract information from different types/attributes when assigning stable_ids, names, synonyms and descriptions for import to an Ensembl database.  It is also common for different data types in a single file to have different attributes specified so a set of patterns to one feature type may not be suitable for another, requiring multiple passes across the same file to extract all of the data.  For these reasons, no attempt is made to import ``.gff`` directly to the core database, but instead an intermediate file is created with specific attributes ready for import in [Step 2.4: Import gff from prepared file](doc:step-24-import-gff-from-prepared-file).\n[block:callout]\n{\n  \"type\": \"info\",\n  \"title\": \"Repairing GFF\",\n  \"body\": \"For ``.gff`` files that contain errors and inconsistencies, it can be frustrating to use many of the available parsers which output an error message and leave the user to manually repair the file, often with a set of one-liners, and then attempt to import the file again.  ``.gff`` handling in easy import uses a [gff parser](https://github.com/rjchallis/gff-parser) which embraces the diversity of real world gff by allowing full customisation of expected relationships and properties with functions to repair, warn or ignore errors during validation.  A subset of parameters for this parser can be controlled with the [[GFF]](doc:gff-core) stanza of the ``.ini`` file (see also [Repairing gff](doc:repairing-gff)).  This approach initially adds some complexity to the parameter specification, but many patterns can be reused across most ``.gff`` files and the benefit is that all modifications to the ``.gff`` can be preserved in the ``.ini`` file.\"\n}\n[/block]\n```\ncd ~/import\nperl ../ei/core/prepare_gff.pl ../ei/conf/core-import.ini\n```\n\nFeatures that cannot be processed using the provided ``.ini`` file are written to a ``.exception.gff`` file which can be processed using a second ``.ini`` file to overwrite specific parameters in the first file (see [Processing exceptions](doc:processing-exceptions)).\n```\nperl ../ei/core/prepare_gff.pl ../ei/conf/core-import.ini /path/to/exception.ini\n```\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Configuration options\"\n}\n[/block]\n- [[GFF]](doc:gff-core)\n```\n[GFF]\n;\tSPLIT = [ ##FASTA GFF CONTIG ]\n\tSORT = 1\n\tCHUNK = [ change region ]\n;\tCHUNK = [ separator\t\t### ]\n\tCONDITION1 = [ MULTILINE   CDS ]\n\tCONDITION1a = [ MULTILINE  five_prime_UTR ]\n\tCONDITION1b = [ MULTILINE  three_prime_UTR ]\n\tCONDITION2 = [ EXPECTATION cds\t hasSister exon force ]\n\tCONDITION3 = [ EXPECTATION cds\t hasParent mrna force ];\n\tCONDITION4 = [ EXPECTATION exon\t hasParent mrna force ];\n\tCONDITION4a = [ EXPECTATION five_prime_UTR hasParent mrna force ];\n\tCONDITION4b = [ EXPECTATION three_prime_UTR  hasParent mrna force ];\n\tCONDITION5 = [ EXPECTATION mrna\t hasParent gene force ];\n\tCONDITION10 = [ EXPECTATION cds|exon|mrna|three_prime_UTR|five_prime_UTR|gene <=[_start,_end] SELF warn ];\n```\n  Meta-syntax to set options for [gff parser](https://github.com/rjchallis/gff-parser).\n\n  - For files with fasta sequence included at the end, ``SPLIT`` will split the gff file on the specified keyword (``##FASTA``) and assign the resulting subfiles to the [[FILES]](doc:files-core) handles ``GFF`` and ``CONTIG``\n  - ``SORT`` is a flag to determine whether the file should be sorted prior to processing.  This is a basic sort which will result in each sequence region forming a block in the sorted file, allowing the file to be processed in chunks for much faster performance.\n  - ``CHUNK`` causes the file to be processed in independent chunks, which is much more efficient than reading the entire file into memory, particularly if there are a large number of validation steps.\n    - for sorted files, specifying ``change region`` will split the file into a separate chunk for each sequence region.\n    - alternatively, for files with additional formatting rows, the file may be split on specific ``separator``s\n  - Most other keys (e.g. ``CONDITION1``) can have any name and will be used to set validation conditions.\n    - Each feature in a ``.gff`` file should have a unique ID.  Specifying ``MULTILINE`` allows individual CDS features, for example to be defined across multiple lines.\n    - ``EXPECTATION``s can be set for individual feature types (or pipe-separated sets of feature types) and may be of type ``hasParent <type>`` (feature has a parent feature of the named type) or ``hasSister <type>`` (feature shares a parent with a feature of the named type at overlapping coordinates), or one of a set of comparison operators ``<``, ``<=``, ``==``, >=``, ``>``.  \n    - For each expectation, the behaviour of the validator can be set to ``ignore``, ``warn``, ``find`` a matching feature, ``make`` a matching feature, ``force`` (``find`` followed by ``make``), or ``die``.\n \n- [[FILES]](doc:files-core)\n```\n[FILES]\n\tGFF = [ gff3 http://www.bioinformatics.nl/wintermoth/data_files/Obru_genes.gff.gz ]\n\tPROTEIN = [ fa http://www.bioinformatics.nl/wintermoth/data_files/ObruPep.fasta.gz ]\n```\n  A ``GFF`` file must be specified, and optionally additional files (e.g. ``PROTEIN``) may be specified as sources of additional information.\n\n- [[_STABLE_IDS]](doc:_stable_ids-core)\n```\n[GENE_STABLE_IDS]\n    GFF = [ gene->Name /(.+)/ ]\n[TRANSCRIPT_STABLE_IDS]\n    GFF = [ SELF->Name /(.+)/ ]\n[TRANSLATION_STABLE_IDS]\n    GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ]\n```\n  These are used within the Ensembl database as the primary identifiers for each gene, transcript amd translation and the expectation is that these will be set to a value that will ideally remain stable across assembly versions.  The stable_id is also used by this script to link records across files so the same identifier (or a pattern that can be linked by [Match and replace](doc:match-and-replace)) must be present in each of the [[FILES]](doc:files-core) referred to above.\n  - Within the GFF file, attributes may be specified by the pattern ``feature->attribute``.  Specifying ``gene->Name`` for a transcript stable_id would select the ``Name`` attribute of the parent ``gene``.  Since transcripts may be of many types, the current feature may be specified by the keyword ``SELF`` and transcripts of a ``gene`` may be retrieved using the ``DAUGHTER`` keyword.\n  - If ``TRANSCRIPT_STABLE_ID`` is not defined, the ``TRANSCRIPT_STABLE_ID`` will be reused.\n\n- [[_DESCRIPTIONS]](doc:_descriptions-core)\n```\n[GENE_DESCRIPTIONS]\n    GFF = [ 1 DAUGHTER->product /(.+)/ ]\n[TRANSCRIPT_DESCRIPTIONS]\n    GFF = [ 1 SELF->product /(.+)/ ]\n```\n  Descriptions are displayed in the Ensembl database and included in the search index (optional [Step 2.8](doc:step-28-generate-search-index).  Each set of descriptions may be sourced from any number of files, in which case the first number in the value array indicates the priority accorded to descriptions from that source.  Descriptions from sources with lower numbers will overwrite those from sources with higher numbers.  If set to 1 this will also cause any existing descriptions in the database to be overwritten.\n\n- [[_NAMES]](doc:_names-core)\n```\n[GENE_NAMES]\n    GFF = [ 1 gene->Name /(.+)/ ]\n[TRANSCRIPT_NAMES]\n    GFF = [ 1 SELF->Name /(.+)/ ]\n```\n  These are used to set synonyms for each stable_id.  If multiple files are specified, each separate synonym will be added to the database.  In this case if the first number in the value array is 1, the first synonym from this source will be added to the database as a display_name, shown in preference to the stable_id.\n\n- [[DBXREFS]](doc:dbxrefs-core)\n```\n[DBXREFS]\n    ;   KEY = [ EXTERNAL_DB_ID NAME ACCESSION_REGEX DISPLAY_NAME_REGEX ]\n    GO = [ 1000 GO /^goslim_goa:GO:(.+)/ ]\n    INTERNAL = [ 9999 Internal /^Internal:(.+)/ /^Internal:(.+)/ ]\n    REFSEQ_MRNA = [ 1801 RefSeq_mRNA /^Genbank:(NM_.+)/ ]\n    REFSEQ_MRNA_PRED = [ 1806 RefSeq_mRNA_predicted /^Genbank:(XM_.+)/ ]\n    REFSEQ_PEPTIDE = [ 1810 RefSeq_peptide /^Genbank:(NP_.+)/ ]\n    REFSEQ_PEPTIDE_PRED = [ 1815 RefSeq_peptide_predicted /^Genbank:(XP_.+)/ ]\n    REFSEQ_RNA = [ 1820 RefSeq_rna /^Genbank:(XR_.+)/ ]\n    REFSEQ_RNA_PRED = [ 1825 RefSeq_rna_predicted /^Genbank:(XR_.+)/ ]\n    ENTREZGENE = [ 1300 EntrezGene /^GeneID:(.+)/ ]\n    UNIPROT = [ 2250 UniProtKB_all /^UniProtKB:(.+)/ ]\n```\n  Pattern matching to associate Dbxref attributes in the GFF with the correct database.  For each Dbxref in the ``.gff`` file, the value array contains the Ensembl external_db_id, the display_name for the external_db and regular expressions to extract the database accession and display name from any additional information in the string.\n\n- [[EXTERNAL_DBS]](doc:external_dbs-core)\n```\n[EXTERNAL_DBS]\n```\n  Used to add any additional external databases to the external_db table in the Ensembl database if required to support [[DBXREFS]](doc:dbxrefs-core)","excerpt":"","slug":"step-23-prepare-the-gff-file-for-import","type":"basic","title":"Step 2.3: Prepare the gff file for import"}

Step 2.3: Prepare the gff file for import


Many ``.gff`` files deviate from the official specification and even for those that are correctly formatted, it is often useful to extract information from different types/attributes when assigning stable_ids, names, synonyms and descriptions for import to an Ensembl database. It is also common for different data types in a single file to have different attributes specified so a set of patterns to one feature type may not be suitable for another, requiring multiple passes across the same file to extract all of the data. For these reasons, no attempt is made to import ``.gff`` directly to the core database, but instead an intermediate file is created with specific attributes ready for import in [Step 2.4: Import gff from prepared file](doc:step-24-import-gff-from-prepared-file). [block:callout] { "type": "info", "title": "Repairing GFF", "body": "For ``.gff`` files that contain errors and inconsistencies, it can be frustrating to use many of the available parsers which output an error message and leave the user to manually repair the file, often with a set of one-liners, and then attempt to import the file again. ``.gff`` handling in easy import uses a [gff parser](https://github.com/rjchallis/gff-parser) which embraces the diversity of real world gff by allowing full customisation of expected relationships and properties with functions to repair, warn or ignore errors during validation. A subset of parameters for this parser can be controlled with the [[GFF]](doc:gff-core) stanza of the ``.ini`` file (see also [Repairing gff](doc:repairing-gff)). This approach initially adds some complexity to the parameter specification, but many patterns can be reused across most ``.gff`` files and the benefit is that all modifications to the ``.gff`` can be preserved in the ``.ini`` file." } [/block] ``` cd ~/import perl ../ei/core/prepare_gff.pl ../ei/conf/core-import.ini ``` Features that cannot be processed using the provided ``.ini`` file are written to a ``.exception.gff`` file which can be processed using a second ``.ini`` file to overwrite specific parameters in the first file (see [Processing exceptions](doc:processing-exceptions)). ``` perl ../ei/core/prepare_gff.pl ../ei/conf/core-import.ini /path/to/exception.ini ``` [block:api-header] { "type": "basic", "title": "Configuration options" } [/block] - [[GFF]](doc:gff-core) ``` [GFF] ; SPLIT = [ ##FASTA GFF CONTIG ] SORT = 1 CHUNK = [ change region ] ; CHUNK = [ separator ### ] CONDITION1 = [ MULTILINE CDS ] CONDITION1a = [ MULTILINE five_prime_UTR ] CONDITION1b = [ MULTILINE three_prime_UTR ] CONDITION2 = [ EXPECTATION cds hasSister exon force ] CONDITION3 = [ EXPECTATION cds hasParent mrna force ]; CONDITION4 = [ EXPECTATION exon hasParent mrna force ]; CONDITION4a = [ EXPECTATION five_prime_UTR hasParent mrna force ]; CONDITION4b = [ EXPECTATION three_prime_UTR hasParent mrna force ]; CONDITION5 = [ EXPECTATION mrna hasParent gene force ]; CONDITION10 = [ EXPECTATION cds|exon|mrna|three_prime_UTR|five_prime_UTR|gene <=[_start,_end] SELF warn ]; ``` Meta-syntax to set options for [gff parser](https://github.com/rjchallis/gff-parser). - For files with fasta sequence included at the end, ``SPLIT`` will split the gff file on the specified keyword (``##FASTA``) and assign the resulting subfiles to the [[FILES]](doc:files-core) handles ``GFF`` and ``CONTIG`` - ``SORT`` is a flag to determine whether the file should be sorted prior to processing. This is a basic sort which will result in each sequence region forming a block in the sorted file, allowing the file to be processed in chunks for much faster performance. - ``CHUNK`` causes the file to be processed in independent chunks, which is much more efficient than reading the entire file into memory, particularly if there are a large number of validation steps. - for sorted files, specifying ``change region`` will split the file into a separate chunk for each sequence region. - alternatively, for files with additional formatting rows, the file may be split on specific ``separator``s - Most other keys (e.g. ``CONDITION1``) can have any name and will be used to set validation conditions. - Each feature in a ``.gff`` file should have a unique ID. Specifying ``MULTILINE`` allows individual CDS features, for example to be defined across multiple lines. - ``EXPECTATION``s can be set for individual feature types (or pipe-separated sets of feature types) and may be of type ``hasParent <type>`` (feature has a parent feature of the named type) or ``hasSister <type>`` (feature shares a parent with a feature of the named type at overlapping coordinates), or one of a set of comparison operators ``<``, ``<=``, ``==``, >=``, ``>``. - For each expectation, the behaviour of the validator can be set to ``ignore``, ``warn``, ``find`` a matching feature, ``make`` a matching feature, ``force`` (``find`` followed by ``make``), or ``die``. - [[FILES]](doc:files-core) ``` [FILES] GFF = [ gff3 http://www.bioinformatics.nl/wintermoth/data_files/Obru_genes.gff.gz ] PROTEIN = [ fa http://www.bioinformatics.nl/wintermoth/data_files/ObruPep.fasta.gz ] ``` A ``GFF`` file must be specified, and optionally additional files (e.g. ``PROTEIN``) may be specified as sources of additional information. - [[_STABLE_IDS]](doc:_stable_ids-core) ``` [GENE_STABLE_IDS] GFF = [ gene->Name /(.+)/ ] [TRANSCRIPT_STABLE_IDS] GFF = [ SELF->Name /(.+)/ ] [TRANSLATION_STABLE_IDS] GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ] ``` These are used within the Ensembl database as the primary identifiers for each gene, transcript amd translation and the expectation is that these will be set to a value that will ideally remain stable across assembly versions. The stable_id is also used by this script to link records across files so the same identifier (or a pattern that can be linked by [Match and replace](doc:match-and-replace)) must be present in each of the [[FILES]](doc:files-core) referred to above. - Within the GFF file, attributes may be specified by the pattern ``feature->attribute``. Specifying ``gene->Name`` for a transcript stable_id would select the ``Name`` attribute of the parent ``gene``. Since transcripts may be of many types, the current feature may be specified by the keyword ``SELF`` and transcripts of a ``gene`` may be retrieved using the ``DAUGHTER`` keyword. - If ``TRANSCRIPT_STABLE_ID`` is not defined, the ``TRANSCRIPT_STABLE_ID`` will be reused. - [[_DESCRIPTIONS]](doc:_descriptions-core) ``` [GENE_DESCRIPTIONS] GFF = [ 1 DAUGHTER->product /(.+)/ ] [TRANSCRIPT_DESCRIPTIONS] GFF = [ 1 SELF->product /(.+)/ ] ``` Descriptions are displayed in the Ensembl database and included in the search index (optional [Step 2.8](doc:step-28-generate-search-index). Each set of descriptions may be sourced from any number of files, in which case the first number in the value array indicates the priority accorded to descriptions from that source. Descriptions from sources with lower numbers will overwrite those from sources with higher numbers. If set to 1 this will also cause any existing descriptions in the database to be overwritten. - [[_NAMES]](doc:_names-core) ``` [GENE_NAMES] GFF = [ 1 gene->Name /(.+)/ ] [TRANSCRIPT_NAMES] GFF = [ 1 SELF->Name /(.+)/ ] ``` These are used to set synonyms for each stable_id. If multiple files are specified, each separate synonym will be added to the database. In this case if the first number in the value array is 1, the first synonym from this source will be added to the database as a display_name, shown in preference to the stable_id. - [[DBXREFS]](doc:dbxrefs-core) ``` [DBXREFS] ; KEY = [ EXTERNAL_DB_ID NAME ACCESSION_REGEX DISPLAY_NAME_REGEX ] GO = [ 1000 GO /^goslim_goa:GO:(.+)/ ] INTERNAL = [ 9999 Internal /^Internal:(.+)/ /^Internal:(.+)/ ] REFSEQ_MRNA = [ 1801 RefSeq_mRNA /^Genbank:(NM_.+)/ ] REFSEQ_MRNA_PRED = [ 1806 RefSeq_mRNA_predicted /^Genbank:(XM_.+)/ ] REFSEQ_PEPTIDE = [ 1810 RefSeq_peptide /^Genbank:(NP_.+)/ ] REFSEQ_PEPTIDE_PRED = [ 1815 RefSeq_peptide_predicted /^Genbank:(XP_.+)/ ] REFSEQ_RNA = [ 1820 RefSeq_rna /^Genbank:(XR_.+)/ ] REFSEQ_RNA_PRED = [ 1825 RefSeq_rna_predicted /^Genbank:(XR_.+)/ ] ENTREZGENE = [ 1300 EntrezGene /^GeneID:(.+)/ ] UNIPROT = [ 2250 UniProtKB_all /^UniProtKB:(.+)/ ] ``` Pattern matching to associate Dbxref attributes in the GFF with the correct database. For each Dbxref in the ``.gff`` file, the value array contains the Ensembl external_db_id, the display_name for the external_db and regular expressions to extract the database accession and display name from any additional information in the string. - [[EXTERNAL_DBS]](doc:external_dbs-core) ``` [EXTERNAL_DBS] ``` Used to add any additional external databases to the external_db table in the Ensembl database if required to support [[DBXREFS]](doc:dbxrefs-core)