Step 2.3: Prepare the gff file for import

Many .gff files deviate from the official specification and even for those that are correctly formatted, it is often useful to extract information from different types/attributes when assigning stable_ids, names, synonyms and descriptions for import to an Ensembl database. It is also common for different data types in a single file to have different attributes specified so a set of patterns to one feature type may not be suitable for another, requiring multiple passes across the same file to extract all of the data. For these reasons, no attempt is made to import .gff directly to the core database, but instead an intermediate file is created with specific attributes ready for import in Step 2.4: Import gff from prepared file.

📘
Repairing GFF
For .gff files that contain errors and inconsistencies, it can be frustrating to use many of the available parsers which output an error message and leave the user to manually repair the file, often with a set of one-liners, and then attempt to import the file again. .gff handling in easy import uses a gff parser which embraces the diversity of real world gff by allowing full customisation of expected relationships and properties with functions to repair, warn or ignore errors during validation. A subset of parameters for this parser can be controlled with the [GFF] stanza of the .ini file (see also Repairing gff). This approach initially adds some complexity to the parameter specification, but many patterns can be reused across most .gff files and the benefit is that all modifications to the .gff can be preserved in the .ini file.

cd ~/import
perl ../ei/core/prepare_gff.pl ../ei/conf/core-import.ini

Features that cannot be processed using the provided .ini file are written to a .exception.gff file which can be processed using a second .ini file to overwrite specific parameters in the first file (see Processing exceptions).

perl ../ei/core/prepare_gff.pl ../ei/conf/core-import.ini /path/to/exception.ini

Configuration options

[GFF]

[GFF]
;	SPLIT = [ ##FASTA GFF CONTIG ]
	SORT = 1
	CHUNK = [ change region ]
;	CHUNK = [ separator		### ]
	CONDITION1 = [ MULTILINE   CDS ]
	CONDITION1a = [ MULTILINE  five_prime_UTR ]
	CONDITION1b = [ MULTILINE  three_prime_UTR ]
	CONDITION2 = [ EXPECTATION cds	 hasSister exon force ]
	CONDITION3 = [ EXPECTATION cds	 hasParent mrna force ];
	CONDITION4 = [ EXPECTATION exon	 hasParent mrna force ];
	CONDITION4a = [ EXPECTATION five_prime_UTR hasParent mrna force ];
	CONDITION4b = [ EXPECTATION three_prime_UTR  hasParent mrna force ];
	CONDITION5 = [ EXPECTATION mrna	 hasParent gene force ];
	CONDITION10 = [ EXPECTATION cds|exon|mrna|three_prime_UTR|five_prime_UTR|gene <=[_start,_end] SELF warn ];

Meta-syntax to set options for gff parser.

For files with fasta sequence included at the end, SPLIT will split the gff file on the specified keyword (##FASTA) and assign the resulting subfiles to the [FILES] handles GFF and CONTIG
SORT is a flag to determine whether the file should be sorted prior to processing. This is a basic sort which will result in each sequence region forming a block in the sorted file, allowing the file to be processed in chunks for much faster performance.
CHUNK causes the file to be processed in independent chunks, which is much more efficient than reading the entire file into memory, particularly if there are a large number of validation steps.
- for sorted files, specifying change region will split the file into a separate chunk for each sequence region.
- alternatively, for files with additional formatting rows, the file may be split on specific separators
Most other keys (e.g. CONDITION1) can have any name and will be used to set validation conditions.
- Each feature in a .gff file should have a unique ID. Specifying MULTILINE allows individual CDS features, for example to be defined across multiple lines.
- EXPECTATIONs can be set for individual feature types (or pipe-separated sets of feature types) and may be of type hasParent <type> (feature has a parent feature of the named type) or hasSister <type> (feature shares a parent with a feature of the named type at overlapping coordinates), or one of a set of comparison operators <, <=, ==, >=, >``.
- For each expectation, the behaviour of the validator can be set to ignore, warn, find a matching feature, make a matching feature, force (find followed by make), or die.
[FILES]

[FILES]
	GFF = [ gff3 http://www.bioinformatics.nl/wintermoth/data_files/Obru_genes.gff.gz ]
	PROTEIN = [ fa http://www.bioinformatics.nl/wintermoth/data_files/ObruPep.fasta.gz ]

A GFF file must be specified, and optionally additional files (e.g. PROTEIN) may be specified as sources of additional information.

[_STABLE_IDS]

[GENE_STABLE_IDS]
    GFF = [ gene->Name /(.+)/ ]
[TRANSCRIPT_STABLE_IDS]
    GFF = [ SELF->Name /(.+)/ ]
[TRANSLATION_STABLE_IDS]
    GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ]

These are used within the Ensembl database as the primary identifiers for each gene, transcript amd translation and the expectation is that these will be set to a value that will ideally remain stable across assembly versions. The stable_id is also used by this script to link records across files so the same identifier (or a pattern that can be linked by Match and replace) must be present in each of the [FILES] referred to above.

Within the GFF file, attributes may be specified by the pattern feature->attribute. Specifying gene->Name for a transcript stable_id would select the Name attribute of the parent gene. Since transcripts may be of many types, the current feature may be specified by the keyword SELF and transcripts of a gene may be retrieved using the DAUGHTER keyword.
If TRANSCRIPT_STABLE_ID is not defined, the TRANSCRIPT_STABLE_ID will be reused.
[_DESCRIPTIONS]

[GENE_DESCRIPTIONS]
    GFF = [ 1 DAUGHTER->product /(.+)/ ]
[TRANSCRIPT_DESCRIPTIONS]
    GFF = [ 1 SELF->product /(.+)/ ]

Descriptions are displayed in the Ensembl database and included in the search index (optional Step 2.8. Each set of descriptions may be sourced from any number of files, in which case the first number in the value array indicates the priority accorded to descriptions from that source. Descriptions from sources with lower numbers will overwrite those from sources with higher numbers. If set to 1 this will also cause any existing descriptions in the database to be overwritten.

[_NAMES]

[GENE_NAMES]
    GFF = [ 1 gene->Name /(.+)/ ]
[TRANSCRIPT_NAMES]
    GFF = [ 1 SELF->Name /(.+)/ ]

These are used to set synonyms for each stable_id. If multiple files are specified, each separate synonym will be added to the database. In this case if the first number in the value array is 1, the first synonym from this source will be added to the database as a display_name, shown in preference to the stable_id.

[DBXREFS]

[DBXREFS]
    ;   KEY = [ EXTERNAL_DB_ID NAME ACCESSION_REGEX DISPLAY_NAME_REGEX ]
    GO = [ 1000 GO /^goslim_goa:GO:(.+)/ ]
    INTERNAL = [ 9999 Internal /^Internal:(.+)/ /^Internal:(.+)/ ]
    REFSEQ_MRNA = [ 1801 RefSeq_mRNA /^Genbank:(NM_.+)/ ]
    REFSEQ_MRNA_PRED = [ 1806 RefSeq_mRNA_predicted /^Genbank:(XM_.+)/ ]
    REFSEQ_PEPTIDE = [ 1810 RefSeq_peptide /^Genbank:(NP_.+)/ ]
    REFSEQ_PEPTIDE_PRED = [ 1815 RefSeq_peptide_predicted /^Genbank:(XP_.+)/ ]
    REFSEQ_RNA = [ 1820 RefSeq_rna /^Genbank:(XR_.+)/ ]
    REFSEQ_RNA_PRED = [ 1825 RefSeq_rna_predicted /^Genbank:(XR_.+)/ ]
    ENTREZGENE = [ 1300 EntrezGene /^GeneID:(.+)/ ]
    UNIPROT = [ 2250 UniProtKB_all /^UniProtKB:(.+)/ ]

Pattern matching to associate Dbxref attributes in the GFF with the correct database. For each Dbxref in the .gff file, the value array contains the Ensembl external_db_id, the display_name for the external_db and regular expressions to extract the database accession and display name from any additional information in the string.

[EXTERNAL_DBS]

[EXTERNAL_DBS]

Used to add any additional external databases to the external_db table in the Ensembl database if required to support [DBXREFS]

📘Repairing GFF

Configuration options

📘
Repairing GFF