`-p` Prepare the gff file for import
.gff files deviate from the official specification and even for those that are correctly formatted, it is often useful to extract information from different types/attributes when assigning stable_ids, names, synonyms and descriptions for import to an Ensembl database. It is also common for different data types in a single file to have different attributes specified so a set of patterns to one feature type may not be suitable for another, requiring multiple passes across the same file to extract all of the data. For these reasons, no attempt is made to import
.gff directly to the core database, but instead an intermediate file is created with specific attributes ready for import in
-g Import gff from prepared file.
.gfffiles that contain errors and inconsistencies, it can be frustrating to use many of the available parsers which output an error message and leave the user to manually repair the file, often with a set of one-liners, and then attempt to import the file again.
.gffhandling in easy import uses a gff parser which embraces the diversity of real world gff by allowing full customisation of expected relationships and properties with functions to repair, warn or ignore errors during validation. A subset of parameters for this parser can be controlled with the [GFF] stanza of the
.inifile (see also Repairing gff). This approach initially adds some complexity to the parameter specification, but many patterns can be reused across most
.gfffiles and the benefit is that all modifications to the
.gffcan be preserved in the
docker run --rm \ --name easy-import-operophtera_brumata_v1_core_32_85_1 \ --link genomehubs-mysql \ -v ~/demo/genomehubs-import/import/conf:/import/conf \ -v ~/demo/genomehubs-import/import/data:/import/data \ -e DATABASE=operophtera_brumata_v1_core_32_85_1 \ -e FLAGS="-p" \ genomehubs/easy-import:latest
Features that cannot be processed using the provided
.ini file are written to a
.exception.gff file which can be processed using a second
.ini file to overwrite specific parameters in the first file (see Processing exceptions).
[GFF] ; SPLIT = [ ##FASTA GFF CONTIG ] SORT = 1 CHUNK = [ change region ] ; CHUNK = [ separator ### ] CONDITION1 = [ MULTILINE CDS ] CONDITION1a = [ MULTILINE five_prime_UTR ] CONDITION1b = [ MULTILINE three_prime_UTR ] CONDITION2 = [ EXPECTATION cds hasSister exon force ] CONDITION3 = [ EXPECTATION cds hasParent mrna force ]; CONDITION4 = [ EXPECTATION exon hasParent mrna force ]; CONDITION4a = [ EXPECTATION five_prime_UTR hasParent mrna force ]; CONDITION4b = [ EXPECTATION three_prime_UTR hasParent mrna force ]; CONDITION5 = [ EXPECTATION mrna hasParent gene force ]; CONDITION10 = [ EXPECTATION cds|exon|mrna|three_prime_UTR|five_prime_UTR|gene <=[_start,_end] SELF warn ];
Meta-syntax to set options for gff parser.
For files with fasta sequence included at the end,
SPLITwill split the gff file on the specified keyword (
##FASTA) and assign the resulting subfiles to the [FILES] handles
SORTis a flag to determine whether the file should be sorted prior to processing. This is a basic sort which will result in each sequence region forming a block in the sorted file, allowing the file to be processed in chunks for much faster performance.
CHUNKcauses the file to be processed in independent chunks, which is much more efficient than reading the entire file into memory, particularly if there are a large number of validation steps.
- for sorted files, specifying
change regionwill split the file into a separate chunk for each sequence region.
- alternatively, for files with additional formatting rows, the file may be split on specific
- for sorted files, specifying
Most other keys (e.g.
CONDITION1) can have any name and will be used to set validation conditions.
- Each feature in a
.gfffile should have a unique ID. Specifying
MULTILINEallows individual CDS features, for example to be defined across multiple lines.
EXPECTATIONs can be set for individual feature types (or pipe-separated sets of feature types) and may be of type
hasParent <type>(feature has a parent feature of the named type) or
hasSister <type>(feature shares a parent with a feature of the named type at overlapping coordinates), or one of a set of comparison operators
- For each expectation, the behaviour of the validator can be set to
finda matching feature,
makea matching feature,
- Each feature in a
[FILES] GFF = [ gff3 http://www.bioinformatics.nl/wintermoth/data_files/Obru_genes.gff.gz ] PROTEIN = [ fa http://www.bioinformatics.nl/wintermoth/data_files/ObruPep.fasta.gz ]
GFF file must be specified, and optionally additional files (e.g.
PROTEIN) may be specified as sources of additional information.
[GENE_STABLE_IDS] GFF = [ gene->Name /(.+)/ ] [TRANSCRIPT_STABLE_IDS] GFF = [ SELF->Name /(.+)/ ] [TRANSLATION_STABLE_IDS] GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ]
These are used within the Ensembl database as the primary identifiers for each gene, transcript amd translation and the expectation is that these will be set to a value that will ideally remain stable across assembly versions. The stable_id is also used by this script to link records across files so the same identifier (or a pattern that can be linked by Match and replace) must be present in each of the [FILES] referred to above.
Within the GFF file, attributes may be specified by the pattern
gene->Namefor a transcript stable_id would select the
Nameattribute of the parent
gene. Since transcripts may be of many types, the current feature may be specified by the keyword
SELFand transcripts of a
genemay be retrieved using the
TRANSCRIPT_STABLE_IDis not defined, the
TRANSCRIPT_STABLE_IDwill be reused.
[GENE_DESCRIPTIONS] GFF = [ 1 DAUGHTER->product /(.+)/ ] [TRANSCRIPT_DESCRIPTIONS] GFF = [ 1 SELF->product /(.+)/ ]
Descriptions are displayed in the Ensembl database and included in the search index (optional Step 2.8. Each set of descriptions may be sourced from any number of files, in which case the first number in the value array indicates the priority accorded to descriptions from that source. Descriptions from sources with lower numbers will overwrite those from sources with higher numbers. If set to 1 this will also cause any existing descriptions in the database to be overwritten.
[GENE_NAMES] GFF = [ 1 gene->Name /(.+)/ ] [TRANSCRIPT_NAMES] GFF = [ 1 SELF->Name /(.+)/ ]
These are used to set synonyms for each stable_id. If multiple files are specified, each separate synonym will be added to the database. In this case if the first number in the value array is 1, the first synonym from this source will be added to the database as a display_name, shown in preference to the stable_id.
[DBXREFS] ; KEY = [ EXTERNAL_DB_ID NAME ACCESSION_REGEX DISPLAY_NAME_REGEX ] GO = [ 1000 GO /^goslim_goa:GO:(.+)/ ] INTERNAL = [ 9999 Internal /^Internal:(.+)/ /^Internal:(.+)/ ] REFSEQ_MRNA = [ 1801 RefSeq_mRNA /^Genbank:(NM_.+)/ ] REFSEQ_MRNA_PRED = [ 1806 RefSeq_mRNA_predicted /^Genbank:(XM_.+)/ ] REFSEQ_PEPTIDE = [ 1810 RefSeq_peptide /^Genbank:(NP_.+)/ ] REFSEQ_PEPTIDE_PRED = [ 1815 RefSeq_peptide_predicted /^Genbank:(XP_.+)/ ] REFSEQ_RNA = [ 1820 RefSeq_rna /^Genbank:(XR_.+)/ ] REFSEQ_RNA_PRED = [ 1825 RefSeq_rna_predicted /^Genbank:(XR_.+)/ ] ENTREZGENE = [ 1300 EntrezGene /^GeneID:(.+)/ ] UNIPROT = [ 2250 UniProtKB_all /^UniProtKB:(.+)/ ]
Pattern matching to associate Dbxref attributes in the GFF with the correct database. For each Dbxref in the
.gff file, the value array contains the Ensembl external_db_id, the display_name for the external_db and regular expressions to extract the database accession and display name from any additional information in the string.
Used to add any additional external databases to the external_db table in the Ensembl database if required to support [DBXREFS]
Updated less than a minute ago