`-p` Prepare the gff file for import
Many .gff
files deviate from the official specification and even for those that are correctly formatted, it is often useful to extract information from different types/attributes when assigning stable_ids, names, synonyms and descriptions for import to an Ensembl database. It is also common for different data types in a single file to have different attributes specified so a set of patterns to one feature type may not be suitable for another, requiring multiple passes across the same file to extract all of the data. For these reasons, no attempt is made to import .gff
directly to the core database, but instead an intermediate file is created with specific attributes ready for import in -g
Import gff from prepared file.
Repairing GFF
For
.gff
files that contain errors and inconsistencies, it can be frustrating to use many of the available parsers which output an error message and leave the user to manually repair the file, often with a set of one-liners, and then attempt to import the file again..gff
handling in easy import uses a gff parser which embraces the diversity of real world gff by allowing full customisation of expected relationships and properties with functions to repair, warn or ignore errors during validation. A subset of parameters for this parser can be controlled with the [GFF] stanza of the.ini
file (see also Repairing gff). This approach initially adds some complexity to the parameter specification, but many patterns can be reused across most.gff
files and the benefit is that all modifications to the.gff
can be preserved in the.ini
file.
docker run --rm \
--name easy-import-operophtera_brumata_v1_core_32_85_1 \
--link genomehubs-mysql \
-v ~/demo/genomehubs-import/import/conf:/import/conf \
-v ~/demo/genomehubs-import/import/data:/import/data \
-e DATABASE=operophtera_brumata_v1_core_32_85_1 \
-e FLAGS="-p" \
genomehubs/easy-import:latest
Features that cannot be processed using the provided .ini
file are written to a .exception.gff
file which can be processed using a second .ini
file to overwrite specific parameters in the first file (see Processing exceptions).
Configuration options
[GFF]
; SPLIT = [ ##FASTA GFF CONTIG ]
SORT = 1
CHUNK = [ change region ]
; CHUNK = [ separator ### ]
CONDITION1 = [ MULTILINE CDS ]
CONDITION1a = [ MULTILINE five_prime_UTR ]
CONDITION1b = [ MULTILINE three_prime_UTR ]
CONDITION2 = [ EXPECTATION cds hasSister exon force ]
CONDITION3 = [ EXPECTATION cds hasParent mrna force ];
CONDITION4 = [ EXPECTATION exon hasParent mrna force ];
CONDITION4a = [ EXPECTATION five_prime_UTR hasParent mrna force ];
CONDITION4b = [ EXPECTATION three_prime_UTR hasParent mrna force ];
CONDITION5 = [ EXPECTATION mrna hasParent gene force ];
CONDITION10 = [ EXPECTATION cds|exon|mrna|three_prime_UTR|five_prime_UTR|gene <=[_start,_end] SELF warn ];
Meta-syntax to set options for gff parser.
-
For files with fasta sequence included at the end,
SPLIT
will split the gff file on the specified keyword (##FASTA
) and assign the resulting subfiles to the [FILES] handlesGFF
andCONTIG
-
SORT
is a flag to determine whether the file should be sorted prior to processing. This is a basic sort which will result in each sequence region forming a block in the sorted file, allowing the file to be processed in chunks for much faster performance. -
CHUNK
causes the file to be processed in independent chunks, which is much more efficient than reading the entire file into memory, particularly if there are a large number of validation steps.- for sorted files, specifying
change region
will split the file into a separate chunk for each sequence region. - alternatively, for files with additional formatting rows, the file may be split on specific
separator
s
- for sorted files, specifying
-
Most other keys (e.g.
CONDITION1
) can have any name and will be used to set validation conditions.- Each feature in a
.gff
file should have a unique ID. SpecifyingMULTILINE
allows individual CDS features, for example to be defined across multiple lines. EXPECTATION
s can be set for individual feature types (or pipe-separated sets of feature types) and may be of typehasParent <type>
(feature has a parent feature of the named type) orhasSister <type>
(feature shares a parent with a feature of the named type at overlapping coordinates), or one of a set of comparison operators<
,<=
,==
, >=,
>``.- For each expectation, the behaviour of the validator can be set to
ignore
,warn
,find
a matching feature,make
a matching feature,force
(find
followed bymake
), ordie
.
- Each feature in a
[FILES]
GFF = [ gff3 http://www.bioinformatics.nl/wintermoth/data_files/Obru_genes.gff.gz ]
PROTEIN = [ fa http://www.bioinformatics.nl/wintermoth/data_files/ObruPep.fasta.gz ]
A GFF
file must be specified, and optionally additional files (e.g. PROTEIN
) may be specified as sources of additional information.
[GENE_STABLE_IDS]
GFF = [ gene->Name /(.+)/ ]
[TRANSCRIPT_STABLE_IDS]
GFF = [ SELF->Name /(.+)/ ]
[TRANSLATION_STABLE_IDS]
GFF = [ SELF->Name /(.+)/ /-RA/-PA/ ]
These are used within the Ensembl database as the primary identifiers for each gene, transcript amd translation and the expectation is that these will be set to a value that will ideally remain stable across assembly versions. The stable_id is also used by this script to link records across files so the same identifier (or a pattern that can be linked by Match and replace) must be present in each of the [FILES] referred to above.
-
Within the GFF file, attributes may be specified by the pattern
feature->attribute
. Specifyinggene->Name
for a transcript stable_id would select theName
attribute of the parentgene
. Since transcripts may be of many types, the current feature may be specified by the keywordSELF
and transcripts of agene
may be retrieved using theDAUGHTER
keyword. -
If
TRANSCRIPT_STABLE_ID
is not defined, theTRANSCRIPT_STABLE_ID
will be reused.
[GENE_DESCRIPTIONS]
GFF = [ 1 DAUGHTER->product /(.+)/ ]
[TRANSCRIPT_DESCRIPTIONS]
GFF = [ 1 SELF->product /(.+)/ ]
Descriptions are displayed in the Ensembl database and included in the search index (optional Step 2.8. Each set of descriptions may be sourced from any number of files, in which case the first number in the value array indicates the priority accorded to descriptions from that source. Descriptions from sources with lower numbers will overwrite those from sources with higher numbers. If set to 1 this will also cause any existing descriptions in the database to be overwritten.
[GENE_NAMES]
GFF = [ 1 gene->Name /(.+)/ ]
[TRANSCRIPT_NAMES]
GFF = [ 1 SELF->Name /(.+)/ ]
These are used to set synonyms for each stable_id. If multiple files are specified, each separate synonym will be added to the database. In this case if the first number in the value array is 1, the first synonym from this source will be added to the database as a display_name, shown in preference to the stable_id.
[DBXREFS]
; KEY = [ EXTERNAL_DB_ID NAME ACCESSION_REGEX DISPLAY_NAME_REGEX ]
GO = [ 1000 GO /^goslim_goa:GO:(.+)/ ]
INTERNAL = [ 9999 Internal /^Internal:(.+)/ /^Internal:(.+)/ ]
REFSEQ_MRNA = [ 1801 RefSeq_mRNA /^Genbank:(NM_.+)/ ]
REFSEQ_MRNA_PRED = [ 1806 RefSeq_mRNA_predicted /^Genbank:(XM_.+)/ ]
REFSEQ_PEPTIDE = [ 1810 RefSeq_peptide /^Genbank:(NP_.+)/ ]
REFSEQ_PEPTIDE_PRED = [ 1815 RefSeq_peptide_predicted /^Genbank:(XP_.+)/ ]
REFSEQ_RNA = [ 1820 RefSeq_rna /^Genbank:(XR_.+)/ ]
REFSEQ_RNA_PRED = [ 1825 RefSeq_rna_predicted /^Genbank:(XR_.+)/ ]
ENTREZGENE = [ 1300 EntrezGene /^GeneID:(.+)/ ]
UNIPROT = [ 2250 UniProtKB_all /^UniProtKB:(.+)/ ]
Pattern matching to associate Dbxref attributes in the GFF with the correct database. For each Dbxref in the .gff
file, the value array contains the Ensembl external_db_id, the display_name for the external_db and regular expressions to extract the database accession and display name from any additional information in the string.
[EXTERNAL_DBS]
Used to add any additional external databases to the external_db table in the Ensembl database if required to support [DBXREFS]
Updated less than a minute ago