Description
A brief description of the easy import pipeline
Easy import approaches Ensembl data import from a file-based perspective, beginning with sequences and gff and optionally loading additional data directly from program output files.
Configuration
All parameters for both EasyMirror and EasyImport are set and passed to scripts through .ini
files. The configuration file format mirrors the use of .ini
files in ensembl-webcode, which has the benefit that a user who sets up an Ensembl instance using EasyImport will gain familiarity with the syntax conventions needed to further customise the instance. Use of configuration files rather than command line parameters also improves reproducibility since all parameters must be saved for the pipeline to be run.
Server setup
Setting up an Ensembl server to host locally imported data is essentially the same as setting up a mirror site (with just a couple of changes to config files) so the server setup and hosting steps simply reuse easy mirror.
Core import
The challenge for a file-based approach to data import is dealing with the fragmentation of the .gff
format and the diverse ways that names, ids and descriptions can be captured both in .gff
attributes and across a range of other file types. Most of the effort in designing easy import has been in creating a mechanism to allow gene models and annotations to be extracted from diverse, potentially "broken" files while maintaining a full and clear record of the import procedure to ensure full reproducibility.
One drawback to this approach is that separate configuration files are required for conceptually separate parts of the pipeline. This means that database connection data are repeated across multiple .ini
files so care must be taken to use the correct template when making changes to the default settings.
Database creation
Consistency with the latest Ensembl database schema is ensured by importing the database schema for each new core database from the ensembl table.sql
file and using an existing database with the same schema version as a template with the ensembl-production populate_production_db_tables.pl
script.
Sequence data import
Scaffold/contig data will almost always be available as .fasta
format (or possibly .fasta
plus .agp
). This is also the starting point for the Ensembl import pipeline so sequence import uses scripts directly from ensembl-pipeline.
Summary statistics
Basic summary statistics can be calculated at the beginning of the core import. A summary of .gff
attribute counts by feature type is particularly useful in determining which fields may contain IDs, descriptions, etc. and whether any discrepancies in the counts of each attribute suggest that the .gff
needs to be repaired by the gff parser.
Parsing and repairing GFF
Our flexible gff parser lies at the heart of easy import. It has been designed to embrace the diversity of real world .gff
files by allowing full customisation of expected relationships and properties with functions to repair, warn or ignore errors during validation. gff parser is a perl module that provides a mechanism to assign expectations and validation rules to specific .gff
feature types while having very little hard-coded awareness of any official gff specification allowing the flexibility to handle many more edge cases than other parsers. For easy import, a subset of the full functionality can be controlled through a meta-syntax in the core import .ini
files and while the gff parser can accommodate most forms of .gff
, including .gtf
, easy import requires a .gff3
compatible format in column nine. This approach initially adds some complexity to the parameter specification, but many patterns can be reused across most .gff3
files and the benefit is that all modifications to the .gff3
can be preserved in the .ini
file.
Gene model import
While the gff parser takes care of standardising .gff
, additional functionality in easy import allows for retrieval of gene IDs, synonyms and descriptions from .gff
attributes as well as from .fasta
headers and simple text files. For a community resource like Lepbase, this provides the flexibility to easily incorporate information from diverse sources as supplied by individual labs rather than demanding a standardised format, which could act as a deterrent to full data sharing. For others implementing this pipeline, it offers the flexibility to integrate with existing protocols without the need to reformat data prior to import.
Import verification
In our experience, most problems with data import from .gff
files can be detected through comparison of provider protein sequences with translations exported from an Ensembl database. easy import checks that the same IDs are present in each of these sets, and that the sequence lengths are identical. The most common causes of differences are alternate interpretations of phase ([0,1,2] vs. [0,2,1]), and manual editing of the provided sequences file to terminate translations at the first stop codon.
Additional annotations
Some xrefs can be imported via Dbxref
attributes in a .gff
file, however, we have deliberately limited the extent to which additional annotations can be imported from .gff
due to the complexity of validating additional feature types and the complexity of mapping from potentially variable attribute names to specific fields in the core database tables. Several xref types can be richly represented in the Ensembl database if all required attributes are provided and this is easiest to ensure by working directly with program outputs. This also fits most closely with the Lepbase model of ensuring consistency across genomes from diverse sources through annotating features with consistent databases/parameters. easy import currently supports direct import of blastp
, interproscan
and repeatmasker
output files.
File export
Sequence file export (scaffold, protein and cds) makes it simple to access the bulk data files for analysis or to provide bulk downloads. In addition to the basic formats, additional files are also exported for use with our comparative analysis pipeline. Detailed summary statistics can also be exported in .json
format, which are used on ensembl.lepbase.org to populate the assembly statistics tables and in our assembly stats and codon usage visualisations.
Search indexing
Ensembl supports a very basic (direct MySQL
) search out of the box, this is best replaced with a search plugin so we have made index_database.pl
available to generate an autocomplete/search index compatible with the GenomeHubs gh-ensembl-plugin plugin.
The final round of summary statistic generation also produces template files for use on the species pages of an Ensembl instance, including our highly detailed assembly stats plots as used on ensembl.lepbase.org.
Compara import
A powerful feature of Ensembl is the integration of single and cross-species analyses in a single genome browser. At lepbase.org, we have recently finished implementing our own compara analysis for import using easy import and will add a full description of a containerised version of the analysis pipeline soon.
Updated less than a minute ago