Description

Easy import approaches Ensembl data import from a file-based perspective, beginning with sequences and gff and optionally loading additional data directly from program output files.

Configuration

All parameters for both easy mirror and easy import are set and passed to scripts through .ini files. The configuration file format mirrors the use of .ini files in ensembl-webcode, which has the benefit that a user who sets up an Ensembl instance using easy import will gain familiarity with the syntax conventions needed to further customise the instance. Use of configuration files rather than command line parameters also improves reproducibility since all parameters must be saved for the pipeline to be run.

Server setup

Setting up an Ensembl server to host locally imported data is essentially the same as setting up a mirror site (with just a couple of changes to config files) so the server setup and hosting steps simply reuse easy mirror.

Core import

The challenge for a file-based approach to data import is dealing with the fragmentation of the .gff format and the diverse ways that names, ids and descriptions can be captured both in .gff attributes and across a range of other file types. Most of the effort in designing easy import has been in creating a mechanism to allow gene models and annotations to be extracted from diverse, potentially "broken" files while maintaining a full and clear record of the import procedure to ensure full reproducibility.

One drawback to this approach is that separate configuration files are required for conceptually separate parts of the pipeline. This means that database connection data are repeated across multiple .ini files so care must be taken to use the correct template when making changes to the default settings.

Database creation

Consistency with the latest Ensembl database schema is ensured by importing the database schema for each new core database from the ensembl table.sql file and using an existing database with the same schema version as a template with the ensembl-production populate_production_db_tables.pl script.

Sequence data import

Scaffold/contig data will almost always be available as .fasta format (or possibly .fasta plus .agp). This is also the starting point for the Ensembl import pipeline so sequence import uses scripts directly from ensembl-pipeline.

Summary statistics

Basic summary statistics can be calculated at the beginning of the core import. A summary of .gff attribute counts by feature type is particularly useful in determining which fields may contain IDs, descriptions, etc. and whether any discrepancies in the counts of each attribute suggest that the .gff needs to be repaired by the gff parser.

Parsing and repairing GFF

Our flexible gff parser lies at the heart of easy import. It has been designed to embrace the diversity of real world .gff files by allowing full customisation of expected relationships and properties with functions to repair, warn or ignore errors during validation. gff parser is a perl module that provides a mechanism to assign expectations and validation rules to specific .gff feature types while having very little hard-coded awareness of any official gff specification allowing the flexibility to handle many more edge cases than other parsers. For easy import, a subset of the full functionality can be controlled through a meta-syntax in the core import .ini files and while the gff parser can accommodate most forms of .gff, including .gtf, easy import requires a .gff3 compatible format in column nine. This approach initially adds some complexity to the parameter specification, but many patterns can be reused across most .gff3 files and the benefit is that all modifications to the .gff3 can be preserved in the .ini file.

Gene model import

While the gff parser takes care of standardising .gff, additional functionality in easy import allows for retrieval of gene IDs, synonyms and descriptions from .gff attributes as well as from .fasta headers and simple text files. For a community resource like Lepbase, this provides the flexibility to easily incorporate information from diverse sources as supplied by individual labs rather than demanding a standardised format, which could act as a deterrent to full data sharing. For others implementing this pipeline, it offers the flexibility to integrate with existing protocols without the need to reformat data prior to import.

Import verification

In our experience, most problems with data import from .gff files can be detected through comparison of provider protein sequences with translations exported from an Ensembl database. easy import checks that the same IDs are present in each of these sets, and that the sequence lengths are identical. The most common causes of differences are alternate interpretations of phase ([0,1,2] vs. [0,2,1]), and manual editing of the provided sequences file to terminate translations at the first stop codon.

Additional annotations

Some xrefs can be imported via Dbxref attributes in a .gff file, however, we have deliberately limited the extent to which additional annotations can be imported from .gff due to the complexity of validating additional feature types and the complexity of mapping from potentially variable attribute names to specific fields in the core database tables. Several xref types can be richly represented in the Ensembl database if all required attributes are provided and this is easiest to ensure by working directly with program outputs. This also fits most closely with the Lepbase model of ensuring consistency across genomes from diverse sources through annotating features with consistent databases/parameters. easy import currently supports direct import of blastp, interproscan and repeatmasker output files.

File export

Sequence file export (scaffold, protein and cds) makes it simple to access the bulk data files for analysis or to provide bulk downloads. In addition to the basic formats, additional files are also exported for use with our comparative analysis pipeline. Detailed summary statistics can also be exported in .json format, which are used on ensembl.lepbase.org to populate the assembly statistics tables and in our assembly stats and codon usage visualisations.

Search indexing

Ensembl supports a very basic (direct MySQL) search out of the box, this is best replaced with a search plugin so we have made index_database.pl available to generate an autocomplete/search index compatible with the lepbase search plugin.

The final round of summary statistic generation also produces template files for use on the species pages of an Ensembl instance, including our highly detailed assembly stats plots as used on ensembl.lepbase.org.

Compara import

A powerful feature of Ensembl is the integration of single and cross-species analyses in a single genome browser. At lepbase.org, we have recently finished implementing our own compara analysis for import using easy import and will add a full description of the analysis pipeline soon.