Easy import approaches [Ensembl](πŸ”—ο»Ώ) data import from a file-based perspective, beginning with sequences and gff and optionally loading additional data directly from program output files.

## Configuration

All parameters for both EasyMirror and EasyImport are set and passed to scripts through `.ini` files. The configuration file format mirrors the use of `.ini` files in [ensembl-webcode](πŸ”—ο»Ώ), which has the benefit that a user who sets up an [Ensembl](πŸ”—ο»Ώ) instance using EasyImport will gain familiarity with the syntax conventions needed to further customise the instance. Use of configuration files rather than command line parameters also improves reproducibility since all parameters must be saved for the pipeline to be run.

## Server setup

Setting up an [Ensembl](πŸ”—ο»Ώ) server to host locally imported data is essentially the same as setting up a mirror site (with just a couple of changes to config files) so the server setup and hosting steps simply reuse [easy mirror](πŸ”—ο»Ώ).

## Core import

The challenge for a file-based approach to data import is dealing with the fragmentation of the `.gff` format and the diverse ways that names, ids and descriptions can be captured both in `.gff` attributes and across a range of other file types. Most of the effort in designing [easy import](πŸ”—ο»Ώ) has been in creating a mechanism to allow gene models and annotations to be extracted from diverse, potentially "broken" files while maintaining a full and clear record of the import procedure to ensure full reproducibility.

One drawback to this approach is that separate configuration files are required for conceptually separate parts of the pipeline. This means that database connection data are repeated across multiple `.ini` files so care must be taken to use the correct template when making changes to the default settings.

### Database creation

Consistency with the latest [Ensembl](πŸ”—ο»Ώ) database schema is ensured by importing the database schema for each new core database from the [ensembl](πŸ”—ο»Ώ) `table.sql` file and using an existing database with the same schema version as a template with the [ensembl-production](πŸ”—ο»Ώ) `populate_production_db_tables.pl` script.

### Sequence data import

Scaffold/contig data will almost always be available as `.fasta` format (or possibly `.fasta` plus `.agp`). This is also the starting point for the [Ensembl](πŸ”—ο»Ώ) import pipeline so sequence import uses scripts directly from [ensembl-pipeline](πŸ”—ο»Ώ).

### Summary statistics

Basic summary statistics can be calculated at the beginning of the core import. A summary of `.gff` attribute counts by feature type is particularly useful in determining which fields may contain IDs, descriptions, etc. and whether any discrepancies in the counts of each attribute suggest that the `.gff` needs to be repaired by the [gff parser](πŸ”—ο»Ώ).

### Parsing and repairing GFF

Our flexible [gff parser](πŸ”—ο»Ώ) lies at the heart of [easy import](πŸ”—ο»Ώ). It has been designed to embrace the diversity of real world `.gff` files by allowing full customisation of expected relationships and properties with functions to repair, warn or ignore errors during validation. [gff parser](πŸ”—ο»Ώ) is a perl module that provides a mechanism to assign expectations and validation rules to specific `.gff` feature types while having very little hard-coded awareness of any official [gff specification](πŸ”—ο»Ώ) allowing the flexibility to handle many more edge cases than other parsers. For [easy import](πŸ”—ο»Ώ), a subset of the full functionality can be controlled through a meta-syntax in the core import `.ini` files and while the [gff parser](πŸ”—ο»Ώ) can accommodate most forms of `.gff`, including `.gtf`, [easy import](πŸ”—ο»Ώ) requires a `.gff3` compatible format in column nine. This approach initially adds some complexity to the parameter specification, but many patterns can be reused across most `.gff3` files and the benefit is that all modifications to the `.gff3` can be preserved in the `.ini` file.

### Gene model import

While the [gff parser](πŸ”—ο»Ώ) takes care of standardising `.gff`, additional functionality in [easy import](πŸ”—ο»Ώ) allows for retrieval of gene IDs, synonyms and descriptions from `.gff` attributes as well as from `.fasta` headers and simple text files. For a community resource like [Lepbase](πŸ”—ο»Ώ), this provides the flexibility to easily incorporate information from diverse sources as supplied by individual labs rather than demanding a standardised format, which could act as a deterrent to full data sharing. For others implementing this pipeline, it offers the flexibility to integrate with existing protocols without the need to reformat data prior to import.

### Import verification

In our experience, most problems with data import from `.gff` files can be detected through comparison of provider protein sequences with translations exported from an [Ensembl](πŸ”—ο»Ώ) database. [easy import](πŸ”—ο»Ώ) checks that the same IDs are present in each of these sets, and that the sequence lengths are identical. The most common causes of differences are alternate interpretations of phase ([0,1,2] vs. [0,2,1]), and manual editing of the provided sequences file to terminate translations at the first stop codon.

### Additional annotations

Some xrefs can be imported via `Dbxref` attributes in a `.gff` file, however, we have deliberately limited the extent to which additional annotations can be imported from `.gff` due to the complexity of validating additional feature types and the complexity of mapping from potentially variable attribute names to specific fields in the core database tables. Several xref types can be richly represented in the Ensembl database if all required attributes are provided and this is easiest to ensure by working directly with program outputs. This also fits most closely with the [Lepbase](πŸ”—ο»Ώ) model of ensuring consistency across genomes from diverse sources through annotating features with consistent databases/parameters. [easy import](πŸ”—ο»Ώ) currently supports direct import of `blastp`, `interproscan` and `repeatmasker` output files.

### File export

Sequence file export (scaffold, protein and cds) makes it simple to access the bulk data files for analysis or to provide bulk downloads. In addition to the basic formats, additional files are also exported for use with our comparative analysis pipeline. Detailed summary statistics can also be exported in `.json` format, which are used on [ensembl.lepbase.org](πŸ”—ο»Ώ) to populate the assembly statistics tables and in our [assembly stats](πŸ”—ο»Ώ) and [codon usage](πŸ”—ο»Ώ) visualisations.

### Search indexing

ο»Ώ[Ensembl](πŸ”—ο»Ώ) supports a very basic (direct `MySQL`) search out of the box, this is best replaced with a search plugin so we have made `index_database.pl` available to generate an autocomplete/search index compatible with the GenomeHubs [gh-ensembl-plugin](πŸ”—ο»Ώ) plugin.

The final round of summary statistic generation also produces template files for use on the species pages of an [Ensembl](πŸ”—ο»Ώ) instance, including our highly detailed [assembly stats](πŸ”—ο»Ώ) plots as used on [ensembl.lepbase.org](πŸ”—ο»Ώ).

## Compara import

A powerful feature of [Ensembl](πŸ”—ο»Ώ) is the integration of single and cross-species analyses in a single genome browser. At [lepbase.org](πŸ”—ο»Ώ), we have recently finished implementing our own compara analysis for import using [easy import](πŸ”—ο»Ώ) and will add a full description of a containerised version of the analysis pipeline soon.