Background

Typical genome projects generate large volumes of data that continue to be valuable well beyond the initial funding cycle. In order to maximise this value, it is important to ensure that the data are made accessible to the widest possible community of users. Standardisation of data formats and both programatic and user interfaces are essential to reduce the training required to access new datasets and to facilitate comparative analyses at a variety of scales.

These considerations are central to the Lepbase project. As a taxon-oriented genomic resource for the Lepidoptera one of the core services we provide is to standardise and make accessible genome data from past and present genome projects. By using an Ensembl instance for our genome browser we are able to offer a familiar and standardised interface to each of the Lepidopteran genomes that we host. From an archival perspective we are able store data in a format that will have long-term support with a mature database structure and codebase. This codebase also gives us access to a powerful application programming interface (API) to facilitate large-scale comparative analysis and data-mining.

Setting up an Ensembl server, even to create a local mirror of existing content has long been considered non-trivial due to the number of dependencies, the complexity of the code and the interconnected configuration files which can make it difficult to trace the cause of problems during installation.

One of our earliest tasks at Lepbase was to find a way to make it easy to set up an Ensembl webserver so we could set up multiple instances for development and testing and move our site between virtual machines without worrying about missing dependencies. A related project, easy mirror, is the result of generalising this approach to simplify setting up a mirror of any Ensembl or Ensembl Genomes (including Bacteria, Metazoa, Fungi, Plants and Protists) species with none, all or any amount in between of the data hosted locally.

With the set up an Ensembl mirror reduced to four simple steps, we then set about making the import of sequence, gene model and annotation data to a core database similarly straightforward in easy import. As we add additional datatypes to Lepbase, we are continuing to extend easy import beyond the core database. Most recently we have added compara import, based on our own orthology pipeline and are working on variation.