`-s` Create database and load sequence data

This step consists of two scripts, import_sequences.pl must be run to set up a core database and load sequence data. import_sequence_synonyms.pl is optional and will be run if you have a list of alternate scaffold/contig names or alternate names are generated by the first script (see configuration options).

docker run --rm \
           --name easy-import-operophtera_brumata_v1_core_32_85_1 \
           --link genomehubs-mysql \
           -v ~/demo/genomehubs-import/import/conf:/import/conf \
           -v ~/demo/genomehubs-import/import/data:/import/data \
           -e DATABASE=operophtera_brumata_v1_core_32_85_1 \
           -e FLAGS="-s" \

Configuration options

    LOCAL = /ensembl

LOCAL is the path to the Ensembl repositories on the localhost and should be set to the same value as [WEBSITE] SERVER_ROOT

    NAME = bombyx_mori_core_31_84_1
    HOST = localhost
    PORT = 3306
    RO_USER = anonymous
    RO_PASS =

Connection details for an existing local (or remote) database using the same schema version as the current import used as a datasource to ensure tables containing data that does not change across species are filled consistently.

    NAME = operophtera_brumata_v1_core_31_84_1
    HOST = localhost
    PORT = 3306
    RW_USER = importer
    RW_PASS = importpassword
    RO_USER = anonymous
    RO_PASS =

Contains the name and connection parameters for the core database that will be created for the current species/assembly. the numbering after _core_ should follow the pattern of [DATABASE_TEMPLATE]. Connection parameters should be as defined in Step 1.2: Setup database connections.

    NAME = ncbi_taxonomy
    HOST = localhost
    PORT = 3306
    RO_USER = anonymous
    RO_PASS =

Connection details for a copy of the (Ensembl format) ncbi_taxonomy database, used to fill in the taxonomic hierarchy in the meta table during import.

    SPECIES.PRODUCTION_NAME = Operophtera_brumata_v1
    SPECIES.SCIENTIFIC_NAME = Operophtera brumata
    SPECIES.COMMON_NAME = Winter moth
    SPECIES.DISPLAY_NAME = Operophtera brumata v1
    SPECIES.DIVISION = EnsemblMetazoa
    SPECIES.URL = Operophtera_brumata_v1
    SPECIES.ALIAS = [ operophtera_brumata operophtera_brumata_v1 operophtera%20brumata winter%moth ]
    ASSEMBLY.DATE = 2015-08-11
    ASSEMBLY.ACCESSION = GCA_001266575.1
    PROVIDER.NAME = Wageningen University
    PROVIDER.URL = http://www.bioinformatics.nl/wintermoth

Metadata for the current import. These fields should be edited to suit the current import and are used either during this import pipeline or as a datasource for parts of the Ensembl website.

    SCAFFOLD = [ fa http://www.bioinformatics.nl/wintermoth/data_files/Obru1.fsa.gz ]

Details of the sequence file(s) to be imported.

  • If a SCAFFOLD file of type fa is provided, then a CONTIG file is optional and vice versa.

  • SCAFFOLD data can also be imported from an agp file provided CONTIG sequences are provided.

  • If no CONTIG file is provided, contigs will be imputed from runs of N in the SCAFFOLD sequence

  • [MODIFY]

  • If OVERWRITE_DB is set to 1, running this script will cause any existing database with the same [DATABASE_CORE] NAME to be dropped and recreated before any data are imported.

  • Setting TRUNCATE_SEQUENCE_TABLES to 1 will truncate any existing sequence tables before importing.

  • if these values are left unset, additional data will be added to an existing database/sequence table, which may have unintended consequences so proceed with caution.


    HEADER = 1
    SCAFFOLD = [ /(.+)/ /scaf_/scaffold/ ]
    CONTIG = [ /(.+)/ /ctg_/contig/ ]
  • To use a file as a source of scaffold name synonyms for import_sequence_synonyms.pl, [FILES] SCAFFOLD_NAMES must be set and the HEADER flag may be used to indicate that the file has a header row that should be skipped during import.
  • Alternatively Match and replace regular expressions may be defined for SCAFFOLD and/or CONTIG names to automatically generate a file of synonyms during sequence import.