`-s` Create database and load sequence data

This step consists of two scripts, import_sequences.pl must be run to set up a core database and load sequence data. import_sequence_synonyms.pl is optional and will be run if you have a list of alternate scaffold/contig names or alternate names are generated by the first script (see configuration options).

docker run --rm \
           --name easy-import-operophtera_brumata_v1_core_32_85_1 \
           --link genomehubs-mysql \
           -v ~/demo/genomehubs-import/import/conf:/import/conf \
           -v ~/demo/genomehubs-import/import/data:/import/data \
           -e DATABASE=operophtera_brumata_v1_core_32_85_1 \
           -e FLAGS="-s" \
           genomehubs/easy-import:latest

Configuration options

[ENSEMBL]
    LOCAL = /ensembl

LOCAL is the path to the Ensembl repositories on the localhost and should be set to the same value as [WEBSITE] SERVER_ROOT

[DATABASE_TEMPLATE]
    NAME = bombyx_mori_core_31_84_1
    HOST = localhost
    PORT = 3306
    RO_USER = anonymous
    RO_PASS =

Connection details for an existing local (or remote) database using the same schema version as the current import used as a datasource to ensure tables containing data that does not change across species are filled consistently.

[DATABASE_CORE]
    NAME = operophtera_brumata_v1_core_31_84_1
    HOST = localhost
    PORT = 3306
    RW_USER = importer
    RW_PASS = importpassword
    RO_USER = anonymous
    RO_PASS =

Contains the name and connection parameters for the core database that will be created for the current species/assembly. the numbering after _core_ should follow the pattern of [DATABASE_TEMPLATE]. Connection parameters should be as defined in Step 1.2: Setup database connections.

[DATABASE_TAXONOMY]
    NAME = ncbi_taxonomy
    HOST = localhost
    PORT = 3306
    RO_USER = anonymous
    RO_PASS =

Connection details for a copy of the (Ensembl format) ncbi_taxonomy database, used to fill in the taxonomic hierarchy in the meta table during import.

[META]
    SPECIES.PRODUCTION_NAME = Operophtera_brumata_v1
    SPECIES.SCIENTIFIC_NAME = Operophtera brumata
    SPECIES.COMMON_NAME = Winter moth
    SPECIES.DISPLAY_NAME = Operophtera brumata v1
    SPECIES.DIVISION = EnsemblMetazoa
    SPECIES.URL = Operophtera_brumata_v1
    SPECIES.TAXONOMY_ID = 472141
    SPECIES.ALIAS = [ operophtera_brumata operophtera_brumata_v1 operophtera%20brumata winter%moth ]
    ASSEMBLY.NAME = v1
    ASSEMBLY.DATE = 2015-08-11
    ASSEMBLY.ACCESSION = GCA_001266575.1
    ASSEMBLY.DEFAULT = v1
    PROVIDER.NAME = Wageningen University
    PROVIDER.URL = http://www.bioinformatics.nl/wintermoth
    GENEBUILD.ID = 1
    GENEBUILD.START_DATE = 2015-08
    GENEBUILD.VERSION = 1
    GENEBUILD.METHOD = import

Metadata for the current import. These fields should be edited to suit the current import and are used either during this import pipeline or as a datasource for parts of the Ensembl website.

[FILES]
    SCAFFOLD = [ fa http://www.bioinformatics.nl/wintermoth/data_files/Obru1.fsa.gz ]

Details of the sequence file(s) to be imported.

  • If a SCAFFOLD file of type fa is provided, then a CONTIG file is optional and vice versa.

  • SCAFFOLD data can also be imported from an agp file provided CONTIG sequences are provided.

  • If no CONTIG file is provided, contigs will be imputed from runs of N in the SCAFFOLD sequence

  • [MODIFY]

[MODIFY]
    OVERWRITE_DB = 1
    TRUNCATE_SEQUENCE_TABLES = 1
  • If OVERWRITE_DB is set to 1, running this script will cause any existing database with the same [DATABASE_CORE] NAME to be dropped and recreated before any data are imported.

  • Setting TRUNCATE_SEQUENCE_TABLES to 1 will truncate any existing sequence tables before importing.

  • if these values are left unset, additional data will be added to an existing database/sequence table, which may have unintended consequences so proceed with caution.

  • [SCAFFOLD_NAMES]

[SCAFFOLD_NAMES]
    HEADER = 1
    SCAFFOLD = [ /(.+)/ /scaf_/scaffold/ ]
    CONTIG = [ /(.+)/ /ctg_/contig/ ]
  • To use a file as a source of scaffold name synonyms for import_sequence_synonyms.pl, [FILES] SCAFFOLD_NAMES must be set and the HEADER flag may be used to indicate that the file has a header row that should be skipped during import.
  • Alternatively Match and replace regular expressions may be defined for SCAFFOLD and/or CONTIG names to automatically generate a file of synonyms during sequence import.