Step 2.2: Create database and load sequence data

This step consists of two scripts, import_sequences.pl must be run to set up a core database and load sequence data. import_sequence_synonyms.pl is optional and may be run if you have a list of alternate scaffold/contig names or alternate names are generated by the first script (see configuration options).

cd ~/import
perl ../ei/core/import_sequences.pl ../ei/conf/core-import.ini
perl ../ei/core/import_sequence_synonyms.pl ../ei/conf/core-import.ini

Configuration options

Eight stanzas of core-import.ini are used to create the core database and import sequence data:

[ENSEMBL]
	LOCAL = /ensembl

LOCAL is the path to the Ensembl repositories on the localhost and should be set to the same value as [WEBSITE] SERVER_ROOT

[DATABASE_TEMPLATE]
	NAME = bombyx_mori_core_31_84_1
	HOST = localhost
	PORT = 3306
	RO_USER = anonymous
	RO_PASS =

Connection details for an existing local (or remote) database using the same schema version as the current import used as a datasource to ensure tables containing data that does not change across species are filled consistently.

[DATABASE_CORE]
	NAME = operophtera_brumata_v1_core_31_84_1
	HOST = localhost
	PORT = 3306
	RW_USER = importer
	RW_PASS = importpassword
	RO_USER = anonymous
	RO_PASS =

Contains the name and connection parameters for the core database that will be created for the current species/assembly. the numbering after _core_ should follow the pattern of [DATABASE_TEMPLATE]. Connection parameters should be as defined in Step 1.2: Setup database connections.

[DATABASE_TAXONOMY]
	NAME = ncbi_taxonomy
	HOST = localhost
	PORT = 3306
	RO_USER = anonymous
	RO_PASS =

Connection details for a copy of the (Ensembl format) ncbi_taxonomy database, used to fill in the taxonomic hierarchy in the meta table during import.

[META]
	SPECIES.PRODUCTION_NAME = Operophtera_brumata_v1
	SPECIES.SCIENTIFIC_NAME = Operophtera brumata
	SPECIES.COMMON_NAME = Winter moth
	SPECIES.DISPLAY_NAME = Operophtera brumata v1
	SPECIES.DIVISION = EnsemblMetazoa
	SPECIES.URL = Operophtera_brumata_v1
	SPECIES.TAXONOMY_ID = 472141
	SPECIES.ALIAS = [ operophtera_brumata operophtera_brumata_v1 operophtera%20brumata winter%moth ]
	ASSEMBLY.NAME = v1
	ASSEMBLY.DATE = 2015-08-11
	ASSEMBLY.ACCESSION = GCA_001266575.1
	ASSEMBLY.DEFAULT = v1
	PROVIDER.NAME = Wageningen University
	PROVIDER.URL = http://www.bioinformatics.nl/wintermoth
	GENEBUILD.ID = 1
	GENEBUILD.START_DATE = 2015-08
	GENEBUILD.VERSION = 1
	GENEBUILD.METHOD = import

Metadata for the current import. These fields should be edited to suit the current import and are used either during this import pipeline or as a datasource for parts of the Ensembl website.

[FILES]
	SCAFFOLD = [ fa http://www.bioinformatics.nl/wintermoth/data_files/Obru1.fsa.gz ]

Details of the sequence file(s) to be imported.

  • If a SCAFFOLD file of type fa is provided, then a CONTIG file is optional and vice versa.

  • SCAFFOLD data can also be imported from an agp file provided CONTIG sequences are provided.

  • If no CONTIG file is provided, contigs will be imputed from runs of N in the SCAFFOLD sequence

  • [MODIFY]

[MODIFY]
	OVERWRITE_DB = 1
	TRUNCATE_SEQUENCE_TABLES = 1
  • If OVERWRITE_DB is set to 1, running this script will cause any existing database with the same [DATABASE_CORE] NAME to be dropped and recreated before any data are imported.

  • Setting TRUNCATE_SEQUENCE_TABLES to 1 will truncate any existing sequence tables before importing.

  • if these values are left unset, additional data will be added to an existing database/sequence table, which may have unintended consequences so proceed with caution.

  • [SCAFFOLD_NAMES]

[SCAFFOLD_NAMES]
	HEADER = 1
	SCAFFOLD = [ /(.+)/ /scaf_/scaffold/ ]
	CONTIG = [ /(.+)/ /ctg_/contig/ ]
  • To use a file as a source of scaffold name synonyms for import_sequence_synonyms.pl, [FILES] SCAFFOLD_NAMES must be set and the HEADER flag may be used to indicate that the file has a header row that should be skipped during import.
  • Alternatively Match and replace regular expressions may be defined for SCAFFOLD and/or CONTIG names to automatically generate a file of synonyms during sequence import.