Step 2.2: Create database and load sequence data

This step consists of two scripts, import_sequences.pl must be run to set up a core database and load sequence data. import_sequence_synonyms.pl is optional and may be run if you have a list of alternate scaffold/contig names or alternate names are generated by the first script (see configuration options).

cd ~/import
perl ../ei/core/import_sequences.pl ../ei/conf/core-import.ini

perl ../ei/core/import_sequence_synonyms.pl ../ei/conf/core-import.ini

Configuration options

Eight stanzas of core-import.ini are used to create the core database and import sequence data:

[ENSEMBL]

[ENSEMBL]
	LOCAL = /ensembl

LOCAL is the path to the Ensembl repositories on the localhost and should be set to the same value as [WEBSITE] SERVER_ROOT

[DATABASE_TEMPLATE]

[DATABASE_TEMPLATE]
	NAME = bombyx_mori_core_31_84_1
	HOST = localhost
	PORT = 3306
	RO_USER = anonymous
	RO_PASS =

Connection details for an existing local (or remote) database using the same schema version as the current import used as a datasource to ensure tables containing data that does not change across species are filled consistently.

[DATABASE_CORE]

[DATABASE_CORE]
	NAME = operophtera_brumata_v1_core_31_84_1
	HOST = localhost
	PORT = 3306
	RW_USER = importer
	RW_PASS = importpassword
	RO_USER = anonymous
	RO_PASS =

Contains the name and connection parameters for the core database that will be created for the current species/assembly. the numbering after _core_ should follow the pattern of [DATABASE_TEMPLATE]. Connection parameters should be as defined in Step 1.2: Setup database connections.

[DATABASE_TAXONOMY]

[DATABASE_TAXONOMY]
	NAME = ncbi_taxonomy
	HOST = localhost
	PORT = 3306
	RO_USER = anonymous
	RO_PASS =

Connection details for a copy of the (Ensembl format) ncbi_taxonomy database, used to fill in the taxonomic hierarchy in the meta table during import.

[META]

[META]
	SPECIES.PRODUCTION_NAME = Operophtera_brumata_v1
	SPECIES.SCIENTIFIC_NAME = Operophtera brumata
	SPECIES.COMMON_NAME = Winter moth
	SPECIES.DISPLAY_NAME = Operophtera brumata v1
	SPECIES.DIVISION = EnsemblMetazoa
	SPECIES.URL = Operophtera_brumata_v1
	SPECIES.TAXONOMY_ID = 472141
	SPECIES.ALIAS = [ operophtera_brumata operophtera_brumata_v1 operophtera%20brumata winter%moth ]
	ASSEMBLY.NAME = v1
	ASSEMBLY.DATE = 2015-08-11
	ASSEMBLY.ACCESSION = GCA_001266575.1
	ASSEMBLY.DEFAULT = v1
	PROVIDER.NAME = Wageningen University
	PROVIDER.URL = http://www.bioinformatics.nl/wintermoth
	GENEBUILD.ID = 1
	GENEBUILD.START_DATE = 2015-08
	GENEBUILD.VERSION = 1
	GENEBUILD.METHOD = import

Metadata for the current import. These fields should be edited to suit the current import and are used either during this import pipeline or as a datasource for parts of the Ensembl website.

[FILES]

[FILES]
	SCAFFOLD = [ fa http://www.bioinformatics.nl/wintermoth/data_files/Obru1.fsa.gz ]

Details of the sequence file(s) to be imported.

If a SCAFFOLD file of type fa is provided, then a CONTIG file is optional and vice versa.
SCAFFOLD data can also be imported from an agp file provided CONTIG sequences are provided.
If no CONTIG file is provided, contigs will be imputed from runs of N in the SCAFFOLD sequence
[MODIFY]

[MODIFY]
	OVERWRITE_DB = 1
	TRUNCATE_SEQUENCE_TABLES = 1

If OVERWRITE_DB is set to 1, running this script will cause any existing database with the same [DATABASE_CORE] NAME to be dropped and recreated before any data are imported.
Setting TRUNCATE_SEQUENCE_TABLES to 1 will truncate any existing sequence tables before importing.
if these values are left unset, additional data will be added to an existing database/sequence table, which may have unintended consequences so proceed with caution.
[SCAFFOLD_NAMES]

[SCAFFOLD_NAMES]
	HEADER = 1
	SCAFFOLD = [ /(.+)/ /scaf_/scaffold/ ]
	CONTIG = [ /(.+)/ /ctg_/contig/ ]

To use a file as a source of scaffold name synonyms for import_sequence_synonyms.pl, [FILES] SCAFFOLD_NAMES must be set and the HEADER flag may be used to indicate that the file has a header row that should be skipped during import.
Alternatively Match and replace regular expressions may be defined for SCAFFOLD and/or CONTIG names to automatically generate a file of synonyms during sequence import.