{"__v":14,"_id":"5735af544b0ab120000b7db2","category":{"__v":0,"_id":"5735a32931a73b1700887c94","project":"5735936aafab441700723a50","version":"5735936aafab441700723a53","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-05-13T09:49:29.176Z","from_sync":false,"order":2,"slug":"quick-start","title":"Stage 2 - Core Import"},"parentDoc":null,"project":"5735936aafab441700723a50","user":"573592b84b0ab120000b7d44","version":{"__v":12,"_id":"5735936aafab441700723a53","project":"5735936aafab441700723a50","createdAt":"2016-05-13T08:42:18.615Z","releaseDate":"2016-05-13T08:42:18.615Z","categories":["5735936aafab441700723a54","5735a32931a73b1700887c94","5735b55beceb872200abbc6c","5735b56eb667601700d3bd6f","5735b9ba4b0ab120000b7dd4","5735b9c94b0ab120000b7dd5","5735cb131f16241700c8a0f7","5735e5c4e4824c3400aa1f21","5735e5d9e4824c3400aa1f23","5735e5f2ec67f6290013ac72","573ecfe0804f901700a9dfc7","573f276c7eeb8b190094ca7d"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":false,"codename":"","version_clean":"1.0.0","version":"1.0"},"updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-05-13T10:41:24.357Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":2,"body":"This step consists of two scripts, ``import_sequences.pl`` must be run to set up a core database and load sequence data. ``import_sequence_synonyms.pl`` is optional and may be run if you have a list of alternate scaffold/contig names or alternate names are generated by the first script (see configuration options).\n\n```\ncd ~/import\nperl ../ei/core/import_sequences.pl ../ei/conf/core-import.ini\n```\n\n```\nperl ../ei/core/import_sequence_synonyms.pl ../ei/conf/core-import.ini\n```\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Configuration options\"\n}\n[/block]\nEight stanzas of ``core-import.ini`` are used to create the core database and import sequence data:\n- [[ENSEMBL]](doc:ensembl-core)\n```\n[ENSEMBL]\n\tLOCAL = /ensembl\n```\n  ``LOCAL`` is the path to the Ensembl repositories on the ``localhost`` and should be set to the same value as [[WEBSITE]](doc:website) ``SERVER_ROOT``\n\n- [[DATABASE_TEMPLATE]](doc:database_template-core)\n```\n[DATABASE_TEMPLATE]\n\tNAME = bombyx_mori_core_31_84_1\n\tHOST = localhost\n\tPORT = 3306\n\tRO_USER = anonymous\n\tRO_PASS =\n```\n  Connection details for an existing local (or remote) database using the same schema version as the current import used as a datasource to ensure tables containing data that does not change across species are filled consistently.\n\n- [[DATABASE_CORE]](doc:database_core-core)\n```\n[DATABASE_CORE]\n\tNAME = operophtera_brumata_v1_core_31_84_1\n\tHOST = localhost\n\tPORT = 3306\n\tRW_USER = importer\n\tRW_PASS = importpassword\n\tRO_USER = anonymous\n\tRO_PASS =\n```\n  Contains the name and connection parameters for the core database that will be created for the current species/assembly.  the numbering after ``_core_`` should follow the pattern of [[DATABASE_TEMPLATE]](doc:database_template-core).  Connection parameters should be as defined in [Step 1.2: Setup database connections](doc:step-12-setup-database-connections).\n\n- [[DATABASE_TAXONOMY]](doc:database_taxonomy-core)\n```\n[DATABASE_TAXONOMY]\n\tNAME = ncbi_taxonomy\n\tHOST = localhost\n\tPORT = 3306\n\tRO_USER = anonymous\n\tRO_PASS =\n```\n  Connection details for a copy of the (Ensembl format) ncbi_taxonomy database, used to fill in the taxonomic hierarchy in the ``meta`` table during import.\n\n- [[META]](doc:meta-core)\n```\n[META]\n\tSPECIES.PRODUCTION_NAME = Operophtera_brumata_v1\n\tSPECIES.SCIENTIFIC_NAME = Operophtera brumata\n\tSPECIES.COMMON_NAME = Winter moth\n\tSPECIES.DISPLAY_NAME = Operophtera brumata v1\n\tSPECIES.DIVISION = EnsemblMetazoa\n\tSPECIES.URL = Operophtera_brumata_v1\n\tSPECIES.TAXONOMY_ID = 472141\n\tSPECIES.ALIAS = [ operophtera_brumata operophtera_brumata_v1 operophtera%20brumata winter%moth ]\n\tASSEMBLY.NAME = v1\n\tASSEMBLY.DATE = 2015-08-11\n\tASSEMBLY.ACCESSION = GCA_001266575.1\n\tASSEMBLY.DEFAULT = v1\n\tPROVIDER.NAME = Wageningen University\n\tPROVIDER.URL = http://www.bioinformatics.nl/wintermoth\n\tGENEBUILD.ID = 1\n\tGENEBUILD.START_DATE = 2015-08\n\tGENEBUILD.VERSION = 1\n\tGENEBUILD.METHOD = import\n```\n  Metadata for the current import.  These fields should be edited to suit the current import and are used either during this import pipeline or as a datasource for parts of the Ensembl website.\n\n- [[FILES]](doc:files-core)\n```\n[FILES]\n\tSCAFFOLD = [ fa http://www.bioinformatics.nl/wintermoth/data_files/Obru1.fsa.gz ]\n```\n  Details of the sequence file(s) to be imported.  \n  - If a ``SCAFFOLD`` file of type ``fa`` is provided, then a ``CONTIG`` file is optional and *vice versa*. \n  - ``SCAFFOLD`` data can also be imported from an ``agp`` file provided ``CONTIG`` sequences are provided.  \n  - If no ``CONTIG`` file is provided, contigs will be imputed from runs of ``N`` in the ``SCAFFOLD`` sequence\n\n- [[MODIFY]](doc:modify-core)\n```\n[MODIFY]\n\tOVERWRITE_DB = 1\n\tTRUNCATE_SEQUENCE_TABLES = 1\n```\n  - If ``OVERWRITE_DB`` is set to 1, running this script will cause any existing database with the same [[DATABASE_CORE]](doc:database_core-core) ``NAME`` to be dropped and recreated before any data are imported.\n  - Setting ``TRUNCATE_SEQUENCE_TABLES`` to 1 will truncate any existing sequence tables before importing.\n  - if these values are left unset, additional data will be added to an existing database/sequence table, which may have unintended consequences so proceed with caution.\n\n- [[SCAFFOLD_NAMES]](doc:scaffold_names-core)\n```\n[SCAFFOLD_NAMES]\n\tHEADER = 1\n\tSCAFFOLD = [ /(.+)/ /scaf_/scaffold/ ]\n\tCONTIG = [ /(.+)/ /ctg_/contig/ ]\n```\n  - To use a file as a source of scaffold name synonyms for ``import_sequence_synonyms.pl``, [[FILES]](doc:files-core) ``SCAFFOLD_NAMES`` must be set and the ``HEADER`` flag may be used to indicate that the file has a header row that should be skipped during import.\n  - Alternatively [Match and replace](doc:match-and-replace) regular expressions may be defined for ``SCAFFOLD`` and/or ``CONTIG`` names to automatically generate a file of synonyms during sequence import.","excerpt":"","slug":"step-22-create-database-and-load-sequence-data","type":"basic","title":"Step 2.2: Create database and load sequence data"}

Step 2.2: Create database and load sequence data


This step consists of two scripts, ``import_sequences.pl`` must be run to set up a core database and load sequence data. ``import_sequence_synonyms.pl`` is optional and may be run if you have a list of alternate scaffold/contig names or alternate names are generated by the first script (see configuration options). ``` cd ~/import perl ../ei/core/import_sequences.pl ../ei/conf/core-import.ini ``` ``` perl ../ei/core/import_sequence_synonyms.pl ../ei/conf/core-import.ini ``` [block:api-header] { "type": "basic", "title": "Configuration options" } [/block] Eight stanzas of ``core-import.ini`` are used to create the core database and import sequence data: - [[ENSEMBL]](doc:ensembl-core) ``` [ENSEMBL] LOCAL = /ensembl ``` ``LOCAL`` is the path to the Ensembl repositories on the ``localhost`` and should be set to the same value as [[WEBSITE]](doc:website) ``SERVER_ROOT`` - [[DATABASE_TEMPLATE]](doc:database_template-core) ``` [DATABASE_TEMPLATE] NAME = bombyx_mori_core_31_84_1 HOST = localhost PORT = 3306 RO_USER = anonymous RO_PASS = ``` Connection details for an existing local (or remote) database using the same schema version as the current import used as a datasource to ensure tables containing data that does not change across species are filled consistently. - [[DATABASE_CORE]](doc:database_core-core) ``` [DATABASE_CORE] NAME = operophtera_brumata_v1_core_31_84_1 HOST = localhost PORT = 3306 RW_USER = importer RW_PASS = importpassword RO_USER = anonymous RO_PASS = ``` Contains the name and connection parameters for the core database that will be created for the current species/assembly. the numbering after ``_core_`` should follow the pattern of [[DATABASE_TEMPLATE]](doc:database_template-core). Connection parameters should be as defined in [Step 1.2: Setup database connections](doc:step-12-setup-database-connections). - [[DATABASE_TAXONOMY]](doc:database_taxonomy-core) ``` [DATABASE_TAXONOMY] NAME = ncbi_taxonomy HOST = localhost PORT = 3306 RO_USER = anonymous RO_PASS = ``` Connection details for a copy of the (Ensembl format) ncbi_taxonomy database, used to fill in the taxonomic hierarchy in the ``meta`` table during import. - [[META]](doc:meta-core) ``` [META] SPECIES.PRODUCTION_NAME = Operophtera_brumata_v1 SPECIES.SCIENTIFIC_NAME = Operophtera brumata SPECIES.COMMON_NAME = Winter moth SPECIES.DISPLAY_NAME = Operophtera brumata v1 SPECIES.DIVISION = EnsemblMetazoa SPECIES.URL = Operophtera_brumata_v1 SPECIES.TAXONOMY_ID = 472141 SPECIES.ALIAS = [ operophtera_brumata operophtera_brumata_v1 operophtera%20brumata winter%moth ] ASSEMBLY.NAME = v1 ASSEMBLY.DATE = 2015-08-11 ASSEMBLY.ACCESSION = GCA_001266575.1 ASSEMBLY.DEFAULT = v1 PROVIDER.NAME = Wageningen University PROVIDER.URL = http://www.bioinformatics.nl/wintermoth GENEBUILD.ID = 1 GENEBUILD.START_DATE = 2015-08 GENEBUILD.VERSION = 1 GENEBUILD.METHOD = import ``` Metadata for the current import. These fields should be edited to suit the current import and are used either during this import pipeline or as a datasource for parts of the Ensembl website. - [[FILES]](doc:files-core) ``` [FILES] SCAFFOLD = [ fa http://www.bioinformatics.nl/wintermoth/data_files/Obru1.fsa.gz ] ``` Details of the sequence file(s) to be imported. - If a ``SCAFFOLD`` file of type ``fa`` is provided, then a ``CONTIG`` file is optional and *vice versa*. - ``SCAFFOLD`` data can also be imported from an ``agp`` file provided ``CONTIG`` sequences are provided. - If no ``CONTIG`` file is provided, contigs will be imputed from runs of ``N`` in the ``SCAFFOLD`` sequence - [[MODIFY]](doc:modify-core) ``` [MODIFY] OVERWRITE_DB = 1 TRUNCATE_SEQUENCE_TABLES = 1 ``` - If ``OVERWRITE_DB`` is set to 1, running this script will cause any existing database with the same [[DATABASE_CORE]](doc:database_core-core) ``NAME`` to be dropped and recreated before any data are imported. - Setting ``TRUNCATE_SEQUENCE_TABLES`` to 1 will truncate any existing sequence tables before importing. - if these values are left unset, additional data will be added to an existing database/sequence table, which may have unintended consequences so proceed with caution. - [[SCAFFOLD_NAMES]](doc:scaffold_names-core) ``` [SCAFFOLD_NAMES] HEADER = 1 SCAFFOLD = [ /(.+)/ /scaf_/scaffold/ ] CONTIG = [ /(.+)/ /ctg_/contig/ ] ``` - To use a file as a source of scaffold name synonyms for ``import_sequence_synonyms.pl``, [[FILES]](doc:files-core) ``SCAFFOLD_NAMES`` must be set and the ``HEADER`` flag may be used to indicate that the file has a header row that should be skipped during import. - Alternatively [Match and replace](doc:match-and-replace) regular expressions may be defined for ``SCAFFOLD`` and/or ``CONTIG`` names to automatically generate a file of synonyms during sequence import.