{"_id":"58c10f5a2c3c720f00768ba1","parentDoc":null,"__v":0,"category":{"_id":"58c10f5a2c3c720f00768b8a","version":"58c10f5a2c3c720f00768b87","__v":0,"project":"5735936aafab441700723a50","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-05-13T09:49:29.176Z","from_sync":false,"order":4,"slug":"quick-start","title":"Importing data"},"githubsync":"","version":{"_id":"58c10f5a2c3c720f00768b87","project":"5735936aafab441700723a50","__v":4,"createdAt":"2017-03-09T08:16:26.385Z","releaseDate":"2017-03-09T08:16:26.385Z","categories":["58c10f5a2c3c720f00768b88","58c10f5a2c3c720f00768b89","58c10f5a2c3c720f00768b8a","58c10f5a2c3c720f00768b8b","58c10f5a2c3c720f00768b8c","58c10f5a2c3c720f00768b8d","58c10f5a2c3c720f00768b8e","58c10f5a2c3c720f00768b8f","58c10f5a2c3c720f00768b90","58c10f5a2c3c720f00768b91","58c10f5a2c3c720f00768b92","58c10f5a2c3c720f00768b93","58c11574b36d8c0f006fda47","58c2cdcafc6eed3900e97640","58c2ce8afc6eed3900e97663"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"2.0.0","version":"2.0"},"project":"5735936aafab441700723a50","user":"573592b84b0ab120000b7d44","updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-05-13T10:41:24.357Z","link_external":false,"link_url":"","sync_unique":"","hidden":false,"api":{"settings":"","results":{"codes":[]},"auth":"required","params":[],"url":""},"isReference":false,"order":2,"body":"This step consists of two scripts, ``import_sequences.pl`` must be run to set up a core database and load sequence data. ``import_sequence_synonyms.pl`` is optional and will be run if you have a list of alternate scaffold/contig names or alternate names are generated by the first script (see configuration options).\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"docker run --rm \\\\\\n           --name easy-import-operophtera_brumata_v1_core_32_85_1 \\\\\\n           --link genomehubs-mysql \\\\\\n           -v ~/demo/genomehubs-import/import/conf:/import/conf \\\\\\n           -v ~/demo/genomehubs-import/import/data:/import/data \\\\\\n           -e DATABASE=operophtera_brumata_v1_core_32_85_1 \\\\\\n           -e FLAGS=\\\"-s\\\" \\\\\\n           genomehubs/easy-import:latest\",\n      \"language\": \"text\",\n      \"name\": \"run import_sequences.pl\"\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Configuration options\"\n}\n[/block]\n- [[ENSEMBL]](doc:ensembl-core)\n```\n[ENSEMBL]\n\tLOCAL = /ensembl\n```\n  ``LOCAL`` is the path to the Ensembl repositories on the ``localhost`` and should be set to the same value as [[WEBSITE]](doc:website) ``SERVER_ROOT``\n\n- [[DATABASE_TEMPLATE]](doc:database_template-core)\n```\n[DATABASE_TEMPLATE]\n\tNAME = bombyx_mori_core_31_84_1\n\tHOST = localhost\n\tPORT = 3306\n\tRO_USER = anonymous\n\tRO_PASS =\n```\n  Connection details for an existing local (or remote) database using the same schema version as the current import used as a datasource to ensure tables containing data that does not change across species are filled consistently.\n\n- [[DATABASE_CORE]](doc:database_core-core)\n```\n[DATABASE_CORE]\n\tNAME = operophtera_brumata_v1_core_31_84_1\n\tHOST = localhost\n\tPORT = 3306\n\tRW_USER = importer\n\tRW_PASS = importpassword\n\tRO_USER = anonymous\n\tRO_PASS =\n```\n  Contains the name and connection parameters for the core database that will be created for the current species/assembly.  the numbering after ``_core_`` should follow the pattern of [[DATABASE_TEMPLATE]](doc:database_template-core).  Connection parameters should be as defined in [Step 1.2: Setup database connections](doc:step-12-setup-database-connections).\n\n- [[DATABASE_TAXONOMY]](doc:database_taxonomy-core)\n```\n[DATABASE_TAXONOMY]\n\tNAME = ncbi_taxonomy\n\tHOST = localhost\n\tPORT = 3306\n\tRO_USER = anonymous\n\tRO_PASS =\n```\n  Connection details for a copy of the (Ensembl format) ncbi_taxonomy database, used to fill in the taxonomic hierarchy in the ``meta`` table during import.\n\n- [[META]](doc:meta-core)\n```\n[META]\n\tSPECIES.PRODUCTION_NAME = Operophtera_brumata_v1\n\tSPECIES.SCIENTIFIC_NAME = Operophtera brumata\n\tSPECIES.COMMON_NAME = Winter moth\n\tSPECIES.DISPLAY_NAME = Operophtera brumata v1\n\tSPECIES.DIVISION = EnsemblMetazoa\n\tSPECIES.URL = Operophtera_brumata_v1\n\tSPECIES.TAXONOMY_ID = 472141\n\tSPECIES.ALIAS = [ operophtera_brumata operophtera_brumata_v1 operophtera%20brumata winter%moth ]\n\tASSEMBLY.NAME = v1\n\tASSEMBLY.DATE = 2015-08-11\n\tASSEMBLY.ACCESSION = GCA_001266575.1\n\tASSEMBLY.DEFAULT = v1\n\tPROVIDER.NAME = Wageningen University\n\tPROVIDER.URL = http://www.bioinformatics.nl/wintermoth\n\tGENEBUILD.ID = 1\n\tGENEBUILD.START_DATE = 2015-08\n\tGENEBUILD.VERSION = 1\n\tGENEBUILD.METHOD = import\n```\n  Metadata for the current import.  These fields should be edited to suit the current import and are used either during this import pipeline or as a datasource for parts of the Ensembl website.\n\n- [[FILES]](doc:files-core)\n```\n[FILES]\n\tSCAFFOLD = [ fa http://www.bioinformatics.nl/wintermoth/data_files/Obru1.fsa.gz ]\n```\n  Details of the sequence file(s) to be imported.  \n  - If a ``SCAFFOLD`` file of type ``fa`` is provided, then a ``CONTIG`` file is optional and *vice versa*. \n  - ``SCAFFOLD`` data can also be imported from an ``agp`` file provided ``CONTIG`` sequences are provided.  \n  - If no ``CONTIG`` file is provided, contigs will be imputed from runs of ``N`` in the ``SCAFFOLD`` sequence\n\n- [[MODIFY]](doc:modify-core)\n```\n[MODIFY]\n\tOVERWRITE_DB = 1\n\tTRUNCATE_SEQUENCE_TABLES = 1\n```\n  - If ``OVERWRITE_DB`` is set to 1, running this script will cause any existing database with the same [[DATABASE_CORE]](doc:database_core-core) ``NAME`` to be dropped and recreated before any data are imported.\n  - Setting ``TRUNCATE_SEQUENCE_TABLES`` to 1 will truncate any existing sequence tables before importing.\n  - if these values are left unset, additional data will be added to an existing database/sequence table, which may have unintended consequences so proceed with caution.\n\n- [[SCAFFOLD_NAMES]](doc:scaffold_names-core)\n```\n[SCAFFOLD_NAMES]\n\tHEADER = 1\n\tSCAFFOLD = [ /(.+)/ /scaf_/scaffold/ ]\n\tCONTIG = [ /(.+)/ /ctg_/contig/ ]\n```\n  - To use a file as a source of scaffold name synonyms for ``import_sequence_synonyms.pl``, [[FILES]](doc:files-core) ``SCAFFOLD_NAMES`` must be set and the ``HEADER`` flag may be used to indicate that the file has a header row that should be skipped during import.\n  - Alternatively [Match and replace](doc:match-and-replace) regular expressions may be defined for ``SCAFFOLD`` and/or ``CONTIG`` names to automatically generate a file of synonyms during sequence import.","excerpt":"","slug":"step-22-create-database-and-load-sequence-data","type":"basic","title":"`-s` Create database and load sequence data"}

`-s` Create database and load sequence data


This step consists of two scripts, ``import_sequences.pl`` must be run to set up a core database and load sequence data. ``import_sequence_synonyms.pl`` is optional and will be run if you have a list of alternate scaffold/contig names or alternate names are generated by the first script (see configuration options). [block:code] { "codes": [ { "code": "docker run --rm \\\n --name easy-import-operophtera_brumata_v1_core_32_85_1 \\\n --link genomehubs-mysql \\\n -v ~/demo/genomehubs-import/import/conf:/import/conf \\\n -v ~/demo/genomehubs-import/import/data:/import/data \\\n -e DATABASE=operophtera_brumata_v1_core_32_85_1 \\\n -e FLAGS=\"-s\" \\\n genomehubs/easy-import:latest", "language": "text", "name": "run import_sequences.pl" } ] } [/block] [block:api-header] { "type": "basic", "title": "Configuration options" } [/block] - [[ENSEMBL]](doc:ensembl-core) ``` [ENSEMBL] LOCAL = /ensembl ``` ``LOCAL`` is the path to the Ensembl repositories on the ``localhost`` and should be set to the same value as [[WEBSITE]](doc:website) ``SERVER_ROOT`` - [[DATABASE_TEMPLATE]](doc:database_template-core) ``` [DATABASE_TEMPLATE] NAME = bombyx_mori_core_31_84_1 HOST = localhost PORT = 3306 RO_USER = anonymous RO_PASS = ``` Connection details for an existing local (or remote) database using the same schema version as the current import used as a datasource to ensure tables containing data that does not change across species are filled consistently. - [[DATABASE_CORE]](doc:database_core-core) ``` [DATABASE_CORE] NAME = operophtera_brumata_v1_core_31_84_1 HOST = localhost PORT = 3306 RW_USER = importer RW_PASS = importpassword RO_USER = anonymous RO_PASS = ``` Contains the name and connection parameters for the core database that will be created for the current species/assembly. the numbering after ``_core_`` should follow the pattern of [[DATABASE_TEMPLATE]](doc:database_template-core). Connection parameters should be as defined in [Step 1.2: Setup database connections](doc:step-12-setup-database-connections). - [[DATABASE_TAXONOMY]](doc:database_taxonomy-core) ``` [DATABASE_TAXONOMY] NAME = ncbi_taxonomy HOST = localhost PORT = 3306 RO_USER = anonymous RO_PASS = ``` Connection details for a copy of the (Ensembl format) ncbi_taxonomy database, used to fill in the taxonomic hierarchy in the ``meta`` table during import. - [[META]](doc:meta-core) ``` [META] SPECIES.PRODUCTION_NAME = Operophtera_brumata_v1 SPECIES.SCIENTIFIC_NAME = Operophtera brumata SPECIES.COMMON_NAME = Winter moth SPECIES.DISPLAY_NAME = Operophtera brumata v1 SPECIES.DIVISION = EnsemblMetazoa SPECIES.URL = Operophtera_brumata_v1 SPECIES.TAXONOMY_ID = 472141 SPECIES.ALIAS = [ operophtera_brumata operophtera_brumata_v1 operophtera%20brumata winter%moth ] ASSEMBLY.NAME = v1 ASSEMBLY.DATE = 2015-08-11 ASSEMBLY.ACCESSION = GCA_001266575.1 ASSEMBLY.DEFAULT = v1 PROVIDER.NAME = Wageningen University PROVIDER.URL = http://www.bioinformatics.nl/wintermoth GENEBUILD.ID = 1 GENEBUILD.START_DATE = 2015-08 GENEBUILD.VERSION = 1 GENEBUILD.METHOD = import ``` Metadata for the current import. These fields should be edited to suit the current import and are used either during this import pipeline or as a datasource for parts of the Ensembl website. - [[FILES]](doc:files-core) ``` [FILES] SCAFFOLD = [ fa http://www.bioinformatics.nl/wintermoth/data_files/Obru1.fsa.gz ] ``` Details of the sequence file(s) to be imported. - If a ``SCAFFOLD`` file of type ``fa`` is provided, then a ``CONTIG`` file is optional and *vice versa*. - ``SCAFFOLD`` data can also be imported from an ``agp`` file provided ``CONTIG`` sequences are provided. - If no ``CONTIG`` file is provided, contigs will be imputed from runs of ``N`` in the ``SCAFFOLD`` sequence - [[MODIFY]](doc:modify-core) ``` [MODIFY] OVERWRITE_DB = 1 TRUNCATE_SEQUENCE_TABLES = 1 ``` - If ``OVERWRITE_DB`` is set to 1, running this script will cause any existing database with the same [[DATABASE_CORE]](doc:database_core-core) ``NAME`` to be dropped and recreated before any data are imported. - Setting ``TRUNCATE_SEQUENCE_TABLES`` to 1 will truncate any existing sequence tables before importing. - if these values are left unset, additional data will be added to an existing database/sequence table, which may have unintended consequences so proceed with caution. - [[SCAFFOLD_NAMES]](doc:scaffold_names-core) ``` [SCAFFOLD_NAMES] HEADER = 1 SCAFFOLD = [ /(.+)/ /scaf_/scaffold/ ] CONTIG = [ /(.+)/ /ctg_/contig/ ] ``` - To use a file as a source of scaffold name synonyms for ``import_sequence_synonyms.pl``, [[FILES]](doc:files-core) ``SCAFFOLD_NAMES`` must be set and the ``HEADER`` flag may be used to indicate that the file has a header row that should be skipped during import. - Alternatively [Match and replace](doc:match-and-replace) regular expressions may be defined for ``SCAFFOLD`` and/or ``CONTIG`` names to automatically generate a file of synonyms during sequence import.