`-b -r -c` Import additional annotations

Some xrefs can be imported via Dbxref attributes in a .gff file, however several xref types can be more richly represented in the Ensembl database if directly imported from program outputs.

docker run --rm \
           --name easy-import-operophtera_brumata_v1_core_32_85_1 \
           --link genomehubs-mysql \
           -v ~/demo/genomehubs-import/import/conf:/import/conf \
           -v ~/demo/genomehubs-import/import/data:/import/data \
           -e DATABASE=operophtera_brumata_v1_core_32_85_1 \
           -e FLAGS="-b" \
           genomehubs/easy-import:latest

docker run --rm \
           --name easy-import-operophtera_brumata_v1_core_32_85_1 \
           --link genomehubs-mysql \
           -v ~/demo/genomehubs-import/import/conf:/import/conf \
           -v ~/demo/genomehubs-import/import/data:/import/data \
           -e DATABASE=operophtera_brumata_v1_core_32_85_1 \
           -e FLAGS="-r" \
           genomehubs/easy-import:latest

Summaries of assembly quality based on conserved gene sets using CEGMA and BUSCO can also be imported to the meta table of the core database. if present, these values will be exported by the script export_json.pl during Step 2.6: Export files for use in summary tables/visualisation.

docker run --rm \
           --name easy-import-operophtera_brumata_v1_core_32_85_1 \
           --link genomehubs-mysql \
           -v ~/demo/genomehubs-import/import/conf:/import/conf \
           -v ~/demo/genomehubs-import/import/data:/import/data \
           -e DATABASE=operophtera_brumata_v1_core_32_85_1 \
           -e FLAGS="-c" \
           genomehubs/easy-import:latest

Example commands

To obtain the correct output format, use commands similar to the following:

blastp vs uniprot

parallel -j $NSLOTS --pipe --block 10k --recstart '>' \
    "nice blastp -query - -db /exports/blast_db/uniprot_sprot.fasta -evalue 1e-10 -outfmt '6 std qlen slen stitle btop'"

repeatmasker

RepeatMasker -pa $NSLOTS -lib /path/to/repeat.library -dir . -xsmall /path/to/seqfile

interproscan

cat $PROTEIN | paste - - | grep -v "\*" | sed 's/\t/\n/g' \
| parallel -j $NSLOTS --pipe --block 100k --recstart '>' \
    "nice interproscan.sh -T /run/shm/ -i - -d $OUTDIR -dp -t p -appl TIGRFAM-13.0,ProDom-2006.1,SMART-6.2,SignalP-EUK-4.0,PrositePatterns-20.97,PRINTS-42.0,SuperFamily-1.75,Gene3d-3.5.0,PfamA-27.0,PrositeProfiles-20.97,Phobius-1.01,TMHMM-2.0c,Coils-2.2 -f TSV"
cat -- $OUTDIR/* > $PROTEIN.interproscan

Configuration options

[FILES]

[FILES]
    BLASTP =  [ BLASTP  http://download.lepbase.org/current/blastp/Operophtera_brumata_v1_-_proteins.fa.blastp.uniprot_sprot.1e-10.gz ]
    IPRSCAN = [ IPRSCAN http://download.lepbase.org/current/interproscan/Operophtera_brumata_v1_-_proteins.fa.interproscan.gz ]
    REPEATMASKER = [ REPEATMASKER http://download.lepbase.org/current/repeats/Operophtera_brumata_v1_-_scaffolds.fa.out.gz ]

Specifiy the (remote) locations of BLASTP, IPRSCAN and REPEATMASKER files as appropriate.

[XREF]

[XREF]
    BLASTP = [ 2000 Uniprot/swissprot/TrEMBL UniProtKB/TrEMBL ]

Set the external db id for BLASTP. The final value in the array will be used when adding links to the original data source to the description.