Step 2.6: Import additional annotations

(optional)

Some xrefs can be imported via Dbxref attributes in a .gff file, however several xref types can be more richly represented in the Ensembl database if directly imported from program outputs.

  • blastp
cd ~/import
perl ../ei/core/import_blastp.pl ../ei/conf/core-import.ini ../ei/conf/core-import-extra.ini
  • repeatmasker
perl ../ei/core/import_repeatmasker.pl ../ei/conf/core-import.ini ../ei/conf/core-import-extra.ini
  • interproscan
perl ../ei/core/import_interproscan.pl ../ei/conf/core-import.ini ../ei/conf/core-import-extra.ini

Summaries of assembly quality based on conserved gene sets using CEGMA and BUSCO can also be imported to the meta table of the core database. if present, these values will be exported by the script export_json.pl during Step 2.6: Export files for use in summary tables/visualisation.

perl ../ei/core/import_cegma_busco.pl ../ei/conf/core-import.ini ../ei/conf/core-import-extra.ini

Example commands

To obtain the correct output format, use commands similar to the following:

  • blastp vs uniprot
parallel -j $NSLOTS --pipe --block 10k --recstart '>' \
    "nice blastp -query - -db /exports/blast_db/uniprot_sprot.fasta -evalue 1e-10 -outfmt '6 std qlen slen stitle btop'"
  • repeatmasker
RepeatMasker -pa $NSLOTS -lib /path/to/repeat.library -dir . -xsmall /path/to/seqfile
  • interproscan
cat $PROTEIN | paste - - | grep -v "\*" | sed 's/\t/\n/g' \
| parallel -j $NSLOTS --pipe --block 100k --recstart '>' \
    "nice interproscan.sh -T /run/shm/ -i - -d $OUTDIR -dp -t p -appl TIGRFAM-13.0,ProDom-2006.1,SMART-6.2,SignalP-EUK-4.0,PrositePatterns-20.97,PRINTS-42.0,SuperFamily-1.75,Gene3d-3.5.0,PfamA-27.0,PrositeProfiles-20.97,Phobius-1.01,TMHMM-2.0c,Coils-2.2 -f TSV"
cat -- $OUTDIR/* > $PROTEIN.interproscan

Configuration options

[FILES]
    BLASTP =  [ BLASTP  http://download.lepbase.org/current/blastp/Operophtera_brumata_v1_-_proteins.fa.blastp.uniprot_sprot.1e-10.gz ]
    IPRSCAN = [ IPRSCAN http://download.lepbase.org/current/interproscan/Operophtera_brumata_v1_-_proteins.fa.interproscan.gz ]
    REPEATMASKER = [ REPEATMASKER http://download.lepbase.org/current/repeats/Operophtera_brumata_v1_-_scaffolds.fa.out.gz ]

Specifiy the (remote) locations of BLASTP, IPRSCAN and REPEATMASKER files as appropriate.

[XREF]
    BLASTP = [ 2000 Uniprot/swissprot/TrEMBL UniProtKB/TrEMBL ]

Set the external db id for BLASTP. The final value in the array will be used when adding links to the original data source to the description.