Processing exceptions

For .gff files that lack consistent sets of attributes, or that contain distinct classes of gene, it is often not possible to find a way to map feature attributes to [_STABLE_IDS], [_NAMES] and [_DESCRIPTIONS] that can be applied to all genes and transcripts in the file. Features for which a gene or transcript stable_id could not be set using the given .ini file are written to a new file with the suffix .exception.gff during -p Prepare the gff file for import .

This file can then be processed in a second run of -p Prepare the gff file for import by passing an additional .ini files to the script using the $INI environment variable. The original .ini can remain unchanged and settings to be altered can be included in the second file.

part of original ini file

...
[FILES]
  SCAFFOLD = [ fa scaffold.fa ]
  GFF = [ gff original.gff ]
[GENE_STABLE_IDS]
  GFF = [ gene->Name /(.+)/ ]
...

exception.ini

[FILES]
  GFF = [ gff original.gff.exception.gff ]
[GENE_STABLE_IDS]
  GFF = [ gene->ID /(.+)/ ]

docker run --rm \
           --name easy-import-operophtera_brumata_v1_core_32_85_1 \
           --link genomehubs-mysql \
           -v ~/demo/genomehubs-import/import/conf:/import/conf \
           -v ~/demo/genomehubs-import/import/data:/import/data \
           -e DATABASE=operophtera_brumata_v1_core_32_85_1 \
           -e FLAGS="-p" \
           -e INI=operophtera_brumata_v1_core_32_85_1.exception.ini \
           genomehubs/easy-import:latest

Exceptions are not bad, they simply reflect the heterogeneity of data in some .gff files. In some cases by deliberately choosing settings that will generate exceptions it is possible to extract more data from the files than would otherwise be possible. In the example above, all genes will have ID attributes, however the NAME attribute is likely to be more suitable for use as a stable_id. It is also likely that additional attributes will be available for features with Names that could be used to set gene/transcript [_NAMES] and [_DESCRIPTIONS].