Step 2.1: Fetch/summarise assembly/annotation files

📘

Optional?

This step may be considered optional as the files will be retrieved in subsequent stages if a local copy is not already present in the working directory. However it is useful to have access to a local copy of the files and the summary statistics generated by this step when determining how to process the .gff file and which information to assign to stable_ids, synonyms and descriptions during subsequent steps.

Sequence, Annotation and other files can be retrieved from a variety of locations, using wget, scp or cp as appropriate, according to the location. Compressed files will be automatically unzipped. This ensures that the original file locations can be stored in the .ini file.

🚧

Working directory

All scripts in stage 2 assume that data files are present in the current working directory. It is therefore important to cd to the directory into which you want local copies of the files to be created and use relative or absolute paths to the scripts/config files.

mkdir ~/import
cd ~/import
perl ../ei/core/summarise_files.pl ../ei/conf/core-import.ini

summarise_files.pl will create a summary subdirectory in the current working directory with a summary of the attributes associated with each feature type in the .gff which is useful when setting options in Step 2.3: Prepare the gff file for import either to retrieve information from particular attributes or to fix broken files.

Configuration options

Only the [FILES] stanza of core-import.ini is used at this stage.

[FILES]
  SCAFFOLD = [ fa http://www.bioinformatics.nl/wintermoth/data_files/Obru1.fsa.gz ]
  GFF = [ gff3 http://www.bioinformatics.nl/wintermoth/data_files/Obru_genes.gff.gz ]
  PROTEIN = [ fa http://www.bioinformatics.nl/wintermoth/data_files/ObruPep.fasta.gz ]