Providing Inputs to CUBScout
Genomes and CDSs
First, CUBScout
only works with coding sequences. CUBScout
does not identify ORFs, pause at stop codons, or parse non-nucleotide characters. It is assumed the coding sequences you provide are in-frame and don't contain 5' or 3' untranslated regions. Codons which have non-specific nucleotides, like "W", are skipped. Sequences with characters outside of those recognized by BioSequences will throw an error.
Some CUBScout
functions, like count_codons
, are meaningful when applied to a single nucleotide sequence. However, most CUBScout
functions are designed to work at the genome-level, and calculate metrics that rely on comparisons between multiple genes. Specifically, none of the codon usage bias or expressivity functions accept a single nucleotide sequence; all expect to operate across a set of sequences, whether in a fasta file or vector of BioSequences.
FASTA Files
Most functions in CUBScout
accept any FASTA-formatted file (e.g. .fa, .fna, .fasta) where each entry corresponds to coding sequences or open readings frames. CUBScout
accepts either a String
which is the complete filepath to a fast-formatted file, or objects of type FASTAReader
or IO
which point to a fasta-formatted file. There is no significant performance advantage between these three options, unless you already have an IOStream
or FASTAReader
open for another purpose.
BioSequences
CUBScout
functions also accept nucleotide sequences from BioSequences (<:NucSeq
). Keep in mind that most CUBScout
functions are designed to operate across genomes, and so accept a vector of nucleotide sequences. The vector corresponds to a genome, with each DNA or RNA string corresponding to a coding sequence.
While there is a slight performance advantage in CUBScout
functions when supplying BioSequences as an input rather than a filepath, supplying filepaths will still be faster than the cumulative time spent reading in a BioSequence and then running a CUBScout
function. This will also use less memory and so is generally recommended, unless you already have BioSequences loaded into Julia's environment for a separate reason.