Index of Functions

CUBScout.EXAMPLE_DATA_PATHConstant
EXAMPLE_DATA_PATH

The path to an example dataset, stored as an artifact within the package. This is an .fna file containing coding sequences from Bacillus subtilis subsp. subtilis str. 168, NCBI Accession # NC_000964.3.

source
CUBScout.CodonDictType
CodonDict

The CodonDict type defines how codons are translated, and is essential for calculating codon usage bias as it identifies stop codons and each amino acid's degeneracy. A default codon dictionary is provided (default_codon_dict), or a user can make their own using the make_codon_dict function.

Fields

  • codons: the 64 codons, in alphabetical order
  • AA: corresponding amino acid for each codon (64 entries long)
  • AA_nostops: same as AA, but with stop codons removed
  • uniqueAA: unique amino acid names including stop codons. Under a standard translation table, this is 21 amino acids long
  • uniqueAA: same as uniqueAA, but with stop codons removed
  • uniqueI: a vector of the same length as uniqueAA, containing vectors of the indices of each codon for that amino acid. For instance, the first entry corresponds to Lysine, and contains the vector [1, 3], corresponding to the positions of codons AAA and AAG in the codons field
  • uniqueI_nostops: same as uniqueI, but with stop codons removed
  • deg: a vector of the same length as uniqueAA, containing the degeneracy for each amino acid.
  • deg_nostops: same as deg, but with stop codons removed
  • stop_mask: a Boolean vector of length 64 which is false for stop codons. This is used to remove stop codons when calculating codon usage bias.

Notes

Generally, CUBScout users shouldn't need to interact with the CodonDict type, as the standard genetic code is applied by default. Details for constructing a custom CodonDict are documented under the make_CodonDict function.

source
CUBScout.all_cubFunction
all_cub(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
all_cub(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)

Calculate all codon usage bias measures at once. Because many measures require the same initial calculations, this is more efficient than calculating them individually.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> all_cub_results = all_cub(EXAMPLE_DATA_PATH); # Calculate all six codon usage measures on example dataset

julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes

julia> all_cub(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate all measures using ribosomal genes as a reference subset
source
CUBScout.bFunction
b(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
b(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)

Calculate B from Karlin and Mrazek, 1996.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • ref_seqs: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes, ref_seqs takes a named tuple in the form ("subset_name" = Bool[],), where Bool[] is the same length as the number of sequences in your fasta file, and contains true for sequences you want as your reference subset and false for those you don't. You can use find_seqs() to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, and CUBScout will return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets, ref_seqs should be a vector of named tuples corresponding to the vector of sequences.
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> result = b(EXAMPLE_DATA_PATH); # Calculate measure on example dataset

julia> result_300 = b(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length

julia> length(result.self)
3801

julia> length(result_300.self)
1650

julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
 0.209127
 0.328976
 0.223653
 0.539114
 0.249196

julia> b(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> b(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons

julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes

julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes

julia> b(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset

julia> b(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subset
source
CUBScout.caiFunction
cai(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
cai(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)

Calculate CAI from Sharp and Li, 1987.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • ref_vector: reference subset, which is required for cai. Bool[] the same length as the number of sequences in your fasta file, and contains true for sequences you want as your reference subset and false for those you don't. You can use find_seqs() to generate this vector. If providing multiple filepaths and want custom reference sets, ref_vectors should be a vector of vectors corresponding to the vector of filepaths.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes

julia> result = cai(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate CAI on example dataset

julia> round.(result.CAI[1:5], digits = 6)
5-element Vector{Float64}:
 0.844967
 0.88548
 0.817348
 1.072675
 0.834179

julia> cai(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> cai(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codons
source
CUBScout.codon_frequencyFunction
codon_frequency(codon_counts::Matrix{<:Integer}, form::String, dict::CodonDict = DEFAULT_CodonDict)

Calculate codon frequency from a matrix of codon counts. Accepts as its first argument a Matrix{<:Integer} which is a product of count_codons(). form can be one of four options: -net_genomic: Frequency of each codon across entire genome (matrix). -net_gene: Frequency of each codon within each gene (column). -byAA_genomic: Frequency of each codon within each amino acid across the entire genome (matrix). -byAA_gene: Frequency of each codon within each amino acid within each gene (column).

If using an alternative genetic code, a custom CodonDict can be provided.

Examples

julia> codon_counts = count_codons(EXAMPLE_DATA_PATH);

julia> count_matrix = codon_counts[1];

julia> codon_frequency(count_matrix, "net_genomic")[1:5]
5-element Vector{Float64}:
 0.04941242971710299
 0.017114892645228374
 0.021009352696846777
 0.022269444158755328
 0.022257296747490142

julia> codon_frequency(count_matrix, "net_genomic") |> size
(64, 1)

julia> codon_frequency(count_matrix, "net_gene") |> size
(64, 4237)
source
CUBScout.count_codonsFunction
count_codons(path::AbstractString, remove_start::Bool = false, threshold::Integer = 0)
count_codons(stream::IO, remove_start::Bool = false, threshold::Integer = 0)
count_codons(reader::FASTAReader, remove_start::Bool = false, threshold::Integer = 0)
count_codons(sequences::Vector{<:NucSeq}, names::Vector{String} = String[], remove_start::Bool = false, threshold::Integer = 0)
count_codons(sequence::NucSeq, remove_start::Bool = false, threshold::Integer = 0)

Read a fasta file or BioSequence and return the occurence of each codon for each gene or sequence.

Arguments

  • path or stream or reader or sequence(s): Fasta sequence to analyze. This can be a path to a fasta file of sequences, an IOStream, an open FASTAReader, or a BioSequences nucleotide sequence, or a vector of nucleotide sequences. Note that count_codons isn't identifying ORFs - make sure these are actual CDSs in frame.
  • remove_start: Whether to ignore the initial start codon
  • threshold: Minimum length of the sequence in codons to be returned in the results.

Output

If providing a single sequence, the result will be a 64x1 Matrix, which corresponds to the 64 codons in alphabetical order. If you want a list of the codons in alphabetical order, this is stored in CUBScout.DEFAULT_CodonDict.codons. If analyzing a fasta file or a vector of sequences, the result will be a tuple. The first element of the tuple is a 64xn matrix, where n = # of sequences above the threshold. The second element is a list of corresponding names for each column. The third element is a Boolean vector where true corresponds to sequences which did pass the threshold, and false is sequences which did not pass the threshold and so are not included in the results matrix. Names are pulled from fasta files and IO streams by default; if you would like to provide a vector of IDs or names when providing a Vector{<:NucSeq}, you can.

Examples

julia> using BioSequences: @dna_str

julia> example_dna = dna"ATGAAAATGAACTTTTGA"
18nt DNA Sequence:
ATGAAAATGAACTTTTGA

julia> count_codons(example_dna) |> first
1

julia> result = count_codons(EXAMPLE_DATA_PATH);

julia> first(result[1], 5)
5-element Vector{Int32}:
 32
  7
  6
 14
 11
source
CUBScout.eFunction
e(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
e(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)

Calculate E from Karlin and Mrazek, 1996.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • ref_vector: reference subset, which is required for e. Bool[] the same length as the number of sequences in your fasta file, and contains true for sequences you want as your reference subset and false for those you don't. You can use find_seqs() to generate this vector. If providing multiple filepaths and want custom reference sets, ref_vectors should be a vector of vectors corresponding to the vector of filepaths.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes

julia> result = e(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate E on example dataset

julia> round.(result.E[1:5], digits = 6)
5-element Vector{Float64}:
 0.762317
 1.025839
 0.875954
 0.986498
 1.111275

julia> e(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> e(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codons
source
CUBScout.encFunction
enc(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
enc(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)

Calculate ENC from Wright, 1990.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> result = enc(EXAMPLE_DATA_PATH); # Run ENC on example dataset

julia> round.(result.ENC[1:5], digits = 6)
5-element Vector{Float64}:
 56.787282
 52.725947
 59.287949
 52.296686
 55.262981

julia> result_300 = enc(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length

julia> length(result.ENC)
3801

julia> length(result_300.ENC)
1650

julia> enc(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> enc(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
source
CUBScout.enc_pFunction
enc_p(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
enc_p(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)

Calculate ENC' from Novembre, 2002.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • ref_seqs: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes, ref_seqs takes a named tuple in the form ("subset_name" = Bool[],), where Bool[] is the same length as the number of sequences in your fasta file, and contains true for sequences you want as your reference subset and false for those you don't. You can use find_seqs() to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, and CUBScout will return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets, ref_seqs should be a vector of named tuples corresponding to the vector of sequences.
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> result = enc_p(EXAMPLE_DATA_PATH); # Calculate measure on example dataset

julia> result_300 = enc_p(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length

julia> length(result.self)
3801

julia> length(result_300.self)
1650

julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
 61.0
 59.369798
 60.749462
 61.0
 61.0

julia> enc_p(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> enc_p(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons

julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes

julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes

julia> enc_p(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset

julia> enc_p(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subset
source
CUBScout.find_seqsMethod
find_seqs(path::AbstractString, match_pattern::Regex)
find_seqs(stream::IO, match_pattern::Regex)
find_seqs(reader::FASTAReader, match_pattern::Regex)

Read a fasta file at path (or reader or IO which points to a fasta file) and query the description field for a given Regex match_pattern. These results can be supplied in either the reference tuples (for codon usage bias functions) or reference vectors (for expressivity measures).

Examples

jldoctest julia> find_seqs(EXAMPLE_DATA_PATH, r"ribosomal")[1:5] 5-element Vector{Bool}: 0 0 0 0 0

source
CUBScout.fopFunction
fop(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
fop(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)

Calculate FOP from Ikemura, 1981.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • ref_vector: reference subset, which is required for fop. Bool[] the same length as the number of sequences in your fasta file, and contains true for sequences you want as your reference subset and false for those you don't. You can use find_seqs() to generate this vector. If providing multiple filepaths and want custom reference sets, ref_vectors should be a vector of vectors corresponding to the vector of filepaths.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes

julia> result = fop(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate CAI on example dataset

julia> round.(result.FOP[1:5], digits = 6)
5-element Vector{Float64}:
 0.567816
 0.566845
 0.509695
 0.725
 0.653784

julia> fop(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> fop(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codons
source
CUBScout.gcbFunction
gcb(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_vector = [], perc = 0.05, rm_start = false, rm_stop = false, threshold = 80)
gcb(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_vectors = [], perc = 0.05, rm_start = false, rm_stop = false, threshold = 80)

Calculate GCB from Merkl, 2003.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of BioSequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each fasta file or vector of BioSequences. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • ref_vector: optional reference subset; by default gcb begins calculations using all genes as a seed. If you want to provide a custom reference set, it should be a vector Bool[] the same length as the number of sequences in your fasta file, and contains true for sequences you want as your reference subset and false for those you don't. You can use find_seqs() to generate this vector. If providing multiple filepaths and want custom reference sets, ref_vectors should be a vector of vectors corresponding to the vector of filepaths.
  • perc: percentage of "top hits" which should be used as a reference set in the next iteration. By default set to 0.05.
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes

julia> result = gcb(EXAMPLE_DATA_PATH); # Calculate GCB on example dataset

julia> round.(result.GCB[1:5], digits = 6)
5-element Vector{Float64}:
 -0.058765
 -0.08659
 -0.005496
 -0.065659
 -0.032062

julia> ribo_result = gcb(EXAMPLE_DATA_PATH, ref_vector = ribosomal_genes); # Calculate GCB with ribosomal genes as reference seed example dataset

julia> round.(ribo_result.GCB[1:5], digits = 6)
5-element Vector{Float64}:
 -0.135615
 -0.036687
 -0.169136
 -0.186104
 -0.01653

julia> gcb(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> gcb(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
source
CUBScout.make_CodonDictFunction
make_CodonDict(filepath::AbstractString, delimiter::AbstractChar = '	')

Make a custom codon dictionary for organisms with non-standard genetic code. filepath points to a delimited file with two columns and no header. The first column should be codons, and the second column their corresponding amino acid. Avoid spaces and special characters (e.g., write GlutamicAcid instead of Glutamic Acid). Stop codons can be coded as Stop, stop, STOP, or *. If delimited using any character outside of tab, supply the delimiter as the second argument as Char, not a string (e.g. ',' not ","). make_CodonDict uses readdlm from DelimitedFiles; it's a good idea to check whether readdlm parses your file correctly before passing to make_CodonDict

Examples

julia> my_CodonDict = make_CodonDict(CUBScout.CodonDict_PATH)
CodonDict(BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}[AAA, AAC, AAG, AAT, ACA, ACC, ACG, ACT, AGA, AGC  …  TCG, TCT, TGA, TGC, TGG, TGT, TTA, TTC, TTG, TTT], ["Lysine", "Asparagine", "Lysine", "Asparagine", "Threonine", "Threonine", "Threonine", "Threonine", "Arginine", "Serine"  …  "Serine", "Serine", "Stop", "Cysteine", "Tryptophan", "Cysteine", "Leucine", "Phenylalanine", "Leucine", "Phenylalanine"], ["Lysine", "Asparagine", "Lysine", "Asparagine", "Threonine", "Threonine", "Threonine", "Threonine", "Arginine", "Serine"  …  "Serine", "Serine", "Serine", "Cysteine", "Tryptophan", "Cysteine", "Leucine", "Phenylalanine", "Leucine", "Phenylalanine"], ["Lysine", "Asparagine", "Threonine", "Arginine", "Serine", "Isoleucine", "Methionine", "Glutamine", "Histidine", "Proline"  …  "Glutamicacid", "Asparticacid", "Alanine", "Glycine", "Valine", "Stop", "Tyrosine", "Cysteine", "Tryptophan", "Phenylalanine"], ["Lysine", "Asparagine", "Threonine", "Arginine", "Serine", "Isoleucine", "Methionine", "Glutamine", "Histidine", "Proline", "Leucine", "Glutamicacid", "Asparticacid", "Alanine", "Glycine", "Valine", "Tyrosine", "Cysteine", "Tryptophan", "Phenylalanine"], Vector{Int32}[[1, 3], [2, 4], [5, 6, 7, 8], [9, 11, 25, 26, 27, 28], [10, 12, 53, 54, 55, 56], [13, 14, 16], [15], [17, 19], [18, 20], [21, 22, 23, 24]  …  [33, 35], [34, 36], [37, 38, 39, 40], [41, 42, 43, 44], [45, 46, 47, 48], [49, 51, 57], [50, 52], [58, 60], [59], [62, 64]], Vector{Int32}[[1, 3], [2, 4], [5, 6, 7, 8], [9, 11, 25, 26, 27, 28], [10, 12, 51, 52, 53, 54], [13, 14, 16], [15], [17, 19], [18, 20], [21, 22, 23, 24], [29, 30, 31, 32, 58, 60], [33, 35], [34, 36], [37, 38, 39, 40], [41, 42, 43, 44], [45, 46, 47, 48], [49, 50], [55, 57], [56], [59, 61]], Int32[2, 2, 4, 6, 6, 3, 1, 2, 2, 4  …  2, 2, 4, 4, 4, 3, 2, 2, 1, 2], Int32[2, 2, 4, 6, 6, 3, 1, 2, 2, 4, 6, 2, 2, 4, 4, 4, 2, 2, 1, 2], Bool[1, 1, 1, 1, 1, 1, 1, 1, 1, 1  …  1, 1, 0, 1, 1, 1, 1, 1, 1, 1])

julia> typeof(my_CodonDict)
CodonDict

julia> fieldnames(CodonDict)
(:codons, :AA, :AA_nostops, :uniqueAA, :uniqueAA_nostops, :uniqueI, :uniqueI_nostops, :deg, :deg_nostops, :stop_mask)
source
CUBScout.mcbFunction
mcb(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
mcb(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)

Calculate MCB from Urrutia and Hurst, 2001.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • ref_seqs: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes, ref_seqs takes a named tuple in the form ("subset_name" = Bool[],), where Bool[] is the same length as the number of sequences in your fasta file, and contains true for sequences you want as your reference subset and false for those you don't. You can use find_seqs() to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, and CUBScout will return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets, ref_seqs should be a vector of named tuples corresponding to the vector of sequences.
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> result = mcb(EXAMPLE_DATA_PATH); # Calculate measure on example dataset

julia> result_300 = mcb(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length

julia> length(result.self)
3801

julia> length(result_300.self)
1650

julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
 0.087211
 0.178337
 0.189682
 0.24012
 0.149869

julia> mcb(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> mcb(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons

julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes

julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes

julia> mcb(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset

julia> mcb(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subset
source
CUBScout.melpFunction
melp(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
melp(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)

Calculate MELP from Supek and Vlahovicek, 2005.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • ref_vector: reference subset, which is required for melp. Bool[] the same length as the number of sequences in your fasta file, and contains true for sequences you want as your reference subset and false for those you don't. You can use find_seqs() to generate this vector. If providing multiple filepaths and want custom reference sets, ref_vectors should be a vector of vectors corresponding to the vector of filepaths.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes

julia> result = melp(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate MELP on example dataset

julia> round.(result.MELP[1:5], digits = 6)
5-element Vector{Float64}:
 0.929414
 1.007671
 0.922357
 0.951239
 1.029531

julia> melp(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> melp(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codons
source
CUBScout.milcFunction
milc(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
milc(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)

Calculate MILC from Supek and Vlahovicek, 2005.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • ref_seqs: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes, ref_seqs takes a named tuple in the form ("subset_name" = Bool[],), where Bool[] is the same length as the number of sequences in your fasta file, and contains true for sequences you want as your reference subset and false for those you don't. You can use find_seqs() to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, and CUBScout will return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets, ref_seqs should be a vector of named tuples corresponding to the vector of sequences.
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> result = milc(EXAMPLE_DATA_PATH); # Calculate measure on example dataset

julia> result_300 = milc(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length

julia> length(result.self)
3801

julia> length(result_300.self)
1650

julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
 0.494826
 0.583944
 0.499472
 0.635493
 0.543935

julia> milc(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> milc(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons

julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes

julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes

julia> milc(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset

julia> milc(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subset
source
CUBScout.scuoFunction
scuo(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
scuo(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)

Calculate SCUO from Wan et al., 2004.

Arguments

  • sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences), sequences could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome. CUBScout is multithreaded; if there are multiple threads available, CUBScout will allocate a thread for each filepath. As such, providing a vector of filepaths (or Vector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.
  • dict: codon dictionary of type CodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary using make_CodonDict
  • names: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. If sequences is of type Vector{<:Vector{<:NucSeq}}, names should be of type Vector{Vector{String}}
  • rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default, CUBScout keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you set rm_start to true, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supply ALTSTART_CodonDict to the dict argument, and keep rm_start as false.
  • rm_stop: whether to remove stop codons from calculations of codon usage bias.
  • threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, set threshold to 0.

Examples

julia> result = scuo(EXAMPLE_DATA_PATH); # Run SCUO on example dataset

julia> round.(result.SCUO[1:5], digits = 6)
5-element Vector{Float64}:
 0.143121
 0.191237
 0.096324
 0.345211
 0.105744

julia> result_300 = scuo(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length

julia> length(result.SCUO)
3801

julia> length(result_300.SCUO)
1650

julia> scuo(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine

julia> scuo(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
source
CUBScout.seq_descriptionsMethod
seq_descriptions(path::AbstractString)
seq_descriptions(reader::FASTAReader)
seq_descriptions(stream::IO)

Read a fasta file at path and return the description fields. Just adds convenience on top of FASTX functions.

Examples

julia> seq_descr = seq_descriptions(EXAMPLE_DATA_PATH);

julia> seq_descr[1]
"lcl|NC_000964.3_cds_NP_387882.1_1 [gene=dnaA] [locus_tag=BSU_00010] [db_xref=EnsemblGenomes-Gn:BSU00010,EnsemblGenomes-Tr:CAB11777,GOA:P05648,InterPro:IPR001957,InterPro:IPR003593,InterPro:IPR010921,InterPro:IPR013159,InterPro:IPR013317,InterPro:IPR018312,InterPro:IPR020591,InterPro:IPR024633,InterPro:IPR027417,PDB:4TPS,SubtiList:BG10065,UniProtKB/Swiss-Prot:P05648] [protein=chromosomal replication initiator informational ATPase] [protein_id=NP_387882.1] [location=410..1750] [gbkey=CDS]"
source
CUBScout.seq_namesMethod
seq_names(path::AbstractString)
seq_names(reader::FASTAReader)
seq_names(stream::IO)

Read a fasta file at path and return the name fields. Just adds convenience on top of FASTX functions.

Examples

julia> seq_name_vector = seq_names(EXAMPLE_DATA_PATH);

julia> seq_name_vector[1]
"lcl|NC_000964.3_cds_NP_387882.1_1"
source