Index of Functions
CUBScout.EXAMPLE_DATA_PATHCUBScout.CodonDictCUBScout.all_cubCUBScout.bCUBScout.caiCUBScout.codon_frequencyCUBScout.count_codonsCUBScout.eCUBScout.encCUBScout.enc_pCUBScout.find_seqsCUBScout.fopCUBScout.gcbCUBScout.make_CodonDictCUBScout.mcbCUBScout.melpCUBScout.milcCUBScout.scuoCUBScout.seq_descriptionsCUBScout.seq_names
CUBScout.EXAMPLE_DATA_PATH — ConstantEXAMPLE_DATA_PATHThe path to an example dataset, stored as an artifact within the package. This is an .fna file containing coding sequences from Bacillus subtilis subsp. subtilis str. 168, NCBI Accession # NC_000964.3.
CUBScout.CodonDict — TypeCodonDictThe CodonDict type defines how codons are translated, and is essential for calculating codon usage bias as it identifies stop codons and each amino acid's degeneracy. A default codon dictionary is provided (default_codon_dict), or a user can make their own using the make_codon_dict function.
Fields
codons: the 64 codons, in alphabetical orderAA: corresponding amino acid for each codon (64 entries long)AA_nostops: same as AA, but with stop codons removeduniqueAA: unique amino acid names including stop codons. Under a standard translation table, this is 21 amino acids longuniqueAA: same as uniqueAA, but with stop codons removeduniqueI: a vector of the same length as uniqueAA, containing vectors of the indices of each codon for that amino acid. For instance, the first entry corresponds to Lysine, and contains the vector[1, 3], corresponding to the positions of codons AAA and AAG in the codons fielduniqueI_nostops: same as uniqueI, but with stop codons removeddeg: a vector of the same length as uniqueAA, containing the degeneracy for each amino acid.deg_nostops: same as deg, but with stop codons removedstop_mask: a Boolean vector of length 64 which is false for stop codons. This is used to remove stop codons when calculating codon usage bias.
Notes
Generally, CUBScout users shouldn't need to interact with the CodonDict type, as the standard genetic code is applied by default. Details for constructing a custom CodonDict are documented under the make_CodonDict function.
CUBScout.all_cub — Functionall_cub(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
all_cub(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)Calculate all codon usage bias measures at once. Because many measures require the same initial calculations, this is more efficient than calculating them individually.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> all_cub_results = all_cub(EXAMPLE_DATA_PATH); # Calculate all six codon usage measures on example dataset
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> all_cub(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate all measures using ribosomal genes as a reference subsetCUBScout.b — Functionb(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
b(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)Calculate B from Karlin and Mrazek, 1996.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}ref_seqs: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes,ref_seqstakes a named tuple in the form("subset_name" = Bool[],), whereBool[]is the same length as the number of sequences in your fasta file, and containstruefor sequences you want as your reference subset and false for those you don't. You can usefind_seqs()to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, andCUBScoutwill return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets,ref_seqsshould be a vector of named tuples corresponding to the vector of sequences.rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> result = b(EXAMPLE_DATA_PATH); # Calculate measure on example dataset
julia> result_300 = b(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.self)
3801
julia> length(result_300.self)
1650
julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
0.209127
0.328976
0.223653
0.539114
0.249196
julia> b(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> b(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> b(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset
julia> b(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subsetCUBScout.cai — Functioncai(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
cai(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)Calculate CAI from Sharp and Li, 1987.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.ref_vector: reference subset, which is required forcai.Bool[]the same length as the number of sequences in your fasta file, and containstruefor sequences you want as your reference subset and false for those you don't. You can usefind_seqs()to generate this vector. If providing multiple filepaths and want custom reference sets,ref_vectorsshould be a vector of vectors corresponding to the vector of filepaths.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> result = cai(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate CAI on example dataset
julia> round.(result.CAI[1:5], digits = 6)
5-element Vector{Float64}:
0.844967
0.88548
0.817348
1.072675
0.834179
julia> cai(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> cai(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codonsCUBScout.codon_frequency — Functioncodon_frequency(codon_counts::Matrix{<:Integer}, form::String, dict::CodonDict = DEFAULT_CodonDict)Calculate codon frequency from a matrix of codon counts. Accepts as its first argument a Matrix{<:Integer} which is a product of count_codons(). form can be one of four options: -net_genomic: Frequency of each codon across entire genome (matrix). -net_gene: Frequency of each codon within each gene (column). -byAA_genomic: Frequency of each codon within each amino acid across the entire genome (matrix). -byAA_gene: Frequency of each codon within each amino acid within each gene (column).
If using an alternative genetic code, a custom CodonDict can be provided.
Examples
julia> codon_counts = count_codons(EXAMPLE_DATA_PATH);
julia> count_matrix = codon_counts[1];
julia> codon_frequency(count_matrix, "net_genomic")[1:5]
5-element Vector{Float64}:
0.04941242971710299
0.017114892645228374
0.021009352696846777
0.022269444158755328
0.022257296747490142
julia> codon_frequency(count_matrix, "net_genomic") |> size
(64, 1)
julia> codon_frequency(count_matrix, "net_gene") |> size
(64, 4237)CUBScout.count_codons — Functioncount_codons(path::AbstractString, remove_start::Bool = false, threshold::Integer = 0)
count_codons(stream::IO, remove_start::Bool = false, threshold::Integer = 0)
count_codons(reader::FASTAReader, remove_start::Bool = false, threshold::Integer = 0)
count_codons(sequences::Vector{<:NucSeq}, names::Vector{String} = String[], remove_start::Bool = false, threshold::Integer = 0)
count_codons(sequence::NucSeq, remove_start::Bool = false, threshold::Integer = 0)Read a fasta file or BioSequence and return the occurence of each codon for each gene or sequence.
Arguments
pathorstreamorreaderorsequence(s): Fasta sequence to analyze. This can be a path to a fasta file of sequences, an IOStream, an open FASTAReader, or a BioSequences nucleotide sequence, or a vector of nucleotide sequences. Note that count_codons isn't identifying ORFs - make sure these are actual CDSs in frame.remove_start: Whether to ignore the initial start codonthreshold: Minimum length of the sequence in codons to be returned in the results.
Output
If providing a single sequence, the result will be a 64x1 Matrix, which corresponds to the 64 codons in alphabetical order. If you want a list of the codons in alphabetical order, this is stored in CUBScout.DEFAULT_CodonDict.codons. If analyzing a fasta file or a vector of sequences, the result will be a tuple. The first element of the tuple is a 64xn matrix, where n = # of sequences above the threshold. The second element is a list of corresponding names for each column. The third element is a Boolean vector where true corresponds to sequences which did pass the threshold, and false is sequences which did not pass the threshold and so are not included in the results matrix. Names are pulled from fasta files and IO streams by default; if you would like to provide a vector of IDs or names when providing a Vector{<:NucSeq}, you can.
Examples
julia> using BioSequences: @dna_str
julia> example_dna = dna"ATGAAAATGAACTTTTGA"
18nt DNA Sequence:
ATGAAAATGAACTTTTGA
julia> count_codons(example_dna) |> first
1
julia> result = count_codons(EXAMPLE_DATA_PATH);
julia> first(result[1], 5)
5-element Vector{Int32}:
32
7
6
14
11CUBScout.e — Functione(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
e(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)Calculate E from Karlin and Mrazek, 1996.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.ref_vector: reference subset, which is required fore.Bool[]the same length as the number of sequences in your fasta file, and containstruefor sequences you want as your reference subset and false for those you don't. You can usefind_seqs()to generate this vector. If providing multiple filepaths and want custom reference sets,ref_vectorsshould be a vector of vectors corresponding to the vector of filepaths.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> result = e(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate E on example dataset
julia> round.(result.E[1:5], digits = 6)
5-element Vector{Float64}:
0.762317
1.025839
0.875954
0.986498
1.111275
julia> e(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> e(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codonsCUBScout.enc — Functionenc(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
enc(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)Calculate ENC from Wright, 1990.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> result = enc(EXAMPLE_DATA_PATH); # Run ENC on example dataset
julia> round.(result.ENC[1:5], digits = 6)
5-element Vector{Float64}:
56.787282
52.725947
59.287949
52.296686
55.262981
julia> result_300 = enc(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.ENC)
3801
julia> length(result_300.ENC)
1650
julia> enc(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> enc(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codonsCUBScout.enc_p — Functionenc_p(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
enc_p(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)Calculate ENC' from Novembre, 2002.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}ref_seqs: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes,ref_seqstakes a named tuple in the form("subset_name" = Bool[],), whereBool[]is the same length as the number of sequences in your fasta file, and containstruefor sequences you want as your reference subset and false for those you don't. You can usefind_seqs()to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, andCUBScoutwill return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets,ref_seqsshould be a vector of named tuples corresponding to the vector of sequences.rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> result = enc_p(EXAMPLE_DATA_PATH); # Calculate measure on example dataset
julia> result_300 = enc_p(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.self)
3801
julia> length(result_300.self)
1650
julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
61.0
59.369798
60.749462
61.0
61.0
julia> enc_p(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> enc_p(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> enc_p(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset
julia> enc_p(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subsetCUBScout.find_seqs — Methodfind_seqs(path::AbstractString, match_pattern::Regex)
find_seqs(stream::IO, match_pattern::Regex)
find_seqs(reader::FASTAReader, match_pattern::Regex)Read a fasta file at path (or reader or IO which points to a fasta file) and query the description field for a given Regex match_pattern. These results can be supplied in either the reference tuples (for codon usage bias functions) or reference vectors (for expressivity measures).
Examples
jldoctest julia> find_seqs(EXAMPLE_DATA_PATH, r"ribosomal")[1:5] 5-element Vector{Bool}: 0 0 0 0 0
CUBScout.fop — Functionfop(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
fop(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)Calculate FOP from Ikemura, 1981.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.ref_vector: reference subset, which is required forfop.Bool[]the same length as the number of sequences in your fasta file, and containstruefor sequences you want as your reference subset and false for those you don't. You can usefind_seqs()to generate this vector. If providing multiple filepaths and want custom reference sets,ref_vectorsshould be a vector of vectors corresponding to the vector of filepaths.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> result = fop(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate CAI on example dataset
julia> round.(result.FOP[1:5], digits = 6)
5-element Vector{Float64}:
0.567816
0.566845
0.509695
0.725
0.653784
julia> fop(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> fop(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codonsCUBScout.gcb — Functiongcb(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_vector = [], perc = 0.05, rm_start = false, rm_stop = false, threshold = 80)
gcb(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_vectors = [], perc = 0.05, rm_start = false, rm_stop = false, threshold = 80)Calculate GCB from Merkl, 2003.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of BioSequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each fasta file or vector of BioSequences. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}ref_vector: optional reference subset; by default gcb begins calculations using all genes as a seed. If you want to provide a custom reference set, it should be a vectorBool[]the same length as the number of sequences in your fasta file, and containstruefor sequences you want as your reference subset and false for those you don't. You can usefind_seqs()to generate this vector. If providing multiple filepaths and want custom reference sets,ref_vectorsshould be a vector of vectors corresponding to the vector of filepaths.perc: percentage of "top hits" which should be used as a reference set in the next iteration. By default set to 0.05.rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> result = gcb(EXAMPLE_DATA_PATH); # Calculate GCB on example dataset
julia> round.(result.GCB[1:5], digits = 6)
5-element Vector{Float64}:
-0.058765
-0.08659
-0.005496
-0.065659
-0.032062
julia> ribo_result = gcb(EXAMPLE_DATA_PATH, ref_vector = ribosomal_genes); # Calculate GCB with ribosomal genes as reference seed example dataset
julia> round.(ribo_result.GCB[1:5], digits = 6)
5-element Vector{Float64}:
-0.135615
-0.036687
-0.169136
-0.186104
-0.01653
julia> gcb(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> gcb(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codonsCUBScout.make_CodonDict — Functionmake_CodonDict(filepath::AbstractString, delimiter::AbstractChar = ' ')Make a custom codon dictionary for organisms with non-standard genetic code. filepath points to a delimited file with two columns and no header. The first column should be codons, and the second column their corresponding amino acid. Avoid spaces and special characters (e.g., write GlutamicAcid instead of Glutamic Acid). Stop codons can be coded as Stop, stop, STOP, or *. If delimited using any character outside of tab, supply the delimiter as the second argument as Char, not a string (e.g. ',' not ","). make_CodonDict uses readdlm from DelimitedFiles; it's a good idea to check whether readdlm parses your file correctly before passing to make_CodonDict
Examples
julia> my_CodonDict = make_CodonDict(CUBScout.CodonDict_PATH)
CodonDict(BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}[AAA, AAC, AAG, AAT, ACA, ACC, ACG, ACT, AGA, AGC … TCG, TCT, TGA, TGC, TGG, TGT, TTA, TTC, TTG, TTT], ["Lysine", "Asparagine", "Lysine", "Asparagine", "Threonine", "Threonine", "Threonine", "Threonine", "Arginine", "Serine" … "Serine", "Serine", "Stop", "Cysteine", "Tryptophan", "Cysteine", "Leucine", "Phenylalanine", "Leucine", "Phenylalanine"], ["Lysine", "Asparagine", "Lysine", "Asparagine", "Threonine", "Threonine", "Threonine", "Threonine", "Arginine", "Serine" … "Serine", "Serine", "Serine", "Cysteine", "Tryptophan", "Cysteine", "Leucine", "Phenylalanine", "Leucine", "Phenylalanine"], ["Lysine", "Asparagine", "Threonine", "Arginine", "Serine", "Isoleucine", "Methionine", "Glutamine", "Histidine", "Proline" … "Glutamicacid", "Asparticacid", "Alanine", "Glycine", "Valine", "Stop", "Tyrosine", "Cysteine", "Tryptophan", "Phenylalanine"], ["Lysine", "Asparagine", "Threonine", "Arginine", "Serine", "Isoleucine", "Methionine", "Glutamine", "Histidine", "Proline", "Leucine", "Glutamicacid", "Asparticacid", "Alanine", "Glycine", "Valine", "Tyrosine", "Cysteine", "Tryptophan", "Phenylalanine"], Vector{Int32}[[1, 3], [2, 4], [5, 6, 7, 8], [9, 11, 25, 26, 27, 28], [10, 12, 53, 54, 55, 56], [13, 14, 16], [15], [17, 19], [18, 20], [21, 22, 23, 24] … [33, 35], [34, 36], [37, 38, 39, 40], [41, 42, 43, 44], [45, 46, 47, 48], [49, 51, 57], [50, 52], [58, 60], [59], [62, 64]], Vector{Int32}[[1, 3], [2, 4], [5, 6, 7, 8], [9, 11, 25, 26, 27, 28], [10, 12, 51, 52, 53, 54], [13, 14, 16], [15], [17, 19], [18, 20], [21, 22, 23, 24], [29, 30, 31, 32, 58, 60], [33, 35], [34, 36], [37, 38, 39, 40], [41, 42, 43, 44], [45, 46, 47, 48], [49, 50], [55, 57], [56], [59, 61]], Int32[2, 2, 4, 6, 6, 3, 1, 2, 2, 4 … 2, 2, 4, 4, 4, 3, 2, 2, 1, 2], Int32[2, 2, 4, 6, 6, 3, 1, 2, 2, 4, 6, 2, 2, 4, 4, 4, 2, 2, 1, 2], Bool[1, 1, 1, 1, 1, 1, 1, 1, 1, 1 … 1, 1, 0, 1, 1, 1, 1, 1, 1, 1])
julia> typeof(my_CodonDict)
CodonDict
julia> fieldnames(CodonDict)
(:codons, :AA, :AA_nostops, :uniqueAA, :uniqueAA_nostops, :uniqueI, :uniqueI_nostops, :deg, :deg_nostops, :stop_mask)CUBScout.mcb — Functionmcb(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
mcb(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)Calculate MCB from Urrutia and Hurst, 2001.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}ref_seqs: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes,ref_seqstakes a named tuple in the form("subset_name" = Bool[],), whereBool[]is the same length as the number of sequences in your fasta file, and containstruefor sequences you want as your reference subset and false for those you don't. You can usefind_seqs()to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, andCUBScoutwill return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets,ref_seqsshould be a vector of named tuples corresponding to the vector of sequences.rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> result = mcb(EXAMPLE_DATA_PATH); # Calculate measure on example dataset
julia> result_300 = mcb(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.self)
3801
julia> length(result_300.self)
1650
julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
0.087211
0.178337
0.189682
0.24012
0.149869
julia> mcb(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> mcb(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> mcb(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset
julia> mcb(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subsetCUBScout.melp — Functionmelp(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
melp(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)Calculate MELP from Supek and Vlahovicek, 2005.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.ref_vector: reference subset, which is required formelp.Bool[]the same length as the number of sequences in your fasta file, and containstruefor sequences you want as your reference subset and false for those you don't. You can usefind_seqs()to generate this vector. If providing multiple filepaths and want custom reference sets,ref_vectorsshould be a vector of vectors corresponding to the vector of filepaths.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> result = melp(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate MELP on example dataset
julia> round.(result.MELP[1:5], digits = 6)
5-element Vector{Float64}:
0.929414
1.007671
0.922357
0.951239
1.029531
julia> melp(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> melp(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codonsCUBScout.milc — Functionmilc(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
milc(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)Calculate MILC from Supek and Vlahovicek, 2005.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}ref_seqs: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes,ref_seqstakes a named tuple in the form("subset_name" = Bool[],), whereBool[]is the same length as the number of sequences in your fasta file, and containstruefor sequences you want as your reference subset and false for those you don't. You can usefind_seqs()to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, andCUBScoutwill return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets,ref_seqsshould be a vector of named tuples corresponding to the vector of sequences.rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> result = milc(EXAMPLE_DATA_PATH); # Calculate measure on example dataset
julia> result_300 = milc(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.self)
3801
julia> length(result_300.self)
1650
julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
0.494826
0.583944
0.499472
0.635493
0.543935
julia> milc(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> milc(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> milc(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset
julia> milc(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subsetCUBScout.scuo — Functionscuo(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
scuo(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)Calculate SCUO from Wan et al., 2004.
Arguments
sequences: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequencescould instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScoutis multithreaded; if there are multiple threads available,CUBScoutwill allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict: codon dictionary of typeCodonDict. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDictnames: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequencesis of typeVector{<:Vector{<:NucSeq}},namesshould be of typeVector{Vector{String}}rm_start: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScoutkeeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_starttotrue, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDictto thedictargument, and keeprm_startasfalse.rm_stop: whether to remove stop codons from calculations of codon usage bias.threshold: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthresholdto 0.
Examples
julia> result = scuo(EXAMPLE_DATA_PATH); # Run SCUO on example dataset
julia> round.(result.SCUO[1:5], digits = 6)
5-element Vector{Float64}:
0.143121
0.191237
0.096324
0.345211
0.105744
julia> result_300 = scuo(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.SCUO)
3801
julia> length(result_300.SCUO)
1650
julia> scuo(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> scuo(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codonsCUBScout.seq_descriptions — Methodseq_descriptions(path::AbstractString)
seq_descriptions(reader::FASTAReader)
seq_descriptions(stream::IO)Read a fasta file at path and return the description fields. Just adds convenience on top of FASTX functions.
Examples
julia> seq_descr = seq_descriptions(EXAMPLE_DATA_PATH);
julia> seq_descr[1]
"lcl|NC_000964.3_cds_NP_387882.1_1 [gene=dnaA] [locus_tag=BSU_00010] [db_xref=EnsemblGenomes-Gn:BSU00010,EnsemblGenomes-Tr:CAB11777,GOA:P05648,InterPro:IPR001957,InterPro:IPR003593,InterPro:IPR010921,InterPro:IPR013159,InterPro:IPR013317,InterPro:IPR018312,InterPro:IPR020591,InterPro:IPR024633,InterPro:IPR027417,PDB:4TPS,SubtiList:BG10065,UniProtKB/Swiss-Prot:P05648] [protein=chromosomal replication initiator informational ATPase] [protein_id=NP_387882.1] [location=410..1750] [gbkey=CDS]"CUBScout.seq_names — Methodseq_names(path::AbstractString)
seq_names(reader::FASTAReader)
seq_names(stream::IO)Read a fasta file at path and return the name fields. Just adds convenience on top of FASTX functions.
Examples
julia> seq_name_vector = seq_names(EXAMPLE_DATA_PATH);
julia> seq_name_vector[1]
"lcl|NC_000964.3_cds_NP_387882.1_1"