Index of Functions
CUBScout.EXAMPLE_DATA_PATH
CUBScout.CodonDict
CUBScout.all_cub
CUBScout.b
CUBScout.cai
CUBScout.codon_frequency
CUBScout.count_codons
CUBScout.e
CUBScout.enc
CUBScout.enc_p
CUBScout.find_seqs
CUBScout.fop
CUBScout.gcb
CUBScout.make_CodonDict
CUBScout.mcb
CUBScout.melp
CUBScout.milc
CUBScout.scuo
CUBScout.seq_descriptions
CUBScout.seq_names
CUBScout.EXAMPLE_DATA_PATH
— ConstantEXAMPLE_DATA_PATH
The path to an example dataset, stored as an artifact within the package. This is an .fna file containing coding sequences from Bacillus subtilis subsp. subtilis str. 168, NCBI Accession # NC_000964.3.
CUBScout.CodonDict
— TypeCodonDict
The CodonDict
type defines how codons are translated, and is essential for calculating codon usage bias as it identifies stop codons and each amino acid's degeneracy. A default codon dictionary is provided (default_codon_dict
), or a user can make their own using the make_codon_dict
function.
Fields
codons
: the 64 codons, in alphabetical orderAA
: corresponding amino acid for each codon (64 entries long)AA_nostops
: same as AA, but with stop codons removeduniqueAA
: unique amino acid names including stop codons. Under a standard translation table, this is 21 amino acids longuniqueAA
: same as uniqueAA, but with stop codons removeduniqueI
: a vector of the same length as uniqueAA, containing vectors of the indices of each codon for that amino acid. For instance, the first entry corresponds to Lysine, and contains the vector[1, 3]
, corresponding to the positions of codons AAA and AAG in the codons fielduniqueI_nostops
: same as uniqueI, but with stop codons removeddeg
: a vector of the same length as uniqueAA, containing the degeneracy for each amino acid.deg_nostops
: same as deg, but with stop codons removedstop_mask
: a Boolean vector of length 64 which is false for stop codons. This is used to remove stop codons when calculating codon usage bias.
Notes
Generally, CUBScout users shouldn't need to interact with the CodonDict
type, as the standard genetic code is applied by default. Details for constructing a custom CodonDict
are documented under the make_CodonDict
function.
CUBScout.all_cub
— Functionall_cub(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
all_cub(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
Calculate all codon usage bias measures at once. Because many measures require the same initial calculations, this is more efficient than calculating them individually.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> all_cub_results = all_cub(EXAMPLE_DATA_PATH); # Calculate all six codon usage measures on example dataset
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> all_cub(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate all measures using ribosomal genes as a reference subset
CUBScout.b
— Functionb(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
b(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
Calculate B from Karlin and Mrazek, 1996.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
ref_seqs
: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes,ref_seqs
takes a named tuple in the form("subset_name" = Bool[],)
, whereBool[]
is the same length as the number of sequences in your fasta file, and containstrue
for sequences you want as your reference subset and false for those you don't. You can usefind_seqs()
to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, andCUBScout
will return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets,ref_seqs
should be a vector of named tuples corresponding to the vector of sequences.rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> result = b(EXAMPLE_DATA_PATH); # Calculate measure on example dataset
julia> result_300 = b(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.self)
3801
julia> length(result_300.self)
1650
julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
0.209127
0.328976
0.223653
0.539114
0.249196
julia> b(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> b(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> b(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset
julia> b(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subset
CUBScout.cai
— Functioncai(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
cai(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
Calculate CAI from Sharp and Li, 1987.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.ref_vector
: reference subset, which is required forcai
.Bool[]
the same length as the number of sequences in your fasta file, and containstrue
for sequences you want as your reference subset and false for those you don't. You can usefind_seqs()
to generate this vector. If providing multiple filepaths and want custom reference sets,ref_vectors
should be a vector of vectors corresponding to the vector of filepaths.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> result = cai(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate CAI on example dataset
julia> round.(result.CAI[1:5], digits = 6)
5-element Vector{Float64}:
0.844967
0.88548
0.817348
1.072675
0.834179
julia> cai(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> cai(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codons
CUBScout.codon_frequency
— Functioncodon_frequency(codon_counts::Matrix{<:Integer}, form::String, dict::CodonDict = DEFAULT_CodonDict)
Calculate codon frequency from a matrix of codon counts. Accepts as its first argument a Matrix{<:Integer}
which is a product of count_codons()
. form
can be one of four options: -net_genomic
: Frequency of each codon across entire genome (matrix). -net_gene
: Frequency of each codon within each gene (column). -byAA_genomic
: Frequency of each codon within each amino acid across the entire genome (matrix). -byAA_gene
: Frequency of each codon within each amino acid within each gene (column).
If using an alternative genetic code, a custom CodonDict
can be provided.
Examples
julia> codon_counts = count_codons(EXAMPLE_DATA_PATH);
julia> count_matrix = codon_counts[1];
julia> codon_frequency(count_matrix, "net_genomic")[1:5]
5-element Vector{Float64}:
0.04941242971710299
0.017114892645228374
0.021009352696846777
0.022269444158755328
0.022257296747490142
julia> codon_frequency(count_matrix, "net_genomic") |> size
(64, 1)
julia> codon_frequency(count_matrix, "net_gene") |> size
(64, 4237)
CUBScout.count_codons
— Functioncount_codons(path::AbstractString, remove_start::Bool = false, threshold::Integer = 0)
count_codons(stream::IO, remove_start::Bool = false, threshold::Integer = 0)
count_codons(reader::FASTAReader, remove_start::Bool = false, threshold::Integer = 0)
count_codons(sequences::Vector{<:NucSeq}, names::Vector{String} = String[], remove_start::Bool = false, threshold::Integer = 0)
count_codons(sequence::NucSeq, remove_start::Bool = false, threshold::Integer = 0)
Read a fasta file or BioSequence and return the occurence of each codon for each gene or sequence.
Arguments
path
orstream
orreader
orsequence(s)
: Fasta sequence to analyze. This can be a path to a fasta file of sequences, an IOStream, an open FASTAReader, or a BioSequences nucleotide sequence, or a vector of nucleotide sequences. Note that count_codons isn't identifying ORFs - make sure these are actual CDSs in frame.remove_start
: Whether to ignore the initial start codonthreshold
: Minimum length of the sequence in codons to be returned in the results.
Output
If providing a single sequence, the result will be a 64x1 Matrix, which corresponds to the 64 codons in alphabetical order. If you want a list of the codons in alphabetical order, this is stored in CUBScout.DEFAULT_CodonDict.codons
. If analyzing a fasta file or a vector of sequences, the result will be a tuple. The first element of the tuple is a 64xn matrix, where n = # of sequences above the threshold. The second element is a list of corresponding names for each column. The third element is a Boolean vector where true
corresponds to sequences which did pass the threshold, and false
is sequences which did not pass the threshold and so are not included in the results matrix. Names are pulled from fasta files and IO streams by default; if you would like to provide a vector of IDs or names when providing a Vector{<:NucSeq}
, you can.
Examples
julia> using BioSequences: @dna_str
julia> example_dna = dna"ATGAAAATGAACTTTTGA"
18nt DNA Sequence:
ATGAAAATGAACTTTTGA
julia> count_codons(example_dna) |> first
1
julia> result = count_codons(EXAMPLE_DATA_PATH);
julia> first(result[1], 5)
5-element Vector{Int32}:
32
7
6
14
11
CUBScout.e
— Functione(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
e(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
Calculate E from Karlin and Mrazek, 1996.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.ref_vector
: reference subset, which is required fore
.Bool[]
the same length as the number of sequences in your fasta file, and containstrue
for sequences you want as your reference subset and false for those you don't. You can usefind_seqs()
to generate this vector. If providing multiple filepaths and want custom reference sets,ref_vectors
should be a vector of vectors corresponding to the vector of filepaths.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> result = e(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate E on example dataset
julia> round.(result.E[1:5], digits = 6)
5-element Vector{Float64}:
0.762317
1.025839
0.875954
0.986498
1.111275
julia> e(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> e(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codons
CUBScout.enc
— Functionenc(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
enc(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
Calculate ENC from Wright, 1990.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> result = enc(EXAMPLE_DATA_PATH); # Run ENC on example dataset
julia> round.(result.ENC[1:5], digits = 6)
5-element Vector{Float64}:
56.787282
52.725947
59.287949
52.296686
55.262981
julia> result_300 = enc(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.ENC)
3801
julia> length(result_300.ENC)
1650
julia> enc(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> enc(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
CUBScout.enc_p
— Functionenc_p(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
enc_p(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
Calculate ENC' from Novembre, 2002.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
ref_seqs
: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes,ref_seqs
takes a named tuple in the form("subset_name" = Bool[],)
, whereBool[]
is the same length as the number of sequences in your fasta file, and containstrue
for sequences you want as your reference subset and false for those you don't. You can usefind_seqs()
to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, andCUBScout
will return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets,ref_seqs
should be a vector of named tuples corresponding to the vector of sequences.rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> result = enc_p(EXAMPLE_DATA_PATH); # Calculate measure on example dataset
julia> result_300 = enc_p(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.self)
3801
julia> length(result_300.self)
1650
julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
61.0
59.369798
60.749462
61.0
61.0
julia> enc_p(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> enc_p(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> enc_p(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset
julia> enc_p(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subset
CUBScout.find_seqs
— Methodfind_seqs(path::AbstractString, match_pattern::Regex)
find_seqs(stream::IO, match_pattern::Regex)
find_seqs(reader::FASTAReader, match_pattern::Regex)
Read a fasta file at path
(or reader
or IO
which points to a fasta file) and query the description field for a given Regex match_pattern
. These results can be supplied in either the reference tuples (for codon usage bias functions) or reference vectors (for expressivity measures).
Examples
jldoctest julia> find_seqs(EXAMPLE_DATA_PATH, r"ribosomal")[1:5] 5-element Vector{Bool}: 0 0 0 0 0
CUBScout.fop
— Functionfop(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
fop(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
Calculate FOP from Ikemura, 1981.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.ref_vector
: reference subset, which is required forfop
.Bool[]
the same length as the number of sequences in your fasta file, and containstrue
for sequences you want as your reference subset and false for those you don't. You can usefind_seqs()
to generate this vector. If providing multiple filepaths and want custom reference sets,ref_vectors
should be a vector of vectors corresponding to the vector of filepaths.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> result = fop(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate CAI on example dataset
julia> round.(result.FOP[1:5], digits = 6)
5-element Vector{Float64}:
0.567816
0.566845
0.509695
0.725
0.653784
julia> fop(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> fop(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codons
CUBScout.gcb
— Functiongcb(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_vector = [], perc = 0.05, rm_start = false, rm_stop = false, threshold = 80)
gcb(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_vectors = [], perc = 0.05, rm_start = false, rm_stop = false, threshold = 80)
Calculate GCB from Merkl, 2003.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of BioSequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each fasta file or vector of BioSequences. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
ref_vector
: optional reference subset; by default gcb begins calculations using all genes as a seed. If you want to provide a custom reference set, it should be a vectorBool[]
the same length as the number of sequences in your fasta file, and containstrue
for sequences you want as your reference subset and false for those you don't. You can usefind_seqs()
to generate this vector. If providing multiple filepaths and want custom reference sets,ref_vectors
should be a vector of vectors corresponding to the vector of filepaths.perc
: percentage of "top hits" which should be used as a reference set in the next iteration. By default set to 0.05.rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> result = gcb(EXAMPLE_DATA_PATH); # Calculate GCB on example dataset
julia> round.(result.GCB[1:5], digits = 6)
5-element Vector{Float64}:
-0.058765
-0.08659
-0.005496
-0.065659
-0.032062
julia> ribo_result = gcb(EXAMPLE_DATA_PATH, ref_vector = ribosomal_genes); # Calculate GCB with ribosomal genes as reference seed example dataset
julia> round.(ribo_result.GCB[1:5], digits = 6)
5-element Vector{Float64}:
-0.135615
-0.036687
-0.169136
-0.186104
-0.01653
julia> gcb(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> gcb(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
CUBScout.make_CodonDict
— Functionmake_CodonDict(filepath::AbstractString, delimiter::AbstractChar = ' ')
Make a custom codon dictionary for organisms with non-standard genetic code. filepath
points to a delimited file with two columns and no header. The first column should be codons, and the second column their corresponding amino acid. Avoid spaces and special characters (e.g., write GlutamicAcid instead of Glutamic Acid). Stop codons can be coded as Stop, stop, STOP, or *. If delimited using any character outside of tab, supply the delimiter as the second argument as Char, not a string (e.g. ','
not ","
). make_CodonDict
uses readdlm
from DelimitedFiles
; it's a good idea to check whether readdlm
parses your file correctly before passing to make_CodonDict
Examples
julia> my_CodonDict = make_CodonDict(CUBScout.CodonDict_PATH)
CodonDict(BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}[AAA, AAC, AAG, AAT, ACA, ACC, ACG, ACT, AGA, AGC … TCG, TCT, TGA, TGC, TGG, TGT, TTA, TTC, TTG, TTT], ["Lysine", "Asparagine", "Lysine", "Asparagine", "Threonine", "Threonine", "Threonine", "Threonine", "Arginine", "Serine" … "Serine", "Serine", "Stop", "Cysteine", "Tryptophan", "Cysteine", "Leucine", "Phenylalanine", "Leucine", "Phenylalanine"], ["Lysine", "Asparagine", "Lysine", "Asparagine", "Threonine", "Threonine", "Threonine", "Threonine", "Arginine", "Serine" … "Serine", "Serine", "Serine", "Cysteine", "Tryptophan", "Cysteine", "Leucine", "Phenylalanine", "Leucine", "Phenylalanine"], ["Lysine", "Asparagine", "Threonine", "Arginine", "Serine", "Isoleucine", "Methionine", "Glutamine", "Histidine", "Proline" … "Glutamicacid", "Asparticacid", "Alanine", "Glycine", "Valine", "Stop", "Tyrosine", "Cysteine", "Tryptophan", "Phenylalanine"], ["Lysine", "Asparagine", "Threonine", "Arginine", "Serine", "Isoleucine", "Methionine", "Glutamine", "Histidine", "Proline", "Leucine", "Glutamicacid", "Asparticacid", "Alanine", "Glycine", "Valine", "Tyrosine", "Cysteine", "Tryptophan", "Phenylalanine"], Vector{Int32}[[1, 3], [2, 4], [5, 6, 7, 8], [9, 11, 25, 26, 27, 28], [10, 12, 53, 54, 55, 56], [13, 14, 16], [15], [17, 19], [18, 20], [21, 22, 23, 24] … [33, 35], [34, 36], [37, 38, 39, 40], [41, 42, 43, 44], [45, 46, 47, 48], [49, 51, 57], [50, 52], [58, 60], [59], [62, 64]], Vector{Int32}[[1, 3], [2, 4], [5, 6, 7, 8], [9, 11, 25, 26, 27, 28], [10, 12, 51, 52, 53, 54], [13, 14, 16], [15], [17, 19], [18, 20], [21, 22, 23, 24], [29, 30, 31, 32, 58, 60], [33, 35], [34, 36], [37, 38, 39, 40], [41, 42, 43, 44], [45, 46, 47, 48], [49, 50], [55, 57], [56], [59, 61]], Int32[2, 2, 4, 6, 6, 3, 1, 2, 2, 4 … 2, 2, 4, 4, 4, 3, 2, 2, 1, 2], Int32[2, 2, 4, 6, 6, 3, 1, 2, 2, 4, 6, 2, 2, 4, 4, 4, 2, 2, 1, 2], Bool[1, 1, 1, 1, 1, 1, 1, 1, 1, 1 … 1, 1, 0, 1, 1, 1, 1, 1, 1, 1])
julia> typeof(my_CodonDict)
CodonDict
julia> fieldnames(CodonDict)
(:codons, :AA, :AA_nostops, :uniqueAA, :uniqueAA_nostops, :uniqueI, :uniqueI_nostops, :deg, :deg_nostops, :stop_mask)
CUBScout.mcb
— Functionmcb(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
mcb(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
Calculate MCB from Urrutia and Hurst, 2001.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
ref_seqs
: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes,ref_seqs
takes a named tuple in the form("subset_name" = Bool[],)
, whereBool[]
is the same length as the number of sequences in your fasta file, and containstrue
for sequences you want as your reference subset and false for those you don't. You can usefind_seqs()
to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, andCUBScout
will return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets,ref_seqs
should be a vector of named tuples corresponding to the vector of sequences.rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> result = mcb(EXAMPLE_DATA_PATH); # Calculate measure on example dataset
julia> result_300 = mcb(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.self)
3801
julia> length(result_300.self)
1650
julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
0.087211
0.178337
0.189682
0.24012
0.149869
julia> mcb(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> mcb(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> mcb(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset
julia> mcb(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subset
CUBScout.melp
— Functionmelp(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, ref_vector::Vector{Bool}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
melp(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, ref_vectors::Vector{Vector{Bool}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
Calculate MELP from Supek and Vlahovicek, 2005.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.ref_vector
: reference subset, which is required formelp
.Bool[]
the same length as the number of sequences in your fasta file, and containstrue
for sequences you want as your reference subset and false for those you don't. You can usefind_seqs()
to generate this vector. If providing multiple filepaths and want custom reference sets,ref_vectors
should be a vector of vectors corresponding to the vector of filepaths.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> result = melp(EXAMPLE_DATA_PATH, ribosomal_genes); # Calculate MELP on example dataset
julia> round.(result.MELP[1:5], digits = 6)
5-element Vector{Float64}:
0.929414
1.007671
0.922357
0.951239
1.029531
julia> melp(EXAMPLE_DATA_PATH, ribosomal_genes, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> melp(EXAMPLE_DATA_PATH, ribosomal_genes, rm_start = true); # Remove start codons
CUBScout.milc
— Functionmilc(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
milc(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, ref_seqs = (), rm_start = false, rm_stop = false, threshold = 80)
Calculate MILC from Supek and Vlahovicek, 2005.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
ref_seqs
: by default, codon usage bias for each gene is calculated using the whole genome ("self") as a reference subset. If you would like to specify your own subsets to calculate against, such as ribosomal genes,ref_seqs
takes a named tuple in the form("subset_name" = Bool[],)
, whereBool[]
is the same length as the number of sequences in your fasta file, and containstrue
for sequences you want as your reference subset and false for those you don't. You can usefind_seqs()
to generate this vector. You can provide multiple reference subsets as separate entries in the named tuple, andCUBScout
will return the calculated measure using each subset. If providing multiple sets of sequences and want custom reference sets,ref_seqs
should be a vector of named tuples corresponding to the vector of sequences.rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> result = milc(EXAMPLE_DATA_PATH); # Calculate measure on example dataset
julia> result_300 = milc(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.self)
3801
julia> length(result_300.self)
1650
julia> round.(result.self[1:5], digits = 6)
5-element Vector{Float64}:
0.494826
0.583944
0.499472
0.635493
0.543935
julia> milc(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> milc(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
julia> all_genes = find_seqs(EXAMPLE_DATA_PATH, r""); # Get a vector which is true for all genes
julia> ribosomal_genes = find_seqs(EXAMPLE_DATA_PATH, r"ribosomal"); # Get a vector which is true for ribosomal genes
julia> milc(EXAMPLE_DATA_PATH, ref_seqs = (ribosomal = ribosomal_genes,)); # Calculate using ribosomal genes as a reference subset
julia> milc(EXAMPLE_DATA_PATH, ref_seqs = (self = all_genes, ribosomal = ribosomal_genes,)); # Calculate using all genes and ribosomal genes as a reference subset
CUBScout.scuo
— Functionscuo(sequences::Union{String, IO, FASTAReader, Vector{<:NucSeq}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{String}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
scuo(sequences::Union{Vector{String}, Vector{<:IO}, Vector{<:FASTAReader}, Vector{<:Vector{<:NucSeq}}}, dict::CodonDict = DEFAULT_CodonDict; names::Union{Vector{Vector{String}}, Nothing} = nothing, rm_start = false, rm_stop = false, threshold = 80)
Calculate SCUO from Wan et al., 2004.
Arguments
sequences
: DNA or RNA sequences to be analyzed, which should be coding sequences only. This can take quite a few forms depending on your use case. It can be a path to fasta file of coding sequences (e.g. .fasta, .fna, .fa), or a IO or FASTAReader pointing to these fasta files. It can also be a vector of BioSequences, if you've already brought them into Julia's environment. There are no quality checks, so it's assumed that each entry is assumed to be an individual coding sequence, in the correct frame, without 5' or 3' untranslated regions. If you are analyzing multiple genomes (or sets of sequences),sequences
could instead be a vector of filepaths, IOStreams, FASTAReaders, or vectors of sequences, with each vector corresponding to a genome.CUBScout
is multithreaded; if there are multiple threads available,CUBScout
will allocate a thread for each filepath. As such, providing a vector of filepaths (orVector{<:Vector{<:NucSeq}}
) as an argument will be faster than broadcasting across a vector of paths. Because a single file is only accessed by a single thread, it's never worth using more threads than the total number of files being analyzed.dict
: codon dictionary of typeCodonDict
. The standard genetic code is loaded by default, but if necessary you can create your own codon dictionary usingmake_CodonDict
names
: An optional vector of names for each sequence. Only relevant if providing a vector of BioSequences, as names are automatically pulled from fasta files. Ifsequences
is of typeVector{<:Vector{<:NucSeq}}
,names
should be of typeVector{Vector{String}}
rm_start
: whether to ignore the first codon of each sequence. Many organisms use alternative start codons such as TTG and CTG, which in other locations would generally code for leucine. There are a few approaches to deal with this. By default,CUBScout
keeps each start codon and assigns it as though it were any other codon. Of course, this would slightly change leucine's contribution to codon usage bias. If you setrm_start
totrue
, the first codon of every sequence is simply discarded. This will also affect the gene's length, which means it could be removed if it falls under the threshold. Other CUB packages (such as R's coRdon, alt.init = TRUE), assign all TTG and CTG codons to methionine, regardless of their location. I disagree with this approach from a biological perspective; those codons still code for leucine most of the time they are used. However, if you want matching output as you would get from coRdon, you can supplyALTSTART_CodonDict
to thedict
argument, and keeprm_start
asfalse
.rm_stop
: whether to remove stop codons from calculations of codon usage bias.threshold
: minimum length of a gene (in codons) to be used in codon usage bias calculations. By default this is set to 80 codons; any genes less than or equal to that length are discarded. If you want no genes discarded, setthreshold
to 0.
Examples
julia> result = scuo(EXAMPLE_DATA_PATH); # Run SCUO on example dataset
julia> round.(result.SCUO[1:5], digits = 6)
5-element Vector{Float64}:
0.143121
0.191237
0.096324
0.345211
0.105744
julia> result_300 = scuo(EXAMPLE_DATA_PATH, threshold = 300); # Increase threshold length
julia> length(result.SCUO)
3801
julia> length(result_300.SCUO)
1650
julia> scuo(EXAMPLE_DATA_PATH, ALTSTART_CodonDict); # Code TTG and CTG as methionine
julia> scuo(EXAMPLE_DATA_PATH, rm_start = true); # Remove start codons
CUBScout.seq_descriptions
— Methodseq_descriptions(path::AbstractString)
seq_descriptions(reader::FASTAReader)
seq_descriptions(stream::IO)
Read a fasta file at path
and return the description fields. Just adds convenience on top of FASTX functions.
Examples
julia> seq_descr = seq_descriptions(EXAMPLE_DATA_PATH);
julia> seq_descr[1]
"lcl|NC_000964.3_cds_NP_387882.1_1 [gene=dnaA] [locus_tag=BSU_00010] [db_xref=EnsemblGenomes-Gn:BSU00010,EnsemblGenomes-Tr:CAB11777,GOA:P05648,InterPro:IPR001957,InterPro:IPR003593,InterPro:IPR010921,InterPro:IPR013159,InterPro:IPR013317,InterPro:IPR018312,InterPro:IPR020591,InterPro:IPR024633,InterPro:IPR027417,PDB:4TPS,SubtiList:BG10065,UniProtKB/Swiss-Prot:P05648] [protein=chromosomal replication initiator informational ATPase] [protein_id=NP_387882.1] [location=410..1750] [gbkey=CDS]"
CUBScout.seq_names
— Methodseq_names(path::AbstractString)
seq_names(reader::FASTAReader)
seq_names(stream::IO)
Read a fasta file at path
and return the name fields. Just adds convenience on top of FASTX functions.
Examples
julia> seq_name_vector = seq_names(EXAMPLE_DATA_PATH);
julia> seq_name_vector[1]
"lcl|NC_000964.3_cds_NP_387882.1_1"