Codon Usage and Bias in Julia with CUBScout

Author

Augustus Pendleton

CUBScout helps you analyze codon usage in Julia. It also happens to be my first official package so I’m freakin’ excited about it! If you think CUBScout can help you in your research, feel free to read through the documentation here, look at the source code, and connect with me!. The source code and files for this analysis are available here.

The documentation has a good, step-by-step walk-through of what CUBScout can do and how to use it. On this project, I want to show off an example of how I’ve been using CUBScout, by exploring some real-life questions concerning codon usage. If you just want to see how CUBScout performs compared to its R equivalent, feel free to skip to the end. With that, let’s get right into the nitty-gritty of genome analysis!

Let’s start off with the basics

CUBScout’s primary function is to calculate codon usage bias (CUB…get it?). CUBScout input’s are usually fasta-formatted files of coding sequences for a genome (though this is flexible - we’ll show some advanced examples later). For my example analyses here, I randomly downloaded 997 .fna files containing coding sequences as annotated in RefSeq. While these files aren’t available on the Github, you can find the accession numbers used in “genome_accessions.txt”, and the code I used to download data in “genome_downloads.sh”.

Once you’ve got your genomes, we can generate a codon usage matrix, using the function count_codons. Let’s try that:

using CUBScout # Load the package :)

filename = joinpath(pwd(), "genomes", "cds_1.fna") # Let's specify a filepath

results = count_codons(filename) # Let's count codons

count_matrix = results[1] # Our count matrix is the first element in the tuple
64×4237 Matrix{Int32}:
 32  21  1  17  7  33  38  21  29  36  …  9  9  11  15  33  12  13  14  11  6
  7   8  2   7  2  20  19   7   8  13     4  1   4   4  10   7   5   3   1  2
  6   3  2   9  1  10   9   8   8  13     6  6  10   6   7   8  10   8   8  0
 14   7  2   7  1  13  16   4   6  11     7  4   4   9  12   7   4   4   5  1
 11   9  2   8  2  20  23  11  18  17     8  1   9   6  25  11   5   2   2  1
  4   5  0   4  0   6   5   3   1   4  …  0  0   4   0   4   3   0   2   0  0
  4   6  0   8  2  16  12   0   4   5     6  2   4   6   8   8   4   3   2  0
  5   6  0   1  0   4  15   0  12   9     2  2   0   3   1   7   6   2   0  0
  6   7  1   5  0  11  17   1   7   2     4  2   7   3   8   6   5   3   3  3
  4   6  1   4  1   7   4   4   5   6     2  1   2   2   7   7   3   2   2  2
  2   0  0   1  0   1   3   4   0   0  …  2  0   2   0   0   2   1   1   0  0
  2   6  0   1  0   5   6   2   2   5     2  1   1   2   1   2   0   4   1  1
  5   1  0   3  2   2   2   3   1   1     4  6   2   5   1   0   1   1   0  0
  ⋮                 ⋮                  ⋱             ⋮                   ⋮  
  9   5  2   5  1   6  12   6   9   6     3  1   4   3   5   5   1   3   3  2
  4   7  0   4  2   3   6   2   3   3     1  2   1   2   5   1   4   0   1  0
  0   1  0   2  1   3   3   3   0   0     1  0   1   1   0   2   0   2   1  0
  8   6  1   5  4   7  10   5   9   8  …  1  3   5   5  13   9   2   2   2  0
  0   0  0   1  0   0   1   0   0   0     0  1   0   0   0   0   0   1   0  0
  0   1  0   0  0   1   1   2   1   1     1  0   0   3   3   1   0   1   0  0
  4   1  1   2  0   2   1   5   1   5     0  4   2   1   2   1   0   4   0  0
  0   0  0   0  0   1   0   1   2   2     3  0   0   1   1   2   0   1   0  0
 12  11  3   6  3  10  12  15   7   7  …  8  2   8   5   4  11   3  10   3  2
  4   6  1   3  0   6  14   5   6   6     0  1   5   4   3   2   1   8   1  2
  6   6  0   7  1   5   8   3   5   5     8  2   2   6  12   6   7   9   2  0
 17   8  3   6  6   9   2  18   6   9     1  6   3   8   5   5   1   4   4  0

Count codons returns a three-element tuple. The first element is the count matrix. The second element is the gene identifiers, and the third element is a Boolean vector showing which genes passed a specific length threshold. This is by default 0, so all genes are included.

Our count matrix will always have 64 rows (for each codon in alphabetical order) and as many columns as there were CDS in our file. Each number in the matrix corresponds to how many times that codon occurred in that CDS. So we can see that our first gene used the codon AAA 32 times.

Personally, I don’t have the 64 codons memorized in alphabetical order, and I don’t remember which codons codes for which amino acid. CUBScout stores this information in a special object called a CodonDict. CUBScout comes loaded with a default CodonDict, which follows the standard genetic code.

DEFAULT_CodonDict.codons # Let's get our 64 codons in alphabetical order
64-element Vector{LongSequence{DNAAlphabet{2}}}:
 AAA
 AAC
 AAG
 AAT
 ACA
 ACC
 ACG
 ACT
 AGA
 AGC
 AGG
 AGT
 ATA
 ⋮
 TCA
 TCC
 TCG
 TCT
 TGA
 TGC
 TGG
 TGT
 TTA
 TTC
 TTG
 TTT
DEFAULT_CodonDict.AA # And their corresponding amino acids
64-element Vector{String}:
 "Lysine"
 "Asparagine"
 "Lysine"
 "Asparagine"
 "Threonine"
 "Threonine"
 "Threonine"
 "Threonine"
 "Arginine"
 "Serine"
 "Arginine"
 "Serine"
 "Isoleucine"
 ⋮
 "Serine"
 "Serine"
 "Serine"
 "Serine"
 "Stop"
 "Cysteine"
 "Tryptophan"
 "Cysteine"
 "Leucine"
 "Phenylalanine"
 "Leucine"
 "Phenylalanine"

If you work with some particular organism that uses an alternative genetic code, you can generate your own custom CodonDict.

Now that we have a count matrix, we can look at codon frequency within this genome. Let’s do that, using the codon_frequency function. Here, we’ll calculate the frequency of each codon across the entire genome.

codon_frequency(count_matrix, "net_genomic") # Let's calculate codon frequency
64×1 Matrix{Float64}:
 0.04941242971710299
 0.017114892645228374
 0.021009352696846777
 0.022269444158755328
 0.022257296747490142
 0.00858093131772688
 0.014475665091012455
 0.008714552841643916
 0.010720495355234845
 0.014114482062727612
 0.0038159067921035347
 0.006614670347602223
 0.009409384766012515
 ⋮
 0.014793117438742629
 0.00796141334320243
 0.0062664445580002445
 0.012826046641200293
 0.0007774343209718577
 0.00428965583144576
 0.010302624407712473
 0.0035519030539401747
 0.01916456583937397
 0.01413310809333423
 0.015401297829419573
 0.030715944125147488

CUBScout supports four methods of calculating codon frequency, which you can read about in the documentation. For now, let’s use our codon_frequency function to explore a fun hypothesis: codon frequency is markedly different in the beginning of genes versus the middle and the end.

Is codon frequency different in the beginning of a gene?

This paper included the intriguing finding that E. coli tends to use different codons in the beginning of its genes. However, they only tested this in one species, and they didn’t control for the possibility that there is a bias towards specific amino acids in the first few residues. I’m going to test these hypotheses using CUBScout. I’ll also demonstrate some of the ways I’ve learnd to do bioinformatics in Julia, using other packages like FASTX, BioSequences, and Gadfly.

First, let’s find all the files we downloaded.

using Glob

genome_dir = joinpath(pwd(), "genomes")

files = readdir(glob"*.fna", genome_dir); # Let's find our files

Now, we want to test if the codon frequency of the first 8 codons in a gene tends to be from the codon frequency of a randomly selected eight codons somewhere else in the gene. In order to test this, I’m going to read in each .fna file, create a concatenated string which is all of the codons at positions 2-9 (don’t need to measure ATG thousands of times) compared to eight contiguous codons at a random position. Then I’m going to calculate the codon frequency for each of those groups.

using FASTX # We'll use FASTX to read in FASTA files
using BioSequences # We'll use BioSequences to parse nucleotide sequences
using Random # We'll use some randon numbers too when sampling
Random.seed!(31491)
function sample_codons(filepath)
  open(FASTA.Reader, filepath) do reader
        start = "" # Initialize empty strings
        random = ""
        for record in reader
          seq = sequence(record) # Get fasta sequence of a single gene
          len = length(seq) # Find it's length
          len < 60 && continue # Make sure it's not too short
          start = start * seq[4:27] # Index codons 2 - 9 and append to our "start" string
          r = 0 # Get ready to randomly sample a codon position
          while true
          r = rand(28:(len - 26)) # Find a random position in the gene, and avoiding any stop codons
          if (r-1) % 3 == 0 break end # Make sure this is in frame
          end
          random = random * seq[r:(r+23)] # Starting from our random position, index 8 codons and append to our "random" string
      end
      return (LongDNA{4}(start), LongDNA{4}(random)) # Return as tuple of nucleotide sequences
      end
  end

codons = sample_codons(files[1]) # Sample our first genome (B. subtilis)

codon_counts = count_codons.(codons) # Generate codon count matrix for start and random codons

freqs = codon_frequency.(codon_counts, "net_genomic") # Calculate frequency of each codon

# Let's plot to see if the frequency of codons differs in "early" codons versus "random" codons
labels = map(x->"$x", DEFAULT_CodonDict.codons) # Make alphabetical codon list in String format (for plotting)
using Gadfly
set_default_plot_size(5inch, 5inch)
plot(
  layer(x->x, 0, .1, Geom.line),
  layer(x = freqs[2], y = freqs[1], Geom.point),
  layer(x = freqs[2], y = freqs[1], label = labels, Geom.label),
  Coord.cartesian(xmin=0, xmax=.1, ymin = 0, ymax = .1),
  Guide.xlabel("Random Codon Frequency"),
  Guide.ylabel("Early Codon Frequency"),
  Theme(discrete_highlight_color = c->nothing, alphas=[0.4], point_label_color = "white", grid_line_width=0mm)
  )
Random Codon Frequency -0.2 -0.1 0.0 0.1 0.2 0.3 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 -0.1 0.0 0.1 0.2 -0.100 -0.095 -0.090 -0.085 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.2 -0.1 0.0 0.1 0.2 0.3 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 -0.1 0.0 0.1 0.2 -0.100 -0.095 -0.090 -0.085 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 Early Codon Frequency

Cool finding! We have the average frequency of each codon in a random location of a gene on the x axis, and the average frequency of each codon in the first 8-positions of a gene on the y axis. Codons below the line are less frequent in the beginning of genes, while codons above the line are more frequent. We can conclude that in our first genome (which happens to be B. subtilis), there is a distinctive pattern of codon usage in the beginning of the gene, versus the rest of the gene.

Moving forward, however, we are going to be combingin results from many genomes, which will likely have very different codon usage patterns. As such, maybe it would be better if we plotted the difference between these frequencies, rather than their frequency alone. This time, I’m going to calcualte the difference (I refer to it as “deviation” in my code) for each codon for our first genome.

deviations = Float64[]
for (start, random) in zip(freqs...)
  dev = start-random # Calculate a scaled % Deviation
  push!(deviations,dev) # Add that to our vector
end

plot(
  layer(x = freqs[2], y = deviations, Geom.point),
  layer(yintercept = [0], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
  layer(x = freqs[2], y = deviations, label = labels, Geom.label, order = 2),
  Guide.xlabel("Random Codon Frequency"),
  Guide.ylabel("Deviation in Frequency at Start"),
  Theme(discrete_highlight_color = c->nothing, alphas=[0.4], point_label_color = "white", grid_line_width=0mm)
)
Random Codon Frequency -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 -0.1 0.0 0.1 0.2 -0.062 -0.060 -0.058 -0.056 -0.054 -0.052 -0.050 -0.048 -0.046 -0.044 -0.042 -0.040 -0.038 -0.036 -0.034 -0.032 -0.030 -0.028 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040 0.042 0.044 0.046 0.048 0.050 0.052 0.054 0.056 0.058 0.060 0.062 0.064 0.066 0.068 0.070 0.072 0.074 0.076 0.078 0.080 0.082 0.084 0.086 0.088 0.090 0.092 0.094 0.096 0.098 0.100 0.102 0.104 0.106 0.108 0.110 0.112 0.114 0.116 0.118 0.120 0.122 AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 -0.2 -0.1 0.0 0.1 0.2 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 Deviation in Frequency at Start

Okay, this is cool! What we’re observing is that in B. subtilis, there are some codons that are preferentially used in the beginning of genes. These appear to be AT-rich codons, especially AAA and ATA. There are also codons that are used less, especially a GC-rich codons, like GGC and GCC.

I wonder if this generalizes across many different genomes? Let’s test it across our 997 randomly sampled genomes from NCBI’s database:

all_genome_results = map(files) do file
  codons = sample_codons(file)
  codon_counts = count_codons.(codons)
  freqs = codon_frequency.(codon_counts, "net_genomic")
  deviations = Float64[]
  for (start, random) in zip(freqs...)
    dev = start-random
    push!(deviations,dev)
  end
  return deviations
end
997-element Vector{Vector{Float64}}:
 [0.04041774840689167, 0.0017406183620486196, 0.008555581779561009, 0.011387774368657066, 0.007227991503422231, -0.0007375501534104311, -0.002448666509322633, 0.0007080481472740135, 0.007877035638423414, -0.0017406183620486196  …  -0.0013570922822751944, -0.00064904413500118, 0.0, -0.0010915742270474391, 0.0012685862638659434, 0.00023601604909133812, 0.012036818503658248, -0.0028026905829596424, 0.002625678546141137, 0.0028616945952324724]
 [0.005907960199004974, 0.0012696932006633493, -0.0012437810945273645, -0.00036276948590381435, 0.004793739635157546, 0.01508084577114428, 0.004094112769485906, 0.004638266998341625, 0.0019693200663349914, 0.011790008291873962  …  0.002591210613598672, 0.005001036484245439, -7.77363184079602e-5, -0.003472222222222221, -0.005882048092868989, 2.5912106135986723e-5, 0.00041459369817578774, -0.005856135986732999, 0.00396455223880597, 0.004560530679933665]
 [0.039405311778290986, 0.007144919168591224, 0.004618937644341802, 0.012918591224018477, 0.011114318706697459, 0.0023816397228637407, 7.217090069283963e-5, 0.004907621247113166, 0.005773672055427252, 0.0020929561200923787  …  -0.00043302540415704385, 0.0019486143187066977, -0.0007217090069284065, -0.001371247113163972, -0.004546766743648961, -0.00216512702078522, 0.0032476905311778284, 0.0007217090069284067, -0.0057015011547344105, 0.009021362586605075]
 [0.0031497035573122526, 0.0011425395256917, 0.0004168725296442681, 0.002300518774703557, 0.007380187747035574, 0.010637969367588929, 0.007827939723320156, 0.007287549407114624, 0.00447751976284585, 0.006283967391304348  …  -0.001960844861660078, 0.0037981719367588935, 1.5439723320158102e-5, -0.0037827322134387355, -0.003365859683794466, 7.71986166007905e-5, 0.0007102272727272727, -0.004909832015810276, 0.004338562252964428, 0.001667490118577075]
 [0.06491357116586981, -0.0011952923869069515, 0.012550570062522984, 0.011171386539168809, 0.013194189040088268, -0.005654652445752115, -0.0041835233541743285, 0.0001379183523354168, 0.015125045972784112, -0.0016550202280250096  …  -0.007447591026112543, -0.0001379183523354168, 0.0, -0.0029422581831555724, -0.005011033468186833, -0.0011952923869069515, 0.01875689591761677, -0.005470761309304892, -0.005332842956969473, 0.012596542846634792]
 [0.04137851544490548, 0.003515444905486399, 0.007895343476256339, 0.006108805901337021, -0.0016136468418626107, 0.0001728907330567081, 0.00011526048870447213, -0.004783310281235593, 0.0010949746426924846, 0.000922083909635777  …  0.0007491931765790686, -0.004783310281235591, -5.763024435223605e-5, 0.0001728907330567082, -0.002881512217611803, -0.0034001844167819264, 0.015387275242047027, -0.000922083909635777, 0.001325495620101428, 0.009970032272936837]
 [0.04054261189454322, 0.004521765787860208, 0.005977927651747397, 0.013488657265481301, 0.009886572654812995, 0.0012262415695892103, 0.0025291232372777438, 0.004828326180257511, 0.005441446965052115, 0.0012262415695892103  …  0.0002299202942979764, 0.0019926425505824644, -0.0006897608828939301, -0.0006897608828939305, -0.002452483139178417, -0.0032955242182709987, 0.007050889025137952, 0.0009196811771919073, -0.0020692826486817875, 0.009426732066217046]
 [0.040482826829706886, 0.00391116705628484, 0.009755439669124257, 0.009215968351016006, 0.003663909368818559, -0.00020230174429059552, -0.0010789426362165077, 0.000944074806689444, 0.007260384822873585, -8.991188635137492e-5  …  0.00015734580111490676, -0.0027423125337169563, -0.0003371695738176587, -0.0013711562668584786, 0.0004945153749325668, -0.0012812443805071023, 0.006091530300305702, -0.0016408919259126055, -0.0008316849487502252, 0.005462147095846075]
 [0.001393383674937073, 0.0052139518158935624, 0.004314994606256742, 0.004270046745774901, 0.008674937072995326, 0.016855447680690402, 0.010787486515641856, 0.01348435814455232, 0.0036407766990291263, 0.017259978425026967  …  0.003011506652283351, 0.005438691118302768, 0.0, -0.004180151024811219, -0.007281553398058253, 0.00031463502337288746, 0.0005393743257820927, -0.006472491909385113, -0.0004944264653002515, 0.0006292700467457749]
 [0.0343891973750631, 0.005678950025239778, 0.005237253912165573, 0.010159010600706717, 0.007130237253912163, 0.0032811711256940943, 0.002082281675921253, 0.00315497223624432, 0.005237253912165574, 0.0041014639071176185  …  0.0015774861181221608, 0.0020822816759212513, -0.00018929833417465926, -0.0005047955577990914, -0.0029025744573447764, -0.0016405855628470467, 0.0037859666834931874, 0.003091872791519434, -0.002397778899545684, 0.004101463907117617]
 [0.045793927327028375, 0.010639621702339471, 0.002519910403185664, 0.007124191139870582, 0.0006222000995520158, 0.004137630662020906, -0.0002799900447984071, 0.004137630662020907, -0.0018043802887008457, 0.0013688402190144356  …  -0.0007466401194624189, 0.002768790443006472, -3.1110004977600795e-5, 0.00046665007466401205, -0.005537580886012942, -0.0035465405674464908, 0.019630413140866103, 0.01060851169736187, -0.012288451966152313, 0.005257590841214535]
 [0.016005154639175255, 0.00631443298969072, 0.004432989690721652, 0.0035567010309278356, 0.010644329896907216, 0.014845360824742266, 0.0035051546391752578, 0.006314432989690721, 0.006417525773195876, 0.008762886597938143  …  -0.0019845360824742274, 0.005051546391752577, -0.00010309278350515464, -0.0034536082474226804, -0.006237113402061857, -0.0004123711340206185, 0.0009020618556701031, -0.0037113402061855656, 0.002680412371134021, 0.0075515463917525776]
 [0.019113283295514052, 0.005498341769942398, 0.003054634316634667, 0.005411066503752837, 0.011869436201780416, 0.00918572176645139, 0.0037964740792459407, 0.0056728923023215225, 0.0057383487519636935, 0.00689474602897539  …  0.00013091289928434356, 0.006742014313143654, -0.00017455053237912376, -0.00375283644615116, -0.006152906266364114, -0.0009600279280851806, 0.0023564321871181702, -0.002116425205096875, 0.0012218537266538675, 0.006262000349101064]
 ⋮
 [0.0015020653398422831, 0.0023939166353736385, 0.004318437852046563, 0.0015020653398422833, 0.009575666541494556, 0.03205970709725873, 0.008073601201652271, 0.0064776567780698464, 0.0021592189260232824, 0.013471648516710476  …  -0.008824633871573416, 0.004083740142696207, 0.0, -0.004365377393916636, -0.0042714983101764924, 0.00032857679309049944, 9.38790837401427e-5, -0.009200150206533989, 0.0002816372512204281, 0.0014081862561021404]
 [0.03271205781666032, 0.008035374667173831, 0.006846709775580068, 0.01531000380372765, 0.008605933815138836, 0.0018067706352225162, 0.00023773297831875158, 0.007702548497527578, 0.002995435526816281, 0.0023773297831875227  …  0.001521491061240015, 0.005420311905667554, 4.754659566375048e-5, -0.0013313046785850134, -0.0023297831875237745, -0.0019494104222137695, 9.509319132749994e-5, 0.0002852795739825033, -0.0029003423354887793, 0.005848231266641305]
 [0.002345132743362832, 0.0015929203539822995, -0.0008407079646017696, 0.000663716814159292, 0.007699115044247788, 0.02814159292035398, 0.01345132743362832, 0.005486725663716815, 0.003097345132743363, 0.01238938053097345  …  -0.008008849557522124, 0.0026106194690265487, 0.0, -0.002256637168141593, -0.007920353982300886, 0.00013274336283185836, 0.00013274336283185842, -0.0068141592920353995, -0.0010619469026548678, 0.00022123893805309734]
 [0.02881125226860254, 0.0038188142770719906, 0.003213853599516033, 0.00555807622504537, 0.0102843315184513, 0.004196914700544466, 0.00117211131276467, 0.006730187537810043, 0.005142165759225651, 0.001512401693889897  …  -3.781004234724734e-5, 0.005860556563823351, -7.562008469449486e-5, -0.0012477313974591647, -0.0022686025408348454, -0.0008318209316394432, 0.01935874168179068, 0.0, -0.0022307924984876007, 0.005142165759225648]
 [0.026326849183477427, 0.006003842459173871, 0.005463496637848224, 0.014289145052833813, 0.009396013448607111, 0.003932516810758886, 0.001200768491834774, 0.005853746397694525, 0.003152017291066283, 0.002731748318924112  …  0.0006904418828049949, 0.0021313640730067224, -3.0019212295869356e-5, -0.0021914024975984636, -0.003121998078770413, -0.0012608069164265134, 0.006394092219020174, -0.0004502881844380413, -0.004562920268972141, 0.005463496637848224]
 [0.05927734375, 0.004687500000000001, 0.011230468750000002, 0.01767578125, 0.0095703125, 0.0038085937499999995, -0.0002929687500000007, -0.001757812499999999, 0.0074218750000000005, -0.0003906249999999995  …  -0.00048828125, -0.0038085937500000003, 0.0, -0.0005859375, 0.0014648437499999991, -0.0013671875000000003, -0.009082031250000004, -0.005664062500000001, -0.00439453125, -0.0002929687500000007]
 [0.05673363095238096, 0.003766741071428572, 0.005998883928571428, 0.0009300595238095205, 0.009579613095238096, 0.0007440476190476199, 0.000232514880952381, -0.0006975446428571404, 0.00255766369047619, 0.001255580357142857  …  -4.6502976190476025e-5, -0.00465029761904762, 0.0, -0.00013950892857142873, -0.003952752976190476, -0.0020461309523809525, 0.02180989583333333, 0.003348214285714286, -0.002418154761904762, -0.0024181547619047603]
 [0.027606951871657754, 0.005147058823529411, 0.002840909090909092, 0.007820855614973261, 0.010294117647058822, 0.0029077540106951863, 0.0022727272727272717, 0.008556149732620321, 0.005414438502673796, 0.0014371657754010704  …  -0.00036764705882352984, 0.005381016042780748, 0.0, -0.003977272727272727, -0.002573529411764707, -0.0007687165775401069, 0.008923796791443851, -0.0038770053475935835, 0.004344919786096257, 0.0063168449197860965]
 [0.017996800568787774, 0.001466405972271597, 0.007509776039815147, 0.0018218983291859222, 0.0033771773906861, 0.007598649129043726, 0.002888375399928901, 0.0029772484891574832, 0.00559900462140064, 0.0009331674369001085  …  -0.0019996445076430856, 0.003021685033771774, -0.00039992890152861714, -0.001910771418414504, 0.0030216850337717732, -0.0003554923569143262, 0.0024884464984002846, -0.00039992890152861643, 0.0017774617845716344, 0.005776750799857804]
 [0.03810848400556329, 0.003894297635605006, 0.006119610570236439, 0.014812239221140475, 0.006641168289290683, -0.00017385257301808128, 0.000799721835883171, 0.0035118219749652284, 0.008936022253129346, 0.0038942976356050076  …  -0.0010083449235048676, -0.002468706536856746, -3.4770514603616136e-5, -0.00288595271210014, -0.0037899860917941595, -0.0020166898470097352, 0.012795549374130737, 0.0004867872044506255, -0.00034770514603616083, 0.0051460361613351845]
 [0.005994496855345912, 0.00963050314465409, 0.009581367924528301, 0.0029481132075471696, 0.003537735849056604, 0.019899764150943397, 0.005896226415094342, 0.0036851415094339623, 0.001179245283018868, 0.009532232704402514  …  -0.004864386792452831, 0.0011792452830188679, 0.0, -0.0027024371069182384, -0.0030955188679245293, 0.0005404874213836478, 0.00044221698113207554, -0.0050609276729559755, 0.0030955188679245285, 0.0014740566037735848]
 [0.003930817610062893, 0.0014412997903563915, 0.006158280922431866, 0.0041928721174004195, 0.010394828791055206, 0.023104472396925223, 0.005022711390635919, 0.011574074074074073, 0.002795248078266946, 0.010438504542278126  …  -0.005022711390635919, 0.006726065688329839, -8.735150244584206e-5, -0.0027952480782669465, -0.004935359888190076, -8.735150244584215e-5, 0.00013102725366876308, -0.0067697414395527615, 0.002009084556254368, 0.0019217330538085255]

Great! That’s returned a a vector of 997 vectors, one for each genome, with the deviation for each codon from the start vs. the random codon samples. Now let’s average that out. I’m going to also try it by averaging the absolute value, in case there are some codons that tend to be very different in the start, but maybe in different directions depending on genomic GC content.

deviation_matrix = reduce(hcat, all_genome_results) # Convert out vector of vectors to a matrix

using Statistics
average_deviation = vec(mean(deviation_matrix, dims = 2)) # Calculate average deviation
average_sd = vec(std(deviation_matrix, dims = 2)) # It's variance
average_abs_deviation = vec(mean(abs.(deviation_matrix), dims = 2)) # Repeat, but taking the absolute value first
average_abs_sd = vec(std(abs.(deviation_matrix), dims = 2))


using DataFrames

dev_results = DataFrame(codon = labels, avg_dev = average_deviation, avg_abs_dev = average_abs_deviation, avg_std = average_sd, avg_abs_std = average_abs_sd)


sorted_avgdev = sort(dev_results, :avg_dev) # Sort in order of deviation

# Plotting the average deviations
avg_devplot1 = plot(sorted_avgdev,
  layer(yintercept=[0], Geom.hline(style = :dot, color=[colorant"ivory4"])),
  layer(x = :codon, y = :avg_dev, ymin=(sorted_avgdev.avg_dev.-sorted_avgdev.avg_std), ymax=(sorted_avgdev.avg_dev.+sorted_avgdev.avg_std), color = [colorant"darkorange1"], Geom.point, Geom.errorbar),
  Guide.xlabel("Codon"),
  Guide.ylabel("Deviation in Frequency at Start"),
  Guide.title("Averaged Deviation of Codon Frequency"),
  Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm
  )
)

# Plotting the average of absolute deviations
avg_abs_devplot1 = plot(sorted_avgdev,
  layer(yintercept=[0], Geom.hline(style = :dot, color=[colorant"ivory4"])),
  layer(x = :codon, y = :avg_abs_dev, ymin=(sorted_avgdev.avg_abs_dev.-sorted_avgdev.avg_abs_std), ymax=(sorted_avgdev.avg_abs_dev.+sorted_avgdev.avg_abs_std), color = [colorant"deepskyblue"], Geom.point, Geom.errorbar),
  Guide.xlabel("Codon"),
  Guide.ylabel("Deviation in Aboslute Frequency at Start"),
  Guide.title("Averaged Absolute Deviation of Codon Frequency"),
  Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm
  )
)
set_default_plot_size(7inch, 7inch)
vstack(avg_devplot1, avg_abs_devplot1)
Codon GGC GCC GTG GAG GCG CTG ATG GGT GGG GAC GTC GAT CAG CGC CCG GGA TGG CGG GCT TTC TAC TGC TAT GAA TGT CAC TCG TTG CAT CTC TGA GTT TAG TAA CCT CGT AGG CCA ATC GTA ACG TCT CCC GCA TCC CTT CTA CGA AGC AGT AAC TCA ACT CAA ACC TTT AGA AAG AAT ATT TTA ACA ATA AAA h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 -0.1 0.0 0.1 0.2 -0.070 -0.068 -0.066 -0.064 -0.062 -0.060 -0.058 -0.056 -0.054 -0.052 -0.050 -0.048 -0.046 -0.044 -0.042 -0.040 -0.038 -0.036 -0.034 -0.032 -0.030 -0.028 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040 0.042 0.044 0.046 0.048 0.050 0.052 0.054 0.056 0.058 0.060 0.062 0.064 0.066 0.068 0.070 0.072 0.074 0.076 0.078 0.080 0.082 0.084 0.086 0.088 0.090 0.092 0.094 0.096 0.098 0.100 0.102 0.104 0.106 0.108 0.110 0.112 Deviation in Aboslute Frequency at Start Averaged Absolute Deviation of Codon Frequency Codon GGC GCC GTG GAG GCG CTG ATG GGT GGG GAC GTC GAT CAG CGC CCG GGA TGG CGG GCT TTC TAC TGC TAT GAA TGT CAC TCG TTG CAT CTC TGA GTT TAG TAA CCT CGT AGG CCA ATC GTA ACG TCT CCC GCA TCC CTT CTA CGA AGC AGT AAC TCA ACT CAA ACC TTT AGA AAG AAT ATT TTA ACA ATA AAA h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 -0.2 -0.1 0.0 0.1 0.2 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 Deviation in Frequency at Start Averaged Deviation of Codon Frequency

Wow, we see some of the same results! GC-rich codons are rarer in the beginning of genes, while AT-rich codons, especially AAA, are higher.

Is this just because amino acids are different at the beginning of genes? To see if there is an actual difference in which codons are preferred for each amino acid (synonymous codons), let’s calculate codon frequency within each amino acid, rather than across all the codons we counted. Notice how I’ve changed the argument in our codon_frequency call.

all_genome_byAA = map(files) do file
  codons = sample_codons(file)
  codon_counts = count_codons.(codons)
  freqs = codon_frequency.(codon_counts, "byAA_genomic")
  deviations = Float64[]
  for (start, random) in zip(freqs...)
    dev = start - random
    push!(deviations,dev)
  end
  return deviations
end
997-element Vector{Vector{Float64}}:
 [0.044839090240977875, -0.07431749423525902, -0.04483909024097782, 0.07431749423525891, 0.07896617943416473, -0.033132191804723155, -0.05727762656267721, 0.011443638933235634, 0.08690493711834729, -0.032654116719878556  …  -0.014346165385252616, -0.030418097715844, 0.0, -0.09998357818677922, 0.0, 0.09998357818677917, 0.11185585892376804, -0.0768592635837404, 0.021516144180230956, 0.0768592635837404]
 [0.15436364031707456, 0.0009427118385877398, -0.1543636403170745, -0.0009427118385877953, 0.0493930624875963, -0.014720766565800236, -0.07877479029183271, 0.044102494370036646, 0.023291355788251485, 0.01840835481525549  …  -0.07783065635534406, 0.04530372132989623, -0.5, -0.07870370370370372, 0.0, 0.0787037037037037, 0.0028495655006716765, -0.13270794987212897, 0.03609535981240914, 0.13270794987212897]
 [0.02148793867776555, 0.018158783783783772, -0.021487938677765522, -0.018158783783783772, 0.04458440110614026, -0.0005820266689831877, -0.039577813490856956, -0.00442456094630006, 0.1292735257635187, 0.004181740381997118  …  -0.012247751241936847, -0.0024690002036524827, -0.36363636363636365, 0.018021472392638016, 0.0, -0.01802147239263807, 0.03473553147938102, -0.03530797881774647, -0.043744540488389966, 0.03530797881774639]
 [0.1283415784363586, -0.10851949715749865, -0.12834157843635863, 0.10851949715749859, 0.06736178264453119, -0.10300221242517871, -0.03655113891604678, 0.07219156869669424, 0.044460152521102005, 0.014102730101183958  …  -0.1268017563435152, 0.04801002319196012, 0.0, -0.15698603685102996, 0.0, 0.15698603685103002, 0.008070504729797031, -0.07592403689636462, 0.046412360252692075, 0.07592403689636462]
 [0.05675599806697362, -0.08640445527323354, -0.05675599806697368, 0.0864044552732336, 0.23676786811386444, -0.13197189227933645, -0.0831264079459349, -0.02166956788859309, 0.28333852212761323, -0.017037927553062965  …  -0.08435510320799752, -0.017896389324960765, 0.0, -0.11684181028443325, 0.0, 0.11684181028443319, 0.16141840599154253, -0.15419221975752695, -0.09487543093932035, 0.1541922197575269]
 [-0.026290043013845854, 0.014392964139630648, 0.02629004301384584, -0.014392964139630648, 0.07442311387496287, 0.004209827589218649, 0.014396678719700018, -0.09302962018388156, -0.03896405647734391, 0.02677568804101616  …  0.013985408204516141, -0.06792981307871485, 0.0, 0.026372585145022914, 0.0, -0.026372585145022942, 0.023389836656228402, -0.028705015720344854, -0.004585353468183631, 0.028705015720344895]
 [0.037615203200334846, -0.016097736255839018, -0.037615203200334874, 0.01609773625583899, 0.045625382396643654, -0.004913747209001118, -0.013894208230367633, -0.026817426957274876, 0.09841425016505001, 0.011645446777987567  …  -0.025877891583422116, -0.030529741247455222, -0.3888888888888889, 0.0532258064516129, 0.0, -0.0532258064516129, 0.047713330941254295, -0.024819580906008165, -0.02374210505838259, 0.02481958090600811]
 [0.01931047894394855, 0.015399280190998843, -0.01931047894394855, -0.015399280190998899, 0.06357886275173463, -0.004807285496699953, -0.0529607907429345, -0.00581078651210018, 0.10330604767683929, 0.008519673348181142  …  -0.0021781737193763973, -0.05608611729769861, -0.18199233716475094, -0.08686041100252281, 0.0, 0.08686041100252284, 0.04642640111769536, -0.03226132955355254, 0.006614008218435399, 0.03226132955355254]
 [0.02521359725136063, -0.0538764877739879, -0.02521359725136063, 0.05387648777398785, 0.07340407154121067, -0.09706497190810903, -0.052266607498038714, 0.07592750786493703, 0.0403967344322794, 0.07006623696903125  …  -0.07889752972358219, 0.021845893259421455, -1.0, -0.051506790023264526, 0.0, 0.05150679002326461, 0.007517201314154277, -0.057142017744873375, -0.0009175692873039307, 0.057142017744873375]
 [0.031685583271000684, 0.026836245458814217, -0.031685583271000684, -0.026836245458814245, -0.009262799504770214, -0.0008602796591653805, -0.01590379433398878, 0.026026873497924402, 0.08870151382689918, 0.03608114985350139  …  0.0014081653828480478, -0.03162113598595459, -0.2857142857142857, 0.04967254967254969, 0.0, -0.04967254967254964, 0.04001501725052331, 0.0472809738830829, -0.031157978095416677, -0.047280973883082944]
 [0.015497466350197331, 0.11423856769448218, -0.015497466350197386, -0.11423856769448215, -0.021854467516953024, 0.045429555821485335, -0.018354321806724166, -0.005220766497808116, -0.011604102692653162, 0.01712652471218784  …  -0.01506338203338762, -0.031539425455125725, 0.0, 0.06035177228786254, 0.0, -0.06035177228786259, 0.08667403310741129, 0.12437986832051928, -0.1695937324470299, -0.12437986832051928]
 [0.20984686954836207, 0.007063166189705927, -0.20984686954836218, -0.007063166189705927, 0.09864532019704433, -0.019909688013136306, -0.12524630541871923, 0.046510673234811166, 0.08692316084024365, 0.0239515287242289  …  -0.15341936431305245, 0.04279827081402072, -1.0, -0.09586822277177609, 0.0, 0.09586822277177609, 0.010216257600753455, -0.13933729821580287, 0.03303152721948184, 0.1393372982158029]
 [0.1760510748707969, 0.019142326845117852, -0.17605107487079696, -0.019142326845117852, 0.08446099112601177, -0.05184925977217475, -0.07434063887269698, 0.04172890751885995, 0.09353877601467542, -0.013563043440228129  …  -0.08054287163876206, 0.05061263127948962, -1.0, -0.04324772971389512, 0.0, 0.04324772971389512, 0.026592867045965876, -0.1085621106258084, 0.015078805802144768, 0.10856211062580828]
 ⋮
 [0.048581957586965624, -0.07418564535886596, -0.048581957586965596, 0.07418564535886603, 0.07959672953041654, 0.07553633369283236, -0.2136976418806657, 0.058564578657416856, 0.024415415946266172, 0.06818448163813218  …  -0.2832425982549034, 0.04360077986000874, -1.0, -0.1562078272604588, 0.0, 0.15620782726045884, 0.00110803324099723, -0.05763126997769796, 0.00834571907678014, 0.057631269977698005]
 [0.06716873560977987, -0.03220672131933461, -0.06716873560977984, 0.03220672131933455, 0.061492336192521424, -0.0528303962279717, -0.01956264671985039, 0.010900706755300693, 0.05261008610086103, -0.03985190787199  …  0.011516953729306105, -0.00802562561494713, 0.0, -0.0017398869073510514, 0.0, 0.0017398869073510514, -0.03108038435825089, -0.019880940419973875, -0.04032503880975853, 0.019880940419973903]
 [0.08367871096995334, -0.029291398256915513, -0.08367871096995338, 0.0292913982569155, 0.06119963894530368, -0.048981929906785404, -0.04478119738235342, 0.03256348834383517, 0.03389976395399012, 0.06279536579193934  …  -0.2558320851884528, 0.027901452407494795, 0.0, -0.06630406630406638, 0.0, 0.0663040663040663, 0.0015982951518380393, -0.04841747685841613, 0.00033791004573162364, 0.048417476858416186]
 [0.08514576059000123, -0.01529984264932066, -0.08514576059000123, 0.015299842649320605, 0.09568196573419097, -0.09232369743254132, -0.02691742431945962, 0.023559156017809946, 0.09937037068317527, -0.07030808062726002  …  -0.030093504230663662, 0.046975881040182355, 0.0, -0.035427884064612225, 0.0, 0.03542788406461217, 0.11017448523779971, -0.046011727849626455, -0.046696025742702374, 0.04601172784962648]
 [0.05800033896870538, -0.03287746428266908, -0.058000338968705434, 0.03287746428266913, 0.05493313593575572, -0.0162889760850721, -0.05600972467954485, 0.017365564828861257, 0.07582611606208944, -0.009975898116059867  …  -0.012183573639099249, -0.03874128267659263, -0.4, -0.03516103273888743, 0.0, 0.03516103273888749, 0.021512081117880166, -0.018092322568206504, -0.04536169113378244, 0.018092322568206476]
 [-0.03475097423703599, 0.015260568513119521, 0.034750974237035964, -0.015260568513119632, 0.09018507300939804, 0.028810150006006383, -0.022131572826650917, -0.09686365018875351, 0.19627749576988154, -0.009815780326729237  …  -0.019899200556134866, -0.09841849148418491, 0.0, -0.12966476913345984, 0.0, 0.1296647691334598, -0.03396190716448033, -0.0753968253968254, -0.019061640975677654, 0.07539682539682535]
 [0.031099198091184443, 0.04877833327068376, -0.03109919809118447, -0.048778333270683816, 0.06081681205392542, -0.00698123182659266, 0.0012001057361882006, -0.05503568596352104, 0.12629703385389368, -0.0014367816091954144  …  -0.004818474908887017, -0.1186109942995982, 0.0, 0.0977236048585747, 0.0, -0.09772360485857479, -0.023654234082755687, 0.06672862830452682, -0.05766128992235136, -0.06672862830452686]
 [0.18624221034945637, -0.059983307861563695, -0.18624221034945637, 0.05998330786156364, 0.11456502953863738, -0.1420020683408506, -0.047376052823667, 0.07481309162588017, 0.07919232875980281, -0.06045889503925711  …  -0.0551573884760939, 0.035553339206657916, 0.0, -0.15485510930350793, 0.0, 0.15485510930350782, 0.06731247594642036, -0.128422922459067, 0.02112459676031525, 0.12842292245906706]
 [0.14418649842192194, -0.03016482362153594, -0.14418649842192188, 0.030164823621535913, 0.04776915947826409, -0.09921163091991414, 0.018048506720582697, 0.03339396472106744, 0.07061498694541715, -0.029833054452094165  …  -0.07710491469640264, 0.050994818343998945, -1.0, -0.12887989203778683, 0.0, 0.12887989203778677, 0.013037558652296187, -0.11654922365246445, 0.019650310410151778, 0.1165492236524644]
 [0.053595851392629124, -0.048445134538859624, -0.05359585139262921, 0.04844513453885957, 0.058322981828460074, -0.036965409857607906, 0.0045881340761622785, -0.025945706047014405, 0.215884981041794, 0.026481164064537766  …  -0.011503707567909978, -0.052220814547080524, 0.0, 0.003117848970251691, 0.0, -0.003117848970251691, 0.10471912080148502, -0.012339230036376214, 0.0036517508236346558, 0.01233923003637627]
 [0.15763132652485473, -0.06628348449694155, -0.15763132652485468, 0.06628348449694157, 0.043203124783116845, -0.04053181832062347, -0.04577796161758052, 0.04310665515508707, 0.015053568978240426, -0.028928499727536083  …  -0.13896019158516648, 0.022419479737287407, 0.0, -0.022075716711617233, 0.0, 0.02207571671161722, 0.0056250582112778585, -0.05487662214000055, 0.024280848379657854, 0.054876622140000546]
 [0.08222297183336144, -0.1378347914765713, -0.08222297183336147, 0.13783479147657132, 0.06502001488912484, 0.03810334517140801, -0.18661260205762825, 0.0834892419970954, 0.03242936364912802, 0.028588225254354616  …  -0.19619202943212313, 0.06453258488759091, -1.0, -0.04215899423385949, 0.0, 0.04215899423385955, 0.003349711645777351, -0.07551023579742866, 0.024312266429190856, 0.07551023579742867]

And again, let’s average that out and plot.

deviation_matrix = reduce(hcat, all_genome_byAA)

average_deviation = vec(mean(deviation_matrix, dims = 2))
average_sd = vec(std(deviation_matrix, dims = 2))
average_abs_deviation = vec(mean(abs.(deviation_matrix), dims = 2))
average_abs_sd = vec(std(abs.(deviation_matrix), dims = 2))

dev_results = DataFrame(codon = labels, avg_dev = average_deviation, avg_abs_dev = average_abs_deviation, avg_std = average_sd, avg_abs_std = average_abs_sd)


sorted_avgdev = sort(dev_results, :avg_dev)

# Plotting the average deviations
avg_devplot = plot(sorted_avgdev,
  layer(yintercept=[0], Geom.hline(style = :dot, color=[colorant"ivory4"])),
  layer(x = :codon, y = :avg_dev, ymin=(sorted_avgdev.avg_dev.-sorted_avgdev.avg_std), ymax=(sorted_avgdev.avg_dev.+sorted_avgdev.avg_std), color = [colorant"darkorange1"], Geom.point, Geom.errorbar),
  Guide.xlabel("Codon"),
  Guide.ylabel("Deviation in Frequency at Start"),
  Guide.title("Averaged Deviation of Codon Frequency"),
  Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm
  )
)

# Plotting the average of absolute deviations
avg_abs_devplot = plot(sorted_avgdev,
  layer(yintercept=[0], Geom.hline(style = :dot, color=[colorant"ivory4"])),
  layer(x = :codon, y = :avg_abs_dev, ymin=(sorted_avgdev.avg_abs_dev.-sorted_avgdev.avg_abs_std), ymax=(sorted_avgdev.avg_abs_dev.+sorted_avgdev.avg_abs_std), color = [colorant"deepskyblue"], Geom.point, Geom.errorbar),
  Guide.xlabel("Codon"),
  Guide.ylabel("Deviation in Aboslute Frequency at Start"),
  Guide.title("Averaged Aboslute Deviation of Codon Frequency"),
  Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm
  )
)
set_default_plot_size(7inch, 7inch)
vstack(avg_devplot, avg_abs_devplot)
Codon TGA GTG CAG GAG AAG CCG TAA CTG GGC ATC TTC TAG CGC ACC TGC GCC TCG CGG GCG ACG AAC GAC CAC GGG TCC TAC TTG CGT AGC GTC CTC ATG TGG TCT AGG GGT GCT CCC CTT CCT TAT ATT CTA ACT CAT GAT AAT CGA AGT TCA GTT ATA TTA CCA TGT GTA TTT GGA ACA AGA GCA AAA GAA CAA h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -1.50 -1.25 -1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 -2 -1 0 1 2 -1.25 -1.20 -1.15 -1.10 -1.05 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65 -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 Deviation in Aboslute Frequency at Start Averaged Aboslute Deviation of Codon Frequency Codon TGA GTG CAG GAG AAG CCG TAA CTG GGC ATC TTC TAG CGC ACC TGC GCC TCG CGG GCG ACG AAC GAC CAC GGG TCC TAC TTG CGT AGC GTC CTC ATG TGG TCT AGG GGT GCT CCC CTT CCT TAT ATT CTA ACT CAT GAT AAT CGA AGT TCA GTT ATA TTA CCA TGT GTA TTT GGA ACA AGA GCA AAA GAA CAA h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 -2.5 -2.4 -2.3 -2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 -4 -2 0 2 -2.50 -2.45 -2.40 -2.35 -2.30 -2.25 -2.20 -2.15 -2.10 -2.05 -2.00 -1.95 -1.90 -1.85 -1.80 -1.75 -1.70 -1.65 -1.60 -1.55 -1.50 -1.45 -1.40 -1.35 -1.30 -1.25 -1.20 -1.15 -1.10 -1.05 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65 -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 Deviation in Frequency at Start Averaged Deviation of Codon Frequency

Fascinating! We do still see that AT-rich codons are favored in the beginning of genes, regardless of what amino acid they encode, especially the codons that end with AA like CAA, GAA, and AAA (UAA is a stop codon). Don’t know why - but I’m interested now!

Is codon usage more “biased” at the beginning of genes?

We’ve gotten a lot done so far using CUBScout to look at codon frequency at the beginning of genes. This has shown us that there are codons that are favored at the beginning of genes as opposed to their middle. However, our approach above isn’t very robust. GC content will strongly influence codon frequency, and we didn’t scale our data, so our results are biased towards highly-expressed codons. Luckily, many people have worked to develop measures of codon usage bias that address these biases.

There are multiple measures of codon usage bias; CUBScout is able to calculate six. These measures balance how biased a gene is in its codon usage, specifically focusing on how evenly genes use synonymous codons. Many codon usage bias measures also seek to normalize these calculations based on over-all GC-content and gene length. Finally, these measures calculate an overall bias across all codons, instead of comparing frequencies for each codon. As such, these measures are a good tool for us to explore codon usage bias as we move deeper into a gene.

Usually, codon usage bias is calculated for each gene in a genome, and compares how biased that gene’s codon usage bias is compared to the rest of the genome. As an example, let’s calculate a metric call ENC’ for our B. subtilis genome:

enc_p(filename).self
3801-element Vector{Float64}:
 61.0
 59.36979815371983
 60.7494622549966
 61.0
 61.0
 56.353402323266224
 55.025304341802055
 57.30607996896261
 61.0
 49.80272180663614
 60.68966700271204
 58.30580344890954
 53.80997034839706
  ⋮
 58.44990355902287
 61.0
 58.275539315744126
 61.0
 58.28913535421893
 61.0
 61.0
 59.45750784609185
 61.0
 59.947884174402645
 59.43051836466144
 61.0

Each element in these vector corresponds to the codon usage bias of a gene in the genome. ENC’ is completely unbiased at a value of 61, and values less than that indicate more bias. There are many ways to tune our codon usage calculations, including gene length thresholds, and removing start or stop codons. All CUBScout codon usage bias functions are multi-threaded as well; they will use all the threads you supply when you initialize Julia. We can also supply a custom reference set of genes to calculate codon usage bias against. For this, we’ll use the find_seqs function to find ribosomal genes.

filename

ribos = find_seqs(filename, r"ribosomal")
4237-element Vector{Bool}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 1

And let’s re-run, using ribosomal genes as a reference set.

enc_p_results = enc_p(filename, ref_seqs = (ribosomal = ribos,))
enc_p_results.ribosomal
3801-element Vector{Float64}:
 61.0
 58.88817312982425
 56.41038374603565
 61.0
 61.0
 59.49029071488768
 52.12248950358573
 60.0279336224004
 61.0
 55.161189431573774
 61.0
 60.41871444187302
 55.72676805318774
  ⋮
 56.331800645054436
 60.0250271210913
 57.45707941079257
 58.99998348905379
 57.92451521804812
 61.0
 61.0
 60.8417465879344
 61.0
 56.532483690129425
 55.66872760680574
 61.0

Now that we’ve practiced calculating codon usage bias, let’s use it to test the bias of the early codons in a gene. I am going to try a different approach in how we sample our codons. I’m going to use a 8-codon sliding window. So for each genome, we’ll take the first 8 codons for every gene and join them into one big “gene”. Then we’ll slide our window three codons to down, and take the next 8 codons from every gene and join them into one big “gene”. We’ll do this all the way to the 80-th codon, and then compare how biased the codon usage is between these big “genes”. I’m going to use the function all_cub which calculates all six codon usage measures at once.

function sample_sliding_codons(filepath)
  open(FASTA.Reader, filepath) do reader
        codons = []
        for record in reader
          seq = sequence(record)
          len = length(seq)
          len <= 240 && continue # Skip if smaller than 80 codons
          for i in 1:9:217
            push!(codons, seq[i:(i+23)]) # Grab each 8-codon string 
          end 
      end
      matrix = reshape(codons, 25, :) # Reshape into a matrix, where each row is a "window", each column is a gene
      return matrix
      end
end

sliding_seqs = sample_sliding_codons(files[1])

joined_seqs = LongDNA{4}.(map(join, eachrow(sliding_seqs))) # Join each row into one long string of DNA

cub_results = all_cub(joined_seqs) # Calculate all SIX CUB measures.

# Turn into a dataframe
df_results = DataFrame(B = cub_results.B.self,
          ENCP = cub_results.ENCP.self,
          ENC = cub_results.ENC.ENC,
          MCB = cub_results.MCB.self,
          MILC = cub_results.MILC.self,
          SCUO = cub_results.SCUO.SCUO,
          Start = collect(1:9:217))
bplot1 = plot(df_results,
            x = :Start,
            y = :B,
            Geom.point, Geom.line,
            layer(yintercept = [0], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
            Guide.xlabel("Codon Position"),
            Guide.ylabel("Karlin and Mrazek's B"),
            Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm, errorbar_cap_length=0mm)
            )
encplot1 = plot(df_results,
            layer(x = :Start,
            y = :ENC,
            Geom.point, Geom.line),
            layer(yintercept = [61], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
            Guide.xlabel("Codon Position"),
            Guide.ylabel("ENC"),
            Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm, errorbar_cap_length=0mm)
            )
encpplot1 = plot(df_results,
            layer(x = :Start,
            y = :ENCP,
            Geom.point, Geom.line),
            layer(yintercept = [61], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
            Guide.xlabel("Codon Position"),
            Guide.ylabel("ENC'"),
            Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm, errorbar_cap_length=0mm)
            )
mcbplot1 = plot(df_results,
            layer(x = :Start,
            y = :MCB,
            Geom.point, Geom.line),
            layer(yintercept = [0], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
            Guide.xlabel("Codon Position"),
            Guide.ylabel("MCB"),
            Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm, errorbar_cap_length=0mm)
            )
milcplot1 = plot(df_results,
            layer(x = :Start,
            y = :MILC,
            Geom.point, Geom.line),
            layer(yintercept = [.5], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
            Guide.xlabel("Codon Position"),
            Guide.ylabel("MILC"),
            Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm, errorbar_cap_length=0mm)
            )
scuoplot1 = plot(df_results,
            layer(x = :Start,
            y = :SCUO,
            Geom.point, Geom.line),
            layer(yintercept = [0], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
            Guide.xlabel("Codon Position"),
            Guide.ylabel("SCUO"),
            Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm, errorbar_cap_length=0mm)
            )

set_default_plot_size(7.5inch, 5inch)
gridstack([bplot1 encplot1 encpplot1; mcbplot1 milcplot1 scuoplot1])
Codon Position -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 -250 0 250 500 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 -0.2 0.0 0.2 0.4 -0.155 -0.150 -0.145 -0.140 -0.135 -0.130 -0.125 -0.120 -0.115 -0.110 -0.105 -0.100 -0.095 -0.090 -0.085 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 0.255 0.260 0.265 0.270 0.275 0.280 0.285 0.290 0.295 0.300 0.305 SCUO Codon Position -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 -250 0 250 500 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.400 0.405 0.410 0.415 0.420 0.425 0.430 0.435 0.440 0.445 0.450 0.455 0.460 0.465 0.470 0.475 0.480 0.485 0.490 0.495 0.500 0.505 0.510 0.515 0.520 0.525 0.530 0.535 0.540 0.545 0.550 0.555 0.560 0.565 0.570 0.575 0.580 0.585 0.590 0.595 0.600 0.605 0.610 0.615 0.620 0.625 0.630 0.635 0.640 0.4 0.5 0.6 0.7 0.400 0.405 0.410 0.415 0.420 0.425 0.430 0.435 0.440 0.445 0.450 0.455 0.460 0.465 0.470 0.475 0.480 0.485 0.490 0.495 0.500 0.505 0.510 0.515 0.520 0.525 0.530 0.535 0.540 0.545 0.550 0.555 0.560 0.565 0.570 0.575 0.580 0.585 0.590 0.595 0.600 0.605 0.610 0.615 0.620 0.625 0.630 0.635 0.640 MILC Codon Position -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 -250 0 250 500 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 -0.25 0.00 0.25 0.50 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 MCB Codon Position -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 -250 0 250 500 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 57.5 58.0 58.5 59.0 59.5 60.0 60.5 61.0 61.5 62.0 62.5 63.0 58.0 58.1 58.2 58.3 58.4 58.5 58.6 58.7 58.8 58.9 59.0 59.1 59.2 59.3 59.4 59.5 59.6 59.7 59.8 59.9 60.0 60.1 60.2 60.3 60.4 60.5 60.6 60.7 60.8 60.9 61.0 61.1 61.2 61.3 61.4 61.5 61.6 61.7 61.8 61.9 62.0 62.1 62.2 62.3 62.4 62.5 58 60 62 64 57.95 58.00 58.05 58.10 58.15 58.20 58.25 58.30 58.35 58.40 58.45 58.50 58.55 58.60 58.65 58.70 58.75 58.80 58.85 58.90 58.95 59.00 59.05 59.10 59.15 59.20 59.25 59.30 59.35 59.40 59.45 59.50 59.55 59.60 59.65 59.70 59.75 59.80 59.85 59.90 59.95 60.00 60.05 60.10 60.15 60.20 60.25 60.30 60.35 60.40 60.45 60.50 60.55 60.60 60.65 60.70 60.75 60.80 60.85 60.90 60.95 61.00 61.05 61.10 61.15 61.20 61.25 61.30 61.35 61.40 61.45 61.50 61.55 61.60 61.65 61.70 61.75 61.80 61.85 61.90 61.95 62.00 62.05 62.10 62.15 62.20 62.25 62.30 62.35 62.40 62.45 62.50 ENC' Codon Position -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 -250 0 250 500 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 45.0 45.5 46.0 46.5 47.0 47.5 48.0 48.5 49.0 49.5 50.0 50.5 51.0 51.5 52.0 52.5 53.0 53.5 54.0 54.5 55.0 55.5 56.0 56.5 57.0 57.5 58.0 58.5 59.0 59.5 60.0 60.5 61.0 61.5 62.0 62.5 63.0 63.5 64.0 64.5 65.0 65.5 66.0 66.5 67.0 67.5 68.0 68.5 69.0 40 50 60 70 45.0 45.5 46.0 46.5 47.0 47.5 48.0 48.5 49.0 49.5 50.0 50.5 51.0 51.5 52.0 52.5 53.0 53.5 54.0 54.5 55.0 55.5 56.0 56.5 57.0 57.5 58.0 58.5 59.0 59.5 60.0 60.5 61.0 61.5 62.0 62.5 63.0 63.5 64.0 64.5 65.0 65.5 66.0 66.5 67.0 67.5 68.0 68.5 69.0 ENC Codon Position -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 -250 0 250 500 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 -0.25 0.00 0.25 0.50 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 Karlin and Mrazek's B

Okay! These plots can be a lot to interpret off the bat. We calculated six measures of codon usage bias, each of which is displayed on the y-axis. The dashed orange line indicates “no bias” for that measure. So any deviation away from the line does imply some bias for the codons from that window. So we are observing an increase in codon usage bias in the beginning of genes in Bacillus subtilis. Does this generalize? Let’s do the same analysis for all genes, for all of our genomes. I’m excluding ENC and SCUO from these calculations; they can get noisier in situations like this and so aren’t very helpful in this application.

# We'll map across our list of files, creating CUB dataframes for each one
big_cub_results_vec = map(files) do file
  codons = sample_sliding_codons(file)
  joined_seqs = LongDNA{4}.(map(join, eachrow(codons)))
  cub_results = all_cub(joined_seqs)
  df_results = DataFrame(B = cub_results.B.self,
          ENCP = cub_results.ENCP.self,
          MCB = cub_results.MCB.self,
          MILC = cub_results.MILC.self,
          Start = collect(1:9:217))
  return df_results
end

big_cub_results = reduce(vcat, big_cub_results_vec) # Combine dataframes into 1

grouped_cub = groupby(big_cub_results, :Start) # Group by the codon window position

combined_cub = combine(grouped_cub, valuecols(grouped_cub) .=> mean, valuecols(grouped_cub) .=> std) # Calculate average codon usage bias in the window

# Plot :)
bplot = plot(combined_cub,
            x = :Start,
            y = :B_mean,
            ymin = (combined_cub.B_mean .- combined_cub.B_std),
            ymax = (combined_cub.B_mean .+ combined_cub.B_std),
            Geom.point, Geom.errorbar,
            layer(yintercept = [0], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
            Guide.xlabel("Codon Position"),
            Guide.ylabel("Karlin and Mrazek's B"),
            Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm, errorbar_cap_length=0mm)
            )
encpplot = plot(combined_cub,
            layer(x = :Start,
            y = :ENCP_mean,
            ymin = (combined_cub.ENCP_mean .- combined_cub.ENCP_std),
            ymax = (combined_cub.ENCP_mean .+ combined_cub.ENCP_std),
            Geom.point, Geom.errorbar),
            layer(yintercept = [61], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
            Guide.xlabel("Codon Position"),
            Guide.ylabel("ENC'"),
            Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm, errorbar_cap_length=0mm)
            )
mcbplot = plot(combined_cub,
            layer(x = :Start,
            y = :MCB_mean,
            ymin = (combined_cub.MCB_mean .- combined_cub.MCB_std),
            ymax = (combined_cub.MCB_mean .+ combined_cub.MCB_std),
            Geom.point, Geom.errorbar),
            layer(yintercept = [0], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
            Guide.xlabel("Codon Position"),
            Guide.ylabel("MCB"),
            Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm, errorbar_cap_length=0mm)
            )
milcplot = plot(combined_cub,
            layer(x = :Start,
            y = :MILC_mean,
            ymin = (combined_cub.MILC_mean .- combined_cub.MILC_std),
            ymax = (combined_cub.MILC_mean .+ combined_cub.MILC_std),
            Geom.point, Geom.errorbar),
            layer(yintercept = [.5], Geom.hline(color = colorant"darkorange1", style = :dot), order = 1),
            Guide.xlabel("Codon Position"),
            Guide.ylabel("MILC"),
            Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm, errorbar_cap_length=0mm)
            )
set_default_plot_size(5inch, 5inch)
gridstack([bplot encpplot; mcbplot milcplot])
Codon Position -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 -250 0 250 500 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 0.84 0.86 0.0 0.5 1.0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 MILC Codon Position -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 -250 0 250 500 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65 -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 -1 0 1 2 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65 -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 MCB Codon Position -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 -250 0 250 500 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 46.0 46.5 47.0 47.5 48.0 48.5 49.0 49.5 50.0 50.5 51.0 51.5 52.0 52.5 53.0 53.5 54.0 54.5 55.0 55.5 56.0 56.5 57.0 57.5 58.0 58.5 59.0 59.5 60.0 60.5 61.0 61.5 62.0 62.5 63.0 63.5 64.0 64.5 65.0 65.5 66.0 66.5 67.0 67.5 68.0 68.5 69.0 69.5 70.0 40 50 60 70 46.0 46.5 47.0 47.5 48.0 48.5 49.0 49.5 50.0 50.5 51.0 51.5 52.0 52.5 53.0 53.5 54.0 54.5 55.0 55.5 56.0 56.5 57.0 57.5 58.0 58.5 59.0 59.5 60.0 60.5 61.0 61.5 62.0 62.5 63.0 63.5 64.0 64.5 65.0 65.5 66.0 66.5 67.0 67.5 68.0 68.5 69.0 69.5 70.0 ENC' Codon Position -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 -250 0 250 500 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 -0.5 0.0 0.5 1.0 -0.30 -0.29 -0.28 -0.27 -0.26 -0.25 -0.24 -0.23 -0.22 -0.21 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 Karlin and Mrazek's B

We once again see clear trends of stronger codon usage bias at the beginning of genes as opposed to the “middle”, across all of our genomes.

Conclusions

Using CUBScout we’ve shown that codon usage bias is different at the beginning of genes compared to the rest of a gene’s length. Moreover, it appears that this is especially biased towards AT-rich codons at the beginning of genes, regardless of amino-acid biases at the beginning of genes or GC content.

Comparing performance

I based CUBScout off a wonderful package from R called coRdon. One of my goals in learning Julia was to write faster code for genomic analyses. As such, I wanted to compare the performance of CUBScout to its R equivalent.

I tested these packages’ speed on three functions; ENC, MILC, and the combined process of calculating all six CUB measures. To note, in R this required customized functions, as you first read in a fasta file as a BioString, calculate codon frequency, and then calculate the metrics. In comparison, these are single-functions in CUBScout. I timed each language in running these functions for the 997 genomes I include in this analysis. You can see the code I ran in the files “r_rests.R” and “small_julia.jl” / “julia_runs.sh” on this project’s repo. All analyses were run on an AMD EPYC 64-core processor with 1014GiB system memory hosted by the Cornell BioHPC Core.

using CSV
performance_results = CSV.read(joinpath(pwd(), "Combined_Results.csv"), DataFrame)
julia_results = subset(performance_results, :Language => ByRow(x -> x == "Julia"))
grouped = groupby(julia_results, [:Threads, :Task])
combined = combine(grouped, :Time => mean, :Time => std)
r_results = subset(performance_results, :Language => ByRow(x -> x == "R"))
small_r = select(r_results, Not([:Threads, :Language, :Time]), :Time => :R_Time)
joined = leftjoin(combined, small_r, on = :Task)
transform!(joined, [:R_Time, :Time_mean] => ByRow((x,y)->x/y) => :Speed_Up)

plot(joined, 
  x = :Threads, y = :Speed_Up, color = :Task, Geom.point, Geom.line,
  Guide.ylabel("Times Faster than R"),
  Theme(discrete_highlight_color = c->nothing, alphas=[0.8], grid_line_width = 0mm
  ))
Threads -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 -10 0 10 20 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 ENC MILC ALL_CUB Task h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 -25 0 25 50 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Times Faster than R

This plot shows the speed-up of CUBScout compared to R for each task. CUBScout is consistently faster (5X up to 25X), especially as it is able to utilize multiple threads when analyzing multiple genomes.

Thank you for reading!

If you’ve come this far, thank you! CUBScout was a personal passion project of mine, and it’s the first time I’ve developed software. I’d love input, collaborations, or questions!