Structure of the Genetic Code

Biophysics:

Mathematical-Physics Message in the Genetic Code?

ISBN: 978-0-9849696-9-2

Version 1.0 September 23, 2020

Douglas C. Youvan

doug@youvan.com

Download Matrices in Wolfram Mathematica .nb Format

The universal genetic code is the only known look-up table in Nature. Herein, we propose that the universal genetic code arrived on Earth via directed panspermia and contains an embedded mathematical-physics message relevant to current scientific investigations. With ~10^85 possible permutations of the code, the information content of any particular code is sufficient to provide for both biological factors and to also encode a message to physicists. I review some of the basic features of the code, show logical (binary) matrix representations, present a pedagogical model of a hypothetical 3-manifold space-time continuum, and then suggest that a large number of mathematicians and physicists should examine the code for any embedded message. Originally presented as a “frozen accident”, the genetic code could very well be the SETI signal, albeit in biological format rather than radio signals.

The universal genetic code presented as a 3-D cube. Axes 1, 2, 3 represent the first, second, and third positions of the code. U/T, C, A, and G are the nucleotides possible at each position. The single-letter code for the 20 amino acids is given in the table, below. The stop codon is symbolized as X.

Properties of the amino acids. The name of each amino acid is followed by the three-letter code, single-letter code, hydropathy value, and molar volume. The hydropathy values are from Kyte and Doolittle, with larger negative values indicative of hydrophilic residues and positive values indicative of hydrophobic residues. Hydrophobic residues tend to partition to the interior of globular proteins or to the transmembrane region of membrane proteins. Molar volume is given in cubic Angstroms.

The figures shown below will address the degeneracy of the code which necessarily arises from 64 codons coding for only 21 entities (20 amino acids plus stop).

Two amino acids, M and W, have a degeneracy equal to one. AUG is also the start codon.

Nine amino acids have a degeneracy equal to two. Degeneracy involves the third position and is split between U/C and A/G.

Only one amino acid (I) has a degeneracy of three with variations limited to the third position.

Five amino acids (P, T, V, A, G) have a degeneracy of four. Again, all degeneracy is confined to the third position of the codon. No amino acid has a degeneracy of five.

Three amino acids (L, S, R) have a degeneracy of six which overfills the four-fold degeneracy of the third position and requires encoding at the first and second positions while maintaining the (U/C) and (A/G) partitioning of the third position. Serine (S) is unique in that its encoding requires changes at both the first and second positions.

There may be other biological constraints on the encoding of the genetic code. The strongest constraint is probably at the second position of the codon, where U encodes hydrophobic residues and A encodes hydrophilic residues, as shown below in blue and red, respectively. A Singular Value Decomposition (SVD) correlation coefficient was found to be c = 0.91 as discussed below.

In addition to amino acid residue hydropathy, it is possible that the structure of the code is also correlated with amino acid molar volume, as shown below.

The figure (above) shows the M subunit from the photosynthetic reaction center, known to have 5 transmembrane alpha helices. The upper plot is from the standard Kyte and Doolittle method based on 20 amino acid parameters. The lower plot is simply a running average of second position (T