Structure of the Genetic Code


Mathematical-Physics Message in the Genetic Code?

ISBN: 978-0-9849696-9-2

Version 1.0 September 23, 2020

Douglas C. Youvan

Download Matrices in Wolfram Mathematica .nb Format

The universal genetic code is the only known look-up table in Nature. Herein, we propose that the universal genetic code arrived on Earth via directed panspermia and contains an embedded mathematical-physics message relevant to current scientific investigations. With ~10^85 possible permutations of the code, the information content of any particular code is sufficient to provide for both biological factors and to also encode a message to physicists. I review some of the basic features of the code, show logical (binary) matrix representations, present a pedagogical model of a hypothetical 3-manifold space-time continuum, and then suggest that a large number of mathematicians and physicists should examine the code for any embedded message. Originally presented as a “frozen accident”, the genetic code could very well be the SETI signal, albeit in biological format rather than radio signals.

The universal genetic code presented as a 3-D cube. Axes 1, 2, 3 represent the first, second, and third positions of the code. U/T, C, A, and G are the nucleotides possible at each position. The single-letter code for the 20 amino acids is given in the table, below. The stop codon is symbolized as X.

Properties of the amino acids. The name of each amino acid is followed by the three-letter code, single-letter code, hydropathy value, and molar volume. The hydropathy values are from Kyte and Doolittle, with larger negative values indicative of hydrophilic residues and positive values indicative of hydrophobic residues. Hydrophobic residues tend to partition to the interior of globular proteins or to the transmembrane region of membrane proteins. Molar volume is given in cubic Angstroms.

The figures shown below will address the degeneracy of the code which necessarily arises from 64 codons coding for only 21 entities (20 amino acids plus stop).

Two amino acids, M and W, have a degeneracy equal to one. AUG is also the start codon.

Nine amino acids have a degeneracy equal to two. Degeneracy involves the third position and is split between U/C and A/G.

Only one amino acid (I) has a degeneracy of three with variations limited to the third position.

Five amino acids (P, T, V, A, G) have a degeneracy of four. Again, all degeneracy is confined to the third position of the codon. No amino acid has a degeneracy of five.

Three amino acids (L, S, R) have a degeneracy of six which overfills the four-fold degeneracy of the third position and requires encoding at the first and second positions while maintaining the (U/C) and (A/G) partitioning of the third position. Serine (S) is unique in that its encoding requires changes at both the first and second positions.

There may be other biological constraints on the encoding of the genetic code. The strongest constraint is probably at the second position of the codon, where U encodes hydrophobic residues and A encodes hydrophilic residues, as shown below in blue and red, respectively. A Singular Value Decomposition (SVD) correlation coefficient was found to be c = 0.91 as discussed below.

In addition to amino acid residue hydropathy, it is possible that the structure of the code is also correlated with amino acid molar volume, as shown below.

The figure (above) shows the M subunit from the photosynthetic reaction center, known to have 5 transmembrane alpha helices. The upper plot is from the standard Kyte and Doolittle method based on 20 amino acid parameters. The lower plot is simply a running average of second position (T – A) over a span of 19 amino acids to predict cyclic hydropathy patterns. The lower plot assigns +1 to T in the second position of the codon, -1 to A, with all other positions / nucleotides set to zero. The T-A method uses only half of the information from the second position of the codon, or 1/6th of the total DNA sequence. The lower plot is additionally smoothed by a second running average over 19 residues, i.e., a bandpass filter.

Our 1990 SVD work on hydropathy originally used a 61×12 matrix to represent the genetic code in binary. The SVD code from Numerical Recipes in C was used to evaluate this over-determined matrix using Penrose’s methods. Mathematica uses the same method to find a pseudo-inverse of the matrix. More recently, I found equations that can both construct and solve such matrices if the stop codons are included and padded with average values (64×12). Mathematica code and output is shown below.

64 x 12 “scroll” matrix representing the genetic code. The column vector on the right is for reference purposes only, relating each row to an encoded amino acid. From left to right, each row is read in three groups of four numbers representing the three positions of the codon which utilize the four nucleotides in alphabetic order. For example, row 1 is read as AAA which encodes lysine (K). Row 64 reads TTT which encodes Phenylalanine (F).

Mathematica code shows that a scroll matrix (D) can be generated for the genetic code by partitioning tuples, with an alphabet, a=4 (four nucleotides), and a word size, w=3 (three codon positions). Mathematica code also shows that an exact equation for the pseudoinverse of D (YI) is equal to the pseudoinverse found by SVD (PI).

The column vector shown above can be mapped to form another matrix (64×21) by using the following substitutions.

Key (above) for translating the 1 x 64 column vector into the 64 x 21 binary matrix (below).

64 x 21 “connection” matrix, wherein each row corresponds to a scroll matrix row. From left to right, columns are organized alphabetically by the single letter code: A, C, D … X, Y.

Thus, the 64×12 scroll matrix and the 64×21 connection matrix are correlated by row, and they form the simplest, complete representation of all the information carried by the genetic code in binary (logical) format. Vertical alignment of 1’s in the connection matrix represents degeneracy. One caveat is that the connection matrix rows could be reduced to 7-bit binary numbers (64×7), but this would obscure degeneracy.

Similar pairs of scroll and connection matrices can represent any of the possible encodings that assign all 20 amino acids and 1 stop codon at least once. There are ~1.5 x 10^84 such pairs as calculated by Mathematica code (from Daniel Lichtblau, Wolfram Research):

An advanced civilization would not necessarily have to use a degenerate code or a code highly correlated with hydropathy. Therefore, I think it would be a mistake to assume such properties restrict the encoding of another message. What is needed at this time is a large number of information scientists, mathematicians, and physicists to look at the code with unbiased eyes. Perhaps, someone will see the solution to a current problem. From the standpoint of an advanced civilization or designer, any recipient civilization that can decode its own genetic code should also be able to decode an advanced mathematical/physics message.

No single person is likely to solve this problem – especially me, with my limited knowledge of physics. However, I can form a hypothesis and follow it through simply to illustrate the pedagogical method:

Four variables in one dimension reminds me of Minkowski space-time. Looking at the unique degeneracy of I and M, we pick (U, C, A) as (x, y, z) with unknown order and G as -t. Given there are three axes, we postulate three (orthogonal) space-time continuums. S, as described above, is in a unique position, casting degeneracy across all three axes (manifolds). I further postulate that these degeneracies are connections between the three manifolds as shown below.

The two figures (above) show the special position of serine (S) as using degeneracy in all three positions of the genetic code, and the use of such connections (gold lines) among the three manifolds in the hypothetical model of space-time continuums where A, C, G, U can be mapped to x, y, z, and -t in this 3-manifold model.

I will readily admit that I not the best person to make the ingenious guess that would result in some unifying theory in mathematics and physics. However, the binary matrices given above are the most concise representation of the genetic code. Within a ToE, theoretical physicists are currently looking to information theory for possible holographic projections within a binary mathematical universe that explains quantum entanglement, general relativity, quantum mechanics, and gravity.

In my opinion, directed panspermia is far more likely than a cascade of sequential events which includes: abiogenesis, the formation of a Prigogine dissipative structure (carrying information), an RNA universe, the Central Dogma replacing ribozymes, and the evolution of the genetic code from a more basic code. Our odds are greatly increased by looking to the estimated 10^24 stars in the universe and to processes completely unknown to us. Notable proponents of (directed) panspermia include Svante Arrhenius, Fred Hoyle, and Francis Crick.

Finally, given SETI expended so much effort into radio signals, should not this problem be examined? The downside of ignoring the possibility of a “code within the code” is potentially ignoring First Contact.

You are free to use any material on this website with attribution.

Work in Progress:

S Matrix

C Matrix