Restriction enzymes use a 24 dimensional coding space to recognize 6 base long DNA sequences

Restriction enzymes recognize and bind to specific sequences on invading bacteriophage DNA. Like a key in a lock, these proteins require many contacts to specify the correct DNA sequence. Using information theory we develop an equation that defines the number of independent contacts, which is the dimensionality of the binding. We show that EcoRI, which binds to the sequence GAATTC, functions in 24 dimensions. Information theory represents messages as spheres in high dimensional spaces. Better sphere packing leads to better communications systems. The densest known packing of hyperspheres occurs on the Leech lattice in 24 dimensions. We suggest that the single protein EcoRI molecule employs a Leech lattice in its operation. Optimizing density of sphere packing explains why 6 base restriction enzymes are so common.



There are no comments yet.


page 1

page 2

page 3

page 4


Coding over Sets for DNA Storage

In this paper we study error-correcting codes for the storage of data in...

The Sphere Packing Bound for DSPCs with Feedback a la Augustin

The sphere packing bound is proved for codes on discrete stationary prod...

DNA Linear Block Codes: Generation, Error-detection and Error-correction of DNA Codeword

In modern age, the increasing complexity of computation and communicatio...

On Communication for Distributed Babai Point Computation

We present a communication-efficient distributed protocol for computing ...

Memory Matching Networks for Genomic Sequence Classification

When analyzing the genome, researchers have discovered that proteins bin...

DNA Steganalysis Using Deep Recurrent Neural Networks

The technique of hiding messages in digital data is called a steganograp...

Multilevel constructions: coding, packing and geometric uniformity

Lattice and special nonlattice multilevel constellations constructed fro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Restriction enzymes recognize and bind to specific sequences on invading bacteriophage DNA. Like a key in a lock, these proteins require many contacts to specify the correct DNA sequence. Using information theory we develop an equation that defines the number of independent contacts, which is the dimensionality of the binding. We show that EcoRI, which binds to the sequence GAATTC, functions in 24 dimensions. Information theory represents messages as spheres in high dimensional spaces. Better sphere packing leads to better communications systems. The densest known packing of hyperspheres occurs on the Leech lattice in 24 dimensions. We suggest that the single protein EcoRI molecule employs a Leech lattice in its operation. Optimizing density of sphere packing explains why 6 base restriction enzymes are so common.

1 Introduction

Restriction enzymes provide a common defense mechanism in procaryotes against foreign DNA injected by bacteriophages [1, 2]. These proteins bind to specific sequences on DNA and cleave the DNA, rendering it susceptible to attack by exonucleases [3, 4] and preventing viral replication. The bacterial genome is protected by modification enzymes that methylate the same pattern that the restriction enzymes cut. Though the sequences bound by restriction enzymes usually consist of only or base pairs, even the least drastic base change of the GAATTC EcoRI binding site decreases EcoRI binding by 1000 fold [5]. How can restriction enzymes have such precise recognition? Why do we find the majority of restriction enzymes have exactly or base pair long recognition sequences?

To model recognition of EcoRI binding to DNA, we distinguish two different energy flows in time. The first is the energy dissipated during the binding or ‘operation’ of the molecules (power) [6]. For EcoRI we consider two states, and in both states the protein is associated with the DNA. The dissipation proceeds from a high energy ‘before’ state in which the EcoRI molecule is somewhere on the DNA but not bound specifically. Then, after a Brownian motion search, when EcoRI encounters a binding site it may begin to form specific bonds. As these bonds form, energy is dissipated to the surrounding water until EcoRI has formed all its bonds to the DNA. We call this latter low energy state the ‘after’ state. The energetic difference between these two states is the specific binding energy, [7]. The time of this dissipation may vary, but the total energy dissipated is constant between the two states, as indicated by the measurability of for the operation [8]. The second important energy involved in EcoRI binding is the thermal noise that passes through the molecule during the binding operation. This noise, , interferes with bond formation. These concepts are parallel to the communications model developed by Shannon in which the power of the communication signal is absorbed into and then dissipates from the receiver while it selects a particular message. The receiver must also handle additional energy caused by thermal noise added to the power. The ratio is called the ‘signal-to-noise’ ratio, but this term is not appropriate for EcoRI since there is no external signal. However, in both models there are the two energies dissipated in time, and , and there is a selection of specific states. Of course, a sufficiently strong thermal noise will eventually dislodge EcoRI from its specific binding, but this reversal is not the selection process we are interested in. Having set up these concepts allows us to apply powerful theorems from information theory to the recognition problem [9, 6, 10, 7].

However, the problem of recognition cannot be explained by thinking about the protein-DNA contact as a single interaction. Instead, there are multiple interactions including hydrogen, van der Waals and electrostatic bonds. To describe this set of interactions takes a series of numbers. Some of these interactions could be independent like the pins in a lock. As in a lock, it is advantageous for the pins to be independent because that way the lock can represent more combinations and is more secure [11]. Unlike a lock, the microscopic EcoRI molecule is continuously impacted and violently jostled by thermal noise (

). Each molecular ‘pin’ has a velocity that is the sum of many small impacts, so the central limit theorem from statistics tells us that the velocity will be approximately Gaussian

[12]. So the moving parts of a molecule–the ‘pins’– that help EcoRI select GAATTC can be modeled as a set of independent molecular oscillators moving under the influence of thermal noise [6].

EcoRI potentially has many pins, each with a particular velocity. When one has a set of independent numbers they can be described as a point in a high dimensional space. Furthermore, when two independent Gaussian distributions are combined at right angles to represent their independence, the resulting 2-dimensional distribution is circular

[13]. With three independent Gaussian distributions the combined distribution is a sphere and when there are more than three the distribution is still spherical, a hypersphere. The radius of the sphere is proportional to the square root of the thermal noise impacting on the molecule [6].

The higher the dimension of the sphere, the more the distribution converges to a single radius and the sphere skin or ‘thickness’ becomes smaller [6]. To see this, consider the volume of a ball (the region enclosed by a sphere) embedded in a D dimensional space,


For a radius , let half the volume lie in the shell between and , that is . Rearranging gives and as , , so the volume is densest near the surface.

So the state of EcoRI bound to GAATTC after dissipation can be represented as a hypersphere. Because the pins of EcoRI have an instantaneous position and velocity, at any one instant EcoRI is at a particular point on the sphere and moves by Brownian motion across the sphere surface. EcoRI bound to a different DNA sequence, such as CAATTC, is on a different hypersphere. If these two spheres were to intersect, then EcoRI would be able to bind sites other than GAATTC and this error would be fatal to the bacterium whose DNA is only protected at GAATTC. Thus the hyperspheres should not intersect. When EcoRI binds to DNA, that provides a finite amount energy that can be dissipated per binding (), so there is a finite set of hyperspheres that can be bound. During evolution EcoRI will tend to minimize the binding energy, while the number of hypersphere states it selects between remains constant [7] so the hyperspheres become tightly packed together without intersecting (Fig. 1). Thus EcoRI can evolve to bind efficiently, using the minimum energy to select between the maximum number of binding states. In addition, by using many interactions in a high dimensional space, the hyperspheres become sharply defined because the distribution around the sphere radius (the thickness) becomes smaller [6]. This allows EcoRI to evolve to reduce the number of times it cuts the wrong sequence, giving it a low error rate.

Figure 1: Sphere packing. Circles demonstrate square and hexagonal sphere packing in two dimensions. The hexagonal packing is

% more dense. In higher dimensional spaces sphere packing is less intuitive. When hyperspheres pack together there is an odd property diagramed on the right side of the figure (which is derived from Shannon’s proof of the channel capacity theorem, Theorem 2 in his figure 5

[9]). The vertical arrow represents moving from the center of one hypersphere to the center of a second hypersphere. For Shannon, working with electrical communications, this voltage is the square root of the power dissipation, . In a 100 dimensional space, the thermal noise in the second sphere (green circle) disturbs the signal in all directions, shown by splayed arrows with lengths . However, 99 of those dimensions do not perturb in the direction of the power dissipation. In his proof, Shannon neglected the 1% of the noise in the direction of the power since this represents the error, and it can be made as small as one may desire by increasing the dimensionality—in 1000 dimensions the error is only 0.1%. So relative to the direction of the power, the received hypersphere can be treated as a flat surface since all the other directions (splayed arrows) are at right angles to the power direction. If two hyperspheres are to be separated with as low an error as desired, then the power to get from one to the next must just exceed the thermal noise power of the first sphere, so and .

Shannon dealt with a closely related problem regarding maximizing the information that could be sent over a phone line for a given power (, joules per second) [14, 9]. Messages from a transmitter can be broken into a series of independent voltages and so the set of numbers describing a particular message can be represented as a point in a high dimensional space which we call the ‘coding space’ since the message is represented by a code, the set of voltage values along each dimension. In addition, thermal noise on the phone wire causes the received voltage pulses to vary according to a Gaussian distribution. So if a message were repeated many times the received message points would form a sphere in the high dimensional space. When the receiver gets one of these noise-disturbed points, it can determine which of the possible transmitted messages is closest and thereby ‘decode’ the message to produce a clear noise-free signal for the person. Shannon recognized that the received spheres should not intersect if the receiver is to avoid ambiguity in decoding.

The receiver in a communications system selects particular symbols from all possible symbols that the transmitter might send. Similarly, a molecule such as EcoRI selects a particular state (binding to GAATTC) from an array of possible states (binding to any arbitrary 6 base long sequence). This concept applies to many other biological macromolecules. These ‘molecular machines’ include proteins that bind DNA, proteins that detect light such as rhodopsin in the eye and proteins that cause motion such as myosin moving on actin in muscle [6]. In every case the molecular machines dissipate energy in order to settle into one of several possible lower energy states and they do this despite the presence of violent thermal noise.

This paper answers the question of why restriction enzymes have such high fidelity despite being disturbed by thermal noise by showing how to calculate the coding space dimensionality of nucleic-acid recognizing molecular machines. The measured dimensionalities imply that restriction enzymes have evolved to exploit coding techniques only recently developed for modern communication systems. This in turn suggests that humans should also be able to build nanometer scale molecules that decode signals.

Background concepts important for understanding these results are basic information theory [15, 16], general molecular biology [17], and how to measure the information content of binding sites on DNA or RNA in bits [18, 19]. In addition, messages in a communications system and the states of molecular machines can be represented by spheres packing together in a high dimensional coding space [9, 20, 6]. The isothermal efficiency is described in [7, 21]. For reviews, see [22, 23, 21].

2 A lower bound on the dimensionality of molecular machines

Molecular machines are molecules that select specific states while dissipating energy [6]. The information, in bits, that a molecular machine can gain is the base 2 logarithm of the number of states it selects amongst [18, 24, 19]. The maximum number of bits that can be gained for the energy dissipated in a communications system is the channel capacity [9]. For molecular machines, we call the corresponding measure the molecular machine capacity [6]. Formulas for the capacities contain a term that represents the dimensionality of the coding space in which the state spheres are packed. Therefore a rearrangement of the formula leads to an equation for the dimensionality. This provides a step towards understanding the nature of the coding space of molecular machines.

The maximum number of distinct choices that a molecular machine can make in the presence of thermal noise by dissipating energy depends on these two factors and also on the number of independent moving parts of the machine or ‘pins,’ , following the lock and key analogy of molecular machines [6]. In communications, the channel capacity sets the upper bound on the rate that information can be faithfully transmitted [14, 9]. Corresponding to the channel capacity of communications systems a molecular machine’s capacity is:


where a molecular machine operation is, for example, the process of going from non-specific to specific DNA binding by a nucleic acid recognizer [6]. This formula was derived by counting the maximum number of distinct molecular states, represented as spheres in a high dimensional space (see [22, 23, 21] for reviews). Shannon’s channel capacity theorem [9] implies that the sequence information a nucleic acid recognizing molecular machine uses to locate its binding sites, [18], can evolve up to but not beyond this capacity:


For nucleic acid recognizers, is the area under a sequence logo [24]. The dimensionality of the coding space used to describe these states is:


since there are both a phase and an amplitude for each of the independent oscillator pins that describe the motions of a molecule at thermal equilibrium [6]. Combining equations (2), (3), and (4) gives a lower bound for the dimensionality:


This lower bound is a function of the information gain and the ratio.

3 Applying the dimensional lower bound to restriction enzyme coding space

The maximum theoretical isothermal efficiency of a molecular machine is defined entirely by the dissipated energy and the thermal noise in terms of the normalized energy dissipation :


where [7]. This expression was first used to describe the efficiency of satellite communications in terms of the ‘signal-to-noise’ ratio, [25].

The efficiency of EcoRI and other molecular machines is observed to be close to [7]. This can be explained if , in which case equation (6) shows . The relationship between and measures the distance between hyperspheres so implies that the state of being bound to one sequence is distinct from the state of being bound to a different sequence.

When there is a choice to be made among several molecular states, such as the strong discrimination EcoRI makes between GAATTC and single base changes of that sequence [5], then the molecular machine operates under the condition that its states are separated, which has been shown geometrically to be equivalent to


(Fig. 1) [7]. This inequality limits the efficiency to %. Substituting into equation (5), the inequality (7) implies that


For fully evolved bistate molecular machines the dimensionality of the coding space is more than twice the information content of a binding site when the latter is expressed in bits. Thus, a lower bound of the dimensionality for the EcoRI coding space is found by noting that GAATTC is bases or bits, so EcoRI operates in a coding space of at least dimensions. Similarly, as a consequence of the inequality (7), supplies a lower bound for the channel capacity in equation (2),


For example, given , if then , so .

4 An upper bound on the dimensionality of molecular machines

The higher the dimension that a molecular machine can work in, the more the probability density tightens around the radius of the hyperspheres

[6]. This suggests that biological systems may tend to evolve to extremely high dimensions to reduce the error rate caused by switching between the hyperspheres. So having determined a lower bound on the dimensionality of a molecular machine (equation (5)) is tantalizing but unsatisfying because biological systems may have much higher dimensionality. For this reason we sought an upper bound on the dimensionality.

The dimensionality of a molecule is related to the number of degrees of freedom (

) that a molecule has. For atoms there are 3 independent axes each atom can move on, but the three translational motions and three rotations about the axes do not contribute to the functioning of the machine, so there are only


degrees of freedom. For water so normal modes that can be observed in the vibrational spectrum of the molecule. These three motions can be described by common arm exercises with the head representing oxygen and the fists hydrogen: pushup/pullup ( symmetric stretch), jumping jack ( bending mode) and one-two punch ( asymmetric stretch) [26].

Although the number of degrees of freedom of an entire molecule consisting of atoms is the relevant number of degrees of freedom involved in the molecular machine selection process coding space () is most likely much smaller [6]:


because to be able to evolve each molecular machine ‘pin’ consists of an average of up to atoms. For a large molecule like EcoRI with thousands of atoms, the relevant degrees of freedom () for DNA binding will be much smaller than given by equation (11), so that relationship does not give a useful upper bound.

As Jaynes pointed out [27, 28], based on the classical equipartition theorem, the energy per degree of freedom of a single thermal oscillator in a molecular machine (lock pin) is ; with degrees of freedom the total thermal noise flowing through a molecule during one dissipation step of that selects a specific molecular state is


(see also equation (31) in [6]).

For molecules that make distinct decisions by selecting between nonoverlapping hyperspherical states, the inequality (7) applies. Substituting (7) into equation (12),


This provides a upper bound on the functional dimensionality, whereas equation (5) provides a lower bound.

We convert equation (13) to a more useful form by noting that the energy available in coding space for making selections at one temperature and pressure [6, 7, 21] is the Gibbs free energy:


[7]. The maximum number of bits that can be gained for that free energy dissipation is


[7]. can be derived from information theory [29] or the second law of thermodynamics [30, 10]. It serves as an ideal conversion factor between energy and bits:


[7]. Further, a ‘real’ isothermal efficiency , that may be less than the theoretical efficiency of equation (6),


can be measured by the information gained, , versus the information that could have been gained for the given energy dissipation, :


[7]. Successively combining equations (14) to (18) gives

Inserting this result into equation (13) gives


which we recognize as an upper bound on the coding space dimensionality as a function of the information gain and the isothermal efficiency .

5 Pincers on the dimensionality of molecular machines

Having determined both a lower bound (equation (5)) and an upper bound (equation 20) on the dimensionality of the coding space, we have the opportunity to determine what will happen as the molecular machine evolves to be optimally efficient.

Combining equations (5) and (20) gives


where we have recast the left hand side in terms of and expressed the logarithm in base to emphasize the striking symmetry of the two sides. The dimensionality is constrained to lie between these bounds, which form closing ‘pincers’ as the molecular machine evolves to become optimal. In the limit as and evolves to its maximum value of  [7], both sides converge to , and is squeezed between them. An optimally evolved molecular machine will operate in dimensions:


However, because the state hyperspheres have a finite thickness [6], must at least slightly exceed , and the left hand side of (21) remains slightly smaller than . Likewise, because of equations (6) and (17), also means that the efficiency is slightly below , which makes the right hand side of (21) sightly larger than . So the dimensionality is restricted to a small interval. This allowed variation in the coding space dimensionality is caused by the effective thickness of the sphere surface fuzziness (which depends on the dimensionality itself) and the evolutionarily acceptable error rate determined by the environment which limits how closely the coding spheres can approach each other and still allow survival [6].

For example, DNA polymerase has a certain error rate, and of course if that error rate were to increase the organism would experience a higher number mutations and be at a selective disadvantage. However, there are also mutations of the polymerase that decrease its error rate [31]. This would reduce the mutation load on the organism, but presumably it does not occur in the wild because the organism would then be less able to evolutionarily adapt to changing conditions compared to siblings that have the higher error rate. In many but not all cases they would also replicate more slowly. Likewise, the error rate for translation is about in amino acids [32] which means that roughly one in every three proteins has an error. Yet organisms survive quite well at this error rate. The error rate set by the environment of the organism in turn determines the acceptable placement and thickness of the spheres.

Note that the lower bound constraint comes from equation (3), the channel capacity theorem of Shannon [9], which limits the efficiency . The upper bound comes from equation (7), the finite energy available to perform state selections () relative to the thermal noise (), which satisfies the biologically required separation of molecular states [7]. These two independent bounds are plotted on a graph of the efficiency curve (Fig. 2, equation (6[7]).

Figure 2: Isothermal efficiency curve for molecular machines showing bounds that constrain the coding space dimensionality . Real molecular machines that select between two or more distinct states may have parameters anywhere in the shaded (green) area in which the real isothermal efficiency is bounded above by the theoretical isothermal efficiency (equation (6)) and to the left by the power to noise ratio (equation (7)). During evolution, they tend to lose unnecessary energy dissipation, which decreases towards the lower limit of . Independently, they tend to increase their information use () for the energy dissipated, which increases toward the theoretical maximum determined by the channel capacity. These factors lead to an ‘optimal’ molecular machine in which and . At that point the dimensionality has been squeezed in a pincers (equation (21)) until it reaches .

The efficiency curve is an upper bound representing functioning at the channel capacity (equation (3)). Points below the curve have . Since the dimensionality parameter is part of the upper bound for in equation (2), this leads to the lower bound on in equation (5). Independently of that, the ratio on the horizontal axis of Fig. 2 is orthogonal to the efficiency and channel capacity of the vertical axis. The thermal noise is determined by the absolute temperature of the molecular machine and the dimensionality (equation (12)). Since is an upper bound on the noise, equation (7) leads to an upper bound on the dimensionality in equation (20).

These two independent constraints on and determine the possible range of the dimensionality. As shown previously [7], the normalized energy dissipation will tend to decrease over evolutionary time because excess contacts that are not required for maintaining information will be lost by mutation. This decrease will continue until at the distinctness of molecular states is threatened by a large error rate caused by the increasing intersection of after state spheres. Meanwhile, the efficiency will tend to increase up to , squeezing the dimensionality towards a single value, by equation (21). This result is consistent with the dimensional analysis of Collier [33], who showed that information is related to the degrees of freedom, which is of course the dimensionality of the space.

6 Dimensionality of restriction enzyme coding space

Now that we know that the dimensionality of an optimal molecular machine is simply twice the number of bits that it selects amongst (equation (22)) we can determine the dimensionalities of the thousands of known restriction enzymes if we assume that they too are optimal.

In the case of restriction enzyme EcoRI, the efficiency is close to  [7], so that must be close to . Since a base cutting restriction enzyme recognizes bits, from equation (21), we see that the EcoRI coding space must be close to dimensions.

If we characterize a restriction enzyme by the number of bases it recognizes, then there are a maximum of two bits per base, so the dimensionality of a % efficient molecule is:


Thus, EcoRI, which has a base recognition site GAATTC, works in dimensions, while TaqI, which recognizes only the 4 bases TCGA, should work in dimensions. The highest known dimension used by a restriction enzyme is dimensions for restriction enzymes such as NotI (GCGGCCGC) and SfiI (GGCCNNNNNGGCC), which cut DNA at patterns base pairs long [34, 35].

In the case of restriction enzymes that digest at partially variable patterns such as GT(T/C)(A/G)AC (HincII), we can use the information needed to describe the pattern to predict the dimension. In this case, for the first two and last two bases (GT and AC), a total of bits are required, while for each of the middle two bases only bit is required to distinguish two of the four bases so bits per site [18], even though the binding site is bases long. Thus, if HincII is optimal at % efficiency, it should operate in dimensions. In the special case where a base is avoided by a restriction enzyme, we record the information as bits for that base [18]. So formula (23) has only a limited application. The number of bits in a binding site is not strictly computed from the physical length of the site, but rather from the average number of bases if all the information were compressed into the smallest region possible [18]. This is the ‘area’ under a sequence logo [24].

The predicted dimensionality of over restriction enzymes in Roberts’ database [35] is given in Table 1 and Fig. 3A. There are two major peaks at and dimensions, corresponding to and base cutters. There is also a minor peak at dimensions for the base cutters.

Example Sequence Compressed Bits Dimension Number
Restriction Bases,
Enzyme (pins)
AbaSI C(11/9) 1.00 2.00 4.00 20
MspJI CNNR(9/13) 1.50 3.00 6.00 1
RlaI VCW 1.71 3.42 6.83 1
SgeI CNNGNNNNNNNNN 2.00 4.00 8.00 6
AspBHI YSCNS(8/12) 2.50 5.00 10.00 1
PsuGI BBCGD 2.62 5.25 10.49 1
SgrTI CCDS(10/14) 2.71 5.42 10.83 2
CviJI RGCY 3.00 6.00 12.00 9
LpnPI CCDG(10/14) 3.21 6.42 12.83 1
EcoBLMcrX RCSRC(-3/-2) 3.50 7.00 14.00 1
M.NgoDCXV GCCHR 3.71 7.42 14.83 1
TaqI TCGA 4.00 8.00 16.00 1210
Bsp1286I GDGCHC 4.42 8.83 17.66 16
AvaII GGWCC 4.50 9.00 18.00 396
Pin17FIII GGYGAB 4.71 9.42 18.83 2
HincII GTYRAC 5.00 10.00 20.00 507
Cco14983V GGGTDA 5.21 10.42 20.83 1
PpuMI RGGWCCY 5.50 11.00 22.00 55
EcoRI GAATTC 6.00 12.00 24.00 1864
Rba2021I CACGAGH 6.21 12.42 24.83 10
PspXI VCTCGAGB 6.42 12.83 25.66 1
RsrII CGGWCCG 6.50 13.00 26.00 54
SgrAI CRCCGGYG 7.00 14.00 28.00 99
KpnBI CAAANNNNNNRTCA 7.50 15.00 30.00 2
SfiI GGCCNNNNNGGCC 8.00 16.00 32.00 36
Table 1: Coding space dimensionality () and number () of restriction enzymes. The information content in bits, , of the recognition sequence of 4297 restriction enzymes from REBASE (restriction enzyme database) version allenz.801 (Dec 27 2017) [35] was computed. A fully conserved base (A, C, G, T) contributes bits, two possibilities (R=G/A, Y=C/T, M=A/C, K=G/T, S=C/G, W=A/T) contributes bit, three possibilities (B=C/G/T, D=A/G/T, H=A/C/T, V=A/C/G) contributes bits and any allowed base (N) contributes bits [36, 18]. The sum of the information at each base, , was used to find the corresponding number of compressed bases () and then the coding dimension (), assuming that each enzyme has an efficiency of and so that there is a unique dimension according to equation (21). The most commercially available enzymes and their reported recognition sequences are given as examples. When the DNA backbone cleavage site is known it is indicated by an arrow (). The distance to cleavage sites outside the given sequence is shown in parenthesis for the corresponding and complementary strands. Star activity (variation within the canonical site) and flanking sequence effects are found for many restriction enzymes [37]. However, the patterns in the database are reported as consensus sequences that may distort the information content [38], and so may affect the results given here.
Figure 3: Comparison of restriction enzyme frequency and best known sphere packing density in different dimensions. A. Coding dimensions used by restriction enzymes. The number of enzymes at each dimensionality is plotted from Table 1. B. Best known sphere packings in high dimensions were given by Conway and Sloane [20, 39]. The graph is equivalent to their Figure 1.5; see Table I.1(a), Table I.1(b) on pages xix and xx; and pages 14 to 16. The updated sphere center density formulas used here were from (Last modified Feb. 2012, accessed Jan 06, 2018). The sphere center density, , is the number of sphere centers per unit volume when sphere radii are set to 1. Without the logarithm, a graph of versus appears nearly flat from to . Circles () represent lattice packings; x’s () represent nonlattice packings.

7 Biological lattices in high dimensional spaces

Ever since Shannon published his theory of hypersphere packing as a description of a communications system [9] mathematicians and engineers have been determining how best to pack spheres together in high dimensional spaces [20]. The restriction enzymes appear to favor particular dimensions for their coding spaces, so we can compare their preferences to the best known packings that humans have determined.

In two dimensions there are two ways to regularly pack circles: in a square lattice or in a hexagonal lattice (Fig. 1). The square packing fills % of the plane, while a hexagonal packing fills [20]. Hexagonal packing is more dense than square packing. In general, the sphere packing density is


where is the volume of an dimensional sphere with radius [40, 41, 42], and is the determinant of the lattice . The determinant provides the volume of the polytope that the sphere is encased in. For convenience, Leech introduced the concept of the sphere center density


which counts the average number of sphere centers per unit volume of the space [43]. Leech, Conway and Sloane rescaled or normalized the sphere center density in several different ways to emphasize the symmetries of the sphere center density as a function of dimension [43, 44, 20]. As these rescalings do not have any biological significance that we are aware of, we only take the logarithm of to graph the best known sphere packings up to dimensions (Fig. 3B). Since each sphere center represents one sphere, and in a biological context spheres represent biological states, the center density is a measure of the number of states available to the system.

Shannon recognized that the packing of spheres in higher dimensions corresponds to the problem of faithfully transmitting a series of distinct messages over a noisy communications channel [9, 45, 20, 46]. In this model, transmitted messages are points in a high dimensional space, and each sphere represents a message received with Gaussian noise added along each dimension. Most of the sphere density is on the surface in high dimensions [6]. To avoid message ambiguity, the spheres must not intersect, which spaces the transmitted messages and allows a decoding that removes the noise from the received signal. The total volume available in which to pack spheres is a large sphere whose radius is determined by the power and thermal noise absorbed by the receiver, while the volume of a smaller message sphere is determined by the thermal noise alone [9].

A corresponding theory also describes the states of molecular machines as spheres in high dimensional space [6]. In both theories, the maximum number of possible messages (or molecular states), known as the capacity, is determined from the number of small spheres that can be packed together inside the larger sphere. As in dimensions, there are many possible ways to pack high dimensional spheres; the more spheres that can be packed together, the more the channel can be utilized. Because of its application to communications and a variety of mathematical and physics fields, the highest sphere packing densities in various dimensions have been determined, as shown in Fig. 3B. The most dense known sphere packing is in dimensions, a packing known as the Leech lattice (symbolized as ) [43, 20, 47, 48, 49, 50, 51, 52, 53]. This packing has been extensively studied, and because of its density it was used in a commercial Motorola modem [54].

Surprisingly, the most common dimensionality of the restriction enzymes is also dimensions, as shown in Fig. 3A. This suggests that EcoRI and the other restriction enzymes may be base cutters because that takes advantage of the dense packing of the Leech lattice. In other words, we hypothesize that the reason so many restriction enzymes are base cutters is that they have discovered the Leech lattice packing by Darwinian evolution. How the Leech lattice is implemented by the atomic structure of restriction enzyme proteins is not known.

For restriction enzymes, the next most commonly used dimensionality is dimensions (Fig. 3A), and we see in Fig. 3B that a good packing, the Barnes–Wall lattice (BW), has also been found for this dimension relative to the other dimensions of similar magnitude [39]

. There are an estimated

good packings equivalent to BW [55]. Thus, the base cutting restriction enzymes may be using bit recognition to take advantage of the good hypersphere packings possible in dimensions.

The longest known restriction enzyme sites have bases, and so these enzymes should use a dimensional space. Correspondingly, dimensions also represents a peak in the known dense lattices called the Quebbemann’s lattice,  [39]. Cohn and Elkies report a peak in dimensions [48] and, intriguingly, this corresponds to cases of base cutters (Table 1). Transcription factors in E. coli have information contents in the range to bits [18]; these may function with the high density packings known to exist above dimensions.

On the lower dimensional end, it is worth noting that there is a small local maximum for the density of sphere packings at dimensions. This would correspond to a base long biological object for which the obvious candidate is the codon of the genetic code. It may be that the genetic code functions in a dimensional space, but the coding is probably not performed via the cubic lattice suggested by Sadegh–Zadeh [56] since there are better packings such as the Coxeter–Todd lattice  [20]. Finally, in dimensions the most efficient possible sphere packing is on an lattice [52]; this corresponds to base pair recognition. Biologically this code might be used for precisely recognizing and methylating CpG base pairs, the basis of an important epigenetic control [57, 58]. Thus, all of the peaks in Fig. 3b could correspond to known biological systems.

Restriction enzymes can evolve from one dimension to the next since this only requires increasing or decreasing the number of base contacts. So there is some fluidity in the dimensions chosen, but for several reasons we do not expect a complete correspondence between Fig. 3A and Fig. 3B. First, because restriction enzymes have evolved, there is a good deal of history in the current choices and some of this may be locked in. Some patterns will be common simply because that particular bacterial species is prevalent and their restriction enzymes were discovered more easily than others. Second, the base enzymes ( dimensions) won’t attack an invading DNA as frequently as shorter ones, so bacteria may tend to avoid using higher dimensions. Third, short patterns that cut frequently would necessitate more self-protective methlyation and so would be expensive. Finally, unknown effects could come into play to eliminate, for example, most of the base ( dimensional) restriction enzymes even though the ( dimensional) and ( dimensional) base sites are quite common.

The information content of transcription factor DNA binding sites evolves based predominantly on the size of the genome and the number of binding sites [18, 19]. Unlike transcription factors, the information content of restriction enzyme sites cannot evolve based on invading genomes because there are no regular specific sequences to bind to. However, the size of the intruding genome does provide some criterion since restriction enzymes protect bacteria from invading bacteriophage. Typical sizes are on the order of base pairs, such as ( base pairs) [59] and T7 ( base pairs) [60]. The restriction enzyme must cut the invader at least once, and preferably more, to disable the phage genome. Thus, it requires approximately bits in the site to attack once. A bit (6 base pairs) site such as EcoRI would cut times; sites are observed. Perhaps this number is lower than expected because phage evolve away from restriction sites; EcoRI would cut T7 times but none are observed. An bit ( base pairs) site would cut more frequently than a bit site but the cell would then have to methylate times as many sites. Perhaps this is one reason that 6 base restriction sites are more abundant than 4 base cutters: 6 bases is short enough that phage are killed but also sufficiently long that methylation is minimized.

Only the best choices of sphere packings in biologically useful dimensions may be reflected in the restriction enzymes. The central suggestion of this paper is that most of the restriction enzymes have discovered that sphere packing in 16 and 24 dimensions is more dense than packings in other dimensions. This provides an explanation for why 4 and 6 base cutters have been found so frequently. In addition, the significant peaks at and dimensions in Fig. 3A suggest the biological use of dense codes in those dimensions that may be consistent with known packings.

8 Mechanism of high dimensional coding

For an evolved molecular machine (from equation (4)) and (from equation (21)) so . So, curiously, for an optimal molecular machine the number of bits is the number of pins. However, how the independent pins are implemented in molecular architecture is a difficult open problem. As in genetics, the underlying mechanism of DNA recombination was not initially known but the results, linearity of genes, were still valid. Here, we know the dimensionality from the theory, but we would also like to know how the molecule works.

There are at least two basic mechanisms by which high dimensional coding could be implemented by molecules: direct contacts and vibrational modes. For example, EcoRI cuts double stranded DNA at the sequence GAATTC. In the co-crystal between EcoRI and this sequence, McClarin et al. [61, 62] observed that each of the bases is contacted with two hydrogen bonds, for a total of specific hydrogen bonds. If each hydrogen bond corresponds to a single ‘pin’ of the molecular machine, with two degrees of freedom per pin [6], there would be dimensions. That such contacts often act independently, and so could be coding space dimensions, is suggested by experiments on several other recognizers [63, 64, 65, 66, 67, 6]. However, experiments with mutant EcoRI imply that it uses more than just hydrogen bonding in recognition [68], and bases of DNA recognition proteins are not entirely independent [69, 70]. Though including dinucleotides may be sufficient [71, 72], finding the important independent dimensions may be challenging. All such pairwise correlations can be displayed with a dimensional sequence logo [73]. Alternatively, the coding space could consist of normal modes of molecular vibration since these are by definition independent [74]. In particular, localized vibrational modes called ‘discrete breathers’ [75] may represent the molecular machine pins.

9 Coding spaces

In classical information theory, a continuous communications signal, such as a song, can be represented by a series of independent numerical values [9]. An analog signal of duration seconds that has a range of frequencies (bandwidth) is described by Fourier components. Since these sine wave amplitudes are independent, they define numbers and hence a single point in a dimensional coding space. Because they are designed from scratch, the dimensionality in communications systems is known a priori. By contrast molecular systems, which also have been shown to use coding spaces [7], do not have a known dimensionality so determining this parameter is an important step towards fully characterizing and understanding their function.

Equations (5) and (20) establish lower and upper bounds on the dimensionality of molecular machines. These constraints can be represented geometrically (Fig. 2). The restriction that the information cannot be larger than the machine capacity (equation (3)) ultimately comes from Shannon’s 1949 model of communication in which he divided the volume of a large ‘before’ sphere, representing the space of all possible messages, by the volume of a small ‘after’ sphere, representing a single message expanded in all possible directions by thermal noise, to determine the maximum number of possible distinct messages in time and hence the channel capacity in bits,


[9]. The corresponding model for molecular states (equation (2)) [6] leads to the lower bound on the dimensionality (equation (5)). In this case there are two geometrical constraints, state spheres must not intersect and the state spheres are confined to the larger sphere defined by the available energy. The observation of % efficient molecular machines comes from the restriction that for biological states to be distinct, the after state spheres must avoid intersecting each other [7], as expressed by (equation (7)). The two constraints on the dimensionality therefore come from the after state spheres bumping into each other and from them being compressed within the larger before sphere.

We found that when a DNA binding protein evolves to be optimally efficient, the upper and lower dimensional bounds converge to twice the information content of the binding site as measured in bits (Fig. 2). Using this result, we found that the common base pair recognizing restriction enzymes, which require bits to describe their pattern, use a dimensional coding space. When EcoRI is bound to a DNA sequence its state can be described as a sphere in the high dimensional coding space with each of the possible hexamer sequences represented by a different sphere [6]. If the sphere for EcoRI bound to GAATTC were to overlap with any other sequence sphere, then EcoRI could bind to and cut at inappropriate locations that are unprotected by the corresponding methylase, leading to death [4]. Since EcoRI binding to sequences other than GAATTC is at least fold down in digestion [5], these spheres effectively do not intersect. Excess binding energy that retains the same binding pattern will be lost by mutational changes in the EcoRI protein structure, so the spheres must be tightly packed together in the dimensional space. Remarkably, it has already been shown by coding theorists that the best known sphere packing is the Leech lattice in dimensions [20, 39, 53]. Likewise the base pair restriction enzymes use a dimensional coding space, and there are good packings known in that space (Fig. 3[55]. Thus, there is a correlation between commonly observed restriction enzyme DNA site sizes and the best packing of spheres in high dimensional spaces. Could this be a coincidence? We believe it is not for the following additional reasons.

First, the data sets are large. The entire collection represents nearly restriction enzyme sites (Table 1). Restriction enzymes are initially discovered by their ability to digest DNA, and this method does not indicate the sequence of the binding site, which is unknown until after the enzyme has been isolated, purified, and characterized. Odd classes of sites are noticed and publicized because these are eagerly sought as research reagents. Likewise, the data on different kinds of high dimensional sphere packings (Fig. 3b) represent research efforts spanning the years since Shannon’s publication in 1948, and there are strong economic incentives to discover and publicize new packings because they can be used to improve communications.

Second, the correlation between sphere packing and restriction enzymes was derived without introducing any free parameters to the boundary equations for the dimensionality. It is a natural consequence of previously established molecular machine theory [6, 10, 7, 21].

Presumably natural systems discovered the Leech lattice long ago, but the details of how a small protein can implement such a code are unknown. However, the fact that restriction enzymes have apparently discovered good codes should help us to understand how they can recognize short DNA sequences so precisely. Conversely, understanding the molecular mechanism of restriction enzyme decoding could lead to single-molecule communications devices [6].

The distribution of restriction enzyme choice of dimensionality is well explained by the best packings of spheres in various dimensions (Fig. 3). The major peak of base cutting restriction enzymes is most likely explained by their use of the Leech lattice in dimensions. Likewise, the peaks at and dimensions correspond to the and base cutters respectively. In addition, restriction enzyme use of , , and dimensions appears to correspond to good nonlattice packings that are known in those dimensions. This leaves three holes in the distribution at , , and dimensions which are rarely used by restriction enzymes but which have decent sphere packings. We suggest that restriction enzymes with dimensions close to a peak evolve into the peak. For dimension , Table 1 gives the example of KpnBI with the recognition sequence CAAANNNNNNRTCA. Notably this is an asymmetric recognition sequence with a part that recognizes exactly 4 bases CAAA on the side and RTCA on the side. Recognition of a purine R is typically accomplished by a single hydrogen bond to the N7 position of either A or G [76, 77]. If an additional contact or hydrogen bond into the major groove evolves (for example to the N6 of A or O6 of G and on the complementary strand the methyl of T, O4 of T or N4 of C), then the enzyme could specify exactly A or G and the dimensionality would increase to . This could improve the sphere packing according to our current knowledge of lattices in and dimensions, so the lack of enzymes is likely to be because most have already evolved to the nearby better packing. Since there are two known enzymes according to Table 1, we can test this idea by inspection of the other one. Indeed, that enzyme is Eco851I with recognition sequence GTCANNNNNNTGAY. Since Y pairs to R on the complementary strand, alteration of the terminal Y to a specific base would switch this enzyme from to dimensions by the same mechanism. Indeed, a similar explanation for an evolutionary easy switch from to the dimensional Leech lattice is suggested by the enzyme RsrII CGWCCG, while the dimensional PpuMI RGGWCCY has three such opportunities. In each of these cases, merely having the disfavored dimension leads to a pattern vulnerable to evolution to a nearby dimension. Whether there are biological constraints that prevent the rare cases from evolving to the best packing dimensions is unknown.

Two additional biological constraints on the distribution of restriction enzyme dimensionalities were mentioned earlier. We expect few if any restriction enzymes to function at low dimensionality (below bases or dimensions) since such enzymes would digest DNA frequently and so would require extensive methylation protection which may be disadvantageous. Restriction sites longer than bases or dimensions would only be found rarely on invading DNA and so presumably these too would not have much advantage. These factors limit the range of functionally useful dimensions.

10 Coding space as a fitness landscape

Considering how well the restriction enzyme frequency (Fig. 3A) and best sphere packing center density (Fig. 3B) distributions match overall given the biological constraints on the range that restriction enzymes can function in, the sphere packing center density distribution appears to be a measure of fitness for restriction enzymes evolving over a high dimensional adaptive fitness landscape, similar to the high dimensional spaces described by Wright [78, 79, 80, 81].

For biological systems, the number of sphere centers corresponds to the number of distinct states the system can be in. The center density (equation (25)) is therefore a more appropriate measure than the filled volume of the lattice defined by (equation (24)) since biological systems evolve to have distinct states [7]. The volume of the state is itself irrelevant. So we propose that vs.  represents the biological coding landscape. Plotting instead of emphasizes the detailed features of the curve. A biological system can evolve to obtain the highest number of distinct states by maximizing the sphere center packing density . Because the capacity is the logarithm of the number of states, this also maximizes the information and the efficiency.

Since the logarithm is monotonic, if , . Examining Fig. 3B, we notice several important features. For lattice packings, since , there is no more than one center per unit sphere packing volume for . There is exactly one in and : these are the densest sparse packings. In higher dimensions (), there can be more than one center per unit sphere packing volume.

Comparing Fig. 3A to Fig. 3B, we encounter a puzzle. The center density in is larger than the center density in . Why then are there more base cutters than base cutters? There are at least three explanations. Evolutionary selection should increase the molecular machine capacity by finding relative maxima of the sphere packing center density but evolution also minimizes the expenditure of resources by a cell. On average, an enzyme that recognizes a shorter sequence requires less protein structure and so requires less energy to synthesize. Perhaps the gain in information density at dimension compared to dimension is insufficient to offset the greater energy cost. In contrast, the significant improvement of information density at dimension may provide a superior benefit to the organism despite the extra expenditure in energy and this may explain why there are more base enzymes than base enzymes. A second consideration is the number of target restriction sites needed on foreign DNA. Larger target DNAs would be best digested less frequently so that the restriction enzyme spends less time on small regions and conversely some enzymes may be targeted to smaller DNAs, leading to a smaller dimensionality. A third factor is the ease of evolving recognition patterns. Protein dimerization allows the creation of a 4 base cutting restriction site from two half sites. Five base recognition is probably more difficult to evolve since the central base has to be handled separately or the entire site has to become asymmetric.

The best center density in dimensions has the same value as the best center density in dimensions (Fig. 3[20]. Since higher dimensionality allows lower error rates [9], why isn’t dimensions used to provide more accurate restriction? A dimensional packing would take bases and the bases could be contacted in the center of the site. Dimeric proteins use less protein structure than a monomer and so require smaller DNA coding, but only one of the two monomers could contact the center at a time. Perhaps this awkward wasteful situation is unfavorable compared to using dimensions. In fact, odd dimensions are avoided by restriction enzymes in general (Table 1). Another possibility is that the huge number of known packings in dimensional space (, [55]) overwhelms a smaller number of packings in dimensions. Similar considerations apply to dimensions and , where the center densities are equal. Here, because is even, the lattice is known to be highly symmetric, and the error rates are smaller, the preference for over is clear.

According to Fig. 3A there are several hundred restriction enzymes that operate in , and dimensional coding spaces. Just as the and Leech lattices allow for dense sphere packings in and dimensions, evolution may select similarly dense packings in other even dimensions, especially those divisible by . There may be best packings in these dimensions that have not been discovered yet; this is an open problem in mathematics.

Evidently many restriction enzymes have discovered the Leech lattice, but does this merely reflect divergent evolution from a common ancestor? Many restriction enzymes have widely different structures [2]

, suggesting convergent evolution. Perhaps a deeper understanding of the coding spaces will help to classify these enzymes. In addition, the coding spaces of restriction enzymes may provide a fertile ground for precise quantitative analysis of population genetics and theoretical evolutionary biology since much is known about sphere packing in high dimensions 

[20]. Though we have found a strong correlation between the high dimensional sphere center packing landscape and restriction enzyme information content preferences, this still leaves the important task of understanding how the codes are implemented by the protein structures as a major problem for coding theorists and biologists.

11 Transcription factors use high dimensional fractal coding

Experimental evidence has been obtained indicating that for the transcription factor Fis, the specific DNA binding mode differs sharply from non-specific DNA binding since there is a break in the binding curve at zero bit binding sites [10, 82, 83]. Shannon pointed out that a mapping from a high dimensional space to lower dimensions creates discontinuities [9], so this result suggests that Fis functions in a high dimensional coding space close to dimensions [84].

For nucleic acid recognizing molecules that have specific sites on the genome, such as transcription factors, the information in the binding sites, , evolves to match the information needed to locate the binding sites,


where is the number of potential binding sites in the genome and is the number of specific binding sites [18, 19]. In the case of restriction enzymes equation (27) does not apply since there are no specific binding sites in the foreign DNA attacked by these enzymes and does not have a particular value. However, the principle that does apply to many other genetic systems such as transcription factors [18], promoters [85, 86], ribosome binding sites ([87] and unpublished observations), and mRNA splicing [88]. In general is not an exact power of two, so , and therefore is usually not an integer. Equation (21) then implies that for an optimal molecular machine in which , the dimension of the binding sites will not be an integer. Objects with non-integer dimensions are called fractals [89]. As shown in Table 1, many restriction enzymes may also have fractal dimensions, although more close inspection using sequencing technologies may be required to confirm the observation [90]. How a molecular coding system can have non-integer dimensions and the possible applications of such high dimensional fractal codes for communication systems remain to be investigated.


TDS thanks N. J. A. Sloane for useful discussions about the normalization of the sphere packing density function, Rich Roberts for useful discussions about REBASE, Michael Smith for suggesting that sphere packing density may be an adaptive landscape, Eckart Bindewald, Misha Kashlev, Ryan Shultzaberger and Randall Johnson for comments on the manuscript and the Advanced Biomedical Computing Center (ABCC) for support. Funding: This research was supported in part by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research. VJ thanks the U.S. National Cancer Institute Werner H. Kirsten Student Intern Program, Soren Brunak, and the Technical University of Denmark for hospitality during initial stages of this project in 1994. The research of VJ is supported by the South African Research Chairs Initiative of the Department of Science and Technology and National Research Foundation. Competing interests: None. Data and materials availability: see Table 1 and Fig. 3.


  • [1] R. J. Roberts. How restriction enzymes became the workhorses of molecular biology. Proc. Natl. Acad. Sci. USA, 102:5905–5908, 2005.
  • [2] A. Pingoud, G. G. Wilson, and W. Wende. Type II restriction endonucleases-a historical perspective and more. Nucleic Acids Res., 42:7489–7527, 2014.
  • [3] V. F. Simmon and S. Lederberg. Degradation of bacteriophage lambda deoxyribonucleic acid after restriction by Escherichia coli K-12. J. Bacteriol., 112:161–169, 1972.
  • [4] J. Heitman, N. D. Zinder, and P. Model. Repair of the Escherichia coli chromosome after in vivo scission by the EcoRI endonuclease. Proc. Natl. Acad. Sci. USA, 86:2281–2285, 1989.
  • [5] D. R. Lesser, M. R. Kurpiewski, and L. Jen-Jacobson. The energetic basis of specificity in the Eco RI endonuclease–DNA interaction. Science, 250:776–786, 1990.,
  • [6] T. D. Schneider. Theory of molecular machines. I. Channel capacity of molecular machines. J. Theor. Biol., 148:83–123, 1991.,
  • [7] T. D. Schneider. 70% efficiency of bistate molecular machines explained by information theory, high dimensional geometry and evolutionary convergence. Nucleic Acids Res., 38:5995–6006, 2010.,
  • [8] G. M. Clore, A. M. Gronenborn, and R. W. Davies. Theoretical aspects of specific and non-specific equilibrium binding of proteins to DNA as studied by the nitrocellulose filter binding assay: Co-operative and non-co-operative binding to a one-dimensional lattice. J. Mol. Biol., 155:447–466, 1982.
  • [9] C. E. Shannon. Communication in the Presence of Noise. Proc. IRE, 37:10–21, 1949.
  • [10] T. D. Schneider. Theory of molecular machines. II. Energy dissipation from molecular machines. J. Theor. Biol., 148:125–137, 1991.,
  • [11] David Macaulay and Neil Ardley. The New Way Things Work. Houghton Mifflin Company, 1998.
  • [12] G. E. Uhlenbeck and L. S. Ornstein. On the theory of brownian motion. Phys. Rev. Lett., 36:823–841, 1930.
  • [13] T. D. Schneider. Claude Shannon: Biologist. IEEE Engineering in Medicine and Biology Magazine, 25(1):30–33, 2006.,
  • [14] C. E. Shannon. A Mathematical Theory of Communication. Bell System Tech. J., 27:379–423, 623–656, 1948.
  • [15] J. R. Pierce. An Introduction to Information Theory: Symbols, Signals and Noise. Dover Publications, Inc., NY, 2nd edition, 1980.,,
  • [16] T. D. Schneider. Information theory primer, with an appendix on logarithms. Published on the web, 2013, 2013.,
  • [17] J. D. Watson, N. H. Hopkins, J. W. Roberts, J. A. Steitz, and A. M. Weiner. Molecular Biology of the Gene. The Benjamin/Cummings Publishing Co., Inc., Menlo Park, California, fourth edition, 1987.
  • [18] T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht. Information content of binding sites on nucleotide sequences. J. Mol. Biol., 188:415–431, 1986.,
  • [19] T. D. Schneider. Evolution of biological information. Nucleic Acids Res., 28:2794–2799, 2000.,
  • [20] J. H. Conway and N. J. A. Sloane. Sphere Packings, Lattices and Groups. Springer-Verlag, New York, third edition, 1998.,
  • [21] T. D. Schneider. A brief review of molecular information theory. Nano Communication Networks, 1:173–180, 2010.,
  • [22] T. D. Schneider. Sequence logos, machine/channel capacity, Maxwell’s demon, and molecular computers: a review of the theory of molecular machines. Nanotechnology, 5:1–18, 1994.,
  • [23] T. D. Schneider. Twenty Years of Delila and Molecular Information Theory: The Altenberg-Austin Workshop in Theoretical Biology Biological Information, Beyond Metaphor: Causality, Explanation, and Unification Altenberg, Austria, 11-14 July 2002. Biol. Theory, 1:250–260, 2006.,,
  • [24] T. D. Schneider and R. M. Stephens. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res., 18:6097–6100, 1990.,
  • [25] J. R. Pierce and C. C. Cutler. Interplanetary communications. In F. I. Ordway, III, editor, Advances in Space Science, Vol. 1, pages 55–109, N. Y., 1959. Academic Press, Inc.
  • [26] Martin Chaplin. Water absorption spectrum, 2000. last updated 2018 October 3, last accessed 2018 Oct 11.
  • [27] E. T. Jaynes. The Muscle As An Engine. unpublished manuscript, pages 1–5, 1983.,
  • [28] E. T. Jaynes. The evolution of Carnot’s principle. In G. J. Erickson and C. R. Smith, editors, Maximum-Entropy and Bayesian Methods in Science and Engineering, volume 1, pages 267–281, Dordrecht, The Netherlands, 1988. Kluwer Academic Publishers.
  • [29] J. H. Felker. A link between information and energy. Proc. IRE, 40:728–729, 1952.
  • [30] L. Szilard. Uber die entropieverminderung in einem thermodynamischen system bei eingriffen intelligenter wesen. Z. Phys., 53:840–856, 1929.
  • [31] A. J. Herr, L. N. Williams, and B. D. Preston. Antimutator variants of DNA polymerases. Crit Rev Biochem Mol Biol, 46:548–570, 2011.
  • [32] I. Wohlgemuth, C. Pohl, and M. V. Rodnina. Optimization of speed and accuracy of decoding in translation. EMBO J, 29:3701–3709, 2010.
  • [33] J. Collier. What We Can Discover from Dimensional Analysis of the Information Concept. In Embodied, Embedded, Networked, Empowered through Information, Computation & Cognition!, pages 1–3. In Proceedings of the DIGITALISATION FOR A SUSTAINABLE SOCIETY 12-16 June 2017; Gothenburg, Sweden, 2017.,
  • [34] B. Q. Qiang and I. Schildkraut. NotI and SfiI: restriction endonucleases with octanucleotide recognition sequences. Methods Enzymol, 155:15–21, 1987.
  • [35] R. J. Roberts, T. Vincze, J. Posfai, and D. Macelis. REBASE–a database for DNA restriction and modification: enzymes, genes and genomes. Nucleic Acids Res., 43:D298–9, 2015.
  • [36] H. B. F. Dixon, H. Bielka, C. R. Cantor, C. Liebecq, N. Sharon, S. F. Velick, J. F. G. Vliegenthart, F. Blattner, N. L. Brown, D. L. Brutlag, W. M. Fitch, W. Goad, R. Grantham, G. Hamm, L. H. Kedes, R. Lathe, D. W. Mount, J. Schroeder, R. Staden, and P. A. Stockwell. Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Eur. J. Biochem., 150:1–5, 1985.
  • [37] N. Kamps-Hughes, A. Quimby, Z. Zhu, and E. A. Johnson. Massively parallel characterization of restriction endonucleases. Nucleic Acids Res., 41:e119, 2013.
  • [38] T. D. Schneider. Consensus Sequence Zen. Applied Bioinformatics, 1:111–119, 2002.,
  • [39] N. J. A. Sloane. The Sphere Packing Problem. Documenta Mathematika, 3:387–396, 1998.,,
  • [40] D. M. Y. Sommerville. An Introduction to the Geometry of N Dimensions. E. P. Dutton, NY., NY, 1929.
  • [41] M. G. Kendall. A Course in the Geometry of n Dimensions. Hafner Publishing Company, New York, 1961.
  • [42] B. Hayes. An Adventure in the Nth Dimension. Amer. Sci., 99(4):442–446, 2011.
  • [43] J. Leech. Some sphere packings in higher space. Canad. J. Math., 16:657–682, 1964.
  • [44] J. H. Conway and N. J. A. Sloane. Laminated lattices. Annals of Mathematics, 116:593–620, 1982.,
  • [45] N. J. A. Sloane. The packing of spheres. Sci. Am., 250(1):116–125, January 1984.
  • [46] B. Cipra. Packing Your n-Dimensional Marbles. Science, 247:1035, 1990.
  • [47] I. Stewart. Mathematics: the 24-dimensional greengrocer. Nature, 424:895–896, 2003.
  • [48] H. Cohn and N. Elkies. New upper bounds on sphere packings I. Annals of Mathematics, 157:689–714, 2003.,
  • [49] H. Cohn and A. Kumar. The densest lattice in twenty-four dimensions. Electronic Research Announcements of the American Mathematical Society, 10:58–67, 2004.,
  • [50] H. Cohn and A. Kumar. Optimality and uniqueness of the Leech lattice among lattices. Annals of Mathematics, 170:1003–1050, 2009.,,
  • [51] Erica Klarreich. Sphere Packing Solved in Higher Dimensions, A Ukrainian mathematician has solved the centuries-old sphere-packing problem in dimensions eight and 24. Quanta Magazine, 20160330:1–6, 2016.
  • [52] M. Viazovska. The sphere packing problem in dimension 8. Annals of Mathematics, 185:991–1015, 2017.,,
  • [53] H. Cohn, A. Kumar, S. D. Miller, D. Radchenko, and M. Viazovska. The sphere packing problem in dimension 24. Annals of Mathematics, 185:1017–1033, 2017.,,
  • [54] G. R. Lang and F. M. Longstaff. A Leech Lattice Modem. IEEE Journal on Selected Areas in Communications, 7:968–973, 1989.,
  • [55] J. H. Conway and N. J. A. Sloane. What are all the best sphere packings in low dimensions? Discrete Comput Geom, 13:383–403, 1995.
  • [56] K. Sadegh-Zadeh. Fuzzy genomes. Artif Intell Med, 18:1–28, 2000.
  • [57] T. B. Miranda and P. A. Jones. DNA methylation: the nuts and bolts of repression. J Cell Physiol, 213:384–390, 2007.
  • [58] J. Song, M. Teplova, S. Ishibe-Murakami, and D. J. Patel. Structure-based mechanistic insights into DNMT1-mediated maintenance DNA methylation. Science, 335:709–712, 2012.
  • [59] F. Sanger, A. R. Coulson, G. F. Hong, D. F. Hill, and G. B. Petersen. Nucleotide sequence of bacteriophage lambda DNA. J Mol Biol, 162:729–773, 1982.
  • [60] J. J. Dunn and F. W. Studier. Complete nucleotide sequence of bacteriophage T7 DNA and the locations of T7 genetic elements. J. Mol. Biol., 166:477–535, 1983.
  • [61] J. A. McClarin, C. A. Frederick, B. C. Wang, P. Greene, H. W. Boyer, J. Grable, and J. M. Rosenberg. Structure of the DNA-Eco RI endonuclease recognition complex at 3 Å resolution. Science, 234:1526–1541, 1986.
  • [62] Y. Kim, J. C. Grable, R. Love, P. J. Greene, and J. M. Rosenberg. Refinement of Eco RI endonuclease crystal structure: a revised protein chain tracing. Science, 249:1307–1309, 1990.
  • [63] J. Childs, K. Villanueba, D. Barrick, T. D. Schneider, G. D. Stormo, L. Gold, M. Leitner, and M. Caruthers. Ribosome binding site sequences and function. In R. Calendar and L. Gold, editors, Sequence Specificity in Transcription and Translation, UCLA Symposia on Molecular and Cellular Biology, Vol. 30, pages 341–350, New York, 1985. Alan R. Liss, Inc.
  • [64] G. D. Stormo, T. D. Schneider, and L. Gold. Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res., 14:6661–6679, 1986.
  • [65] D. Barrick, K. Villanueba, J. Childs, R. Kalil, T. D. Schneider, C. E. Lawrence, L. Gold, and G. D. Stormo. Quantitative analysis of ribosome binding sites in E. coli. Nucleic Acids Res., 22:1287–1295, 1994.
  • [66] Y. Takeda, A. Sarai, and V. M. Rivera. Analysis of the sequence-specific interactions between Cro repressor and operator DNA by systematic base substitution experiments. Proc. Natl. Acad. Sci. USA, 86:439–443, 1989.
  • [67] N. Lehming, J. Sartorius, B. Kisters-Woike, B. von Wilcken-Bergmann, and B. Müller-Hill. Mutant lac repressors with new specificities hint at rules for protein–DNA recognition. EMBO J., 9:615–621, 1990.
  • [68] J. Heitman and P. Model. Substrate recognition by the EcoRI endonuclease. Proteins, 7:185–197, 1990.
  • [69] T. K. Man, J. S. Yang, and G. D. Stormo. Quantitative modeling of DNA-protein interactions: effects of amino acid substitutions on binding specificity of the Mnt repressor. Nucleic Acids Res., 32:4026–4032, 2004.
  • [70] M. L. Bulyk, P. L. Johnson, and G. M. Church. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res., 30:1255–1261, 2002.
  • [71] G. D. Stormo. Maximally efficient modeling of DNA sequence motifs at all levels of complexity. Genetics, 187:1219–1224, 2011.
  • [72] Y. Zhao, S. Ruan, M. Pandey, and G. D. Stormo. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics, 191:781–790, 2012.
  • [73] E. Bindewald, T. D. Schneider, and B. A. Shapiro. CorreLogo: An online server for 3D sequence logos of RNA and DNA alignments. Nucleic Acids Res., 34:w405–w411, 2006.,
  • [74] P. Doruker, L. Nilsson, and O. Kurkcuoglu. Collective dynamics of EcoRI-DNA complex by elastic network model and molecular dynamics simulations. J. Biomol. Struct. Dyn., 24:1–16, 2006.
  • [75] P. Csermely, R. Palotai, and R. Nussinov. Induced fit, conformational selection and independent dynamic segments: an extended view of binding events. Trends Biochem Sci, 35:539–546, 2010.
  • [76] N. C. Seeman, J. M. Rosenberg, and A. Rich. Sequence-specific recognition of double helical nucleic acids by proteins. Proc. Natl. Acad. Sci. USA, 73:804–808, 1976.
  • [77] T. D. Schneider. Reading of DNA sequence logos: Prediction of major groove binding by information theory. Meth. Enzym., 274:445–455, 1996.,
  • [78] S. Wright. The roles of mutation, inbreeding, crossbreeding, and selection in evolution. Proceedings of the Sixth International Congress on Genetics, I:355–366, 1932.
  • [79] S. Gavrilets. Evolution and speciation on holey adaptive landscapes. Trends Ecol Evol, 12:307–312, 1997.
  • [80] S. Gavrilets. High-Dimensional Fitness Landscapes and Speciation. In M. Pigliucci and G. Muller, editors, Evolution - the Extended Synthesis, pages 45–79, Cambridge, MA, 2010. MIT Press. gavrila/PAPS/altenberg.pdf.
  • [81] M. Pigliucci. Sewall wright’s adaptive landscapes: 1932 vs. 1988. Biol Philos, 23:591–603, 2008.,
  • [82] T. D. Schneider and J. Spouge. Information content of individual genetic sequences. J. Theor. Biol., 189:427–441, 1997.,
  • [83] R. K. Shultzaberger, L. R. Roberts, I. G. Lyakhov, I. A. Sidorov, A. G. Stephen, R. J. Fisher, and T. D. Schneider. Correlation between binding rate constants and individual information of E. coli Fis binding sites. Nucleic Acids Res., 35:5275–5283, 2007.,
  • [84] P. N. Hengen, S. L. Bartram, L. E. Stewart, and T. D. Schneider. Information analysis of Fis binding sites. Nucleic Acids Res., 25:4994–5002, 1997.,
  • [85] R. K. Shultzaberger, Zehua Chen, Karen A. Lewis, and T. D. Schneider. Anatomy of Escherichia coli promoters. Nucleic Acids Res., 35:771–788, 2007.,
  • [86] F. E. Penotti. Human DNA TATA boxes and transcription initiation sites. A statistical study. J. Mol. Biol., 213:37–52, 1990.
  • [87] R. K. Shultzaberger, R. E. Bucheimer, K. E. Rudd, and T. D. Schneider. Anatomy of Escherichia coli Ribosome Binding Sites. J. Mol. Biol., 313:215–228, 2001.,
  • [88] R. M. Stephens and T. D. Schneider. Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J. Mol. Biol., 228:1124–1136, 1992.,
  • [89] B. B. Mandelbrot. The fractal geometry of nature. W. H. Freeman and Co., San Francisco, 1983.
  • [90] D. Cohen-Karni, D. Xu, L. Apone, A. Fomenkov, Z. Sun, P. J. Davis, S. R. Kinney, M. Yamada-Mabuchi, S. Y. Xu, T. Davis, S. Pradhan, R. J. Roberts, and Y. Zheng. The MspJI family of modification-dependent restriction endonucleases for epigenetic studies. Proc. Natl. Acad. Sci. USA, 108:11040–11045, 2011.