The term “functional information” refers to the information that is necessary and sufficient to predict the function of a symbolic sequence within a particular environment. Typically, the sequence could be nucleic or amino acids, but it also could be a sequence of neuronal firings or even a string of letters written in a particular alphabet. The function could be the biochemical activity (for biological sequences), or the level of performance for a neuronal network that produces the firings. For words in a language, the functionality could be simply the likelihood that the word is understood by someone familiar with the language. In some sense, we often simply use the word “information” to indicate functional information. For example, the hemoglobin DNA sequence carries the information necessary to make the protein hemoglobin (given the right cellular environment) is functional in the sense that it binds oxygen with a particular affinity. For recordings from animal brains, the functionality of the sequence of neuronal activations could be the accuracy with which the animal performs a rewarded task. Functional information can be defined quantitatively in terms of the likelihood that a random sequence performs the function or task at the specified level [1, 2]. Specifically, if we require a particular sequence to perform function at least at the level , then functional information about as defined by Szostak is given by
where is the fraction of all sequences (say, given a particular length ) that perform the function at a level at least .
Such a definition of functional information makes eminent sense (it implies that functional sequences are exponentially rare) and accurately links information content to functionality. For example, in the evolution of RNA molecules that encode an enzymatic activity , functional information is correlated to both activity and structural complexity. In the evolution of HIV protease  functional information can be seen to decrease when anti-viral drugs are given (reflecting the reduced activity of the protease), then to increase again as anti-viral resistance emerges. In fact, functional information defined in this manner turns out to be just the coarse-grained Shannon information content of the sequences .
A shortcoming of this measure is that only large groups of sequences can be characterized in this manner, because accurately estimating the fractionnecessitates large ensembles of functional sequences. Here we develop an approach that allows us to attach a score to every individual sequence in an ensemble, which then allows us to predict the level of functionality of sequences not in the ensemble. In particular, this approach allows us to study how functional information emerges from multi-site correlations within the sequence.
2 Information content of functional symbolic sequences
are written in terms of the state of a joint random variable(one for each monomer), that is, , while the environment can be thought of as the state of an environment variable, that is, . With this notation, the information content can be written as
In Eq. (2), we defined two ensembles: the unconditional (unconstrained) ensemble and the constrained (by function within environment ) ensemble . Typically, the unconstrained ensemble is all the possible sequences of a given length , while the ensemble consists of only those sequences that are functional in at a specified level. Further, is the Shannon entropy of the unconstrained ensemble which, if sequences of length are drawn from an alphabet of symbols, is simply . The entropy of the constrained ensemble requires the knowledge of the likelihood to find a functional sequence within the ensemble. Then, we can calculate the Shannon entropy of sequences in as
where is the number of different functional sequences within . If we coarse-grain entropy (3
) by assuming that each sequence appears with equal probability, thenand
which is Szostak’s functional information since .
Because most functional sequences of interest are exponentially rare, finding a sufficient number of them to accurately estimate the fraction is difficult or nearly impossible. For example, while tens of thousands of variants of the 99-mer HIV-1 protease are known, they are not obtained by randomly sampling the 99-mer space, and therefore cannot be used to calculate functional information in this manner. However, it is possible to relate the information to monomer, dimer, and tri-mer entropies (and so on) as follows.
In general, a multi-variable entropy such as can be decomposed into multi-mer correlation entropies using Fano’s entropy and information decomposition theorems ). Specifically111Fano’s decomposition theorems are just specific information-theoretic examples of the more general inclusion-exclusion theorem that has been known for some time. For example  refers to it as a “théorème logique bien connu”.
Here, we write for informations (shared entropies) for ease of reading. In these formulæ, stands for a permutation of the index . For example, the entropy decomposition theorem (5) can also be written as
These theorems allow us to approximate the exact information content . To start with, if we neglect all correlations between variables, then the first-order approximation to is
where is the (functional) monomer entropy of site , that is,
with the likelihood to find symbol (one out of possible ones) at position .
If we keep correlations between pairs of sites but neglect correlations between three sites, we obtain the second-order approximation of the entropy as
where is the entropy shared between variables and . We can deduce how this is written in terms of marginal entropies (entropies obtained by summing over one or more of the monomers) using the decomposition theorem (6) for as
the standard Shannon formula for information.
Because and can be estimated from finite ensembles of functional sequences, approximations such as (10) are sufficient to estimate the information content as long as correlations between three or more variables can be neglected (see, e.g., ).
While these approximations are useful for calculating the information content of sequences from finite ensembles, they do not allow us to correlate individual characters of a sequence (for example, its particular level of functionality) to information because, as we have emphasized repeatedly, information is not a property of a single sequence. However, each sequence is also a pattern, so it should be possible in principle to relate the individual pattern in the sequence to its function. Indeed, methods of this sort have been developed before: they use sophisticated techniques to link the patterns to function, either by regression [11, 12] or by using a multiple-sequence alignment (MSA) to train a probabilistic model (usually called “Potts” or “Ising” model) [13, 14, 15, 16]. We now show that there is a much simpler approach that allows us to link patterns to function directly, without regression or fitting a Potts model.
3 Model-free linkage of pattern to function
A common technique to model correlations between variables in a sequence is the so-called Potts or Ising model Hamiltonian (energy functional)222Techically speaking, the model is neither a Potts nor an Ising model, as those models only have nearest-neighbor interactions. It is best called an infinite-range Ising-like model. that can be written as [13, 14]
Here as before, the are the sequence monomers taken from an alphabet of size . For neuronal firings, the are binary, while for much of the literature on molecules, the are reduced to a small alphabet, for example for wild-type monomer, and for a mutant. The parameters and parameters are Lagrange multipliers obtained by maximizing the entropy
constrained to reproduce the monomer and dimer probabilities and that can be obtained from the MSA. The probability is incidentally the same as defined in (3): it is the likelihood to find sequence by chance. Maximization under the constraints yields the familiar Boltzmann distribution
with from (12) above, and where
is the partition function.
Now, let’s take a closer look at the entropy . Using (14), we can write this as
It is clear that the right-hand-side of Eq. (16) is an approximation of the “true” entropy , fitted to the MSA using the first and second-order Lagrange multipliers. Higher-order terms are neglected. But it is a strange formula, because also appears on the right-hand-side, and we don’t have an approximation for that. But it turns out that we can approximate the left-hand-side using averages taken from the MSA as well, up to any order we want. This is possible because we can decompose exactly using the entropy decomposition theorem, for example written in the form (7). Using this approximation, we will be able to determine , and eliminate the Lagrange multipliers.
Because the Potts model parameters only consider up to bi-partite correlations, we will neglect terms of order three or higher in (7) for now, but it is clear that we could keep them as long as we fit an appropriate third-order term in the Hamiltonian (12). Using directly from the MSA, along with the , and so forth, we can insert those terms in (16) to yield
In (17), we defined the average Lagrange coefficients
We also introduced the simplified notation and and so on, which we will use from now on.
In order to understand the term , we turn to Berg-von Hippel (BvH) theory [17, 18]. BvH theory defines an energy functional for a DNA sequence that is related to the binding affinity of a transcription factor using a MSA. Indeed, BvH theory is very similar to the max-entropy approach from , except that it assumes that the distribution at each site is at maximum entropy. In fact, the energy score of a DNA binding site is given precisely by an energy functional of the form (12), but without the second-order term (as correlations between sites are close to irrelevant in DNA binding).
Unlike in the Potts-model approach, the counterparts to are not fitted to the MSA in the BvH approach. Rather, they are directly extracted from the MSA in terms of a matrix as333We changed notation slightly in Eq. (20), from in Eq. (9) to here, to conform to BvH theory. The difference is that in BvH theory it is convenient to index the symbol, so that it is obvious that is a matrix element, while this was implicit in the form . In the latter formulation, rather than summing over an index we sum over the symbols .
because this is the expression that maximizes the single-site entropy (subject to constraints). We can see this by writing
In Eq. (20), is a Lagrange multiplier that allows us to set the scale of the energy just as we did in the earlier maximization (we set here). Furthermore, is the probability (likelihood) to find the consensus monomer at site . Obviously, the consensus monomer is the one that has the highest frequency in the MSA at that position. This assures that the consensus sequence is the one with vanishing energy (the “ground state”). The matrix is called the “position-weight matrix” (PWM) of a sequence , and it is fairly successful at predicting DNA binding sites in a genome (see, e.g., ), and even at detecting correlations between binding sites .
We now introduce a second-order PWM (as BvH in fact did in the appendix to , but for adjacent monomers only)
Just as before, is the likelihood to find symbol combination at the pair of sites , while is the likelihood to find the consensus pair there444A more precise (but also more cumbersome) notation would be , to indicate that in pair-wise probabilities, refers to the consensus dimer.. Defining this matrix implicitly assumes that any pair of monomers is also in statistical equilibrium. While assuming that even higher-order correlations are in statistical equilibrium might be increasingly doubtful, it is certainly more likely than assuming that the entire sequence is at maximum entropy, as (14) does.
We can now use BvH theory to calculate and insert it into Eq. (17). In standard BvH theory, is given by (recall we are setting here)
but this is only the first-order approximation. In general, for any sequence we must have
Inserting this into (17) we can now identify
This implies that in the BvH model, there is a relationship between the energy contribution to a site and the entropy of that site, and it is precisely the relationship that is summarized by the thermodynamical Helmholtz free-energy relation
where is the free energy, is the average energy, and is the Gibbs entropy (, of course, is the temperature). As a consequence, we see that the parameters of the Potts model are given entirely by quantities that can be extracted directly from the MSA, without the need of any optimization:
Thus, the energy of a particular sequence can now be calculated from the first- and second-order PWMs (just as in BvH theory), without fitting any parameter to a model. In the next section, we extend this approach to take into account arbitrary-order correlations between monomers, in order to construct predictive energy and information scores for arbitrary sequences.
4 Information and energy scores
Using the PWM (20) obtained from an alignment of functional sequences in an ensemble , we can define the first-order energy score of an arbitrary sequence as
where is the matrix representing sequence , which has a ‘1’ at position if symbol is found there, and a ‘0’ otherwise. With this definition, the consensus sequence has , and sequences that carry symbols at that position that rarely occur in the MSA score badly (have high energy). For cases where the symbol does not occur at all in the MSA at that position, we have to introduce the concept of pseudocounts in order to deal with zeros in the denominator of (20), as we discuss in more detail below.
The average first-order energy score over the ensemble then becomes
where is the first-order approximation to the Shannon entropy of the functional ensemble , defined in (8). This is the first-order version of the Helmholtz free energy relationship.
In preparation of a discussion of information scores, let us also introduce the first-order information matrix666This is in fact Schneider’s information score introduced in , but with a uniform prior and without finite sample size correction, which we will take care of using pseudocounts.
defined in such a way that its average is the first-order Shannon information (if we take logs to the base of the alphabet )
The corresponding information score for sequence is
We now construct energy and information scores that include higher-order corrections. In general,
Analogous relations hold for information scores for monomers, dimers, and so on. We now construct approximations to the exact energy score (35). The first-order approximation was already given in (30), but we rewrite it here:
We can rewrite the correction terms as a function of joint energy scores using the decomposition theorem. For example, using (38), we find
Likewise, if we neglect fourth-order corelations,
Turning now to information scores (here, , and so on, and logs are to the base ) we can define
where the shared information scores etc. are defined analgously to the shared energy scores. Averaging these over the sequences in the functional ensemble will return the approximations (8), (10), and so on.
5 Application to sequences of self-replicating programs
In the previous section we saw that energy and information scores are built up hierarchically, adding pairwise, three-part, and higher correlations until the full functionality emerges. But what level of correlations do we need to keep track of in order to accurately capture a sequence’s functionality? To test this, we will use a data set of sequences in which correlations are important, but is also large enough so that higher-order PWMs are statistically significant. Previously, we analyzed a set of 36,171 sequences of self-replicating computer programs from the digital life system Avida (see, e.g., [23, 24, 25]. Avida simulates a computer within a standard computer, within which self-replicating programs written in a custom programming language live and thrive. The programming language typically uses only 26 instructions (conveniently labeled by the lowercase letters a-z) that are chosen in such a way that programs are highly evolvable. Because the replication of instructions is noisy (giving rise to mutations), populations of programs evolve and adapt to user-specified fitness landscapes. The sequences that we analyze here represent the complete set of all possible self-replicating programs of length (out of a possible ) obtained via an exhaustive search . “Self-replication” is defined here as the capacity to form a growing colony of programs (in the absence of mutations) and does not necessarily require perfect copying of information. Understanding the structure of the fitness landscape of self-replicators is an important element in trying to understand how replication could have emerged from non-replication, i.e., the origin of life [5, 27].
The 36,171 sequences form complex clusters in genotype space (see Fig. 1).
We can immediately calculate the information content of these 9-mers using Szostak’s formula (1), which gives777Choosing a base for the logarithm gives units to entropy and information. In the following, we will take logs to the base of the alphabet, giving units of “mers”. An -mer has mers of potential information (entropy).
that is, about 64% of the sequence is information (the smallest possible replicator in this world has , with an even more compressed code .
5.1 Emergence of information from multi-partite correlations
We can now ask how this information emerges from correlations, by averaging the information scores (44) over . In Fig. 2 we show the approximations to the exact information score (47) as a function of the order of the approximation. As expected from the information decomposition theorem, the approximations move closer to the exact value as more correlations are incorporated.
5.2 Prediction of function from information scores
In this section we will test how much “correlation information” (information stored in correlations between sites, as opposed to bias in per-site frequency) is necessary to distinguish functional from non-functional sequences. Because we will now be scoring sequences that are not in the set , we have to introduce pseudocounts in the PWMs. As is clear from (20), if a particular symbol at position does not occur in the MSA, then the maximum likelihood estimator of the probability vanishes, leading to an infinite matrix element. As a consequence, the score of a sequence that contains such a symbol at the position is not well-defined. A common cure for this defect is to introduce pseudocounts, which add a constant to every possible symbol count. While the common Laplace pseudocount adds 1 to each count ( and therefore , the pseudocount is often used as a variable that can be adapted to the statistics of the dataset in question (see, e.g., [28, 29]). In what follows, we adopt a pseudocount of throughout, but note that the results on classification accuracy scale only extremely weakly over the 12 orders of magnitude of pseudocounts we tested. Generally speaking, an infinitesimal pseudocount moves information scores of un-represented symbols far out into the negative, and energy scores far to the positive. The pseudocount for higher-order PWMs (we construct those explicitly up to , where is the order of correlations that are included) is adjusted so that the marginal probabilities remain correct.
In order to predict the function of an arbitrary sequence in light of an MSA of
functional ones, we need to construct a classifier in such a manner that, given a particular threshold, the score divides functional from non-functional sequences. For energy scores, we would deem those with a score as functional, while for information scores we would deeem those with to be replicators. Note that because we have exhaustively classified this dataset already, we can determine the accuracy of classification (given a subset of sequences for a given classifier) with perfect accuracy.
In principle, we can construct classifiers directly from the information score functions (44), but due to the fact that the correction terms to involve the summation of many terms with alternating signs, the pseudodocount prevents these classifiers from being effective888This holds equally when using energy score functions (39).. Indeed, it appears to be impossible to construct pseudocounts in a manner that does not introduce spurious correlations at all levels (while it is of course possible to make sure no spurious correlations are introduced for a chosen given level). However, it turns out that sums over the multiple-site energy and information functionals and do work well, because they do not rely on cancellations. For example, we can define the following information classifiers (similar classifiers can be built from energy functionals):
To test the quality of the classifiers, we create test data containing sequences. Specifically, we create three ensembles: with randomly generated sequences (all sequences are ) that we made sure are all non-functional, as well as a set of one-mutants and two-mutants that are one or two mutations away from functional sequences, but known not to be functional. We expect the one-mutant set in particular to be the hardest to classify.
We show typical density distributions of scores for the classifiers in Fig. 3. These classifiers were all “trained” on the full set , meaning, the PWMs were created with the entire set of replicators in the MSA. As expected, the set of random sequences has a density distribution (purple in Figs. 3(a-d)) far removed from the functional sequences (light blue), but the one-mutant and two-mutant sequences have a significant overlap with functional sequences if is used as a classifier. Including higher-order correlations reduces this overlap, until the overlap (mis-classification) almost completely disappears in .
We can characterize the accuracy of the classifier as a function of the size of the training set by measuring the quality score of an ROC (receiver operating chararacteristic, see, e.g., ) plot. ROCs are generated by plotting the false positive rate (FPR) against the true positive rate (TPR) as a function of the threshold in the classifier. A typical ROC (for classifier trained on 1% of used on 100,000 non-functional single-point mutants of the functional set) is shown in Fig. 4. The quality score of the classifier is obtained by calculating the area under the curve (AUC) of the classifier, as compared to the AUC=0.5 of a random classifier where the TPR increases linearly with the FPR.
It is clear from Fig. 5 that all classifiers work well even when only 0.1% of the true replicators (36 sequences) are used to create the MSA, but for training sets that small, the classifiers and that use three-partite and four-partite correlations perform less well than and on the test data for one- and two mutants of functional sequences, which can be attributed to overfitting. Indeed, for data sets this small, the equilibrium assumption behind the max entropy approach will be violated for those PWMs. However, for larger MSAs, including higher-order correlations significantly improves the performance of the classifier. When using the full set of replicators, taking into account up to four-way correlations appears to capture almost all of the information encoded in the sequences. On random sequences, all classifiers perform close to perfectly even on the smallest training set size, and including multivariate correlations is clearly overkill. The quality of classifiers based on energy functionals is the same as for those described here (data not shown), as they are intimately connected via the Helmholtz free energy relation.
Fano’s entropy and information decomposition theorems allow us to understand how functional information emerges from the combination of multivariate correlations. Here we have shown that defining energy and information functionals built from multi-variate position weight matrices allow us to capture the correlations inherent in multiple sequence alignments in a model-free manner. Standard approaches to fitting MSAs rely on expensive inverse models to fit the bi-partite correlations in the data, but must reduce the alphabet size considerably to reduce the number of parameters in the Potts or Ising models. It is clear from this analysis that such a fit is entirely unnecessary, as the marginal probabilities can be extracted from the data as long as the maximum entropy assumption is used on single sites, pairs of sites, triples, and so on.
We tested this approach on the largest computational genotype-phenotype map created to date. Having a complete genotype-phenotype map allows us to calculate the information content of functional sequences exactly, and study how this information is built up by lower-order correlations. Further, a complete data set helps in creating ensembles that can be used to test the classification accuracy of function prediction. In particular, we used the model-free approach to construct classifiers that can distinguish functional from non-functional sequences even when the non-functional sequences are single-site polymorphisms of the functional ones. Of course, this approach needs to be tested on biological data sets with significantly longer sequences. However, since the computational cost of creating these complex classifiers scales mainly with the level of multivariate correlations that are included, we expect to be able to classify sequences of several hundreds of sites with full alphabet size (for example, or 21 for proteins) taking into account all tri-partite correlations. For binary data such as neuronal spike trains, we expect to be able to handle sequences of thousands of sites, possibly including even more than 4-part correlations.
Sequence models of the Ising or Potts type perform markedly better than the standard Hidden Markov Models (HMMs) that have been developed and used since the early 1970s
, because they can take into account arbitrary correlations between pairs of symbols, rather than just those between adjacent symbols in HMMs. However, going beyond pair-wise correlations is deemed impossible for Ising models because of the associated explosion in the number of parameters. While more general pattern classification task (of which sequence classification is but a subset) can nowadays be achieved by training deep neural networks on annotated data, these methods have also displayed severe drawbacks[32, 33], for example by being easily fooled by detracting patterns. Furthermore, training such models on a large corpus of data is expensive. The model-free classifiers that we have described here are different from deep networks in important ways, as they do not require any training, and their vulnerability to overfitting is easily controlled by adjusting the level of correlation to be included to the size of the MSA.
Acknowledgements This work was supported in part by funds provided by Michigan State University in support of the BEACON Center for the Study of Evolution in Action. We acknowledge computational resources provided by the Institute for Cyber-Enabled Research (iCER) at Michigan State University.
-  Szostak JW. 2003 Functional information: Molecular messages. Nature 423, 689.
-  Hazen RM, Griffin PL, Carothers JM, Szostak JW. 2007 Functional information and the emergence of biocomplexity. Proc. Natl. Acad. Sci. USA 104, 8574–8581.
-  Carothers JM, Oestreich SC, Davis JH, Szostak JW. 2004 Informational complexity and functional activity of RNA structures. J. American Chem. Society 126, 5130–5137.
-  Gupta A, Adami C. 2016 Strong selection significantly increases epistatic interactions in the long-term evolution of a protein. PLoS Genet 12, e1005960.
-  Adami C, LaBar T. 2017 From entropy to information: Biased typewriters and the origin of life. In Information and Causality: From Matter to Life (ed. S Walker, P Davies, G Ellis), pp. 95–112. Cambridge, MA: Cambridge University Press.
-  Adami C, Cerf NJ. 2000 Physical complexity of symbolic sequences. Physica D 137, 62–69.
-  Adami C. 2004 Information theory in molecular biology. Phys. Life Reviews 1, 3–22.
-  Adami C. 2012 The use of information theory in evolutionary biology. Ann. N.Y. Acad. Sci. 1256, 49–65.
-  Fano RM. 1961 Transmission of Information: A Statistical Theory of Communication. New York and London: MIT Press and John Wiley.
-  Sylvester J. 1883 Note sur le théorème de Legendre. Comptes Rendus Acad. Sci. Paris 96, 463–465.
-  Hinkley T, Martins J, Chappey C, Haddad M, Stawiski E, Whitcomb JM, Petropoulos CJ, Bonhoeffer S. 2011 A systems analysis of mutational effects in HIV-1 protease and reverse transcriptase. Nat Genet 43, 487–9.
-  Kouyos RD, von Wyl V, Hinkley T, Petropoulos CJ, Haddad M, Whitcomb JM, Böni J, Yerly S, Cellerai C, Klimkait T, Günthard HF, Bonhoeffer S, Swiss HIV Cohort Study. 2011 Assessing predicted HIV-1 replicative capacity in a clinical setting. PLoS Pathog 7, e1002321.
-  Schneidman E, Berry MJ 2nd, Segev R, Bialek W. 2006 Weak pairwise correlations imply strongly correlated network states in a neural population. Nature 440, 1007–12.
-  Mora T, Bialek W. 2011 Are biological systems poised at criticality? J. Stat. Phys. 144, 268–302.
-  Ferguson AL, Mann JK, Omarjee S, Ndung’u T, Walker BD, Chakraborty AK. 2013 Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity 38, 606–17.
-  Biswas A, Haldane A, Levy RM. 2021. The role of epistasis in determining the fitness landscape of HIV proteins. BioRxiv 448646.
-  Berg OG, von Hippel PH. 1987 Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193, 723–750.
-  Berg OG, von Hippel PH. 1988 Selection of DNA binding sites by regulatory proteins II. The binding specificty of cyclic AMP receptor protein to recognition sites. J. Mol. Biol. 200, 709–723.
-  Stormo GD. 2000 DNA binding sites: Representation and discovery. Bioinformatics 14, 16–23.
-  Brown CT, Callan Jr CG. 2004 Evolutionary comparisons suggest many novel cAMP response protein binding sites in Escherichia coli. Proc. Natl. Acad. Sci. USA 101, 2404–2409.
-  Clifford J, Adami C. 2015 Discovery and information-theoretic characterization of transcription factor binding sites that act cooperatively. Phys Biol 12, 056004.
-  Schneider TD. 1997 Information content of individual genetic sequences. J. theor. Biol. 189, 427–441.
-  Adami C. 1998 Introduction to Artificial Life. New York: Springer Verlag.
-  Adami C. 2006 Digital genetics: Unravelling the genetic basis of evolution. Nat Rev Genet 7, 109–118.
-  Ofria C, Bryson DM, Wilke CO. 2009 Avida: A software platform for research in computational evolutionary biology. In Artificial Life Models in Software (ed. M Komosinski, A Adamatzky), pp. 3–35. Springer London.
-  C G N, Adami C. 2021 Information-theoretic characterization of the complete genotype-phenotype map of a complex pre-biotic world. Phys Life Rev 38, 111–114.
-  C G N, LaBar T, Hintze A, Adami C. 2017 Origin of life in a digital microcosm. Philos Trans R Soc Lond A 375, 20160350.
-  Nemenman I, Shafee F, Bialek W. 2002 Entropy and inference, revisited. In Adv Neural Inf Process Syst. vol. 14 (ed. G Dietterich, S Becker, Z Ghahramani), pp. 471–478.
-  Nemenman I. 2011 Coincidences and estimation of entropies of random variables with large cardinalities. Entropy 13, 2013–2023.
-  Fawcett T. 2006 An introduction to ROC analysis. Pattern Recognition Letters 27, 861–874.
Rabiner LR. 1989 A tutorial on Hidden Markov models and selcted applications in speech recognition.Proceeding of the IEEE 77, 257–286.
-  Nguyen A, Yosinski J, Clune J. 2015 Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Computer Vision and Pattern Recognition (CVPR ’15). IEEE.
-  Jo J, Bengio Y. 2018 Measuring the tendency of CNNs to learn surface stastistical regularities. ArXiv:1711.11561