The IMP game: Learnability, approximability and adversarial learning beyond Σ^0_1

02/07/2016 ∙ by Michael Brand, et al. ∙ Monash University 0

We introduce a problem set-up we call the Iterated Matching Pennies (IMP) game and show that it is a powerful framework for the study of three problems: adversarial learnability, conventional (i.e., non-adversarial) learnability and approximability. Using it, we are able to derive the following theorems. (1) It is possible to learn by example all of Σ^0_1 ∪Π^0_1 as well as some supersets; (2) in adversarial learning (which we describe as a pursuit-evasion game), the pursuer has a winning strategy (in other words, Σ^0_1 can be learned adversarially, but Π^0_1 not); (3) some languages in Π^0_1 cannot be approximated by any language in Σ^0_1. We show corresponding results also for Σ^0_i and Π^0_i for arbitrary i.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper deals with three widely-discussed topics: approximability, conventional learnability and adversarial learnability, and introduces a unified framework in which all three can be studied.

First, consider approximability. Turing’s seminal 1936 result (Turing:Computable, ) demonstrated that some languages that can be accepted by Turing machines (TMs) are not decidable. Otherwise stated, some R.E. languages are not recursive. Equivalently: some co-R.E. languages are not R.E.; any R.E. language must differ from them by at least one word. However, the diagonalisation process by which this result was originally derived makes no stronger claim regarding the number of words differentiating a co-R.E. language and an R.E. one. It merely shows one example of a word where a difference must exist.

We extend this original result by showing that some co-R.E. languages are, in some sense, as different from any R.E. language as it is possible to be.

To formalise this statement, consider an arbitrary (computable) enumeration, , over the complete language (the language that includes all words over the chosen alphabet). Over this enumeration, , we define a distance metric, dissimilarity, between two languages, and , as follows.

where is the symmetric difference. We note that the value of depends on the enumeration chosen, and therefore, technically, . However, all results in this paper are true for all possible choices of the enumeration, for which reason we omit the choice of enumeration, opting for this more simplified notation.

ranges between (the languages are essentially identical) and (the languages are completely dissimilar).

We prove:

Theorem 1.

There is a co-R.E. language such that every R.E. language has a dissimilarity distance of from .

Consider now learnability. Learnability is an important concept in statistics, econometrics, machine learning, inductive inference, data mining and other fields. This has been discussed by E. M. Gold and by L. G. Valiant in terms of language identification in the limit

Gold1967 ; Valiant1984 , and also in statistics via the notion of statistical consistency, also known as “completeness” (converging arbitrarily closely in the limit to an underlying true model).

Following upon his convergence results in Solomonoff1978 , Solomonoff writes (Solomonoff2011, , sec. 2 (Completeness and Incomputability)):

“It is notable that completeness and incomputability are complementary properties: It is easy to prove that any complete prediction method must be incomputable. Moreover, any computable prediction method can not be complete – there will always be a large space of regularities for which its predictions are catastrophically poor.”

In other words, in Solomonoff’s problem set-up it is impossible for a Turing machine to learn every R.E. language: every computable learner is limited.

Nevertheless, in the somewhat different context within which we study learnability, we are able to show that this tension does not exist: a Turing machine can learn any computable language. Moreover, we will consider a set of languages that includes, as a proper subset of it, the languages

and will prove that while no deterministic learning algorithm can learn every language in the set, a probabilistic one can (with probability

), and a mixed strategy involving several deterministic learning algorithms can approximate this arbitrarily well.111Here and elsewhere we use the standard notations for language families in the arithmetical hierarchy Rogers:recursive : is the set of recursively enumerable languages, is the set of co-R.E. languages.

Lastly, consider adversarial learning Lowd:adversarial ; Wei:adversarial ; Huang:adversarial . This is different from the conventional learning scenario described above in that while in conventional learning we attempt to converge to an underlying “true model” based on given observations, adversarial learning is a multi-player process in which each participant can observe (to some extent) other players’ predictions and adjust their own actions accordingly. This game-theoretic set-up becomes of practical importance in many scenarios. For example, in online bidding bidders use information available to them (e.g., whether they won a particular auction) to learn the strategy used by competing bidders, so as to be able to optimise their own strategy accordingly.

We consider, specifically, an adversarial learning scenario in which one player (the pursuer) attempts to copy a second player, while the second player (the evader) is attempting to avoid being copied. Specifically, each player generates a bit ( or ) and the pursuer wins if the two bits are equal while the evader wins if they are not. Though on the face of it this scenario may seem symmetric, we show that the pursuer has a winning strategy.

To attain all these results (as well as their higher-Turing-degree equivalents), we introduce a unified framework in which these questions and related ones can all be studied. The set-up used is an adaptation of one initially introduced by Scriven Scriven1965 of a predictor and a contrapredictive (or avoider) effectively playing what we might nowadays describe as a game of iterated matching pennies. In Section 2, we give a formal description of this problem set-up and briefly describe its historical evolution. In Section 3, we explain the relevance of the set-up to the learnability and approximability problems and analyse, as an example case, adversarial learning in the class of decidable languages. In Section 4, we extend the analysis to adversarial learning in all other classes in the arithmetical hierarchy, and in particular to Turing machines.

In Sections 5 and 6 we then return to conventional learnability and to approximability, respectively, and prove the remaining results by use of the set-up developed, showing how it can be adapted to these problems.

2 Matching Pennies

The matching pennies game is a zero-sum two-player game where each player is required to output a bit. If the two bits are equal, this is a win for Player “”; if they differ, this is a win for Player “”. The game is a classic example used in teaching mixed strategies (see, e.g. flake1998computational, , pp. 283–284): its only Nash equilibrium (vonNeumann:1944:TGE, ; Nash1951:Non-cooperative_Games, ) is a mixed strategy wherein each player chooses each of the two options with probability .

Consider, now, an iterative version of this game, where at each round the players choose a new bit with perfect information of all previous rounds. Here, too, the best strategy is to choose at each round a new bit with probability for each option, and with the added caveat that each bit must be independent of all previous bits. In the iterative variation, we define the payoff (of the entire game) to be

(1)

for Player “”, where is if the bits output in the ’th round are equal and if they are different. The payoff for Player “” is

(2)

These payoff functions were designed to satisfy the following criteria:

  • They are always defined.

  • The game is zero-sum and strategically symmetric, except for the essential distinction between a player aiming to copy (Player “”, the pursuer) and a player aiming for dissimilarity (Player “”, the evader).

  • The payoff is a function solely of the sequence. (This is important because in the actual IMP game being constructed players will only have visibility into past , not full information regarding the game’s evolution.)

  • Where a limit exists (in the sense) to the percentage of rounds to be won by a player, the payoff is this percentage.

In particular, note that when the payoff functions take the value or , there exists a limit (in the sense) to the percentage of rounds to be won by a player, and in this case the payoff is this limit.

In the case of the strategy pair described above, for example, where bits are determined by independent, uniform-distribution coin tosses, the limit exists and the payoff is

for both players, indicating that the game is not biased towards either. This is a Nash equilibrium of the game: neither player can ensure a higher payoff for herself as long as the other persists in the equilibrium strategy. The game has other Nash equilibria, but all share the payoffs.

Above, we describe the players in the game as agents capable of randomisation: they choose a random bit at each new round. However, the game can be played, with the same strategies, also by deterministic agents. For this, consider every possible infinite bit-string as a possible strategy for each of the players. In this case, the game’s Nash equilibrium would be a strategy pair where each player allots a bit-string from a uniform distribution among all options.

We formalise this deterministic outlook on the matching pennies game as follows.

Definition 1 (Iterative Matching Pennies game).

An Iterative Matching Pennies game (or IMP), denoted , is a two player game where each player chooses a language: Player “” chooses and Player “” chooses , where and are two collections of languages over the binary alphabet.

Where , we denote the game .

Define to be the empty string and define, for every natural ,

Then the payoffs and are as defined in (2) and (1), respectively. The notation “” indicates string concatenation.

Player (mixed) strategies in this game are described as distributions, and , over and , respectively. In this case, we define

Note again that the game is zero sum: any pair of strategies, pure or mixed, satisfies

(3)

To better illustrate the dynamics embodied by Definition 1, let us add two more definitions: let

(4)

and let

(5)

noting that by Definition 1, , where “” denotes the exclusive or (“xor”) function.

The scenario encapsulated by the IMP game is that of a competition between two players, Player “” and Player “”, where the strategy of the players is encoded in the form of the languages and , respectively (or distributions over these in the case of mixed strategies).

After rounds, each player has visibility to the set of results so far. This is encoded by means of , a word composed of the characters , where each is if the bits that were output by the two players in round are equal and if they are not. It is based on this history that the players now generate a new bit: Player “” generates and Player “” generates . The players’ strategies are therefore functions from a word () to a bit ( for Player “”, for Player “”). To encode these strategies in the most general form, we use languages: and are simply sets containing all the words to which the response is “”. Our choice of how weak or how strong a player can be is then ultimately in the question of what language family, , its strategy is chosen from.

Once and are determined, is simply their xor ( if the bits differ, if they are the same), and in this way the definition generates the infinite list of that is ultimately used to compute the game’s overall payoff for each player.

Were we to actually try and run a real-world IMP competition by directly implementing the definitions above, and were we to try to implement the Nash equilibrium player strategies, we would immediately run into two elements in the set-up that are incomputable: first, the choice of a uniform infinitely-long bit-string, our chosen distribution among the potential strategies, is incomputable (it is a choice among uncountably many elements); second, for a deterministic player (an agent) to output all the bits of an arbitrary (i.e., general) bit-string, that player cannot be a Turing machine. There are only countably many Turing machines, so only countably many bit-strings that can thus be output.

In this paper, we examine the IMP game with several choices for and . The main case studied is where . In this case, we still allow player mixed strategies to be incomputable distributions, but any and are computable by TMs.

The set-up described here, where Iterated Matching Pennies is essentially described as a pursuit-evasion game, was initially introduced informally by Scriven Scriven1965 in order to prove that unpredictability is innate to humans. Lewis and Richardson LewisRichardson1966 , without explicitly mentioning Turing machines or any (equivalent) models of computation, reinvestigated the model and used it to refute Scriven’s claim, with a proof that hinges on the halting problem, but references it only implicitly.

The set-up was redeveloped independently by Dowe, first in the context of the avoider trying to choose the next number in an integer sequence to be larger (by one) than the (otherwise) best inference that one might expect (Dowe2008a, , sec. 0.2.7, p. 545, col. 2 and footnote 211), and then, as in Scriven1965 , in the context of predicting bits in a sequence (Dowe2008b, , p. 455)(Dowe2013a, , pp. 16–17). Dowe was the first to introduce the terminology of TMs into the set-up. His aim was to illicit a paradox, which he dubbed “the elusive model paradox”, whose resolution relies on the undecidability of the halting problem. Thus, it would provide an alternative to the method of Turing:Computable to prove this undecidability. Variants of the elusive model paradox and of the “red herring sequence” (the optimal sequence to be used by an avoider) are discussed in (Dowe2011a, , sec. 7.5), with the paradox also mentioned in (DoweHernandez-OralloDas2011, , sec. 2.2)(Hernandez-OralloDoweEspana-CubilloHernandez-LloredaInsa-Cabrera2011, , footnote 9).

Yet a third independent incarnation of the model was by Solomonoff, who discussed variants of the elusive model paradox and the red herring sequence in (Solomonoff2010, , Appendix B) and (Solomonoff2009, , sec. 3).

We note that the more formal investigations of Dowe and of Solomonoff were in contexts in which the “game” character of the set-up was not explored. Rather, the set-up was effectively a one-player game, where regardless of the player’s choice of next bit, the red herring sequence’s next bit was its reverse. We, on the other hand, return to the original spirit of Scriven’s formulation, investigating the dynamics of the two player game, but do so in a formal setting.

Specifically, we investigate the question of which of the two players (if either) has an advantage in this game, and, in particular, we will be interested in the game’s Nash equilibria, which are the pairs of strategies for which

and

We define

and

where is a (potentially incomputable) distribution over and is a (potentially incomputable) distribution over . Where , we will abbreviate this to and .

A Nash equilibrium must satisfy

(6)

where, as before, .

We note that while it may seem, at first glance, that the introduction of game dynamics into the problems of learnability and approximability inserts an unnecessary complication into their analysis, in fact, we will show that the ability to learn and/or approximate languages, when worded formally, involves a large number of interlocking “”, “”, “”, “” and “” clauses that are most naturally expressed in terms of minmax and maxmin solutions, Nash equilibria and mixed strategies.

3 Halting Turing machines

The IMP game serves as a natural platform for investigating adversarial learning: each of the players has the opportunity to learn from all previous rounds, extrapolate from this to the question of what algorithm their adversary is employing and then choose their own course of action to best counteract the adversary’s methods.

Furthermore, where , IMP serves as a natural arena to differentiate between the learning of a language (e.g., one selected from R.E.) and its complement (e.g., a language selected from co-R.E.), because Player “”, the copying player, is essentially trying to learn a language from , namely that chosen by Player “”, whereas Player “” is attempting to learn a language from co-, namely the complement to that chosen by Player “”. Any advantage to Player “” can be attributed solely to the difficulty to learn co- by an algorithm from , as opposed to the ability to learn .

To exemplify IMP analysis, consider first the game where , the set of decidable languages. Because decidable languages are a set known to be closed under complement, we expect Player “” to be equally as successful as Player “” in this variation. Consider, therefore, what would be the Nash equilibria in this case.

Theorem 2.

Let be the set of decidable languages over . The game does not have any Nash equilibria.

We remark here that most familiar and typically-studied games belong to a family of games where the space of mixed strategies is compact and convex, such as those having a finite number of pure strategies, and such games necessarily have at least one Nash equilibrium. However, the same is not true for arbitrary games. (For example, the game of “guess the highest number” does not have a Nash equilibrium.) IMP, specifically, does not belong to a game family that guarantees the existence of Nash equilibria.

Proof.

We begin by showing that for any (mixed) strategy ,

(7)

Let be any (necessarily incomputable) enumeration over those Turing machines that halt on every input, and let be the sequence of languages that is accepted by them. The sequence enumerates (with repetitions) over all languages in . Under this enumeration we have

For this reason, for any there exists an such that

We devise a strategy, , to be used by Player =. This strategy will be pure: the player will always choose language , which we will now describe. The language is the one accepted by Algorithm 1.

1:function calculate bit()
2:     . Number of prediction errors so far.
3:     if  then
4:         Accept.
5:     else if  then
6:         Accept.
7:     else
8:         Reject.
9:     end if
10:end function
Algorithm 1 Algorithm for learning a mixed strategy

Note that while the enumeration is not computable, Algorithm 1 only requires to be accessible to it, and this can be done because any such finite set of TMs can be hard coded into Algorithm 1.

Consider the game, on the assumption that Player “”’s strategy is for . After at most prediction errors, Algorithm 1 will begin mimicking a strategy equivalent to and will win every round from that point on.

We see, therefore, that for any we have , from which we conclude that (or, equivalently, ), in turn proving that for any Nash equilibrium we necessarily must have

(8)

For exactly the symmetric reasons, when we also have

(9)

Player “” can follow a strategy identical to that described in Algorithm 1, except reversing the condition in Step 5.

Because we now have that , we know that Equation (6) cannot be satisfied for any strategy pair. In particular, there are no Nash equilibria. ∎

This result is not restricted to , the decidable languages, but also to any set of languages that is powerful enough to encode Algorithm 1 and its complement. It is true, for example, for as well as for with any set of Oracles, i.e., specifically, for any .

Definition 2.

We say that a collection of languages is adversarially learnable by a collection of strategies if .

If a collection is adversarially learnable by , we simply say that it is adversarially learnable.

Corollary 3.0.

is not adversarially learnable by .

Proof.

As was shown in the proof of Theorem 2, . ∎

We proceed, therefore, to the question of how well each player fares when includes non-decidable R.E. languages, and is therefore no longer closed under complement.

4 Adversarial learning

We claim that R.E. languages are adversarially learnable, and that it is therefore not possible to learn the complement of R.E. languages in general, in the adversarial learning scenario.

Theorem 4.

The game has a strategy, , for Player “” that guarantees for all (and, consequently, also for all distributions among potential candidates).

In particular, is adversarially learnable.

Proof.

We describe explicitly by means of an algorithm accepting it. This is given in Algorithm 2.

1:function calculate bit()
2:     Let be an enumeration over all Turing machines.
3:     . Number of prediction errors so far.
4:     Simulate
5:end function
Algorithm 2 Algorithm for learning an R.E. language

Note that Algorithm 2 does not have any “Accept” or “Reject” statements. It returns a bit only if returns a bit and does not terminate if fails to terminate. To actually simulate and to encode the enumeration , Algorithm 2 can simply use a universal Turing machine, , and define the enumeration in a way such that accepts the input “” if and only if accepts the input .

To show that Algorithm 2 cannot be countered, consider any R.E. language to be chosen by Player “”. This language, , necessarily corresponds to the output of for some (finite) . In total, Player “” can lose at most rounds. In every subsequent round, its output will be identical to that of , and therefore identical to the bit chosen by Player “”. ∎

We see, therefore, that the complement of Algorithm 2’s language cannot be learned by any R.E. language. Player “” cannot hope to win more than a finite number of rounds.

Note that these results do not necessitate that , the R.E. languages. As long as is rich enough to allow implementing Algorithm 2, the results hold. This is true, for example, for sets that allow Oracle calls. In particular:

Corollary 5.0.

For all , is adversarially learnable by but not by ; is adversarially learnable by but not by .

Proof.

To show the learnability results, we use Algorithm 2. To show the non-learnability results, we appeal to the symmetric nature of the game: if Player “” has a winning learning strategy, Player “” does not. ∎

5 Conventional learnability

To adapt the IMP game for the study of conventional (i.e., non-adversarial) learning and approximation, we introduce the notion of nonadaptive strategies.

Definition 3.

A nonadaptive strategy is a language, , over such that

where is the bit length of .

Respective to an arbitrarily chosen (computable) enumeration over the complete language, we define the function such that, for any language , is the language such that

Furthermore, for any collection of languages, , we define

is the nonadaptive application of .

To elucidate this definition, consider once again a (computable) enumeration, over the complete language.

In previous sections, we have analysed the case where the two competing strategies are adaptive (i.e., general). This was the case of adversarial learning. Modelling the conventional learning problem is simply done by restricting to nonadaptive strategies. The question of whether a strategy (or ) can learn is the question of whether it can learn adversarially . The reason this is so is because the bit output at any round by a nonadaptive strategy is independent of any response made by either player at any previous round: at each round , , the response of Player “”, as defined in (5), is a function of , a word composed of exactly bits. Definition 3 now adds to this the restriction that the response must be invariant to the value of these bits and must depend only on the bit length, , which is to say on the round number. Regardless of what the strategy of Player “” is, the sequence output by Player “” will always remain the same. Thus, a nonadaptive strategy for Player “” is one where the player’s output is a predetermined, fixed string of bits, and it is this string that the opposing strategy of Player “” must learn to mimic.

Note, furthermore, that if is the set of all nonadaptive languages, then for every we have

(10)

The equality stems from the fact that calculating from and vice versa (finding any that matches ) is, by definition, recursive, so there is a reduction from any to and back. If a language can be computed over the input by means of a certain nonempty set of quantifiers, no additional unbounded quantifiers are needed to compute it from .

This leads us to Definition 4.

Definition 4.

We say that a collection of languages is (conventionally) learnable by a collection of strategies if .

If a collection is learnable by , we simply say that it is learnable.

Corollary 6.0.

For all , is learnable by . In particular, is learnable.

Proof.

We have already shown (Corollary 5) that is adversarially learnable by , and is a subset of , as demonstrated by (10). ∎

Constraining Player “” to only be able to choose nonadaptive strategies can only lower the minmax value. Because it is already at , it makes no change: we are weakening the player that is already weaker. It is more rewarding to constrain Player “” and to consider the game . Note, however, that this is equivalent to the game under role reversal.

Theorem 7.

is learnable.

Proof.

To begin, let us consider a simpler scenario than was discussed so far. Specifically, we will consider a scenario in which the feedback available to the learning algorithm at each point is not only , the information of which rounds it had “won” and which it had “lost”, but also and , what the bit output by each machine was, at every step.222Because , using any two of these as input to the TM is equivalent to using all three, because the third can always be calculated from the others.

In this scenario, Player “” can calculate a co-R.E. function by calculating its complement in round and then reading the result as the complement to , which is given to it in all later rounds.

For example, at round Player “” may simulate a particular Turing machine, , in order to test whether it halts. If it does halt, the player halts and accepts the input, but it may also continue indefinitely. The end effect is that if halts then and otherwise it is . At round , Player “” gets new inputs. (Recall that if one views the player as a Turing machine, it is effectively restarted at each round.) The new input in the real IMP game is

, but for the moment we are assuming a simpler version where the input is the pair of strings

. This being the case, though whether halts or not is in general not computable by a player, once a simulation of the type described here is run at round , starting with round the answer is available to the player in the form of , which forms part of its input.

More concretely, one algorithm employable by Player “” against a known nonadaptive language is one that calculates “?” (which is an R.E. function) in every ’th round, and then uses this information in the next round in order to make the correct prediction. This guarantees . However, it is possible to do better.

To demonstrate how, consider that Player “” can determine the answer to the question “?” for any chosen , and . The way to do this is to simulate simultaneously all Turing machine runs that calculate “?” for each and to halt if of them halt. As with the previous example, by performing this algorithm at any stage , the algorithm will then be able to read out the result as in all later rounds.

Consider, now, that this ability can be used to determine exactly (rather than simply bounding it) by means of a binary search, starting with the question “?” in the first round, and proceeding to increasingly finer determination of the actual set size on each later round. Player “” can therefore determine the number of “” bits in a set of outputs of a co-R.E. function in this way in only queries, after which the number will be written in binary form, from most significant bit to least significant bit, in its input. Once this cardinality has been determined, Player “” can compute via a terminating computation the value of each of “?”: the player will simulate, in parallel, all machines, and will terminate the computation either when the desired bit value is found via a halting of the corresponding machine, or until the full cardinality of halting machines has been reached, at which point, if the desired bit is not among the machines that halted, then the player can safely conclude that its computation will never halt.

Let be an arbitrary (computable) sequence with . If Player “” repeatedly uses bits (each time picking the next value in the sequence) of its own output in order to determine Player “”’s next bits, the proportion of bits determined correctly by this will approach .

However, the actual problem at hand is one where Player “” does not have access to its own output bits, . Rather, it can only see , the exclusive or (xor) values of its bits and those of Player “”. To deal with this situation, we use a variation over the strategy described above.

First, for convenience, assume that Player “” knows the first bits to be output by Player “”. Knowing Player “”’s bits and having visibility as to whether they are the same or different to Player “”’s bits give, together, Player “” access to its own past bits.

Now, it can use these first bits in order to encode, as before, the cardinality of the next bits, and by this also their individual values (as was demonstrated previously with the calculation of “?”). This now gives Player “” the ability to win every one of the next rounds. However, instead of utilising this ability to the limit, Player “” will only choose to win the next , leaving the remaining bits free to be used for encoding the cardinality of the next . This strategy can be continued to all . The full list of criteria required of the sequence for this construction to work and to ultimately lead to is:

  1. .

  2. .

  3. .

A sequence satisfying all these criteria can easily be found, e.g. .

Two problems remain to be solved: (1) How to determine the value of the first bits, and (2) how to deal with the fact that is not known.

We begin by tackling the second of these problems. Because is not known, we utilise a strategy of enumerating over the possible languages, similar to what is done in Algorithm 2. That is to say, we begin by assuming that co- and respond accordingly. Then, if we detect that the responses from Player “” do not match those of we progress to assume that co-, etc.. We are not always in a position to tell if our current hypothesis of is correct, but we can verify that it matches at least the first bits of each set. If Player “” makes any incorrect predictions during any of these rounds, it can progress to the next hypothesis. We note that it is true that Player “” can remain mistaken about the identity of forever, as long as is such that the first predictions of every are correct, but because these correct predictions alone are enough to ensure , the question of whether the correct is ultimately found or not is moot.

To tackle the remaining problem, that of determining bits of in order to bootstrap the process, we make use of mixed strategies.

Consider a mixed strategy involving probability for each of strategies, differing only by the bits they assign as the first bits for each language in order to bootstrap the learning process. If co-, of the strategies one will make the correct guess regarding the first input bits, after which that strategy can ensure . However, note that, if implemented as described so far, this is not the case for any other . Suppose, for example, that co-. All strategies begin by assuming, falsely, that co-, and all may discover later on that this assumption is incorrect, but they may do so at different rounds. Because of this, a counter-strategy can be designed to fool all learner strategies.

To avoid this pitfall, all strategies must use the same bit positions in order to bootstrap learning for each , so these bit positions must be pre-allocated. We will use bits in order to bootstrap the ’th hypothesis, for some known , regardless of whether the hypothesis is known to require checking before these rounds, after, or not at all. The full set of rounds pre-allocated in this way still has only density zero among the integers, so even without a win for Player “” in any of these rounds its final payoff remains .

Suppose, now, that is still not the assumption currently being verified (or falsified) at rounds . The Hamming weight (number of “”s) of which bits should be encoded by Player “” in these rounds’ bits? To solve this, we will pre-allocate to each hypothesis an infinite number of bit positions, which, altogether for all hypotheses, still amount to a set of density among the integers. The hypothesis will continuously predict the values of this pre-allocated infinite sequence of bits until it becomes the “active” assumption. If and when it does, it will expand its predictions to all remaining bit positions.

This combination of strategies, of which one guarantees a payoff of , therefore guarantees in total an expected payoff of at least . We want to show, however, that . To raise from to , we describe a sequence of mixed strategies for which the expected payoff for Player “” converges to .

The ’th element in the sequence of mixed strategies will be composed of equal probability pure strategies. The strategies will follow the algorithm so far, but instead of moving from the hypothesis co- to co- after a single failed attempt (which may be due to incorrect bootstrap bits), the algorithm will try each language times. In total, it will guess at most bits for each language, which are the bits defining the strategy.

This strategy ensures a payoff of at least , so converges to , as desired, for an asymptotically large .

The full algorithm is described in Algorithm 3. It uses the function triangle, defined as follows: let

and

(11)

The value of for equals

describing a triangular walk through the nonnegative integers.

The algorithm is divided into two stages. In Step 1, the algorithm simulates its actions in all previous rounds, but without simulating any (potentially non-halting) Turing machine associated with any hypothesis. The purpose of this step is to determine which hypothesis (choice of Turing machine and bootstrapping) is to be used for predicting the next bit. Once the hypothesis is determined, Step 2 once again simulates all previous rounds, only this time simulating the chosen hypothesis wherever it is the active hypothesis. In this way, the next bit predicted by the hypothesis can be determined.

The specific sequence used in Algorithm 3 is (which was previously mentioned as an example of a sequence satisfying all necessary criteria).

1: The strategy is a uniform mixture of algorithms.
2: We describe the ’th algorithm.
3:function calculate bit()
4:      length of The round number. Let .
5:      Step 1: Identify , the current hypothesis.
6:     
7:      A set managing which positions are predicted by which hypothesis.
8:     for  do
9:         if  such that  then
10:              Let be as above.
11:               hypothesis number.
12:               predicted positions.
13:               next positions to be predicted.
14:              Let be such that . We only construct that have such an .
15:         else if  such that , and  then
16:               First bootstrap bit for hypothesis .
17:              Let be as above.
18:              
19:              
20:              
21:              
22:         else if  then Unusable bits.
23:              Accept input. Arbitrary choice.
24:         else
25:              Next .
26:         end if
27:         
28:         if  then
29:               These bits are predicted accurately for the correct hypothesis.
30:              if  and  then
31:                   Incorrect prediction, so hypothesis is false.
32:                  
33:                  
34:              end if
35:         else if  then Bits with are used to encode next bit counts.
36:               New positions to predict on.
37:              
Algorithm 3 Algorithm for learning any co-R.E. language
38:              while  do
39:                  
40:                  if  such that , and or and such that , , ,  then
41:                       ” is the minimum nonnegative integer not appearing in .
42:                       
43:                  end if
44:              end while
45:              
46:         end if
47:     end for
48:      Step 2: Predict, assuming .
49:     
50:     
51:      is the machine to be simulated. .
52:      The try number of this machine.
53:     
54:     
55:     for  do
56:         if  and  then
57:              Let be such that .
58:              
59:              if  then
60:                   Number of ’s in .
61:              end if
62:              if  then
63:                  if  then
64:                       if  then
65:                           Accept input.
66:                       else
67:                           Reject input.
68:                       end if
69:                  end if
70:              else if  then
71:                  Simulate simultaneously on all inputs in until are accepted.
72:                   If this simulation does not terminate, this is a rejection of the input.
73:                  Accept input.
74:              else
75:                  if  then Previous simulation terminated.
76:                        Binary search.
77:                  end if
78:                  if  then counter holds the number of terminations in .
79:                       Simulate simultaneously on all inputs in until counter are accepted. Guaranteed to halt, if hypothesis is correct.
80:                       Let be on all that terminated, otherwise.
81:                  end if
82:              end if
83:         end if
84:     end for
85:end function

Some corollaries follow immediately.

Corollary 8.0.

There exists a probabilistic Turing machine that is able to learn any language in with probability .

Proof.

Instead of using a mixed strategy, it is possible to use probabilistic Turing machines in order to generate the guessed bits that bootstrap each hypothesis. In this case, there is neither a need for a mixed strategy nor a need to consider asymptotic limits: a single probabilistic Turing machine can perform a triangular walk over the hypotheses for , investigating each option an unbounded number of times. The probability that for the correct at least one bootstrap guess will be correct in this way equals .

The method for doing this is essentially the same as was described before. The only caveat is that because the probabilistic TM is re-initialised at each round and because it needs, as part of the algorithm, to simulate its actions in all previous rounds, the TM must have a way to store its random choices, so as to make them accessible in all later rounds.

The way to do this is to extend the hypothesis “bootstrap” phase from bits to bits. In each of the first bits, the TM outputs a uniform random bit. The bit available to it in all future rounds is then this random bit xor the output of Player “”. is therefore also a uniform random bit. In this way, in all future rounds the TM has access to these consistent random bits. It can then use these in the second set of bootstrap bits as was done with the value in the deterministic set-up. ∎

We note, as before, that the construction described continues to hold, and therefore the results remain true, even if Oracles are allowed, that are accessible to both players, and, in particular, the results hold for any with :

Corollary 9.0.

For all , is learnable by .

Furthermore:

Corollary 10.0.

For all , the collection of languages learnable by is a strict superset of .

Proof.

We have already shown that and are both learnable by . Adding the languages as additional hypotheses to Algorithm 3 we can see that the set is also learnable.

To give one example of a family of languages beyond this set which is also learnable by , consider the following. Let , for a fixed , be the set of languages recognisable by a Turing machine which can make at most calls to a Oracle.

This set contains and , but it also contains, for example, the xor of any two languages in , which is outside of , and therefore strictly beyond the ’th level of the arithmetic hierarchy.

We will adapt Algorithm 3 to learn . The core of Algorithm 3 is its ability to use bits of in order to predict bits. We will, instead, use bits in order to predict the same amount. Specifically, we will use the first bits in order to predict the result of the first Oracle call in each of the predicted positions, the next bits in order to predict the second Oracle call in each of the predicted positions, and so on.

In total, for this to work, all we need is to replace criterion in our list of criteria for the sequence with the new criterion

An example of such a sequence is . ∎

In fact, Algorithm 3 can be extended even beyond what was described in the proof to Corollary 10. For example, instead of using a constant , it is possible to adapt the algorithm to languages that use Oracle calls at the ’th round, for a sufficiently low-complexity by similar methods.

Altogether, it seems that R.E. learning is significantly more powerful than being able to learn merely the first level of the arithmetic hierarchy, but we do not know whether it can learn every language in . Indeed, we have no theoretical result that implies R.E. learning cannot be even more powerful than the second level of the arithmetic hierarchy.

A follow-up question which may be asked at this point is whether it was necessary to use a mixed strategy, as was used in the proof of Theorem 7, or whether a pure strategy could have been designed to do the same.

In fact, no pure strategy would have sufficed:

Lemma 10.1.

For all ,

This result is most interesting in the context of Corollary 8, because it describes a concrete task that is accomplishable by a probabilistic Turing machine but not by a deterministic Turing machine.

Proof.

We devise for each a specific antidote. The main difficulty in doing this is that we cannot choose, as before, , because is now restricted to be nonadaptive, whereas is general.

However, consider such that its bit for round is the complement of ’s response on . This is a nonadaptive strategy, but it ensures that will be for every . Effectively, describes ’s “red herring sequence”. ∎

6 Approximability

When both players’ strategies are restricted to be nonadaptive, they have no means of learning each other’s behaviours: determining whether their next output bit will be or is done solely based on the present round number, not on any previous outputs. The output of the game is therefore solely determined by the dissimilarity of the two independently-chosen output strings.

Definition 5.

We say that a collection of languages is approximable by a collection of strategies if .

If a collection is approximable by , we simply say that it is approximable.

In this context it is clear that for any

because can always be chosen to equal , but unlike in the case of adversarial learning, here mixed strategies do make a difference.

Though we do not know exactly what the value of is, we do know the following.

Lemma 10.2.

If and are mixed strategies from , then

(12)

and

(13)

where is as in the definition of the IMP game.

In other words, Player “” can always at the very least break even, from a perspective.

Proof.

Let be a mixture of the following two strategies: all zeros (), with probability ; all ones (), with probability . By the triangle inequality, we have that for any language ,

and because this is true for each in , it is also true in expectation over all . The fact that is independent of in the construction means that this bound is applicable for both (12) and (13). ∎

Just as interesting (and with tighter results) is the investigation of . We show

Lemma 10.3.
(14)

where is as in the definition of the IMP game.

Proof.

Let be as in (11), and let be the maximum integer, , such that .

The language will be defined by

where is an enumeration over all R.E. languages.

To prove that for any , if the claim holds, let us first join the rounds into “super-rounds”, this being the partition of the rounds set according to the value of . At each super-round, equals a specific , and by the end of the super-round, a total of of the total rounds will have been rounds in which equals this . Hence, the Hamming distance between the two (the number of differences) at this time is at most of the string lengths. Because each choice of repeats an infinite number of times, the of this proportion is . ∎

With this lemma, we can now prove Theorem 1.

Proof.

The theorem is a direct corollary of the proof of Lemma 10.3, because the complement of the language that was constructed in the proof to attain the infimum can be used as . ∎

Combining Lemma 10.2 and 10.3 with the definition of the payoff function in (1), we get, in total:

Corollary 11.0.

and

Though we have the exact value of neither maxmin nor minmax in this case, we do see that the case is somewhat unusual in that neither player has a decisive advantage.

7 Conclusions and further research

We have introduced the IMP game as an arena within which to test the ability of algorithms to learn and be learnt, and specifically investigated three scenarios:

Adversarial learning,

where both algorithms are simultaneously trying to learn each other by observations.

Non-adversarial (conventional) learning,

where an algorithm is trying to learn a language by examples.

Approximation,

where languages (or language distributions) try to mimic each other without having any visibility to their opponent’s actions.

In the case of adversarial learning, we have shown that can learn but not .

In conventional learning, however, we have shown that can learn , and beyond into the level of the arithmetic hierarchy, but this learnability is yet to be upper-bounded. Our conjecture is that the class of learnable languages is strictly a subset of . If so, then this defines a new class of languages between the first and second levels of the arithmetic hierarchy, and, indeed, between any consecutive levels of it.

Regarding approximability, we have shown that (unlike in the previous results) no side has the absolute upper hand in the game, with the game value for Player “”, if it exists, lying somewhere between and . We do not know, however, whether the game is completely unbiased or not.

An investigation of adversarial learning in the context of recursive languages was given as a demonstration of the fact that in IMP it may be the case that no Nash equilibrium exists at all, and pure-strategy learning was given as a concrete example of a task where probabilistic Turing machines have a provable advantage over deterministic ones.

References

  • [1] D.L. Dowe. Foreword re C. S. Wallace. Computer Journal, 51(5):523–560, September 2008. Christopher Stewart WALLACE (1933-2004) memorial special issue.
  • [2] D.L. Dowe. Minimum Message Length and statistically consistent invariant (objective?) Bayesian probabilistic inference – from (medical) “evidence”. Social Epistemology, 22(4):433–460, Oct–Dec 2008.
  • [3] D.L. Dowe.

    MML, hybrid Bayesian network graphical models, statistical consistency, invariance and uniqueness.

    In Bandyopadhyay, P.S. and Forster, M.R., editor, Handbook of the Philosophy of Science – Volume 7: Philosophy of Statistics, pages 901–982. Elsevier, 2011.
  • [4] D.L. Dowe. Introduction to Ray Solomonoff 85th Memorial Conference. In

    Proceedings of Solomonoff 85th memorial conference – Lecture Notes in Artificial Intelligence (LNAI)

    , volume 7070, pages 1–36. Springer, 2013.
  • [5] D.L. Dowe, J. Hernández-Orallo, and P.K. Das. Compression and intelligence: Social environments and communication. In AGI: 4th Conference on Artificial General Intelligence – Lecture Notes in Artificial Intelligence (LNAI), pages 204–211, 2011.
  • [6] G.W. Flake. The Computational Beauty of Nature: Computer Explorations of Fractals, Chaos, Complex Systems, and Adaptation. A Bradford book. Cambridge, Massachusetts, 1998.
  • [7] E.M. Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967.
  • [8] J. Hernández-Orallo, D.L. Dowe, S. España-Cubillo, M.V. Hernández-Lloreda, and J. Insa-Cabrera. On more realistic environment distributions for defining, evaluating and developing intelligence. In AGI: 4th Conference on Artificial General Intelligence – Lecture Notes in Artificial Intelligence (LNAI), volume 6830, pages 82–91. Springer, 2011.
  • [9] Ling Huang, Anthony D. Joseph, Blaine Nelson, Benjamin I.P. Rubinstein, and J. D. Tygar. Adversarial machine learning. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, AISec ’11, pages 43–58, New York, NY, USA, 2011. ACM.
  • [10] D.K. Lewis and J.S. Richardson. Scriven on human unpredictability. Philosophical Studies: An International Journal for Philosophy in the Analytic Tradition, 17(5):69–74, October 1966.
  • [11] Wei Liu and Sanjay Chawla. A Game Theoretical Model for Adversarial Learning. In Saygin, Y and Yu, JX and Kargupta, H and Wang, W and Ranka, S and Yu, PS and Wu, XD, editor, 2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), pages 25–30. Knime; Mitre; CRC Press, 2009. 9th IEEE International Conference on Data Mining, Miami Beach, FL, DEC 06-09, 2009.
  • [12] Daniel Lowd and Christopher Meek. Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, pages 641–647, New York, NY, USA, 2005. ACM.
  • [13] J. Nash. Non-cooperative Games. The Annals of Mathematics, 54(2):286–295, 1951.
  • [14] J.v. Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, Princeton, NJ, 1944.
  • [15] Hartley Rogers, Jr. Theory of Recursive Functions and Effective Computability. MIT Press, Cambridge, MA, second edition, 1987.
  • [16] M. Scriven. An essential unpredictability in human behavior. In B.B. Wolman and E. Nagel, editors, Scientific Psychology: Principles and Approaches, pages 411–425. Basic Books (Perseus Books), 1965.
  • [17] R.J. Solomonoff. Complexity-based induction systems: Comparisons and convergence theorems. IEEE Transaction on Information Theory, IT-24(4):422–432, 1978.
  • [18] R.J. Solomonoff.

    Algorithmic probability: Theory and applications.

    In F. Emmert-Streib and M. Dehmer, editors, Information Theory and Statistical Learning, Springer Science and Business Media, pages 1–23. Springer, N.Y., U.S.A., 2009.
  • [19] R.J. Solomonoff.

    Algorithmic probability, heuristic programming and AGI.

    In Proceedings of the Third Conference on Artificial General Intelligence, AGI 2010, pages 251–257, Lugano, Switzerland, March 2010. IDSIA.
  • [20] R.J. Solomonoff. Algorithmic probability – its discovery – its properties and application to strong AI. In H. Zenil, editor, Randomness Through Computation: Some Answers, More Questions, pages 1–23. World Scientific Publishing Co., Inc., River Edge, NJ, USA, 2011.
  • [21] A.M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proc. London Math. Soc., 42:230–265, 1936.
  • [22] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.