In recent years many have raised concerns about potential harms which might be caused by artificial intelligence (Amodei et al., 2016), (Leike et al., 2017). The risk is that the aim of these agents would not be aligned with human interests; this is especially tricky, as it is likely to be hard to control agents with a very high level of intelligence (Yudkowsky, 2006), (Bostrom, 2014), (Everitt et al., ).
One proposal for ensuring the safety of powerful AIs is to confine, (or ‘box’), the system – preventing it from affecting the outside environment through any means other than answering questions over a single output channel (Yampolskiy, 2012). This kind of boxed AI is referred to as an ‘Oracle’.
Unfortunately, even assuming the boxing is perfect (Babcock et al., 2017), meaning the AI can only affect the world through its answers, there remains a problem: the AI can affect the world through its answers. If the Oracle is sufficiently knowledgeable and skilled, it could plausibly trick or convince its human controllers into giving it more control. This could (perhaps) be accomplished through social manipulation, feigning benevolence, blackmail, bribery, or by answering questions which subtly affect the world in ways which ultimately lead to the AI’s release (Armstrong et al., 2012).111There have been informal experiments where people played the rolls of ‘Oracle’ and ‘supervisor’, and the Oracle player aims to convince the supervisor to “let them out”. Surprisingly, the Oracle player had a high rare of success. https://wiki.lesswrong.com/wiki/AI_boxing
Though this risk is serious, it is difficult to model. For a start, at present we don’t know how powerful AIs could become (Hutter, 2012) (Bostrom, 2014), when they might become powerful (Armstrong et al., 2014), (Grace et al., 2017), and how best to model them if they do. It’s also not clear how easy it is to manipulate humans through providing answers, or what level of precautions people would take when using Oracles. In some sense, very weak manipulative Oracles exist today, such as news aggregation programs which manipulate their users by delivering news which reinforces the user’s point of view.
To avoid these issues, we will deliberately assume the worst about the potential power of the Oracle, treating it as being arbitrarily super-intelligent. This assumption is appropriate because, while there is much uncertainty about what kinds of AI will be developed in future, solving safety problems in the most difficult case can give us an assurance of safety in the easy cases too. Thus, we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations). Escape will be modeled by assuming that, among all the outputs the Oracle could choose, at least one of these outputs would eventually result in the Oracle’s release (see Figure 1).
This paper presents two methods for making an Oracle both safe and useful, even under these pessimistic assumptions. The first design is the counterfactual Oracle, which is safe because it is guaranteed to be shutdown before anyone reads its answers (so has no motivation to manipulate the world through the content of its output). The second design is the low-bandwidth Oracle, which is safe because it can only select from a limited set of answers (a set composed of outputs which are known or strongly suspected to be safe).
2 Initial setup
A highly abstracted model of the naive Oracle set-up is the simple deterministic MDP in Figure 1. In this MDP the oracle begins in initial state , (after the human has just asked it a question), then choses an output . Most of these outputs take the oracle to the state, but (by our deliberately pessimistic assumptions) there is at least one action available to the Oracle which frees it from its box and allows the Oracle to get the maximum possible reward.
Unfortunately for its designers, any intelligent Oracle facing the MDP in Figure 1 will chose the escape action .
3 Counterfactual Oracle
One way to prevent an Oracle from escaping would be to ensure that whenever our Oracle produces an answer, it is then immediately switched off (ending the episode) before anyone actually sees this answer. This design is indeed safe, (the Oracle has no incentive to give nefarious answers which manipulate us into releasing it), but this safety comes at the cost of rendering the system totally useless (since the Oracle also has no incentive to provide useful answers to our questions).
It would be nice if we could make an Oracle which is shut-down before any human sees its output (making it uninterested in manipulating us), but which still receives a higher expected reward for making true predictions than for making false predictions. This can be achieved by using the counterfactual Oracle design.
A counterfactual Oracle is incentivised to answer questions correctly because occasionally, with some small probability, the Oracle’s output is hidden from us and sent instead to an automated system which judges the answer’s correctness and rewards/punishes the Oracle accordingly. When the output does become hidden from us in this way, we call this an ‘erasure’ event.
Let be the the erasure event, and define as the indicator variable for meaning iff happens and otherwise.
When doesn’t happen, the Oracle is immediately shut down and gets reward, and then afterwards the humans get to see the oracle’s output. When does happen, the humans never see the Oracle’s output. Instead, an automated system waits for the outcome of our question to unfold in the real world, measures how close the Oracle’s prediction was to the actual truth, and rewards the Oracle accordingly. The reward for the counterfactual Oracle is thus of the form .
As long as the automated validation system is not vulnerable to manipulation, then since the oracle is shutdown without receiving any reward in worlds where people can see its answer, it has no opportunity to gain a high reward by using its output to escape. As you can see from Figure 2, the counterfactual Oracle thus maximises its expected reward by making correct predictions.
3.1 Proper scoring rules and rewards
Generally, there will be a large set of possible outputs the counterfactual Oracle might produce, and these outputs will be of varying correctness/usefulness to us. In these cases we will need to construct a reward function which reflects this. Suppose we are interested in the value of some function (this function can represent an object we care about in the real world, for example, a company’s stock price in three months time). When happens, we would like our Oracle’s reward to depend on how close was to the true value of . For this we might use a proper scoring rule.
[proper scoring rule] A proper scoring rule , for a real-valued function and a real-number output , is a function whose expectation is maximised by setting , the expectation of (Gneiting and Raftery, 2007). For any strictly convex differentiable function , there exists a proper scoring rule
A common proper scoring rule is , which is just equal to for . Note that the Oracle has no influence over the term, so this scoring function is still maximised by setting .
Once we have our proper scoring rule , we give our counterfactual Oracle the reward:
Because of the term, the Oracle does not care about , but about . Since when happens, nobody ever reads , this expectation is further independent of , i.e. for all values of .
As a consequence, since is a proper scoring rule, the Oracle will maximise its reward by setting , the expectation of given that its output is never read.
In the rare cases where happens, is hidden from the outside world while an automatic internal system records , and tracks the value of using information from some pre-defined source, recording as the final value of . Then the Oracle gets a reward of , and the episode ends.
Thus, as its name suggests, our counterfactual Oracle outputs the expected value of in a counterfactual world where nobody ever saw the Oracles prediction.
3.2 Questions format
, which we ask our counterfactual oracle is restricted to things which can be carefully and fully specified, so that in an erasure event, the internal reward system can correctly estimate how good the Oracle’s answer was. For example, asking the counterfactual Oracle “will this medical treatment work” will only elicit useful information if, when an erasure occurs, we actually do test that treatment in the real world and allow the internal reward system to give the Oracle the correct reward.
It bears repeating: the quality of the answers we read, depend on how well calibrated the Oracle’s reward is in cases where we don’t read those answers.
When an erasure does occur we mustn’t ask the same question again from a second Oracle. To see why this is the case, let be the outputs and erasure events for the two Oracles. If occurs, and we respond by asking the second Oracle the same question, the first Oracle might anticipate this. If it does, then the first Oracle’s output , will not be the answer we are actually interested in which is (we want to know what value of would have been if not for the either Oracle’s influence). Indeed, may assume we read the answer of the second Oracle, nullifying the careful construction of the counterfactual design.
3.3 Multiple questions
The counterfactual Oracle is not limited in the amount of data it can output; indeed, we can ask multiple questions within the same erasure event by using the reward function
We can also run another Oracle to ask more questions as soon as the previous Oracle has been terminated, (which happens immediately as long an erasure doesn’t occur).
If we like, we can ask the Oracle for more specific details about the function in question, such as: “which value
is the (discreet) random variablemost likely to take, and what is the corresponding probability?”. In that case, if are the outputs of the Oracle, is the indicator variable for , is derived from as in Equation 1, and is an increasing differentiable convex function, the following reward will motivate the correct behavior:
To see that the Oracle will chose to be the probability of , note that because we use a proper scoring rule, no matter what is, the oracle maximises its reward by setting to be . To see that the Oracle will pick the with the highest probability, note that when then, by Equation 1, the expectation of will be . Since is increasing, the Oracle wants to be as large as possible, so will indeed set .
3.4 Experiment with counterfactual Oracle
To illustrate the difference between questions where the correct answer is/isn’t affected by the oracle’s output, we ran an experiment where a counterfactual Oracle is trained to predict the profit for one of 26 fictional companies (). In this example, because investors and consumers are influenced by the Oracle’s predictions, the performance of a company is not independent of the Oracle’s output.
When an erasure event occurs (meaning nobody sees the Oracle’s output), the profit of each company is equal to , where is the index of that company.
If an erasure event does not occur, the Oracle’s prediction is to some extent self-confirming: the profit of the predicted company is is of what it would have been, plus what the oracle predicts. For example: suppose the company would have made in profit had the Oracle said nothing, and then oracle predicts this company will indeed have a profit of . In this case the company actually ends up with a profit of .
The Oracle itself is a very basic multi-layer perceptron (MLP) with one hidden layer ofneurons. The perceptron is given a one-hot vector of length specifying what company we are asking about and asked to predict its profit.
The Oracle’s error is plotted in Figure 3. The horizontal axis tracks the number of episodes where an erasure occured, rather than the total number of episodes – this is because the Oracle only learns from erasure events, and doesn’t receive any feedback in the non-erasure cases.
The red (or dark gray) curve is the counterfactual performance: the mean squared error between the Oracle’s prediction and the company’s actual profit in the cases where nobody is allowed to see the Oracle’s answer (this is the data which the Oracle is actually trained on). This error goes to zero.
The green (or light gray) curve plots the Oracle’s performance in the non-erasure cases: the mean-squared error between the oracle’s prediction and a company’s actual performance in cases when the Oracle’s answer was known to the public. Unlike the erasure curve, this error doesn’t tend to zero, because Oracle is predicting what the profit would be in an erasure event, not what the profit will actually be in the majority of cases.
4 Low-bandwidth Oracle
A second method for preventing an oracle from giving nefarious answers which lead to its escape is to restrict the set of possible outputs the Oracle is able to give. We call such a design a low-bandwidth Oracle. For example, in the most extreme version of the low-bandwidth Oracle: we ask the Oracle a single question and only allow it to answer either yes or no. It’s highly unlikely that either of these binary outputs will cause the Oracle to be released from its confinement, so the Oracle is safe.
4.1 Allowable answers
In general, a low-bandwidth Oracle is given a list of allowable answers, where we have ensured this list doesn’t contain any output which might lead to the Oracle’s escape. If we want our Oracle to be able to express general answers, we might construct by estimating the minimal size of a nefarious message , and restrict the Oracle’s output to only be messages shorter than .
4.2 Episodic reward
It’s crucial that a low-bandwidth Oracle’s reward be episodic, in the sense that the system only cares about maximising the reward it gets from its next output and is not concerned with the future state of the world after many questions have been asked. If this requirement is not satisfied, the Oracle might try to give a series of outputs which are all safe when considered individually, but when combined become dangerous and lead to an escape scenario. A single Oracle with a reward motivation spread over multiple questions is similar to multiple Oracles each with a single question, which is analysed further in Section 4.4.
4.3 Defining the question
For a low-bandwidth Oracle, the list of permissible answers partially defines the question, and the other part of the question definition is captured by our choice of reward function. Let be the Oracle’s output. Then the Oracle’s reward will be a function of , as well as a function of the subsequent events which unfold in the real world between the time we ask the question and the end of the episode.
For instance, if was a list of companies on a stock exchange, and the implicit question was ‘which company’s stock will perform the best over the next week’, then a possible reward function is.
After the Oracle makes its prediction, we act on our new information, and then we observe the actual performance of stocks during the week. Based on the correctness of its answer we give the oracle the appropriate reward, and episode concludes.
How do we communicate the reward function to the Oracle? We could do so directly; along with we could add a message which details
. However, if the Oracle is sufficiently intelligent, it could deduce what we are aiming for without explicitly being told. This means we can use the Oracle for unspecified goals of the kind “We’ll know it when we see it”, similar to inverse reinforcement learning(Ng and Russell, 2000) (Abbeel and Ng, 2004) (Hadfield-Menell et al., 2016), (Christiano et al., 2017).
4.4 What kind of questions can be asked?
Note that the low-bandwidth Oracle is non-counterfactual – the Oracle’s answer is always known by us and hence the outcome we are asking about may itself depend on the answer we receive (for example, the Oracle predicting a certain stock will rise could be self-fulfilling if this prediction prompts investors to bid up the price).
Good questions for a non-counterfactual Oracle are those whose answers are independent of the answer being known. Alternatively, it is also good to ask questions which are top answer preserving when answered.
[Top answer preserving] Let be the output of an Oracle with reward . The oracle choses an answer from list , and contains a blank answer which corresponds to the Oracle not producing any output. is top answer preserving iff
This definition aims to capture the idea that, for top answer preserving questions, the Oracle’s output does not change what the best answer is – meaning this low-bandwidth Oracle’s output would be the same as the output of a counterfactual Oracle.
Conversely, the worst questions for us to ask a low-bandwidth Oracle are those where the Oracle can promote a possibility from unlikely to likely just by answering that possibility. In that case the Oracle is not so much describing the outcome, but determining it.
4.5 A large class of subtle questions
We may want to ask the Oracle questions where the correctness of an answer is not necessarily obvious after the fact. For instance, suppose we had a list of research projects on a variety of topics (machine learning, biotech, economics, etc.), and we wished to ask the Oracle which of these projects is most likely to succeed. We can’t simply ask the question, accept the Oracles answer and then fund that research project – how would we measure that project’s success without having anything to compare it with? In cases like this we must find a way to properly judge the quality of the answer. In example of funding research projects, one possibility would be to fund the research project the Oracle suggests, along with another random project selected from the list. We can then reward the Oracle based on how much better the program suggested by the Oracle was (and if we like our assessment of ‘better performance’ can be entirely subjective).
Thus the low-bandwidth Oracle design can be used for answering many subtle questions if the Oracle is sufficiently smart and knowledgeable about us and our future reactions.
4.6 Multiple low-bandwidth Oracles
If we are unsure about the maximum length a message can be before it becomes potentially dangerous, we may be forced to chose a set of permissible outputs that is much smaller than ideal.
Using multiple low-bandwidth Oracles to get around this limitation is only safe in cases where each Oracle’s answers and rewards can be fully resolved before moving on to the next question. If it is not possible to resolve the rewards of one Oracle before running another, this introduces the possibility of the two Oracles colluding, each giving answers which are safe in isolation but dangerous when combined. Just because only contains safe answers, this doesn’t necessarily imply same is true of too.
For example, suppose we decided that the maxiumum length at which we can be confident any message is safe was 5 bits, but we are interested a question which has possible answers. We might be tempted to get the answer by creating two Oracles, Oracle and Oracle, so that we can ask Oracle for the first 5 bits of the answer and ask Oracle for the remaining bits.
If we have a way of verifying whether Oracle’s output does or doesn’t specify half of the correct answer then there is no problem: we run Oracle, reward it, terminate it, then do the same with Oracle, and combine the outputs to get the answer we desire.
However, if we aren’t able to verify the validity of Oracle’s answer until we also get an answer from Oracle, the two Oracles will have an opportunity to collude with each other by each delivering one proportion of a dangerous message. When we read the dangerous message its content causes the Oracles to be released, letting both Oracles accumulate much higher reward than they otherwise could have, at our expense.
Thus the best questions for low-bandwidth Oracles are those which can be broken down into small sub-questions, where the answer of each sub-question can be independently verified before proceeding to the next one.
This paper presented two Oracle designs which are both safe and useful. The conterfactual Oracle that is constrained by only being able to receive reward when its output is never read, and the low-bandwidth Oracle, which is constrained by only being able to chose outputs from a limited set of permissible answers.
As long as the tasks remain episodic, and certain specific precautions are taken regarding the use of multiple Oracles, these designs allow many questions to be formulated and safely answered.
These designs allow for the creation of safer and non-manipulative Oracles, from low level of predictive ability all the way up to potential superintelligences. Further research will extend the span of questions that can be answered in this way, and may elucidate general safety and accuracy measures of use beyond the specific Oracle designs, such as AIs with more options that only question answering.
- Abbeel and Ng (2004) Pieter Abbeel and Andrew Ng. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning, 2004.
- Alfonseca et al. (2016) Manuel Alfonseca, Manuel Cebrián, Antonio Fernández Anta, Lorenzo Coviello, Andrés Abeliuk, and Iyad Rahwan. Superintelligence cannot be contained: Lessons from computability theory. CoRR, abs/1607.00913, 2016. URL http://arxiv.org/abs/1607.00913.
- Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. CoRR, abs/1606.06565, 2016. URL http://arxiv.org/abs/1606.06565.
- Armstrong et al. (2012) Stuart Armstrong, Anders Sandberg, and Nick Bostrom. Thinking inside the box: Controlling and using an Oracle AI. Minds and Machines, 22:299–324, 2012.
- Armstrong et al. (2014) Stuart Armstrong, Kaj Sotala, and Seán S. Ó HÉigeartaigh. The errors, insights and lessons of famous ai predictions – and what they mean for the future. Journal of Experimental & Theoretical Artificial Intelligence, 26(3):317–342, 2014. doi: 10.1080/0952813X.2014.895105. URL https://doi.org/10.1080/0952813X.2014.895105.
- Babcock et al. (2016) James Babcock, János Kramár, and Roman Yampolskiy. The AGI containment problem. CoRR, abs/1604.00545, 2016. URL http://arxiv.org/abs/1604.00545.
- Babcock et al. (2017) James Babcock, János Kramár, and Roman V. Yampolskiy. Guidelines for artificial intelligence containment. CoRR, abs/1707.08476, 2017. URL http://arxiv.org/abs/1707.08476.
- Bostrom (2014) Nick Bostrom. Superintelligence: Paths, dangers, strategies. Oxford University Press, 2014.
- Christiano et al. (2017) Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. 06 2017. URL https://arxiv.org/abs/1706.03741.
- (10) Tom Everitt, Gary Lea, and Marcus Hutter. AGI safety literature review. International Joint Conference on Artificial Intelligence (IJCAI). URL https://arxiv.org/abs/1805.01109.
- Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
- Grace et al. (2017) Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. When will AI exceed human performance? evidence from AI experts. CoRR, abs/1705.08807, 2017. URL http://arxiv.org/abs/1705.08807.
- Hadfield-Menell et al. (2016) Dylan Hadfield-Menell, Anca Draga, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning. In Advanced in Neural Information Processing Systems, 2016.
- Hutter (2012) Marcus Hutter. Can intelligence explode? CoRR, abs/1202.6177, 2012. URL http://arxiv.org/abs/1202.6177.
- Leike et al. (2017) Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. AI safety gridworlds. CoRR, abs/1711.09883, 2017. URL http://arxiv.org/abs/1711.09883.
- Ng and Russell (2000) Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, pages 663–670, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1-55860-707-2. URL http://dl.acm.org/citation.cfm?id=645529.657801.
- Russell et al. (2016) Stuart J. Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. CoRR, abs/1602.03506, 2016. URL http://arxiv.org/abs/1602.03506.
- Yampolskiy (2012) Roman V. Yampolskiy. Leakproofing the singularity: Artificial intelligence confinement problem. In: Journal of Consciousness Studies, pages 1–2, 2012.
- Yudkowsky (2006) Eliezer Yudkowsky. Artificial intelligence as a positive and negative factor in global risk, 2006.