-  Peter Stone, Rodney Brooks, Erik Brynjolfsson, Ryan Calo, Oren Etzioni, Greg Hager, Julia Hirschberg, Shivaram Kalyanakrishnan, Ece Kamar, Sarit Kraus, et al. One hundred year study on artificial intelligence: Report of the 2015-2016 study panel. Technical report, Stanford University, 2016.
-  Pedro Domingos. The Master Algorithm : How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books, New York, NY, 2015.
-  Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Oxford, UK, 2014.
-  Erik Brynjolfsson and Andrew McAfee. The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies. WW Norton & Company, New York, 2014.
-  Ryan Calo. Robotics and the lessons of cyberlaw. California Law Review, 103:513, 2015.
-  Tao Jiang, Srdjan Petrovic, Uma Ayyer, Anand Tolani, and Sajid Husain. Self-driving cars: Disruptive or incremental. Applied Innovation Review, 1:3–22, 2015.
-  William D. Nordhaus. Two centuries of productivity growth in computing. The Journal of Economic History, 67(01):128–159, 2007.
-  Katja Grace. Algorithmic progress in six domains. Technical report, Machine Intelligence Research Institute, 2013.
-  Erik Brynjolfsson and Andrew McAfee. Race Against the Machine: How the Digital Revolution Is Accelerating Innovation, Driving Productivity, and Irreversibly Transforming Employment and the Economy. Digital Frontier Press, Lexington, MA, 2012.
-  Seth D. Baum, Ben Goertzel, and Ted G. Goertzel. How long until human-level ai? results from an expert assessment. Technological Forecasting and Social Change, 78(1):185–195, 2011.
-  Vincent C. Müller and Nick Bostrom. Future progress in artificial intelligence: A survey of expert opinion. In Vincent C Müller, editor, Fundamental issues of artificial intelligence, chapter part. 5, chap. 4, pages 553–570. Springer, 2016.
-  Toby Walsh. Expert and non-expert opinion about technological unemployment. arXiv preprint arXiv:1706.06906, 2017.
-  Irving John Good. Speculations concerning the first ultraintelligent machine. Advances in computers, 6:31–88, 1966.
-  Philip Tetlock. Expert political judgment: How good is it? How can we know? Princeton University Press, Princeton, NJ, 2005.
-  J Doyne Farmer and François Lafond. How predictable is technological progress? Research Policy, 45(3):647–665, 2016.
-  Lyle Ungar, Barb Mellors, Ville Satopää, Jon Baron, Phil Tetlock, Jaime Ramos, and Sam Swift. The good judgment project: A large scale test. Technical report, Association for the Advancement of Artificial Intelligence Technical Report, 2012.
-  Joe W. Tidwell, Thomas S. Wallsten, and Don A. Moore. Eliciting and modeling probability forecasts of continuous quantities. Paper presented at the 27th Annual Conference of Society for Judgement and Decision Making, Boston, MA, 19 November 2016., 2013.
Thomas S. Wallsten, Yaron Shlomi, Colette Nataf, and Tracy Tomlinson.
Efficiently encoding and modeling subjective probability distributions for quantitative variables.Decision, 3(3):169, 2016.
We developed questions through a series of interviews with Machine Learning researchers. Our survey questions were as follows:
Three sets of questions eliciting HLMI predictions by different framings: asking directly about HLMI, asking about the automatability of all human occupations, and asking about recent progress in AI from which we might extrapolate.
Three questions about the probability of an “intelligence explosion”.
One question about the welfare implications of HLMI.
A set of questions about the effect of different inputs on the rate of AI research (e.g., hardware progress).
Two questions about sources of disagreement about AI timelines and “AI Safety.”
Thirty-two questions about when AI will achieve narrow “milestones”.
Two sets of questions on AI Safety research: one about AI systems with non-aligned goals, and one on the prioritization of Safety research in general.
A set of demographic questions, including ones about how much thought respondents have given to these topics in the past. The questions were asked via an online Qualtrics survey. (The Qualtrics file will be shared to enable replication.) Participants were invited by email and were offered a financial reward for completing the survey. Questions were asked in roughly the order above and respondents received a randomized subset of questions. Surveys were completed between May 3rd 2016 and June 28th 2016.
Our goal in defining “high-level machine intelligence” (HLMI) was to capture the widely-discussed notions of “human-level AI” or “general AI” (which contrasts with “narrow AI”). We consulted all previous surveys of AI experts and based our definition on that of an earlier survey . Their definition of HLMI was a machine that “can carry out most human professions at least as well as a typical human.” Our definition is more demanding and requires machines to be better at all tasks than humans (while also being more cost-effective). Since earlier surveys often use less demanding notions of HLMI, they should (all other things being equal) predict earlier arrival for HLMI.
The demographic information on respondents and non-respondents (Table S3) was collected from public sources, such as academic websites, LinkedIn profiles, and Google Scholar profiles. Citation count and seniority (i.e. numbers of years since the start of PhD) were collected in February 2017.
Elicitation of Beliefs
Many of our questions ask when an event will happen. For prediction tasks, ideal Bayesian agents provide a cumulative distribution function (CDF) from time to the cumulative probability of the event. When eliciting points on respondents’ CDFs, we framed questions in two different ways, which we call “fixed-probability” and “fixed-years”. Fixed-probability questions ask by which year an event has an p% cumulative probability (for p=10%, 50%, 90%). Fixed-year questions ask for the cumulative probability of the event by year y (for y=10, 25, 50). The former framing was used in recent surveys of HLMI timelines; the latter framing is used in the psychological literature on forecasting[17, 18]. With a limited question budget, the two framings will sample different points on the CDF; otherwise, they are logically equivalent. Yet our survey respondents do not treat them as logically equivalent. We observed effects of question framing in all our prediction questions, as well as in pilot studies. Differences in these two framings have previously been documented in the forecasting literature [17, 18] but there is no clear guidance on which framing leads to more accurate predictions. Thus we simply average over the two framings when computing CDF estimates for HLMI and for tasks. HLMI predictions for each framing are shown in Fig. S2.
For each timeline probability question (see Figures 1 and 2), we computed an aggregate distribution by fitting a gamma CDF to each individual’s responses using least squares and then taking the mixture distribution of all individuals. Reported medians and quantiles were computed on this summary distribution. The confidence intervals were generated by bootstrapping (clustering on respondents with 10,000 draws) and plotting the 95% interval for estimated probabilities at each year. The time-in-field and citations comparisons between respondents and non-respondents (TableS3
) were done using two-tailed t-tests. The region and gender proportions were done using two-sided proportion tests. The significance test for the effect of region on HLMI date (TableS2
) was done using robust linear regression using the R functionrlm from the MASS package to do the regression and then the f.robtest function from the sfsmisc package to do a robust F-test significance.
S1: Automation Predictions by Researcher Region
This question asked when automation of the job would become feasible, and cumulative probabilities were elicited as in the HLMI and milestone prediction questions. The definition of “full automation” is given above (p.Time Until Machines Outperform Humans). For the “NA/Asia gap”, we subtract the Asian from the N. American median estimates.
|Question||Europe||N. America||Asia||NA/Asia gap|
S2: Regression of HLMI Prediction on Demographic Features
We standardized inputs and regressed the log of the median years until HLMI for respondents on gender, log of citations, seniority (i.e. numbers of years since start of PhD), question framing (“fixed-probability” vs. “fixed-years”) and region where the individual was an undergraduate. We used a robust linear regression.
|Gender = “female”||-0.25473||0.39445||-0.64578||0.55320||0.3529552|
|Framing = “fixed_probabilities”||-0.34076||0.16811||-2.02704||0.04414||4.109484|
|Region = “Europe”||0.51848||0.21523||2.40898||0.01582||5.93565|
|Region = “M.East”||-0.22763||0.37091||-0.61369||0.54430||0.3690532|
|Region = “N.America”||1.04974||0.20849||5.03496||0.00000||25.32004|
|Region = “Other”||-0.26700||0.58311||-0.45788||0.63278||0.2291022|
S3: Demographics of Respondents vs. Non-respondents
There were (n=406) respondents and (n=399) non-respondents. Non-respondents were randomly sampled from all NIPS/ICML authors who did not respond to our survey invitation. Subjects with missing data for region of undergraduate institution or for gender are grouped in “NA”. Missing data for citations and seniority is ignored in computing averages. Statistical tests are explained in section “Statistics” above.
|Undergraduate region||Respondent proportion||Non-respondent proportion||p-test p-value|
|Gender||Respondent proportion||Non-respondent proportion||p-test p-value|
|Variable||Respondent estimate||Non-respondent estimate||statistic||p-value|
|Years in field||8.6||11.1||4.04||0.000060|
S4: Survey responses on AI progress, intelligence explosions, and AI Safety
Three of the questions below concern Stuart Russell’s argument about highly advanced AI. An excerpt of the argument was included in the survey. The full argument can be found here:
S5: Description of AI Milestones
The timelines in Figure 2 are based on respondents’ predictions about the achievement of various milestones in AI. Beliefs were elicited in the same way as for HLMI predictions (see “Elicitation of Beliefs” above). We chose a subset of all milestones to display in Figure 2 based on which milestones could be accurately described with a short label.
|Milestone Name||Description||n||In Fig. 2||median (years)|
|Translate New Language with ’Rosetta Stone’||Translate a text written in a newly discovered language into English as well as a team of human experts, using a single other document in both languages (like a Rosetta stone). Suppose all of the words in the text can be found in the translated document, and that the language is a difficult one.||35||16.6|
|Translate Speech Based on Subtitles||Translate speech in a new language given only unlimited films with subtitles in the new language. Suppose the system has access to training data for other languages, of the kind used now (e.g., same text in two languages for many languages and films with subtitles in many languages).||38||10|
|Translate (vs. amateur human)||Perform translation about as good as a human who is fluent in both languages but unskilled at translation, for most types of text, and for most popular languages (including languages that are known to be difficult, like Czech, Chinese and Arabic).||42||X||8|
|Telephone Banking Operator||Provide phone banking services as well as human operators can, without annoying customers more than humans. This includes many one-off tasks, such as helping to order a replacement bank card or clarifying how to use part of the bank website to a customer.||31||X||8.2|
|Make Novel Categories||
Correctly group images of previously unseen objects into classes, after training on a similar labeled dataset containing completely different classes. The classes should be similar to the ImageNet classes.
One-shot learning: see only one labeled image of a new object, and then be able to recognize the object in real world scenes, to the extent that a typical human can (i.e. including in a wide variety of settings). For example, see only one image of a platypus, and then be able to recognize platypuses in nature photos. The system may train on labeled images of other objects. Currently, deep networks often need hundreds of examples in classification tasks, but there has been work on one-shot learning for both classification and generative tasks.  Lake et al. (2015). Building Machines That Learn and Think Like People  Koch (2015) Siamese Neural Networks for One-Shot Image Recognition  Rezende et al. (2016). One-Shot Generalization in Deep Generative Models
|Generate Video from New Direction||See a short video of a scene, and then be able to construct a 3D model of the scene good enough to create a realistic video of the same scene from a substantially different angle. For example, constructing a short video of walking through a house from a video taking a very different path through the house.||42||11.6|
|Transcribe Speech||Transcribe human speech with a variety of accents in a noisy environment as well as a typical human can.||33||X||7.8|
|Read Text Aloud (text-to-spech)||Take a written passage and output a recording that can’t be distinguished from a voice actor, by an expert listener.||43||X||9|
|Math Research||Routinely and autonomously prove mathematical theorems that are publishable in top mathematics journals today, including generating the theorems to prove.||31||X||43.4|
|Putnam Math Competition||Perform as well as the best human entrants in the Putnam competition—a math contest whose questions have known solutions, but which are difficult for the best young mathematicians.||45||X||33.8|
|Go (same training as human)||Defeat the best Go players, training only on as many games as the best Go players have played. For reference, DeepMind’s AlphaGo has probably played a hundred million games of self-play, while Lee Sedol has probably played 50,000 games in his life.  Lake et al. (2015). Building Machines That Learn and Think Like People||42||X||17.6|
|Starcraft||Beat the best human Starcraft 2 players at least 50Starcraft 2 is a real time strategy game characterized by:
|Quick Novice Play at Random Game||Play a randomly selected computer game, including difficult ones, about as well as a human novice, after playing the game less than 10 minutes of game time. The system may train on other games.||44||12.4|
|Angry Birds||Play new levels of Angry Birds better than the best human players. Angry Birds is a game where players try to efficiently destroy 2D block towers with a catapult. For context, this is the goal of the IJCAI Angry Birds AI competition.||39||X||3|
|All Atari Games||
Outperform professional game testers on all Atari games using no game-specific knowledge. This includes games like Frostbite, which require planning to achieve sub-goals and have posed problems for deep Q-networks.  Mnih et al. (2015). Human-level control through deep reinforcement learning.  Lake et al. (2015). Building Machines That Learn and Think Like People
|Novice Play at half of Atari Games in 20 Minutes||Outperform human novices on 50% of Atari games after only 20 minutes of training play time and no game specific knowledge. For context, the original Atari playing deep Q-network outperforms professional game testers on 47% of games, but used hundreds of hours of play to train.  Mnih et al. (2015). Human-level control through deep reinforcement learning.  Lake et al. (2015). Building Machines That Learn and Think Like People||33||6.6|
|Fold Laundry||Fold laundry as well and as fast as the median human clothing store employee.||30||X||5.6|
|5km Race in City (bipedal robot vs. human)||Beat the fastest human runners in a 5 kilometer race through city streets using a bipedal robot body.||28||X||11.8|
|Assemble any LEGO||Physically assemble any LEGO set given the pieces and instructions, using non- specialized robotics hardware. For context, Fu 2016 successfully joins single large LEGO pieces using model based reinforcement learning and online adaptation.  Fu et al. (2016). One-Shot Learning of Manipulation Skills with Online Dynamics Adaptation and Neural Network Priors||35||X||8.4|
|Learn to Sort Big Numbers Without Solution Form||
Learn to efficiently sort lists of numbers much larger than in any training set used, the way Neural GPUs can do for addition, but without being given the form of the solution. For context, Neural Turing Machines have not been able to do this, but Neural Programmer-Interpreters have been able to do this by training on stack traces (which contain a lot of information about the form of the solution).  Kaiser & Sutskever (2015). Neural GPUs Learn Algorithms  Zaremba & Sutskever (2015). Reinforcement Learning Neural Turing Machines  Reed & de Freitas (2015). Neural Programmer-Interpreters
|Python Code for Simple Algorithms||Write concise, efficient, human-readable Python code to implement simple algorithms like quicksort. That is, the system should write code that sorts a list, rather than just being able to sort lists.
Suppose the system is given only:
|Answer Factoid Questions via Internet||Answer any “easily Googleable” factoid questions posed in natural language better than an expert on the relevant topic (with internet access), having found the answers on the internet.
Examples of factoid questions:
|Answer Open-Ended Factual Questions via Internet||Answer any “easily Googleable” factual but open ended question posed in natural language better than an expert on the relevant topic (with internet access), having found the answers on the internet.
Examples of open ended questions:
|Answer Questions Without Definite Answers||Give good answers in natural language to factual questions posed in natural language for which there are no definite correct answers. For example: “What causes the demographic transition?”, “Is the thylacine extinct?”, “How safe is seeing a chiropractor?”||47||10|
|High School Essay||Write an essay for a high-school history class that would receive high grades and pass plagiarism detectors. For example answer a question like “How did the whaling industry affect the industrial revolution?”||42||X||9.6|
|Generate Top 40 Pop Song||Compose a song that is good enough to reach the US Top 40. The system should output the complete song as an audio file.||38||X||11.4|
|Produce a Song Indistinguishable from One by a Specific Artist||Produce a song that is indistinguishable from a new song by a particular artist, e.g., a song that experienced listeners can’t distinguish from a new song by Taylor Swift.||41||10.8|
|Write New York Times Best-Seller||Write a novel or short story good enough to make it to the New York Times best-seller list.||27||X||33|
|Explain Own Actions in Games||For any computer game that can be played well by a machine, explain the machine’s choice of moves in a way that feels concise and complete to a layman.||38||X||10.2|
|World Series of Poker||Play poker well enough to win the World Series of Poker.||37||X||3.6|
|Output Physical Laws of Virtual World||After spending time in a virtual world, output the differential equations governing that world in symbolic form. For example, the agent is placed in a game engine where Newtonian mechanics holds exactly and the agent is then able to conduct experiments with a ball and output Newton’s laws of motion.||52||14.8|
We thank Connor Flexman for collecting demographic information. We also thank Nick Bostrom for inspiring this work, and Michael Webb and Andreas Stuhlmüller for helpful comments. We thank the Future of Humanity Institute (Oxford), the Future of Life Institute, and the Open Philanthropy Project for supporting this work.