Companion repository to the g-index benchmark paper for general machine intelligence.
To build increasingly general-purpose artificial intelligence systems that can deal with unknown variables across unknown domains, we need benchmarks that measure precisely how well these systems perform on tasks they have never seen before. A prerequisite for this is a measure of a task's generalization difficulty, or how dissimilar it is from the system's prior knowledge and experience. If the skill of an intelligence system in a particular domain is defined as it's ability to consistently generate a set of instructions (or programs) to solve tasks in that domain, current benchmarks do not quantitatively measure the efficiency of acquiring new skills, making it possible to brute-force skill acquisition by training with unlimited amounts of data and compute power. With this in mind, we first propose a common language of instruction, i.e. a programming language that allows the expression of programs in the form of directed acyclic graphs across a wide variety of real-world domains and computing platforms. Using programs generated in this language, we demonstrate a match-based method to both score performance and calculate the generalization difficulty of any given set of tasks. We use these to define a numeric benchmark called the g-index to measure and compare the skill-acquisition efficiency of any intelligence system on a set of real-world tasks. Finally, we evaluate the suitability of some well-known models as general intelligence systems by calculating their g-index scores.READ FULL TEXT VIEW PDF
To make deliberate progress towards more intelligent and more human-like...
The learning efficiency and generalization ability of an intelligent age...
A special purpose learning system assumes knowledge of admissible tasks ...
A purely inter-model version of a machine intelligence benchmark would a...
Despite recent advances in many application-specific domains, we do not ...
Human creativity generates novel ideas to solve real-world problems. Thi...
Over the years, there has been growing interest in using Machine Learnin...
Companion repository to the g-index benchmark paper for general machine intelligence.
The concept of intelligence has been expressed in informal terms from time immemorial, but to date there has been no consensus on a formal definition GOTTFREDSON199713. In psychology, definitions of intelligence include “the faculty of adapting one’s self to circumstances” binet1961development, and “the aggregate or global capacity . .. to act purposefully, to think rationally, and to deal effectively with [the] environment”, wechsler1944measurement. The definition of artificial intelligence observes similar variety, with common reliance on a human reference; for eg. “machines capable of performing tasks that would require intelligence if done by humans” quillian1968semantic, minsky1982semantic. The developments in the field of AI have led to further refinements, with terms such as skill-based, narrow, or weak AI versus general or strong AI Pennachin2007, searle1980minds, Searle:2009.
The Turing test “imitation game” turing1950computing, one of the first measures of artificial intelligence was qualitative in nature. It required that an artificial intelligence convince a human judge that it was a human. This created an inaccurate perception of an AI’s capabilities due to variance in the judge’s knowledge dennett1998brainchildren. The improvements to the Turing test (such as the Lovelace test bringsjord2003creativity and its successor riedl2014lovelace) maintain the requirement of a human or human-like judge, focusing on the judge’s available resources and formal descriptions of what the judge can use for measuring the AI’s capabilities. Till date, subjective human judgment plays a key role in evaluating any intelligence system.
Recent breakthroughs in the capabilities of machine intelligence systems have relied upon the scaling of computational methods such as support vector machines, random forests and neural networks. While this was assisted by the increased availability of raw computing power, we note that framing machine intelligence in terms of computation enabled quantitative methods for evaluation, as measuring the performance of a system was simplified to computing a numerical benchmark score on a publicly available dataset. The MNIST dataset mnist served as a benchmark for digit classification oliveira2004support, manjunath2007unconstrained, keysers2007comparison , and is used today in introductory texts to showcase the power of deep neural networks. Over the years, designing a good benchmark has developed into a specialized problem, involving the collection of large diverse datasets spanning multiple years beyer2020.
The presence of quantitative benchmarks was a boon for developing machine learning-based intelligence systems, as one could compare different methods and design targeted improvements to build upon a consistent, computable, numerical score. The state-of-the-art in image classification improved every year since the ImageNet benchmark deng2009imagenet became widely available, starting with AlexNet in 2012 krizhevsky2012imagenet, to SENet hu2019senet in 2017 (seeFigure 1
), with new techniques such as ResNet skip-connections gaining prominence on the back of their ImageNet performance. Similar improvements were also seen in the field of language understanding, with the GLUE benchmark wang2018glue and the development of well-known transformer models like BERT devlin2018bert and GPT-2 radford2019language.
Deep learning-based methods have achieved a high level of skill in specialized tasks, but it is still hard to quantitatively measure the “generalization capability” of an intelligence system - or its “ability to handle situations (or tasks) that differ from previously encountered situations” chollet2019. This is due to the difficulty of measuring the variables involved in current definitions of artificial intelligence. legg2007universal informally define intelligence as measuring an “agent’s ability to achieve goals in a wide range of environments”. It further mentions the properties desirable in a measure of intelligence, such as a formal mathematical definition, applicability across different methods without bias, and an informative, numerical score to enable comparison across agents. However, an agent’s intelligence cannot be computed practically according to this definition as it relies on finding the Kolmogorov complexity li2008introduction of each environment .
orallo2017 bifurcates the measurement of intelligence systems into task-oriented and ability-oriented evaluations. Both are important for evaluating a system, but the former is far more common. Task-oriented evaluations include human judgments of AI performance, direct benchmark comparisons, and assessment of adversarial situations in games such as playing Chess or Go. Ability-oriented evaluations take the form of psychometrics in the case of human intelligence, and extend to artificial systems dowe2012iq when intelligence is viewed as a form of information processing chandrasekaran1990. This allows the use of algorithmic information theory (AIT) chaitin1982godel to perform ability-oriented evaluations of artificial systems. Building on this, chollet2019 provides an outline for the measurement of intelligence via a framework that is easily mapped to current methods in machine learning. The intelligence of a system here is defined as “the measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty”, and involves testing via a benchmark dataset called the Abstract Reasoning Corpus (ARC). However, this framework does not offer a quantitative measure of generalization difficulty, and all evaluation is close-ended and binary.
In this paper, we define the g-index, a quantitative benchmark to measure the intelligence of an artificial system as a computable, numerical value. The naming is inspired from the g-factor, which is a measure of general ability in the field of psychometrics jensen1999g. It accounts for performance, generalization difficulty and sample efficiency across a wide range of real-world tasks. Section 2 describes the experimental setup for the benchmark, and showcases the components that enable calculating numerical values for evaluation. Section 3 formally defines the parameters on which the g-index depends, constructs a mathematical formula, and shows how the properties of the g-index follow the guidelines in current literature. Section 4 evaluates some well-known transformer models as candidates for a general intelligence system by computing their g-index scores, and provides a sample dataset of real-world tasks and their associated skill programs that can be used with the g-index 111https://github.com/mayahq/g-index-benchmark. Finally, Section 5 notes possible shortcomings in the current design and components of the g-index and indicates directions for improvement.
In this section, we describe the details of the components required to compute the g-index
A task is specified to an intelligence system .
The intelligence system generates a skill program to solve the task. The intelligence system has been trained on a training set (or a curriculum ) of tasks that may or may not be related to .
A scoring function evaluates the responses of the skill program against possible situations of the task, and provides a score along with some feedback if available.
The intelligence system can be updated based on the evaluation and feedback of the scoring function.
The generalization difficulty of a task measures how different is from the curriculum of the intelligence system. It can be used to weight the system’s performance for varying degrees of unseen task specifications.
While previous definitions of general intelligence rely on quantities like Kolmogorov complexity li2008introduction which are difficult to compute, the components in our setup together enable computing numerical values that can be combined to measure the capabilities of an intelligence system. The skill program
is expressed in a custom programming language that can be extended to construct new programs without additional syntax complexity, which streamlines collecting and augmenting data for the intelligence system. The scoring function does not require running the program to compute the score, which means it can also be used as a loss function or reward function for training the intelligence system. Finally, we construct an intuitive formulation of generalization difficulty based on nearest neighbors that reuses the scoring function. These features are described in the subsections that follow.
Human beings follow a sequence of steps to perform a specified task. For an artificial intelligence system, the equivalent sequence of steps is the program. Hence to perform a task, the system would need to generate (or synthesize) programs, when provided a specification via examples of inputs, in natural language, images, audio, or video.
The field of program synthesis deals with the automatic construction of programs that are consistent with a given task specification waldinger1969. The common form of a task specification is a set of input-output examples amarel1970, summers1977 from which the necessary program(s) can be synthesized. The application of deep learning to program synthesis is called neural program synthesis 222This is different from neural program induction, where the neural network learns the program from the given examples, but does not provide an explicit formulation of the program.. Many neural program synthesis techniques have the task specified via a set of input-output examples parisotto2016neurosymbolic , but some also use natural language text prompts lin2017program, demonstration videos sun2018neural, or combinations of these as welltfcoder, shu2021agent.
In our current setup, the tasks submitted to the intelligence system are specified in English without listing any input-output examples. When scoring the generated program, an associated reference program is provided.
The choice of target programming language for synthesis varies across implementations. yin2017syntactic describe the generation of Python code snippets from a given description. lin2017program use recurrent neural networks (RNNs) to produce shell scripts. Codex copilot, which uses a large language model similar to GPT-3 brown2020gpt3, generates entire functions in Python from documentation strings, with similar capabilities being extended to other common programming languages. It is also common to use a domain-specific language (DSL). DSLs for program synthesis may be designed from scratch for a specific purpose raza2015compositional, a restricted subset of a language tfcoder, gramRLprog or an extended version of an existing language bfplus.
Extensible via custom nodes: Every node in a flow-based program is assigned a special type attribute, and all nodes with the same type contain the same attribute keys. Most FBP implementations provide a default library of node types for constructing programs, but we can also create and add new node types that encapsulate a custom functionality for our own use333For examples of reusable nodes see the library in Appendix B.. This allows for maximum extensibility and reusability within the same level of expressivity: adding a new node type increases the number of possible skill programs without changing the language syntax or bloating program size.
Perform variety of tasks: Each node JSON description in the FBP is an abstraction linked to a particular black box process. The functionality encapsulated in each node can be implemented in any programming language across platforms. This means that flow-based program can be deployed on desktop computers, on servers in the cloud, and even on embedded devices such as the Raspberry Pi and Arduino. This allows us to specify tasks that may be performed across multiple devices, locally or over the internet. Figure 3 shows four different tasks with their associated flow-based skill programs.
Constrained program design: The design of flow-based programs can be restricted in three ways:
The limited syntax of JSON and the DAG construction makes it difficult to use advanced programming constructs like recursion and loops.
If necessary, the library of available nodes can be customized to prevent the intelligence system from using specific nodes.
If certain node properties are confusing or risky (such as allowing arbitrary code execution), they can be restricted by designing new nodes that do not allow such modifications.
Proper restrictions along these axes can limit program aliasing – the existence of multiple valid programs that satisfy the task specification – by providing one obvious direction to solve a given task.
Efficient program generation - Program synthesis methods based on deep learning may have limits on the size of synthesized programs. For instance, neural network architectures like transformers have a fixed upper bound on the number of tokens that can be generated, so it is important to use the available token space efficiently. With flow-based programs, we can succinctly specify nodes that perform complex tasks due to the power of encapsulation. This allows for a larger space from which the intelligence system can generate programs.
While programs in other programming languages can be analysed as graphs by examining their Abstract Syntax Trees (AST) representations, the DAGs of flow-based programs are easier to compare with one another due to the following reasons:
The order of nodes in the AST is dependent on the order of text in the program source. For our program DAGs, the order of the program is specified implicitly in the program source by the edges between nodes. DAG comparisons are less affected by the order of information specified in the program source compared to AST comparisons.
AST representations suffer from program aliasing–two programs that satisfy the same task can have completely different ASTs. For our program DAGs, we can minimize program aliasing by constraining the kinds of nodes that are allowed for use.
For complex specifications, the program size (and therefore AST size) can grow arbitrarily large if encapsulation is not used, which makes it harder to compare two given programs. Our program DAGs are designed to benefit from encapsulation: each node can encapsulate a “black-box” process of arbitrary complexity, and the attributes of the node are used to examine and control the behavior of the process.
The AST is a low-level representation of the program: it is used to ensure that the program source is syntactically valid, check for minor semantic errors (like dereferencing null pointers) and perform program optimizations. It is difficult to reason about the behavior of the program from looking at its AST representation. The DAG is a high-level representation of the flow-based program: we can infer the program’s general behavior from the DAG structure and the node types, and examine node attributes to understand or change the behavior of any component.
After a skill program has been synthesized by the intelligence system, it is evaluated by a scoring function. The program can be evaluated by its success on a special set of input-output pairs, or by match-based metrics. For a given task, match-based metrics compute similarity by comparing the structure a generated program with a known reference program. This reference could be provided by a human, generated via a fixed set of rules, augmented from existing data, or synthesized by another intelligence system. The BLEU score bleu2002score can be used to compare the text of the two given programs, but it does not consider the structured syntax or the semantic features of the programs. CodeBLEU ren2020codebleu improves upon the BLEU score by comparing the abstract syntax trees (AST) and the semantic dataflow of the programs. While match-based metrics do not need to run the generated program for evaluation, they are affected by program aliasing. Recent synthesis methods kulal2019spoc,lachaux2020unsupervised,copilot evaluate programs via functional correctness, wherein a generated program is considered correct if it passes a set of unit tests444Unit tests include a set of known input-output pairs, but may also contain collections of inputs that together test for the presence of certain properties such as types of failures in a given program.. Functional correctness is useful because it is similar to how humans evaluate programs written by each other, but it requires running the generated program to obtain a score.
In Subsection 2.3 we noted that flow-based programs can be constrained to minimize the program aliasing, and the DAG representations are easier to compare than ASTs (Subsubsection 2.3.2). Hence for our scoring function, we use a match-based divergence metric to compare the DAG of the generated program with a known reference.
Let be a known reference skill program, and be a skill program generated by the intelligence system.
Let be the DAG denoting the . Here, and refer to the vertices and edges of respectively, and .
Let be the DAG denoting the program . Here, and refer to the vertices and edges of respectively, and .
Given two programs , the divergence metric accepts their DAGs as input and is constrained as follows:
Consider the simple case for , where the DAGs and both contain only one attributed node and zero edges. If is the only node in and the only node in , we can compare the attributes of with those in to obtain a node similarity value .
If both nodes have different types,
If both nodes have the same type, they will have the same attribute keys. Node attribute values follow a binary comparison. If is number of attributes for a given node type, and is the number of attribute values that are equal for both nodes, then
If both nodes have the same type, and all node properties are equal, .
Thus when , both contain only one node and zero edges, we define as
We can see that the value of follows the constraints given in Equation 1 for the simple case.
When computing the value of between two arbitrary DAGs, we would need to consider the common substructure between the DAGs in addition to the node-wise similarity values. The value of should be low when the DAGs have highly similar structure, so we compute it by finding the largest subgraph common to both DAGs. This is known as the maximum common edge subgraph problem (MCES) bokhari1981mapping, an extension of the subgraph isomorphism problem ullmann1976algorithm. Two graphs and are isomorphic to each oher () if there exists a bijection that preserves the graph structure:
We need to find subgraphs and that are isomorphic to each other, ie . This requires that and have the same number of vertices, ie
and a bijection between the subgraphs that satisfies Equation 3. and also need to be the largest common subgraphs
We obtain , , and the bijective mapping by constructing an association graph between the nodes of and , and finding a maximum clique in barrow1976subgraph, kozen1978clique. Appendix C explains the process of obtaining in detail, including the rules for construction of the association graph and ensuring the properties of from Equation 1 are satisfied. Figure 4 provides a visual overview of the process. Once we obtain , we calculate the value of the metric is using each node in , and its image via in . If , then
While has an exponential complexity due to the use of subgraph isomorphism, in practice, the computation of is sped up as Equation 11 reduces the number of vertices in the association graph , and Equation 12 tends to produces sparser graphs. As does not require the execution of the generated programs, it can be used in the training loop of supervised or reinforcement learning methods.
With the match-based DAG divergence metric , we now have a scoring function to compare a generated flow-based program with a reference program . Flow-based programs can be expressed in JSON, so a JSON parser can be used to ensure valid syntax. The DAGs constructed from the valid JSON can then be input to to compare the semantic dataflow of the programs. Since enables the comparison of any two flow-based programs, it can be extended to measure the dissimilarity of a program from a given set of programs, and more generally measure the dissimilarity between two sets of programs, by computing all necessary pairwise comparisons.
The values of are based on the structural differences between the DAGs and consider degrees of errors in the generated program:
Syntax Errors: The generated program has incorrect JSON syntax, resulting in a reduced or incomplete DAG after parsing.
Function Errors: Some nodes are incorrectly specified or missing from the generated DAG. If a node in the generated DAG has only one differing attribute compared to the reference, it is reflected in when computing the node similarity .
Dataflow Errors: The generated program has the same nodes as the reference, but has different edges, denoting different semantics.
Humans have the capability to generalize, or they are able to use past learning experience to navigate new situations in the present. A machine intelligence would need to possess similar capabilities to deal with new or unseen task specifications. In machine learning, generalization deals with a model’s performance on previously unseen data samples that are similar to the distribution of the training set of the model. A model is said to have overfit the training data if predicts well on the training data, but fails to predict on new unseen samples. In deep learning, it is common to use some regularization techniques to avoid overfitting.
In the context of the g-index and our experimental setup, we focus on the generalization difficulty of a specified task. For a given task , chollet2019 informally defines generalization difficulty as “a measure of how much the shortest training-time solution program needs to be edited in order to become an appropriate evaluation-time solution program”, and relies on Relative Algorithmic Complexity to compute the edit difference between the programs. We rephrase this statement for our experiment setup as follows: Suppose an intelligence system , trained on a curriculum of tasks , is given a task . If is an appropriate evaluation-time program that solves the task , how much does differ from the training-time solution programs for tasks ?
Since the skill programs generated by the intelligence system in our experimental setup are flow-based programs, they can be expressed as DAGs. Therefore, if is an optimal flow-based program that solves the task , we can use the metric defined in Subsection 2.4 and quantify the difference between and programs generated for tasks in the curriculum by a nearest neighbor search. Thus, we can define the domain distance of a single task for an intelligence system trained on a curriculum of tasks as :
Note that is bounded between . If it means that the task can be found in the training set , and hence no generalization is required.
If is too large, the value can be approximated by clustering the elements of into task domains using the divergence metric : two tasks within the same domain are “closer” to each other () than tasks in different domains. Figure 5 shows an example of task domains and their related divergence scores.
The definition of in Equation 6 enables the computation of generalization difficulty within of the training-set/test-set paradigm that is commonly used for training machine learning models. By computing , a model’s performance on an unseen dataset can be weighted in context of the generalization difficulty of the tasks in the set.
We now define the formula for computing the g-index based on the experimental setup defined in Section 2. First, we describe the necessary variables:
An intelligence system generates a flow-based skill program when input the specification for a given task . This formulation is method-agnostic, and allows the g-index benchmark to apply not only for deep-learning based systems of today, but also be extended to any new techniques in the future, by plugging in the variables measured below.
The intelligence system is trained on a curriculum , consisting of tasks and their associated reference skill programs . We expect that the ideal intelligence system would also be the most sample-efficient in terms of training – it would learn to solve a large variety of tasks from just a single demonstration. Hence the value of g-index should decrease as the number of training samples increases.
A curriculum domain , refers to a subset of where all tasks belong to the same task domain. We expect that an ideal intelligence system would be able to generalize from the least number of tasks provided per domain. So, the value of the g-index for the system should decrease as more tasks are provided for a given domain. We define , the weight of considering curriculum tasks from as:
The priors of the system refer to knowledge built into the system before any training. Examples of priors include previously learnt weights, neural network architectures, data pre-processing techniques, program optimizations etc. We expect that the ideal intelligence system would be able to generalize from having the least amount of built-in priors. Hence the value of the g-index should decrease as more priors are encoded into the system. For our experiments, we consider the value of as a fixed constant, but we expect this to change as more complex systems are evaluated.
When training the intelligence system on a curriculum , it is important to measure the experience of the system with . The units of measure for compute power are FLOPS(Floating Point Operations Per Second) or MIPS(Million Instructions Per Second), which are reflective of the rate at which the system is exposed to the data. We expect that the ideal intelligence system would be one that exhibits high performance after being trained for the least amount of time, with the least amount of compute power. So, the value of the g-index should decrease as the intelligence system is trained for longer and on larger amounts of compute power. We define the experience of the system for a given curriculum in terms of the compute power used for training on (measured in teraFLOPS) multiplied by the amount of time was trained on (measured in seconds).
A scoring function evaluates the skill program for the task to measure the performance of the intelligence system . We expect that an ideal intelligence system would produce a perfect skill program for . So, the value g-index for the system should increase as its performance on the scoring function improves. We use the divergence metric from Subsection 2.4 and compute the difference of the generated skill program with a known reference program to calculate the performance of on the task as:
When testing the capabilities of after training on a curriculum , we wish to see how performs on tasks of varying dissimilarity from . We expect that the ideal intelligence system would be able to generalize to tasks that are highly dissimilar from those on which it was trained. Hence the value of the g-index should increase non-linearly if the system performs well on tasks of increasing generalization difficulty. We define the generalization difficulty of a task for a system trained on a curriculum using defined in Equation 6 and the exponential function :
We use a set of tasks to evaluate an intelligence system trained on a curriculum . The task contribution of a single task to the g-index using the performance , the generalization difficulty , the priors , and the experience :
The constants and component functions used to calculate each variable’s impact are determined by trends seen during experimentation. Thus we obtain the formula for the g-index by averaging over the evaluations for the set of tasks :
If the skill of an intelligence system in a particular domain is defined as it’s ability to consistently generate a set of instructions (or programs) to solve tasks in that domain, the ideal benchmark should aim to measure the efficiency of acquiring new skills. It should penalise the amount of data and compute power required to train the system, and should reward the performance of the system on tasks of increasing generalization difficulty. If an intelligence system trained using the least amount of training data and compute power obtains the highest performance on tasks of high generalization difficulty, it should have the highest score on this benchmark. Keeping these constraints in mind, we observe the responsiveness of the g-index benchmark to variations in the number of training samples (Figure 6), compute power (Figure 7), performance and generalization difficulty (Figure 8) by running simulations across these variables.
We see that the value of the g-index decreases with an increase in the number of training samples from Figure 6, decreases with an increase in compute usage from Figure 7, and increases with an increase performance and generalization difficulty from Figure 8, which makes it useful for measuring skill-acquisition efficiency.
From our definition of the g-index , we reason that general intelligence is not a binary property that a system either possesses or lacks, but is better described as a continuous spectrum . Where an intelligence system lies on this spectrum depends on the following factors of the evaluation setup:
The diversity of the evaluation scope of tasks - whether or not they lie within the similar domains where all tasks have low divergence score relative to each other.
The generalization difficulty of tasks with respect to curriculum C, or how different the tasks in the test scope are from the curriculum it has seen. We use the distance score to refer to this.
The efficiency with which a system converts its priors, experience with curriculum to a high performance on the task scope
We qualitatively categorise intelligence systems into four levels of generalization power based on the properties considered above. Each is harder to achieve than the previous one, and the formulation of the g-index makes it difficult to brute-force a higher score by utilising unlimited amounts of priors, data and compute. The aim of these levels is to aid subjective discussions about the strengths and weaknesses of different approaches to build intelligence systems. We note that these levels merely represent approximate demarcations of generalization difficulty; as the measurement of general intelligence systems becomes more refined, these demarcations may change, or a new categorisation of intelligence systems may be formulated.
Level 0 - No generalization
. L0 broadly describes a system which encounters zero uncertainty in the tasks on which it is evaluated. This is because all edge cases are handled internally via hard-coded rules and the system does not act upon any learned heuristic. For example, a sorting algorithm that outputs a sorted array every time in a rule-based manner, or a chess playing algorithm that iterates through all possible moves to win a game of chess, can be said to not display any generalization.
Level 1 - Generalization to known unknowns in known domains. An L1 general intelligence is able to generalize across predictable amounts of uncertainty within a set of related tasks in the same domain. Consider a set of task specifications of the form “Buy stock every hours”. The variables and here are ‘known unknowns’. In the program DAG for any of these tasks, the variable maps to the tickername attribute of the submit-order node type and the variable maps to the frequency attribute of the schedule-trigger node type, both of which are wired together (see Appendix B). The degree of uncertainty is only in the attributes of the nodes submit-order and schedule-trigger - there are no new nodes or wires that require adapting to, hence the generalization difficulty is low ( ). If an intelligence system trained only on program DAG samples with different combinations of and , or tasks of the form “Buy TSLA stock every 10 hours” and “Every 20 hours, buy AAPL stock” is able to learn to generate the correct program DAG for unknown combinations of and , then it can be said to have reached L1 generalization. Current deep learning-based approaches are reasonably successful at attaining this degree of generalization power.
Level 2 - Generalization to unknown unknowns within known domains. An L2 intelligence system is able to generalize to unknown amounts of uncertainty, but within similar task domains. For example, when an intelligence system which has been shown in it’s training curriculum two different DAG programs for the task specifications “Search for query on Google”, and another to “Save a list of items to file” is able to successfully generate a program DAG for the new task specification : “Search for a query on Google and save results to file”. This task has mid-range generalization difficulty ( ) because the system has to learn how to combine two different program DAGs it has seen before. Unlike L1, the uncertainty here is not just with the attributes of nodes, but also with the extra nodes and wires needed to solve the task. This sort of composite program synthesis within known domains can be said to be an outcome of L2 generalization. While some deep learning systems are occasionally able to demonstrate L2 generalization when exposed to large amounts of data and compute power, we expect that sample-efficient methods of reaching L2 that score high on the g-index will require entirely new approaches to the problem.
Level 3 - Generalization to unknown unknowns across unknown domains. An L3 intelligence system is able to adapt to arbitrary amounts of uncertainty in the tasks on which it is evaluated. This is the most challenging level of generalization because it requires the system to perform well on tasks with high generalization difficulty (), i.e. the nodes and wire connections in the program DAG required to solve a task are highly dissimilar to any task it has seen before in it’s training curriculum. For example, if an intelligence system that is shown only web navigation tasks of the form “Summarize page from Wikipedia.com”, is asked to “Learn how to drive a Toyota on a given city street”, it has to find novel ways to convert its experience in one domain into mastery in another completely unknown domain. For instance, it could do this is by using its web navigation skills to watch an online city-driving video tutorial, create a web-based simulation sandbox of the street with virtual car controls, program new node types to interface with the controls on a real car, and then be able to generate a program DAG to drive a car down a street. Current learning methods are insufficiently equipped to create or scale up to such an L3 intelligence system. We expect new methods will need to emerge which incorporate high sample-efficiency, embodied agency, long-term memory and elements of self-improvement.
In this section, we compute the values of the g-index and its components for some
well-known large language models. We use a small dataset of text prompts and their
associated flow-based programs to finetune transformer-based models before measuring
their g-index scores. We construct a small dataset of real-world tasks from
16 task domains to train the models. The task domains are
described in Appendix A. A sample of the dataset is available at https://github.com/mayahq/g-index-benchmark.
We finetune four transformer models: GPT2-345M,
GPT2-774M, and GPT2-1.5B from
radford2019language, and GPT-Neo-2.7B from
gpt-neo, gao2020pile. We use the HuggingFace implementations
huggingface of the transformer models in the experiments.
With the current set of experiments, we aim to measure skill-acquisition efficiency via the g-index with tasks of low generalization difficulty. The average domain distance between the training set and test sets across all experiments is 0.09. The training samples range from 640 to 10240 across all the experiments. In every experiment, the training samples were distributed equally across all 16 task domains. After training, the models are tested with 5 unseen samples per task to obtain their average performance
. In every experiment, the number of training epochs was held constant at30. When synthesizing the programs, we hold the temperature of the models at a constant 0.7, and allow only one attempt at synthesis. We expect more attempts for a given task specification will yield better performance copilot. We examine the following relationships:
average performance vs program size
(Figure 9): The programs generated for each task are of different
sizes. The size of the program (number of characters in the program text) affects how
easily it can be generated. For instance, the number of tokens transformer models can
generate is bounded by their context window. We expect model performance to falter as the
size of the program to generate increases.
skill levels vs program size (Figure 10): The skill of an intelligence
system is its ability to consistently generate correct programs to solve
tasks in a given domain. In addition to performance, a measure of the system’s skill in a
particular domain helps contextualize its potential for real-world use. When choosing an
intelligence system to deploy in real-world tasks, we would prefer a system that
generates correct programs more often.
average performance vs number of training samples
(Figure 11): The g-index penalizes increments in training samples, but
rewards improvements in performance. We expect that the ideal intelligence system with
high g-index values would occur at an optimal tradeoff point between these two
average performance vs compute used
(Figure 12): The g-index penalizes high compute usage, but rewards
improvements in performance. Compute is measured in terms of available compute power and
training time, so systems that use multiple processing units to reduce time are penalized
accurately. We expect that the ideal intelligence system with high g-index values would
occur at an optimal tradeoff point between compute usage and performance.
|Model Name||# Training Samples||Compute Used555f1||g-index|
Different intelligence systems may obtain the best measurement at any individual component of the g-index . However, the best-performing system may not be the most resource-efficient, and vice-versa. The ideal system would be one that has the right combination of priors, experience, sample-efficiency, and maximal performance. Table 1 shows the top 8 models according to their g-index values, along with the models’ best component scores.
In this paper, we describe an experiment framework to obtain a quantitative measurement of skill-acquisition efficiency of machine intelligence systems. We model the intelligence system to accept a wide range of real-world tasks specified in natural language and synthesize programs representable as directed acyclic graphs in a flow-based programming syntax. We define a match-based metric to compare the DAG structures of two given programs to score the performance of the system, and use this metric to measure the generalization difficulty of any task provided to the system. We formulate the g-index benchmark and show that its changes with respect to dataset size, available compute, and performance align with intuitive expectations of an intelligence system with high generalization power. We then measure and compare the g-index
scores of some fine-tuned transformer models and estimate their suitability for general-purpose intelligence systems.
While the g-index benchmark shifts the evaluation of intelligence systems into a quantitative context akin to skill-based evaluations, it is not yet a complete measure of general machine intelligence. However, we believe that future evaluations of intelligence systems will require a similar framework, one that reflects the potential real-world use of such systems. Over the course of our experiments, we have found some possible directions for improving the g-index measurement. We describe these possibilities and their effects below.
The evaluation framework can be improved to represent more real-world use cases for a machine intelligence system. The task specification can be expanded to include different natural languages, audio, video, and input-output examples. The task may even be specified in parts across an interaction between a human and the system austin2021program. Flow-based programs face limitations with maintaining state, so there is potential for improving the language of synthesized programs to account for multi-stage tasks. It is also possible to create more nodes with new functionality, enabling the construction of larger, diverse datasets of task specifications and their associated programs. With larger datasets, better data augmentation techniques can be built to generate reference programs to evaluate tasks. As more nodes are designed and flow-based programs grow larger, the subgraph isomorphism computation with may slow down, and the chance of program aliasing issues may also increase. In such cases, we may also need to use functional correctness like copilot to score the programs synthesized by the system.
With a given a set of tasks, the components of the g-index such as compute and domain distance can be refined by testing with a wider variety of intelligence systems, to ensure that the g-index value accurately represents their capabilities. Additionally, since the intelligence systems in our framework are evaluated on real-world tasks, human feedback can be incorporated when attempting to understand or improve the calculation of these components.
After calculating the scores for a wide range of systems, the g-index formula would need to be updated with additional specifications. If a system’s g-index score is unnaturally high, perhaps the system actually exhibits high skill-acquisition efficiency, or the formula contains some poorly specified variables which unfairly portray the system’s ability. For example, we consider the weightage for the priors encoded in the system ( in Equation 8) to be a constant neglible value for the current set of experiments that use transformer models fine-tuned from pre-trained weights. When comparing the g-index
of a fine-tuned model to a model trained from scratch, the priors/compute tradeoff would be different, but it is not clear how such differences can be measured. Going further, it is difficult to quantify the benefit of priors like hyperparameters, data preprocessing, or hardcoded rules, in a manner that translates fairly across different kinds of intelligence systems. Perhaps the weight of some priorscan be calculated by comparing the system’s performance with and without the prior:
The arrangement of Equation 10 along with the levels of generalization discusssed in Subsection 3.3 leads to an interesting question. When comparing machine intelligence with human intelligence, given that human beings have had thousands of years to build priors, should human priors have a weight of ? Is it necessary to compare the built-in priors of humans with that of machines?
The overarching aim of this work is not only to propose a method-agnostic way to compare different techniques quantitatively, but also to spark a conversation on the relative merits of different approaches that could help reach higher levels of general machine intelligence. While we don’t expect our g-index definition to be complete, it anchors a previously abstract concept in a mathematical formulation made up of quantities that can be measured during experiments. The flow-based programming language we propose can act as a common language to express and compare programs of real-world utility across a wide variety of domains. The g-index explicitly rewards resource-efficiency, which we hope incentivises sustainable ways of achieving generalization which do not rely on unlimited amounts of compute and data. A good outcome of this would be the patronage and competitive development of new methods and technologies to reach higher g-index levels, similar in spirit to the Hutter compression challenge hutter-prize. Ultimately, our belief is that intelligent machines will only be able to contribute to real technological progress when they can learn how do more from less, not the other way around.
|Domain Name||Example Text Prompt||Nodes||Program size(chars)||Best Avg Performance||Skill Level **|
|template-slider||Two sliders with values ranging from 0 to 20 changing in steps of 2||6||1493||1.00||1.00|
|template-form||Create a form with fields for entering Name, Benchmark Score and Date of Submission||5||1473||1.00||1.00|
|cron-reminder||Set a reminder for ’Send Daily Digest’ every first day of the week, Tuesday through Saturday, only in January||3||935||1.00||1.00|
|cron-schedule||Repeat At 48, 9, 14, 7, 6, 56, 46, 39, 15, 3, 37, and 30 minutes past the hour, between 05:00 AM and 08:59 PM, on day 1,4,8 and 9 of the month, only on Tuesday, every 3 months, November through December||3||896||1.00||1.00|
|On Facebook, when user says ’Hello’, reply with ’Hi there! How can I help you?’, and when user says ’Bye’, reply with ’Goodbye!’||8||2253||1.00||1.00|
|gmail-send||Send email with body ’Upcoming Meeting’ and subject ’Discuss paper appendices’ to email email@example.com||12||4708||0.63||0.00|
|google-search||Search Google for ’How to make tables on LaTeX’ and scrape results||9||2963||0.60||0.0|
|googlesearch-to-csv||Search Google for ”Papers on measuring general intelligence”, scrape results and put into singularity.csv||12||3894||0.59||0.0|
|http||Create a HTTP POST endpoint called /agents||4||857||1.0||1.0|
|telegram-2-reply||Reply ’Yes, detective?’ to ’Sonny!’, and ’Of course, Dave’, to ’Open the pod bay doors’ on Telegram||8||2148||1.0||1.0|
|telegram-3-reply||On Telegram, when user says ’Are friends electric?’, reply with ’No, only sheep’, when user says ’How deep is your love?’ reply with ’6.5 meters’, and when user says ’Is that all there is?’, reply with ’Yes’||10||2910||1.0||1.0|
|Obtain tweets about the topic #Alignment||4||892||1.0||1.0|
|url-skill||Create a skill called ’Open LessWrong’ which opens url https://www.lesswrong.com/||4||1067||1.0||1.0|
|youtube-pause||Pause Youtube video||3||919||1.0||1.0|
|youtube-play||Find and play Vivaldi Four Seasons on Youtube||9||3050||0.89||0.0|
|youtube-resume||Resume Youtube Video||3||916||1.0||1.0|
|common utility||Custom triggers, catch bugs, add comments||inject, debug, complete, catch, status, link in, link out, comment|
|functional||Change, switch, filter or delay the passed message object or add custom logic to manipulate it||function, switch, change, range, template, delay, trigger, exec, filter|
|network||Different kinds of network interfaces to send and receive data||mqtt in, mqtt out, http in, http response, http request, websocket in, websocket out, tcp in, tcp out, tcp request, udp in, udp out|
|sequence||Manipulate sequences and arrays in predictable ways||split, join, sort, batch|
|parser||Parse data from different files into fixed formats for easy processing||csv, html, json, xml, yaml|
|storage||Read and write to files||file, file in, watch|
|dashboard||Make dynamic dashboards with forms and charts by linking UI elements to other pieces of logic||button, dropdown, switch, slider, numeric, text input, date picker, colour picker, form, text, gauge, chart, audio out, notification|
|browser-automation||Interact with the browser to navigate the web and scrape websites||open, click, type, press, execute function, find tab, scrape, query, bookmark|
|spotify-automation||Integrate with spotify and control music played on any device||play, search, control playback, get playback state, control playlist|
|gdrive-automation||Search through, read, export and append to files in on your google drive||search gdrive, gsheet append, gdrive-export-file|
|scheduling||Schedule triggers to run events at any fixed interval||schedule-trigger|
|zoom-automation||Create, view and attend zoom meetings||create-meeting, list-meetings, list-meetings-registrants|
|system utilities||Interact with various system level utilities on the desktop||clipboard-add, clipboard-get, open-target, file-search, desktop-notify|
|stock-automation||Buy, sell and view orders on the stock market via third party API||submit-order, get-order, get-bars, get-account|
|home-automation||Control lights and switches remotely||light-control, switch-control|
When computing the value of between two arbitrary DAGs, we
use the similarity function to compare individual nodes, and
account for structural similarity by computing the largest subgraph common to both DAGs.
This is known as the maximum common edge subgraph problem, bokhari1981mapping,
an extension of the subgraph isomorphism problem ullmann1976algorithm. In this
section, we explain in detail the calculation of outlined in
Two graphs and are isomorphic to each oher () if there exists a bijection that preserves the graph structure:
We need to find the largest subgraphs and that are isomorphic to each other (), and the isomorphism between the nodes of the two subgraphs. We obtain these by constructing the association graph and finding a maximum clique in . The association graph for given graphs , contains the vertices : a vertex associates a node to a node with their node similarity . Only nodes of the same type are considered for association:
An edge in connects a vertex to another vertex provided the below structure-preserving condition is satisfied (note the resemblance to Equation 3):
It can be shown that finding a maximum clique666 Note that the node similarities are used to filter out elements from and when computing the maximum clique (a node-weight maximum clique is computed). in the association graph is equivalent to finding the largest common subgraph between and barrow1976subgraph, kozen1978clique. Once a maximum clique in has been obtained, we can construct the common subgraphs and . Since nodes in the DAGs and cannot have self-loops, Equation 12 ensures that any clique in the association graph will always provide a one-to-one mapping between the corresponding vertices of and . If
is the maximum clique in the association graph , then
are the required largest common subgraphs. The maximum subgraph isomorphism is the mapping expressed via the pairs in the elements of of the clique .
Thus with we can compute the value of the metric as provided in Equation 5:
is symmetric because the node similarities are symmetric, and finding the largest common subgraph between two graphs is also symmetric: since , we can construct a bijection between and , and use its inverse for the symmetric case.