LambdaNet: Probabilistic Type Inference using Graph Neural Networks

04/29/2020 ∙ by Jiayi Wei, et al. ∙ The University of Texas at Austin 19

As gradual typing becomes increasingly popular in languages like Python and TypeScript, there is a growing need to infer type annotations automatically. While type annotations help with tasks like code completion and static error catching, these annotations cannot be fully determined by compilers and are tedious to annotate by hand. This paper proposes a probabilistic type inference scheme for TypeScript based on a graph neural network. Our approach first uses lightweight source code analysis to generate a program abstraction called a type dependency graph, which links type variables with logical constraints as well as name and usage information. Given this program abstraction, we then use a graph neural network to propagate information between related type variables and eventually make type predictions. Our neural architecture can predict both standard types, like number or string, as well as user-defined types that have not been encountered during training. Our experimental results show that our approach outperforms prior work in this space by 14% (absolute) on library types, while having the ability to make type predictions that are out of scope for existing techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dynamically typed languages like Python, Ruby, and Javascript have gained enormous popularity over the last decade, yet their lack of a static type system comes with certain disadvantages in terms of maintainability (type-maintain), the ability to catch errors at compile time, and code completion support (type-or-not). Gradual typing can address these shortcomings: program variables have optional type annotations so that the type system can perform static type checking whenever possible (Siek2007GradualTF; chung2018kafka). Support for gradual typing now exists in many popular programming languages (understanding-typescript; vitousek2014design), but due to their heavy use of dynamic language constructs and the absence of principal types (ancona2004principal), compilers cannot perform type inference using standard algorithms from the programming languages community (understanding-typescript; hindley-milner; pierce2000local), and manually adding type annotations to existing codebases is a tedious and error-prone task. As a result, legacy programs in these languages do not reap all the benefits of gradual typing.

To reduce the human effort involved in transitioning from untyped to statically typed code, this work focuses on a learning-based approach to automatically inferring likely type annotations for untyped (or partially typed) codebases. Specifically, we target TypeScript, a gradually-typed variant of Javascript for which plenty of training data is available in terms of type-annotated programs. While there has been some prior work on inferring type annotations for TypeScript using machine learning

(DeepTyper; JSNice), prior work in this space has several shortcomings. First, inference is restricted to a finite dictionary of types that have been observed during training time—i.e., they cannot predict any user-defined data types. Second, even without considering user-defined types, the accuracy of these systems is relatively low, with the current state-of-the-art achieving 56.9% accuracy for primitive/library types (DeepTyper). Finally, these techniques can produce inconsistent results in that they may predict different types for different token-level occurrences of the same variable.

In this paper, we propose a new probabilistic type inference algorithm for TypeScript to address these shortcomings using a graph neural network architecture (GNN) (GAT; Gated-GNN; mou2016convolutional). Our method uses lightweight source code analysis to transform the program into a new representation called a type dependency graph, where nodes represent type variables and labeled hyperedges encode relationships between them. In addition to expressing logical constraints (e.g., subtyping relations) as in traditional type inference, a type dependency graph also incorporates contextual hints involving naming and variable usage.

Given such a type dependency graph, our approach uses a GNN to compute a vector embedding for each type variable and then performs type prediction using a pointer-network-like architecture 

(vinyals_pointer_2015). The graph neural network itself requires handling a variety of hyperedge types—some with variable numbers of arguments—for which we define appropriate graph propagation operators. Our prediction layer compares the vector embedding of a type variable with vector representations of candidate types, allowing us to flexibly handle user-defined types that have not been observed during training. Moreover, our model predicts consistent type assignments by construction because it makes variable-level rather than token-level predictions.

We implemented our new architecture as a tool called LambdaNet and evaluated its performance on real-world TypeScript projects from Github. When only predicting library types, LambdaNet has a top1 accuracy of , achieving a significant improvement over DeepTyper (). In terms of overall accuracy (including user-defined types), LambdaNet achieves a top1 accuracy of around , which is (absolute) higher than the TypeScript compiler.

Contributions.

   This paper makes the following contributions: (1) We propose a probabilistic type inference algorithm for TypeScript that uses deep learning to make predictions from the type dependency graph representation of the program. (2) We describe a technique for computing vector embeddings of type variables using GNNs and propose a pointer-network-like method to predict user-defined types. (3) We experimentally evaluate our approach on hundreds of real-world TypeScript projects and show that our method significantly improves upon prior work.

2 Motivating Example and Problem Setting

Figure 1: A motivating example: Given an unannotated version of this TypeScript program, a traditional rule-based type inference algorithm cannot soundly deduce the true type annotations (shown in green).

Figure 1 shows a (type-annotated) TypeScript program. Our goal in this work is to infer the types shown in the figure, given an unannotated version of this code. We now justify various aspects of our solution using this example.

Typing constraints.   The use of certain functions/operators in Figure 1 imposes hard constraints on the types that can be assigned to program variables. For example, in the forward function, variables x, y must be assigned a type that supports a concat operation; hence, x, y could have types like string, array, or Tensor, but not, for example, boolean. This observation motivates us to incorporate typing constraints into our model.

Contextual hints.   Typing constraints are not always sufficient for determining the intended type of a variable. For example, for variable network in function restore, the typing constraints require network’s type to be a class with a field called time, but there can be many classes that have such an attribute (e.g., Date). However, the similarity between the variable name network and the class name MyNetwork hints that network might have type MyNetwork. Based on this belief, we can further propagate the return type of the library function readNumber (assuming we know it is number) to infer that the type of the time field in MyNetwork is likely to be number.

Need for type dependency graph.   There are many ways to view programs—e.g., as token sequences, abstract syntax trees, control flow graphs, etc. However, none of these representations is particularly helpful for inferring the most likely type annotations. Thus, our method uses static analysis to infer a set of predicates that are relevant to the type inference problem and represents these predicates using a program abstraction called the type dependency graph.

Handling user-defined types.   As mentioned in Section 1, prior techniques can only predict types seen during training. However, the code from Figure 1 defines its own class called MyNetwork and later uses a variables of type MyNetwork in the restore method. A successful model for this task therefore must dynamically make inferences about user-defined types based on their definitions.

2.1 Problem Setting

Our goal is to train a type inference model that can take as input an entirely (or partially) unannotated TypeScript project

and output a probability distribution of types for each missing annotation. The prediction space is

, where is the set of all user-defined types (classes/interfaces) declared within , and is a fixed set of commonly-used library types.

Following prior work in this space (DeepTyper; JSNice; xu2016python), we limit the scope of our prediction to non-polymorphic and non-function types. That is, we do not distinguish between types such as List<T>, List<number>, List<string> etc., and consider them all to be of type List. Similarly, we also collapse function types like number  string and string  string into a single type called Function. We leave the extension of predicting structured types as future work.

Figure 2: An intermediate representation of the (unannotated version) program from Figure 1. The represent type variables, among which are newly introduced for intermediate expressions.
Figure 3: Example hyperedges for Figure 2. Edge labels in gray (resp. red) are positional arguments (resp. identifiers). (A) The return statement at line 6 induces a subtype relationship between and . (B) MyNetwork declares attributes name and time and method forward . (C) is associated with a variable whose named is restore. (D) Usage hyperedge for line 10 connects and to all classes with a time attribute.

3 Type Dependency Graph

Table 1: Different types of hyperedges used in a type dependency graph.