Do Names Echo Semantics? A Large-Scale Study of Identifiers Used in C++'s Named Casts

11/02/2021
by   Constantin Cezar Petrescu, et al.
University of Surrey
0

Developers relax restrictions on a type to reuse methods with other types. While type casts are prevalent, in weakly typed languages such as C++, they are also extremely permissive. If type conversions are performed without care, they can lead to software bugs. Therefore, there is a clear need to check whether a type conversion is essential and used adequately according to the developer's intent. In this paper, we propose a technique to judge the fidelity of type conversions from an explicit cast operation, using the identifiers in an assignment. We measure accord in the identifiers using entropy and use it to check if the semantics of the source expression in the cast match the semantics of the variable it is being assigned. We present the results of running our tool on 34 components of the Chromium project, which collectively account for 27MLOC. Our tool identified 1,368 cases of discord indicating potential anti-patterns in the usage of explicit casts. We performed a manual evaluation of a random-uniform sample of these cases. Our evaluation shows that our tool identified 25.6 and 28.04

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/25/2019

Explicit and Controllable Assignment Semantics

Despite the plethora of powerful software to spot bugs, identify perform...
11/06/2018

Gradual Type Theory (Extended Version)

Gradually typed languages are designed to support both dynamically typed...
10/12/2018

Semantic subtyping for non-strict languages

Semantic subtyping is an approach to define subtyping relations for type...
04/06/2019

Type-Level Computations for Ruby Libraries

Many researchers have explored ways to bring static typing to dynamic la...
08/13/2018

Automated Refactoring: Can They Pass The Turing Test?

Refactoring is a maintenance activity that aims to improve design qualit...
08/02/2021

Identifying historical roots in paediatric echocardiography using RPYS

Echocardiography is a non-invasive diagnostic tool which can be performe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Developers like flexibility while using programming language features during software development [flexibility flexibility]. Type casts allow developers to work around the restrictions imposed on a specific type and use methods written for other types. While casting offers flexibility, it can lead to undefined behaviour in weakly typed languages like C/C++. For example, consider the cast operation a=(T)b, the outcome of this statement is unclear unless we know what T stands for and what are the types of a and b. If a and b are scalars, this could be a value conversion. If they are objects, this could be a downcast from b to create a, if a’s class is derived from b’s class. a and b could be unrelated pointer types, in which case, the set of permissible operations is so vast that compilers might struggle to identify semantic errors.

typesafety_study (typesafety_study) studied the safety of type casts and found that a quarter of them were guarded with type checks to ensure their validity of type casts against run-time errors. This was corroborated in a later study by casting_java_explicit (casting_java_explicit) on the classification of patterns for type casting. A study of implicit casting in JavaScript [js_study js_study] found most implicit casts to be harmless and useful, implying that developers use them judiciously. casting_java_explicit (casting_java_explicit) performed a study of how developers use type casts in Java and found 26 usage patterns for type casts. Importantly, they discovered that half of the casts inspected by them were not guarded locally which could potentially cause run-time errors. Thus, there is a need to vet type casts to understand if they are being used carefully.

Type casts come in two forms: implicit and explicit. Implicit casts or coercions are conversions from one type to another without explicitly specifying the new type, and they are usually limited to numeric types. Compilers have multiple checks to vet implicit casts on numerics. Even so, it is not possible to categorically enforce checks on casts for several mainstream languages with user-defined types. Therefore, for languages like C++, that are permissive in how memory is used at a low-level, several primitives for explicit type conversion have been introduced. These primitives, which are called named casts, come with a unique set of checks on the cast operation. They are the recommended technique for explicitly changing one type to another in C++ and have two placeholders in the primitive: a source expression that needs to be cast and the destination type for the cast.

In this paper, we propose a lightweight approach to check if casts are used judiciously. dual_channel (dual_channel) presented source code as being dual channel. One channel is represented by the algorithmic channel, comprised of instructions understood and executed by computers. The second channel is the natural language channel, consisting of identifiers and comments to provide semantics for the instructions. In line with the recent work that uses meaning in identifiers in programs [refinym refinym; flexeme flexeme], we propose a dual channel approach to analyse named casts. Our assumption is that developers leave hints about their intent in the identifiers they choose and that this information can be used to check the fidelity of an explicit type conversion. In particular, we are interested in knowing if the source expression that is being cast is related to the destination variable to which the result of the cast is being assigned. Our main contributions are as follows:

  1. We propose a lightweight approach to detect casts misuse and imprecise names for source and destination used in named casts.

  2. We extract the identifier information of the variables used in named casts from the Chromium project, which is an aggregation of over 34 components with nearly 27 million lines of C++ code.

  3. We perform an information-theoretic analysis of the identifiers used in the named casts. We check if identifiers in the source expression are in accord with the identifier for destination variable to which the recast source expression is assigned.

  4. We perform an in-depth investigation of the cases where there is a discord in the variable names. We evaluate the effectiveness of our approach through a manual evaluation on a sampled dataset.

  5. We identify several cases where the discord was innocuous and essential, but also two anti-patterns where named casts are being used without sufficient care. Two instances of these issues have already been patched in a recent release of the software. In addition, we discover two cases where the named casts were part of code with a high complexity that eventually led to bugs. After the bugs were fixed, the named casts were completely removed.

We discuss an overview of casting in C++, along with an example of imprecise named cast usage and the motivation for our research in Section 2. We describe our methodology in Section 3 and the results of our evaluation in Section 4. Next, we summarise the analysis findings based on the named cast’s type in Section 5. Section 6 discusses some threats to validity. Section 7 presents the related work and Section 8 concludes this work.

2 Cast operations, their use and the motivation of the work

C++ provides several ways in which a type conversion can be effected. We first provide an overview of these ways. Then, we show through an example how, despite clear guidelines on how casts should be used, type casts can be used imprecisely. After, we present the motivation behind this work.

2.1 Implicit and Explicit Casts

Type conversions are operations where the type of an expression is changed from one type to another. There are two types of conversions: implicit and explicit casts. In implicit casts, the conversion is done without the developers explicitly specifying the type to which a value needs to be converted. Implicit casts are performed automatically by the compiler if there is a viable conversion. For example, in C/C++, it is possible to pass a float as an argument to a method which expects a double [implicitCastCPP implicitCastCPP]. Implicit conversions, also known as standard conversions [implicitCastSTD implicitCastSTD], are generally applied on built-in numerical data types, booleans and some pointer conversions [implicitCastDocumentation implicitCastDocumentation]. The implicit conversions between numerical types are called promotions [implicitCastSTD implicitCastSTD] and are allowed from smaller size types to larger size types.

C/C++ also allows explicit conversion using syntactic constructs. The syntactic constructs tell the compiler to perform a type conversion where the new type is specified in comparison to implicit conversions. There are two ways to perform explicit casts, which are presented in Figure 1. Here, a variable x of type double is converted to an int type. The first is the functional style, where the target type is treated as a method and the variable that will be converted is passed as an argument. The other is commonly referred to as the C-style syntax where the use of the variable is qualified by the target type within parenthesis.

1double x = 10.3;
2int y;
3y = int (x);    // functional notation
4y = (int) x;    // c-like cast notation
Figure 1: Functional and C-style syntax for implicit type conversion.

The functional and C-style explicit casts can handle conversions of built-in types such as numeric types. However, using those operators on user-defined types, particularly class hierarchies, requires additional language constructs. Thus, C++ introduced the following four named cast operators: static_cast, dynamic_cast, const_cast and reinterpret_cast. Out of the four, static_cast, dynamic_cast and const_cast perform additional checks either statically or at runtime to avoid undefined behaviour resulting from incorrect usage of type casts [explicitCastCPP explicitCastCPP]. reinterpret_cast is the most permissive with no checks on the validity of the type conversion. It merely reinterprets the memory holding an object as another type.

The static_cast operator.

static_cast vets the casts by statically checking the validity of the conversions against the class hierarchies [staticCastCPP staticCastCPP]. As shown in Figure 2, a downcast of an object a typed as base class Base to a derived class Derived is allowed, but the developer needs to be confident that a will never be an object of another derived class of Base. If the latter happens, accessing a field of the Derived class through b would lead to undefined behaviour. This is because static_cast does not apply runtime checks to validate if a is an object of type Derived or another derived class Derived2 of Base. Therefore, the correctness of a static_cast is reliant on the developer. static_cast operations are also used for converting enum and void types where the developer is sure of the type of the data pointed to by a void pointer.

1class Base {};
2class Derived: public Base {};
3Base * a = new Base;
4Derived * b = static_cast<Derived*>(a);
Figure 2: Example of static_cast .
1class Base { virtual void vf(){} };
2
3class Derived : public Base { };
4int main()
5{
6        Base *pBDerived = new Derived;
7        Derived *pd;
8        pd = dynamic_cast<Derived*>(pBDerived);
9        return 0;
10}
Figure 3: Example of dynamic_cast.
The dynamic_cast operator.

dynamic_cast is an operator used for casting pointers and class reference conversions. Unlike static_cast, dynamic_cast checks whether the named cast is permissible at runtime. If not, it returns a null pointer [dynamicCastCPP dynamicCastCPP]. This operation guarantees that the result points to a valid object of the new type at the end of the type conversion. Figure 3 presents an example of dynamic_cast for a pointer pBDerived. The pointer has the initial type Base* and it points to a Derived object. Through the cast on line 8, the pBDerived pointer becomes an object of class Derived. dynamic_cast operations perform validity checks using the Run-Time Type Identification (RTTI) which is a feature in C++ to inspect types of objects at runtime. Naturally, the runtime checks introduce overheads and dynamic_cast is an expensive operation for performance-sensitive applications.

The reinterpret_cast operator.

This operator’s role is to reinterpret memory holding an object of one type as another type, thus converting from one type to another. The pointer to the memory is recast into a new pointer type without any checks if the content can be of the new type. In general, this cast is used on low-level conversions based on a reinterpretation of the binary values of the variables [reinterpretCastCPP reinterpretCastCPP]. In Figure 4, a reinterpret_cast example is shown on line 5. The variable a of class A is reinterpreted to type B and assigned to pointer b even though A and B are unrelated in the class hierarchy. The reinterpret_cast has a lower overhead than the other operators since it does not perform validity checks. Like the static_cast, though, the correctness for this conversion relies entirely on the developer.

1class A { /* ... */ };
2class B { /* ... */ };
3
4A * a = new A;
5B * b = reinterpret_cast<B*>(a);
Figure 4: Example of reinterpret_cast.
The const_cast operator.

This operator makes it possible to modify variables that have the type qualifier const, which directs the compiler not to allow any modification for a variable, and volatile, which prevents the compiler from applying any optimisations on the variable. An example is presented in Figure 5. The variable c of type const char* is passed as an argument to a method print which only supports char*. This forces the use of const_cast in line 9 as mandatory to match the actual type to the formal parameter type. The C++ standard states that the const_cast operator can introduce undefined behaviour in programs. This situation can appear if the constness is removed from a variable and after the variable is modified [constCastCPP constCastCPP].

1void print (char * str)
2{
3  cout << str << ’\n’;
4}
5
6int main ()
7{
8  const char * c = "sampletext";
9  print ( const_cast<char *> (c) );
10  return 0;
11}
Figure 5: Example of const_cast.

2.2 An example of imprecise named cast usage

1// Add information on the relationship between QUIC error codes
2// and their symbolic names.
3std::unique_ptr<base::DictionaryValue> dict(new base::DictionaryValue());
4
5for (QuicErrorCode error = QUIC_NO_ERROR;
6      error < QUIC_LAST_ERROR;
7      error = static_cast<QuicErrorCode>(error + 1)) {
8  dict->SetInteger(QuicErrorCodeToString(error),
9      static_cast<int>(error));
10}
Figure 6: An example where two static_cast operators are used to iterate over an enumeration and store integer values in a dictionary. The snippet is from the file net_log_util.cc of component Net

taken from an open source implementation of the QUIC protocol in the Chromium project

Named casts were proposed initially to provide semantic clarity. However, developers sometimes use them to bypass type system restrictions at the cost of increased code complexity. Consider Figure 6 as an example. The code is a snippet taken from the implementation of QUIC protocol [quic quic]. QUIC is a general-purpose transport layer network protocol open sourced as a part of the Chromium project. There are two uses of the static_cast operator in this snippet, which populates a dictionary dict with key-value pairs, which are strings representing an error description and an integer representing the error code. It is important to note here that error itself is neither an integer nor a string but an unscoped enum type QuicErrorCode.

The type enum or enumeration is a user-defined type which consists of a set of named integral constants [enumerationsCPP enumerationsCPP]. Enumerations are generally used in three situations: a single choice where the developer filters through the choices with a switch statement, a multiple choice through C-style bitsets, or as a type definition for integral types. In Figure 6, the type enum is not used for any of the three situations, but it is used to iterate over the enumeration values and populate dict. By design, C++ does not encourage the iteration over objects of type enum since it does not provide an iterator. In the example, the iteration is achieved by implicitly casting the loop control variable error into an integer, incrementing it and casting it back to QuicErrorCode using a static_cast in line 7. In the loop expression, QUIC_NO_ERROR and QUIC_LAST_ERROR are the first and last elements of the enumeration. The second static_cast in line 9 converts the variable error of type QuicErrorCode to an int. It is used as a parameter for the function SetInteger, which populates the dictionary dict with key-value pairs. This is the second time that the developers chose to cross the boundaries between an enum type to an int to be able to use operators of the type int.

The iteration on enum objects can be pernicious, as enum types are not guaranteed to be contiguous. The Clang++ compiler would replace QUIC_NO_ERROR and QUIC_LAST_ERROR with their actual values in the loop from the snippet. This means that error would take all the values in the corresponding range. The enumeration QuicErrorCode is not contiguous and the values for each entry are defined by the developers. This means that the dict could contain error codes that were not described originally in QuicErrorCode. However, the developers handle those cases explicitly in the function QuicErrorCodeToString, which contains a switch over all the values from QuicErrorCode. This function returns the string of the error or an invalid error code for any other values. This implementation is not erroneous; however, it is suboptimal.

One may wonder at this stage, what could be a better solution and what should the solution aim to achieve? Type systems came about to ensure type safety and casts typically should be avoided wherever possible. The aim of a better solution should be to keep the enum and int types separate and implement all operators essential to iterate or operate in the enum space. The developers used an enumeration to generate a dictionary object type used later by the rest of the application. The enumeration implementation consists of the QuicErrorCode declaration along with a set of functions of switch cases such as QuicErrorCodeToString that allow the return of the string for an error. We believe a better solution would be to declare and use a dictionary from the start rather than declaring and using the enumeration to create the dictionary.

This solution would not require the crossing of type boundaries, since the type of the dictionary can be declared accordingly to the types of the values. Also, the solution would bring improved efficiency. Enumerations are efficient since they are resolved at compile time and converted into integral literals at the bitcode level. The enumerations are used along with switch cases and iterations over the enumerations, which present a linear efficiency. This efficiency performs well on a small number of cases, which is not the case for QuicErrorCode since it consists of 199 cases. On the other hand, the selection of a key in a dictionary would have a logarithmic efficiency. We are not sure if QuicErrorCode is used in any other part of the application, but dictionaries should generally perform better than large enumerations. Our solution would also ease the code maintainability process. Each time QuicErrorCode needs to be updated, it requires modifications at the declaration and at each function with switch cases. It would be easier to maintain a dictionary since the only modification required would be at the declaration. This example shows a need for tools that identify if the cast of types is essential and if the cast is done correctly. It is crucial to ensure that the crossing type boundaries are beneficial from a software engineering point of view, allowing code reuse without confounding the uses of types and operators for those types.

2.3 Motivation

In this research, we hypothesise that in large and mature projects such as Chromium, where code is reviewed before it is merged in the application, there are hints in program identifiers that point to their purpose. We aim to use this natural language information in identifiers to understand if named casts are being used for good software engineering reasons. If not, we aim to identify poor practices. For example, the actual to formal binding for the method SetInteger binds error of type QuicErrorCode to a formal named in_value of type int. A perfunctory check of the names for the variables and the types may seem that these variables are disparate. However, one may notice upon close inspection that SetInteger

is a modifier of a dictionary. Therefore, it is essential that formal arguments of this modifier are named generically. In this work, we combine an automated analyser with human inspection to classify cases where

named casts are used to point out both good and poor practices in using named casts.

In a named cast situation, precise names are meaningful names that reflect the relation between the source and destination. The choice of the identifiers is not only vital during development, but also during maintenance. Precise names reflect that the developers had a good understanding of the problem that they solved. The same precise names allow other developers to gain a faster and more comprehensive understanding of the code. Thus, the reusability and maintenance of the code is made easier. If the relation between source and destination does not exist, developers may be misled by the names and overlook some cases which could be dangerous during code testing and maintenance. Those cases need to be identified and refactored with meaningful names. Our tool uses the information-theoretic analysis to discover imprecise names given to the source expression and destination variable.

3 Methodology

Figure 7: Software architecture diagram of our tool which extracts named casts from a C++ codebase and analyses them using information theory.

Our objective is to analyse if natural language identifiers are indicative of the purpose of the cast. For this, we focus on assignment expressions where the right hand side is a named cast expression and on actual-to-formal bindings in method calls where the argument to the method is a named cast expression. In both cases, the expression that is cast to a new type is referred to as the source and the identifier to which the cast expression is bound is called the destination.

3.1 Overview of Software Architecture

Figure 7 presents an overview of our tooling. We rely on a Clang plugin to traverse the abstract syntax tree (AST) of source files. Our plugin traverses every node to discover named cast expressions and then determines if the expression is part of a larger sub-tree representing an assignment operation or a method call expression. Details of this process can be found in Section 3.2. From the set of named casts, we prioritise those casts where the source and destination are significantly different for manual investigation. Details of our prioritisation process can be found in Section 3.3 and the results of our manual investigation can be found in Section 4.3.

Our corpus is generated from the Chromium project [chrome chrome]. Chromium is an extensive system written in C++ and it only supports the Clang compiler for building. Chromium uses the Ninja build system and GN [gn gn] as a meta-build system that generates Ninja build files. The Ninja files run the Clang compiler, for which our analysis plugin is written, on the C++ files. Therefore, we modified the meta-build system to use a local version of Clang that is compatible with our plugin. The output generated by our modified compilation phase is a JSON file containing the named cast information for every C++ file that is compiled. These named casts constitute the dataset for our analysis which is described next.

3.2 Extraction of Named Casts

In Figure 8, we present an example of how our plugin analyses a named cast from the Net sub-system in Chromium. After Clang parses the source file and produces an AST for the file net_log_util.cc, the plugin traverses the tree and searches for named casts that are a part of either assignments or call expressions. On the left in Figure 8, the syntax tree for the function call SetInteger is shown. The node CallExpr has a child CXXStaticCastExpr which represents the node for static_cast implying that the named cast is used as an argument for a function call. The plugin then follows the call to find the method definition. A projection of the AST for the method definition is shown on the right in Figure 8. The plugin then links the formal parameter to the actual parameter for SetInteger and discovers that the source variable is error and the destination variable is in_value. All the macro names in the code will be replaced with actual code at the compilation stage [preprocessor preprocessor]. However, the physical location of the named casts would still point to the macro’s call. To solve this, our plugin is designed to follow macro definitions, post their expansion, to discover named casts inside macro definitions as well. For each C++ file analysed, the Clang plugin generates a JSON file with information about named casts. Each JSON entry in the file consists of the type of named cast i.e. static_cast, dynamic_cast reinterpret_cast or const_cast. It additionally contains the type and the subtokens for the source and the destination expression. To generate the subtokens, we extract all tokens from each expression and we preserve only identifiers, keywords and literals tokens. Those tokens are split in subtokens based on the camelcase and snakecase separators.

Figure 8: Abstract syntax tree representation for our motivating example; we selected only the nodes of interest. The left side shows the function call, SetInteger. The right side presents the mapping between the function call and the function definition.

3.3 Data analysis

In this research, we study if the identifiers convey the reason for the use of a named cast. We do this by comparing the source expression subtokens with the destination variable subtokens. Our comparison is based on a notion of entropy – the amount of information in names. We find cases where source subtokens are significantly different from the destination subtokens. The difference is measured using conditional entropy which computes the number of additional bits that would be required to represent the destination given the subtokens in the source. While we have access to the type information, we do not use this information in the calculation of the conditional entropy. The reason for this is that, during development and sometimes in static time, the type of a variable is not always visible to the human. That is why including the type in our analysis would make it different than the way a human would view code.

Next, we show how we compute the conditional entropy of fooBar given the entropy for bazGoo in the named cast fooBar = static_cast<Quux*> bazGoo. Equation 1 presents the standard Shannon’s formula for computing the entropy [shannonEntropy shannonEntropy], which

is the negative sum of the probabilities multiplied with the logarithm value of the probability. Here,

represents bazGoo and represents the probabilities for baz and Goo which are the subtokens of the identifier. The subtokens’ probabilities have a value of since there are only two possible options. Thus, . In other words, we need only one bit to represent the two possible options for the source subtokens.

We then compute the conditional entropy as shown in Equation 2 [condEnt_ent condEnt_ent

]. The conditional entropy is the amount of information (in bits) required to express the outcome of a random variable knowing the outcome of another random variable. In

Equation 2, Y is a placeholder for the subtokens from foo and Bar in our example. We try to compute the conditional entropy of Y given X

based on the chain rule. Thus, the conditional entropy value is the entropy value of the

source’s subtokens subtracted from the joint entropy value of both source and destination subtokens. In current example, the joint entropy is computed for all the subtokens baz, Goo, foo and Bar. . The conditional entropy tells how many more bits are need to represent the additional subtokens that the destination identifiers bring knowing the source’s subtokens. In the example, the conditional entropy equals with the difference between the joint entropy and entropy of the bazGoo and it has value one. Thus, the destination fooBar identifier will require an additional bit in order to represent the two new additional subtokens.

(1)
(2)

The role of conditional entropy value is to discover how different a destination expression is, compared to the source expression used in a named cast. Therefore, we compare the subtokens of the destination expression with the subtokens of the source expression for each named cast operation we collected from Chromium. If we were to consider the subtokens across multiple named cast cases in the conditional entropy calculation for each case, then the result would not be the difference between source and destination. The comparison would instead identify if the destination expression contains unique subtokens compared to source subtokens from all the cases. The chances that some of the destination subtokens appear in the subtokens from source expression increases with the addition of multiple source expressions in the calculation of the conditional entropy.

The conditional entropy values of the destination given the source enables the identification of cases where the source looks significantly different from the destination. A low conditional entropy value implies that source and destination subtokens are similar. On the other hand, a high conditional entropy value means they have few subtokens in common. If identifiers are used for different purposes, under the assumption that names are chosen carefully, their subtokens will also be different. We are interested in the cases where the conditional entropy is high. Those cases should generally point to clear instances where disparate names are used in the source and the destination expressions. This is indicative of the destination variable serving a different purpose than the source expression.

We generate a ranked list for the named casts based on their conditional entropy value in order to select identifiers where the expressions in source and destination are disparate. We performed this for all four categories of named casts: const_cast, dynamic_cast, reinterpret_cast and static_cast

. Additionally, we manually sampled cases from the top quartile in the ranked list for each of these categories. Our samples are randomly selected from the outlier dataset using the central limit theorem [

central_limit_theorem central_limit_theorem] with a 90% confidence. One may wonder why we did not use a simpler distance metric such as Levenshtein Distance (LD) instead of conditional entropy. LD uses three operations: insertion, deletion, substitution and the edit distance is the number of operations used to transform the input string into the output string. It is sensitive to the ordering of subtokens. Subtoken ordering is not important to us as we want only to check if the subtokens are being reused from the source in the destination. Whether an identifier is called thrown_type or type_thrown is immaterial to us, but it affects the Levenshtein distance.

4 Evaluation

Name Description Lines of Code Assignment expressions Call expressions Total
S R D C S R D C
V8 JavaScript Engine 1,359,009 1262 1649 0 8 1592 353 0 4 4868
Net Networking Protocols 765,964 616 1153 0 26 693 770 0 15 3273
gpu Graphics Stack 277,035 1386 307 0 10 171 100 0 56 2030
UI UI Frameworks 178843 197 823 0 5 689 36 0 4 1754
Media Media Components 370,069 450 700 0 20 358 207 0 3 1738
Blink Browser Engine 1,524,213 1081 120 0 0 138 0 0 0 1339
Chrome Application Layer 2,385,043 776 199 0 22 256 3 0 0 1256
Webrtc Communications API 634,428 482 78 0 9 541 33 0 1 1144
Skia Graphics Library 665,350 349 274 0 20 208 179 0 33 1063
Device Sensor Communication 133,831 469 376 0 0 116 30 0 0 991
Policy Policy Settings 38,532 121 34 0 353 314 34 0 0 856
Perfetto Tracing Service 205,355 297 7 0 54 454 1 0 0 813
Safe Browsing URL Check Protocol 9,046 162 57 0 79 440 46 0 0 784
Dawn WebGPU 66,458 125 542 0 0 25 3 0 0 695
Protobuf Serializing Struct Data 227,475 160 77 0 17 394 10 0 15 673
Common Application Layer 39,981 341 319 0 1 9 0 0 0 670
Base Core Components 278,364 192 220 0 7 129 102 0 6 656
Pdfium PDF Library 483,545 369 62 0 1 181 20 0 0 633
ICU Unicode Components 325,354 285 63 75 40 79 14 1 5 562
VIZ Visual Subservices 83,767 176 235 0 0 51 57 0 0 519
Metrics Proto Data Analysis 75,204 165 0 0 47 304 0 0 0 516
Sync Sync Implementation 139,526 92 1 0 84 313 3 0 0 493
Angle Graphics Engine 2,381,153 175 28 0 3 230 19 0 0 455
Buildtools Buildtools Chromium 510,018 187 153 13 2 25 7 0 3 390
Audio Audio System 34,120 43 202 0 0 33 50 0 0 328
Swiftshader Graphics Library 2,166,480 160 87 0 5 62 6 0 0 320
Extensions Core Parts Extension 223,979 312 4 0 0 0 0 0 0 316
CC Compositor Renderer 198,390 117 17 0 0 167 6 0 2 309
Remote Cocoa Cocoa Front-End 4,255 137 158 0 0 5 1 0 0 301
Logging Logs Implementation 42,865 90 0 0 6 176 0 0 0 272
Rest of Corpus Components < 250 2238 1284 0 247 1925 545 0 42 6281
Total Casts 13012 9229 88 1066 10078 2635 1 189 36298
Table 1: C++ Corpus from Google Chromium. Represents the distribution of cast types and the frequency of usage of each conversion operator (S - static_cast, R - reinterpret_cast, D - dynamic_cast, C - const_cast)

Our corpus consists of casts collected from the Chromium project. We give a quantitative overview of the type of named casts in our corpus in Section 4.1. A human evaluation is performed on a sampled set of named casts where the subtokens in source and destination are significantly different and the results are presented in Section 4.2. We selected and described the most interesting named cast cases in Section 4.3.

(a) Static Cast Assignment Cases.
(b) Reinterpret Cast, Const Cast and Dynamic Cast Assignment Cases.
(c) Static Cast Function Call Cases.
(d) Reinterpret Cast, Const Cast and Dynamic Cast Function Call Cases.
Figure 9: Type conversions represented by source expression length and conditional entropy. The star cases are the outliers.

4.1 Quantitative analysis

Table 1 shows the distribution of named casts in various components of Chromium. Our corpus consists of 36,298 named casts. Table 1 shows the frequency for each category of named casts for individual modules in the Chromium corpus. Overall, 63.62% are static_casts, 32.68% are reinterpret_casts, 0.25% are dynamic_casts and 3.45% are const_casts. As discussed in Section 3.2, we consider named casts that are a part of either assignments or actual-to-formal parameter binding in function calls. The proportion of named casts that are a part of assignments is 64.46% (23,395 casts) while only 35.54% (12,903 casts) are in call expressions.

It is observed from Table 1 that dynamic_cast and const_cast operators are used rarely. The dynamic_cast operator uses Run-Time Type Identification (RTTI) to verify that the types can be converted at runtime, which is an expensive operation. It is likely that the cost of checking prohibits their widespread use. const_cast operators are used to set or remove the constness or volatility of variables. Such variables are rare themselves which explains why so few instances of const_cast are present in our dataset. static_cast can be used to cast up or down objects. A check on the class inheritance hierarchy evaluates if the conversion between the object and destination type is possible. Therefore, static_cast is safer than reinterpret_cast which is extremely permissive, allowing arbitrary type conversions. Indeed, best practice is to use static_cast over reinterpret_cast and this is reflected in the prevalence of static_cast operations in our corpus. It is noticed from Table 1 that the larger and performance-critical modules such as the JavaScript compiler V8, networking (Net), GPU, user interface (UI), the Media libraries, etc. have the most casts. Interestingly, none of these modules uses the runtime intensive dynamic_cast cast operators. Only International Components for Unicode (ICU) and Buildtools components contain a total of 88 dynamic_cast operators. Neither of these components are central to the user experience of the browser and thus they can potentially tolerate runtime overheads.

Figures (a)a, (b)b, (c)c and (d)d show the conditional entropy against the size of the source length. For named casts that are a part of assignments, Figure a shows a graph for static_cast and Figure b shows a graph for const_cast, dynamic_cast and reinterpret_cast. For named casts that are a part of actual to formal bindings, Figure c shows a graph for static_cast and Figure d shows the graph for the remaining casts. As expected with longer source expressions, conditional entropy stays low. It is interesting to note that some named casts with short source expressions show noticeably high conditional entropy. These are the cases we investigated further to understand if the named cast operation is used carefully and there is ample justification to cross type boundaries. These outliers are highlighted in the Figures a, (b)b, (c)c and (d)d

by using colours that are different from the rest of the points in the dataset. They were identified by using an estimated Gaussian distribution and selecting the upper quartile. The outliers consist of 991

static_cast, 319 reinterpret_cast, 11 dynamic_cast and 47 const_cast operations. The share of each category of named types in the outliers is proportional to the constitution of the original dataset.

4.2 Manual evaluation

Figure 10: Classification of the manually evaluated results (TP - True Positive, FP - False Positive)

For the manual investigation of named casts, we selected a subset from the outliers using random uniform sampling targeting a 90% confidence using the central limit theorem [central_limit_theorem central_limit_theorem]. The sampled dataset consists of 164 data points with a breakdown of 126 static_cast, 32 reinterpret_cast, 5 const_cast and 1 dynamic_cast operations. Our human evaluation had three different raters analysing the sampled dataset and filtering the cases in true positive and false positive. The human evaluators classified 50, 90 and 94 cases out of 164 as true positives. The true positive rate represents how many times our approach correctly identified a case that presents an incorrect implementation or imprecise names for the identifiers. Our raters have agreed on a total of 83 true positive cases, which means that our technique presents a true positive rate of 50.6%. The results of the manual analysis are presented in Figure 10. The raters have categorised each true positive based on their type. The results indicate 25.6% cases represent incorrect implementations while 28.04% cases represent imprecise names for the source and destination. Some examples of imprecise names of source/destination pairs are: tag with chars[i], levels with fparams[0], param with bufSize, t with output_cursor, val with p[i], frames with out_trace. The raters evaluated cases as having imprecise names when the source and destination names are not meaningful and when the names can cause confusion rather than clarity for the code meaning.

Analysis of False Positives

The cases which represent false positives fall into two categories. One of them is when the named casts are used correctly and efficiently simultaneously, while the identifiers’ names relate and show the code’s purpose. The rate of those cases is only 24.4% of the entire dataset. The last quarter of the dataset is represented by correct implemented named conversions with generic names. In some cases, generic names will not produce code quality problems and they can provide enough information regarding the named casts. For instance, one case converts buffer[buffer_pos] to current, which means that the code extracts one element of buffer from index buffer_pos and stores it in current variable. Even if this named cast is sound and the identifier names are reasonable, our tool would filter such cases as false positives since their genericity makes the names different.

As part of our evaluation, we want to discover the degree of agreement between raters. The inter-rater agreement, also called Cohen’s Kappa coefficient [kappa kappa], is a robust measurement metric for the agreement level between two or multiple raters. The Kappa can take values between -1 and 1. If Kappa has value “1”, it means that the raters are in perfect agreement. Kappa’s negative value means that the raters are in disagreement. Kappa coefficient has been calculated as the mean value between the kappa coefficient between any two raters. The Cohen’s Kappa coefficient for this evaluation was approximately 0.62. This means that our raters were in a substantial agreement about the nature of type conversion.

4.3 Qualitative analysis

We split the named casts cases from the sampled set based on their operator and present the analysis of the most interesting cases in the following subsections.

Analysis of static_cast examples

An example of a static_cast where the source and destination look different is presented in Figure 11. The listing contains a call to CompareAndSwapPtr as well as the definition for the same. This method is actually called from within a macro function definition, RTC_HISTOGRAM_COMMON_BLOCK. The purpose of this macro function is to add the information passed to the histogram_pointer safely. If the memory where histogram_pointer points is empty, then the pointer will be changed to point to the new memory address. Otherwise, the code from lines 1-4 will ensure that it points to a nullptr.

The static_cast used on line 3 in Figure 11 is passed as a parameter to the function CompareAndSwapPtr. The function call is part of a pointer declaration. The newly declared pointer prev_pointer will become the output of the method CompareAndSwapPtr. This function makes use of the API Interlocked CompareExchangePointer from Windows which is used to perform a pointer comparison and swap atomically. The code has to clear atomic_histogram_pointer. So, the API call ultimately will compare the pointer with a nullptr. If those two pointers contain different values, then it will store the value of nullptr in the address of atomic_histogram_pointer. The static_cast converts the nullptr to the type webrtc::metrics:Histogram* for consistency.

Since the code from Figure 11 tries to validate if atomic_histogram _pointer is null, it is required to compare the pointer with a null pointer literal: nullptr. In order to compare two pointers, they need to be of the same type and therefore, a static_cast is used as it is the only named cast operator which allows casts from nullptr to a different type. The destination identifier to which the named cast is bound is old_value. While old_value looks different to nullptr and that is why our information-theoretic analysis identified it, the method CompareAndSwapPtr is likely designed to be generic and accepting of many different pointer types. Therefore, this use of named cast is sound.

1webrtc::metrics::Histogram* prev_pointer =
2  rtc::AtomicOps::CompareAndSwapPtr( &atomic_histogram_pointer,
3  static_cast<webrtc::metrics::Histogram*> (nullptr),
4  histogram_pointer);
5
6static T* CompareAndSwapPtr(T* volatile* ptr, T* old_value, T* new_value)
7{ return static_cast<T*>( ::InterlockedCompareExchangePointer(
8        reinterpret_cast<PVOID volatile*>(ptr), old_value, new_value));
9}
Figure 11: A static_cast example which presents a good utilisation of the operator for efficiency and portability.

Figure 12 presents a use of the static_cast operator in the component Base, in file ip_address.cc inside the method ParseV4. This method is used as part of the constructor for the class IPAddress to extract the IPv4 address from a string. The named cast operation in Figure 12 is part of a variable assignment. Although the source and destination identifiers are selected because they look different, we need to understand how they are used to assess whether a named cast is necessary here. We studied how the source and destination identifiers are used and found that the input string for ParseV4 is split in octets in order to be parsed and added to the IPv4 address. The source identifier is next_octet of type uint16_t which represents one byte of the IPv4 address. The destination variable is address.bytes_ where bytes_ is a member of the class IPv4. Specifically, it is an array of type array<uint8_t, 16>. The array has the length 16 since IPAddress can also have the IPv6 format. The implementation of ParseV4 does not seem to be erroneous. However, the use of the static_cast operator is unnecessary since the conversion from string to octets can be done using the built-in type transformation type. Developers can use functions such as sscanf to read parts of the formatted string and return directly the desired output. In fact, this is exactly what the developers did in later versions of the implementation: the ParseV4 function has now been refactored [commit_ipadd commit_ipadd] and updated to use sscanf.

1address.bytes_[i++] = static_cast<uint8_t>(next_octet);
Figure 12: An example of static_cast operator used in function ParseV4. This function has been refactored and replaces the conversion with a function that reads the values.

The code from Figure 13 presents a set of four static_cast conversions collected from the component Swiftshader from the file Surface.cpp. We identified the cast because the source identifier is very short compared to the destination identifier. These casts are inside a method write which contains a switch statement that writes the colour values (RGBA format) to a data structure. The source identifiers are r, g, b and a of type float which are short for the colours red, green, blue and alpha, the last of which represents the opacity. The destination identifier, which is originally a void pointer, has a generic name element because it may point to arbitrary data types. However, notice in Figure 13, element has been implicitly cast to point at an unsigned int to match the type for the desired destination type. Implicitly casting void pointers at the point of use can be confusing. This could lead to the variable element being treated differently, assuming it has another type. We have found 45 such conversions in the switch statement.

Another static_cast analysed is presented in line 4 in Figure 14, which belongs to the file Context.cpp from component libANGLE. This case has been identified because the source and destination expressions are different. The source variable is a pointer of type const void* with the identifier paths

and it represents a vector of

paths from the Render Tree. The destination variable is a pointer of type const auto* with the identifier nameArray. This conversion is required to allow the conversion of the paths vector in a target template type. The template type is used as an argument to the named cast operator in line 4 and it appears in the function template declaration on lines 1-2 in Figure 14. The role of the function GatherPaths is to iterate through all the paths and returns their names. This case belongs to a larger and more complex code that has the functionality to validate the command buffer at path rendering. The developers decided to stop supporting this feature since this rendering method had a worse performance compared to the other rendering methods [nameArray_bug1 nameArray_bug1]. In addition, under specific circumstances this functionality was trying to retrieve information from an empty pointer which was leading to a crash [nameArray_bug2 nameArray_bug2]. This example shows that a name cast conversion can be used correctly, but it might also add complexity to the code, leading to inefficient and error-prone code.

1((unsigned int*)element)[0] = static_cast<unsigned int>(r);
2((unsigned int*)element)[1] = static_cast<unsigned int>(g);
3((unsigned int*)element)[2] = static_cast<unsigned int>(b);
4((unsigned int*)element)[3] = static_cast<unsigned int>(a);
Figure 13: Example of how a static_cast is used on primitive types. The destination variable is originally a void pointer and may potentially be misused if the developer is unaware of the various types it can represent.

Analysis of reinterpret_cast examples

Figure 15 presents two similar cases of reinterpret_cast with high conditional entropy. Those cases have been investigated and two different source identifiers are bound to the same destination identifier even if the conversions appear in different components. Figure 15 contains the calls and the signature for the function host_statistics and host_info. These method calls have been collected from the files process_metrics _mac.cc from Base component and audio_low_latency_input_mac.cc from Media component. The functions host_statistics and host_info are defined in the Mach library which contains services and primitives for the OS X kernel.

The role of the functions host_statistics and host_info is to retrieve host-specific information. The function host_statistics in line 2 obtains information about virtual memory for a host. The host_info method in line 10 retrieves basic information about a host such as the number of current physical processors for the host. Both methods return a variable kr of type kern_return_t. This variable is an integer which maps to a list of generic errors. If the method is successful, then kr would have the value . Otherwise, it would have a different value which represents a specific error. In fact, not only those two functions, but most of the methods from the Mach library follow the same coding conventions and they have a similar format.

The source variable for the first case has the identifier &data which is a generic name. Its type is vm_statistics_data_t which is a pointer to the structure vm_statistics and contains statistics on the kernel’s use of virtual memory. The source identifier for the casts from line 10 is &hbi which is an acronym for the source variable type. Just like &data, &hbi is the address of a structure host_basic_info which is used to present basic information about a host. The two casts from Figure 15 have similarly named destination identifiers: host_info_out with type host_info_t.

1template <typename T>
2std::vector<Path *> GatherPaths(..., const void *paths
3...
4const auto *nameArray = static_cast<const T *>(paths);
Figure 14: Example of a static_cast which was part of code which is not supported anymore.

host_statistic can hold two different types of structures: vm_ statistics for virtual memory information and host_load_info for host’s processor load information. The flavor keeps track of the type of statistics desired. In this way, the functions will treat each destination variable differently based on the variable flavor. Implementing the functions in this manner allows the functions to perform different operations based on the parameters passed. The destination identifiers are identical since the functions host_statistics and host_info follow the same coding conventions and have a similar format. Unfortunately, if the developer is not careful to pass the correct match between the type and the flavor as parameters to functions, it may lead to a crash. This is a case where rigorously adhering to a coding convention can cause confusion during development.

A finding has been noticed while analysing the use of similar casts for the component Mach. Those type conversions are designed for a specific platform, Mac OS. According to the developer’s comments [problems_kernel_count problems_kernel_count], the implementation for this platform caused the most problems to the developers. So, they had to build a specific solution for it. This is also supported by the fact that those functions are defined in the Mach library. Even if this conversion pattern can cause confusion, the pattern seems vital for Chromium’s execution on Mac OS platform since it was designed for a specific system.

1//check the total number of pages currently in use and pageable.
2kern_return_t kr = host_statistics(host.get(), HOST_VM_INFO,
3    reinterpret_cast<host_info_t>(&data), &count);
4
5kern_return_t host_statistics(host_t host_priv, host_flavor_t
6    flavor, host_info_t host_info_out,
7    mach_msg_type_number_t *host_info_outCnt);
8
9//retrieve the number of current physical processors
10kern_return_t kr = host_info(mach_host.get(), HOST_BASIC_INFO,
11    reinterpret_cast<host_info_t>(&hbi), &info_count);
12
13kern_return_t host_info (host_t host, host_flavor_t flavor,
14    host_info_t host_info_out,
15    mach_msg_type_number_t *host_info_outCnt)
Figure 15: An example of how reinterpret_cast operator is used to allow functions with pointer parameters which can point to two different data structures.

A second case of reinterpret_cast use that we studied is presented in Figure 16. This snippet is from component Dawn in file WireCmd_autogen.cpp and is one of 13 similar cases. The file is generated from WireCmd.cpp using the build system and contains serialisation and deserialisation functions. The generated file is large with 14,000 lines of code and has a total of 200 type conversions which have the same identifier for source variables and also for the destination variables. The source identifier is the string buffer and in most cases, it is a pointer to a pointer for char. There are cases when the source variables have additional type qualifiers such as const volatile. The destination variable is memberBuffer and it is declared with the type auto. We observed that the destination type varies from pointers to numeric types such as unsigned long long to pointers for structures and enumerations. The casts are part of assignment expressions in which the memberBuffer is initialised with a part of the buffer.

The purpose of these casts is to serialise and deserialise a variety of different structures for the component Dawn. In other words, the methods provide the functionality to convert objects tp streams of bytes and recreate the objects when needed. Since the universe of types to be serialised is large, developers have relied on macros to serialise/deserialise objects. The example selected in the Figure 16 presents the buffer which is converted in the type DawnTextureFormat. The target type is an enumeration. Similar to the example from Section 2.2, lines 2-4 iterate over the enumeration. While the use of macros is preferred for serialisation and deserialisation, given the massive number of types that need to be serialised or deserialised, macros provide little insight into the actual role of the casts. Nonetheless, the generated file can be created from only 700 lines of code which contain macros. The use of reinterpret_cast in this case is clearly beneficial from a software reuse point of view and leads to a decrease in the amount of code. On the other hand, the named cast operator is used to bypass the lack of an iterator for the enumeration type, which if not done correctly, can be pernicious as reinterpret_cast comes with no semantic checks at all and as discussed above, enum types may not be contiguous in the first place.

1auto memberBuffer = reinterpret_cast<DawnTextureFormat*> (*buffer);
2
3for (size_t i = 0; i < memberLength; ++i) {
4    memberBuffer[i] = record.colorFormats[i];
5}
Figure 16: An example of reinterpret_cast which is used to be enable iteration over an enumeration.

The code from Figure 17 presents the use of a reinterpret_cast in line 4. The snippet is collected from component V8 in file api.cc. The source variable is a void* pointer with the identifier info, while the destination variable is a shared pointer with the identifier bs_indirection of type std::shared_ptr<i::BackingStore>*. To understand this case, first, we need to understand what the type BackingStore is. In caching, a backing store is represented by the copy of a data in the memory, more specific in our case, a copy to an ArrayBuffer [backingData backingData]. The named cast operator is used to retrieve the shared pointer for BackingStore data, which will be deleted later in the same function. The BackingStore pointer is a shared pointer that can be accessed from the V8 and the Embedder components of Chromium and generates a lifetime management problem when both components hold pointers to the backing store data. The code complexity is increased since the components can resize the shared memory or transfer ownership from one component to another. The unsafe ownership model of BackingStore is prone to errors, such as memory leaks and access of the pointers after deleting them, which has eventually led to various bugs [backingData_bug1 backingData_bug1; backingData_bug2 backingData_bug2].

1// The backing store deleter just deletes the indirection, which downrefs
2// the shared pointer. It will get collected normally.
3void BackingStoreDeleter(... void* info) {
4        std::shared_ptr<i::BackingStore>* bs_indirection = reinterpret_cast<std::shared_ptr <i::BackingStore>*> (info);
5        ...
6        delete bs_indirection;
7}
Figure 17: An example of reinterpret_cast which was removed from code.

The problems have been solved by refactoring the ownership model and making the BackingStore to own the shared pointers [backingData backingData; backingData_commit backingData_commit]. The previous implementation required each component to delete its shared pointer instance through the method BackingStoreDeleter. The new version of the BackingStore class counts the shared pointers references and if the count reaches zero, then the BackingStore will delete the pointer. The named cast operation, along with the function BackingStoreDeleter, was removed in the new implementation [backingData_commit backingData_commit]. While the named cast operation was not directly causing the bugs, we can definitely say that it added complexity to the code by asking each component to delete its shared pointer instance, and eventually the code led to bugs. Our approach identified this case because the source and destination identifiers (info and bs_indirection) are different. We can notice there is a semantic relation between the identifiers since info refers to the data and bs_indirection refers to backing store pointer which is the copy of the data. If a semantic perspective would be considered, it is likely that this case would not have been identified.

Figure 18 presents two versions of a macro function F collected from the file ast-value-factory.cc of component AST . The first version contains a reinterpret_cast on line 6. We identified this named cast because the source and destination expressions are very different. The source expression is an integer literal representing the value 1. The destination variable is a void* pointer with the identifier entry->value and it points to the value of an entry in a HashMap. The function F is used in the initialisation of HashMap objects and each entry is initialised with value 1. The second version of the macro function F, which is a refactored version (refactored_hashmap), does not contain the named cast operation. With the lack of named cast operation along with the information from the commit, we can tell that the new implementation of the HashMaps supports empty values objects without causing any errors. The named cast operation in the first version was a workaround, without a proper way of defining the behaviour, if the entries did not have values. This means that the code in the first version was error-prone in the case of empty values. A proper implementation shows that the named cast operation is not needed in the current case.

1// Old implementation
2#define F(name, str)
3...
4  HashMap::Entry* entry =
5        string_table_.InsertNew(name##_string_, name##_string_->Hash());
6  entry->value = reinterpret_cast<void*>(1);
7
8// New Implementation
9#define F(name, str)
10...
11  string_table_.InsertNew(name##_string_, name##_string_->Hash());
Figure 18: An example of reinterpret_cast which was removed from code.

Analysis of const_cast examples

There were only five cases of const_ cast operators in the sampled dataset. Four cases belong to the library ICU in two different files: tznames_impl.cpp and tzfmt.cpp. For these cases, the source identifiers are generic and partially different compared to the destination identifiers. The Figure 19 presents one of the four cases from the file tznames_impl.cpp. The source variable is the pointer this which is an instance of the class encapsulating the statement and has the type const TimeZoneNamesImpl* . The destination variable is a pointer called nonConstThis which does not have the qualifier const in its type. The chosen identifiers for source and destination reinforces our hypothesis that identifiers carry meaning. Here, the getters in the encapsulating class need to maintain the integrity of the original object. Thus, the desired values need to be extracted from a non const object derived from the pointer this using a const_cast operator. This is an instance where explicit casting is being used judiciously, clearly indicating its purpose through meaningful identifiers.

1TimeZoneNamesImpl *nonConstThis = const_cast<TimeZoneNamesImpl *>(this);
Figure 19: A const_cast example used to obtain a non const object from the const pointer this.

Figure 20 presents a second example of a const_cast. This example is taken from component Base and belongs to the method CaptureStackTrace which is used to collect frames in the execution stack. It is interesting and complements the one discussed in Figure 19 because the type qualifier const is being added to a value in this case. In this case, the type conversion is a parameter for the function call TraceStackFramePointers. The function in lines 1-3 returns the total number of the frames for the stack. The source identifier is frames which has the type void** and it represents the pointer to the stack frames. Line 5 of Figure 20 shows the function declaration. The destination identifier is out_trace with the type const void**. Being able to check the stack is vital for debugging but at the same time, the stack should be protected during debugging. The const_cast is required in this case to protect the stack frames from inadvertent manipulation while the developer is inspecting the stack. Here, we see an instance where the cast is necessary but the identifier for the destination is not descriptive enough. The advantage of our approach is we are able to bring this to the notice of the developer who may choose to use a more meaningful identifier for the destination.

1size_t frame_count = base::debug::TraceStackFramePointers(
2      const_cast<const void**>(frames),
3      max_entries, skip_frames);
4
5size_t TraceStackFramePointers(const void** out_trace,
6      size_t max_depth, size_t skip_initial)
Figure 20: An example of how a const_cast operator is used to add the const qualifier to a variable.

Analysis of dynamic_cast examples

Since the sampled dataset had only one instance of dynamic_cast, we expanded our investigation to the entire dataset and analysed a total of 11 cases. We present two of them below. The first instance can be found in Figure 22. It has been extracted from private_typeinfo.cpp and it is part of libc++abi library. The use of the dynamic_cast operator appears in variable declarations in methods can_catch and can_catch_nested. These methods are used for exception handling and report mismatches during type conversions by checking if the result is null or not. If not, the methods return an exception. The source variable, in our example, has the identifier __pointee, which is of the type const __shim_type_info* . The destination variable is member_ptr_type, which is a const pointer to  __pointer_to_member_type_info, which itself is derived from the class __pbase_type_info a sub-class of std::type_info which contains information about types for variables. The names in this cast are generic and understandably so. libc++abi implements the Application Binary Interface for C++ and is expected to be generic to fit in with a wide spectrum of low-level transactions between the application, libraries and the operating system. The dynamic_cast operator is used in this case to check at runtime if the destination variable can take the source’s type while keeping the natural language identifiers as generic as possible.

1const __pointer_to_member_type_info* member_ptr_type =
2    dynamic_cast<const __pointer_to_member_type_info*> (__pointee);
Figure 21: An example of how a dynamic_cast is used in the implementation of an exception handler

A second example of dynamic_cast is presented in Figure 22. The snippet is from the file upluralrules.cpp in the ICU (International Components for Unicode) module. The source variable is fmt with the type const class icu_64::NumberFormat* which captures the format of the expression. The destination variable is decFmt and it has the type const class icu_64::DecimalFormat* . The destination’s type class DecimalFormat inherits from source’s type class NumberFormat [ICU_doc ICU_doc] and this is an example of a down-cast operation which is verified at runtime. If the checks fail and decFmt is NULL, the method continues to check for other known formats. The ICU module handles a wide variety of data types. Even for numerics, which is the focus of our example, there are several different types that need checking: int32_t, double and FixedDecimal. Most of these values are only available at runtime and therefore, the developers prefer to insert explicit checks through the dynamic_cast operator. The identifiers in this case reflect the type specialisation that is happening through the dynamic_cast operator. This is an example where type conversions are used judiciously with clear objectives and the names reflecting the type conversion that is taking place. Further, the use of dynamic_cast operator makes the type conversion safe at runtime.

1const DecimalFormat *decFmt = dynamic_cast<const DecimalFormat *>(&fmt);
Figure 22: An example of dynamic_cast used to perform a down-cast conversion.

5 Discussion

In this work, we presented a summary of the findings from the named cast operators study. We have identified: two cases of iteration over enumeration types (Figure 6 and 16), two cases of poorly named variables (Figure 13 and 16), two instances of anti-patterns that have been refactored in later versions of the software so that the named cast operators were no longer used (Figure 12 and 18), two cases that increased the complexity of the code which led to poor quality code and bugs (Figure 14 and 17), two cases that enabled a function to change behaviour based on the types of the pointer (Figure 15) and two good programming practices for protecting values stored in variables (Figure 19 and 20).

The operator static_cast is the most versatile and most widely used operator for explicit type conversions. In Figure 6, we discovered the use of the static_cast to iterate over an enumeration which is an abuse of the enumeration type and an inefficient implementation. Figure 11 presents a good use of static_cast, demonstrating how it can be used to provide safety during pointer initialisations. We also found examples where named casts were used as a quick workaround. The case from Figure 12 showed a cast which has been removed in recent versions. The case from Figure 13 shows conversions between primitive types, which in most cases is harmless. However, the destination variable is a void pointer which can point to many types and lead to type confusions. The last case from Figure 14 shows a correct use of the static_cast operator being part of complex code that led to inefficient code and even to a bug.

The reinterpret_cast operator is used mostly for pointer to pointer conversions as it is the most permissive. Figure 15 presented two examples of conversions of two different pointer types bound to a destination which has the same name. Using the same name to store data of different kinds is not desirable and we believe the code can benefit from variable renaming. In Figure 16, we presented an example of serialisation/deserialisation where the developers have relied on reinterpret_cast to be able to deal with a diversity of objects. There is a strong software engineering reason to do so as it is essential to keep the interface to the serialiser and deserialiser generic to be able to deal with any data type. The case from Figure 17 shows another example where complex code led to bugs. After the bugs were solved, the code has been refactored and the named cast was completely removed. Last case shows the use of a reinterpret_cast as a quick workaround to not develop the behaviour for empty values case for entries of a HashMap. This named cast operation was also removed in the recent versions.

dynamic_cast operators are used infrequently. They are used when the developer is unsure if a conversion is possible or not. In this way, the runtime checks will confirm whether the casts are valid. An example where it is mandatory to prove a cast is valid appears in the implementation of an exception handler showed in Figure 21. Another essential use-case of dynamic_cast operator is for downcasts. The component ICU contains the most dynamic conversions and they are used for downcasts. Section 7 discusses some solutions to avoid the expensive dynamic cast. However, the question of why from all Chromium’s components only ICU has implemented its downcasts with dynamic_cast remains unanswered.

The operator const_cast is used for software engineering reasons and security reasons. Even if this operator can introduce undefined behaviour as presented in Section 2, the analysed cases were adequately implemented. We have identified two const_cast usage patterns from the analysis. One pattern appears when an object tries to access itself through the pointer this in a function declared with the qualifier const. The const functions will make the pointer this also have the qualifier const. However, there are times when the const this pointer needs to be passed as a parameter to non-const functions. Figure 19 shows an example where an explicit conversion was performed in a getter to obtain information from an object. Another use-case appears when some non-const variables need to be protected against modification in specific methods. In order to do so, the const_cast will be used to add the const qualifier. Figure 20 shows how a stack is passed as a parameter to a function after the conversion. The motivation behind the use of some const type conversions comes from the use of third party libraries.

6 Threats to Validity

Internal threats

The results of the manual evaluation and the findings of the named casts operators usages are influenced by the subjective experience of the raters. We tried to minimise this bias by using three raters with experience in C++. Each rater consulted the ISO C++ Standard to understand how the named cast operators should be used and only after the raters provided feedback on the sample data. After each rater performed an initial evaluation, they selected together the interesting cases presented in Section 4.3.

External threats

Our tool is subject to analyse code where variable names are chosen carelessly. In an ideal world, the natural language channel provides enough context to understand the code’s purpose. Our approach relies on the connection between the identifiers to detect cast misuses and the tool performs better if the identifiers are meaningful. In a scenario where the names are chosen carelessly, our tool might identify fewer cases of casts misuses, but it will identify more cases of imprecise names. In many cases, cast misuse can be overshadowed by imprecise naming. This is overcome by initially identifying imprecise naming, essentially forming the first stage of a two stage refactoring - clarification of intent followed by validation of intent. However, our tool will also detect some false positives based on the nature of the approach. Developers might decide in some cases that generic or different names are appropriate for the source and destination identifiers. In such cases, these casts would be flagged despite the identifiers being meaningful to the code.

7 Related Work

Research into type systems accelerated with Luca Cardelli’s seminal and accessible papers on type theory [typefulPaper typefulPaper; understanding_types understanding_types; types_data types_data]. He lucidly explained how type systems could help us write better programs with fewer bugs. Some of that research also discusses properties of types in object-oriented programming. explicit_casting_research (explicit_casting_research) presented an analysis of the explicit type casts operators for C++ with details of each type of operator. fastDynCastPaper (fastDynCastPaper) proposed a method to implement dynamic casts, which is an expensive operation, for systems where performance is critical. dynCastPaper (dynCastPaper) have demonstrated the efficiency of the Gibbs and Stroustrup implementation by using it as a baseline while also improving the performance by a factor of two.

Type casting studies.

In term of the effects, there are a significant number of research papers that present the study of the undefined behaviour introduced by type conversions [UB1 UB1; UB2 UB2; UB3 UB3]. Undefined behaviour can have many causes and some of them are due to type conversions. For instance, during the execution of a dynamic_cast, the program needs to check the pointer’s type. This is done by the dereferencing the pointer, and this case is undefined behaviour [UBblog UBblog; UBblog2 UBblog2]. Compilers will capture some cases of undefined behaviour for which they will generate warnings, but not all of them [UB1 UB1]. For this reason, developers need tools and techniques to verify their code.

js_study (js_study) have done an empirical study over the implicit casts for JavaScript. They proved that those type conversions are in general harmless and developers use them correctly. This can be translated as most of the times, implicit casts are safe to use. However, there is contradicting evidence that unrestrained named casts or explicit casts can have undesirable effects. Tools have been researched and developed to detect such casts. detectPaper_caver (detectPaper_caver) present CAVER, which is a tool to identify poor practices in casting and also discussed their security implications. The tool analyses C++ code and focuses on the unsafe uses of the static_cast and dynamic_cast. This work has provided a good background to understand how named casts can go wrong. Their tool’s evaluation, much like ours, is performed on the code from Chromium. detectPaper_hextype (detectPaper_hextype) provide another tool HexType that performs well at detecting badly implemented casts. They have implemented HexType using low-overhead data structures and compiler optimisations to minimise the required resources. casting_java_explicit (casting_java_explicit) provided an empirical study of type conversions for Java. The target of their research is to discover when and how developers use an explicit cast. This is done through discovering and presenting 25 patterns of cast-usages from real-life Java code. This paper is the closest to our work, but unlike us, it does not use any signal from the natural language identifiers to detect anti-patterns.

Dual-Channel Research.

knuth (knuth) proposed a paradigm shift in programming, which is commonly known as Literate Programming, where writing code to instruct a computer is secondary to presenting it to human beings. In Literate Programming, each program contains its explanation in natural language intermixed with sections of code. Knuth presented the system WEB, which is a literate programming language comprising of a document formatting language (TEX) and a programming language (PASCAL). Literate programs contain a human-readable explanation interspersed with code which is automatically picked up by the WEB system to produce an executable. At the same time, WEB enables the inclusion of powerful features such as pictures, equations, tables, and others in the natural language part of Literate program. Thus, the natural language information remains in harmony with the software itself.

Literate programming laid the foundation for novel research directions in Software Engineering that drew upon advances in Natural Language Processing.

naturalness (naturalness) proposed the naturalness hypothesis

for software which noted that large programs can be repetitive and can be modeled with techniques that capture repetition such as n-grams. They noted that code is analogous to natural languages in the way it tends to repeat. Such repetitive patterns can be harvested and interpreted as statistical properties that can be used to develop better software engineering tools. They used this observation to build a statistical language model over a large corpus to improve code completion. An n-gram language model was built using token sequences, which included natural language information in the form of identifiers, from open source code. The model was used in a plugin to complete code for Eclipse IDE which performed better than the Eclipse’s completion system at that time.

Source code is normally written for it to run on a device. But, the same code is also written for developers who maintain or improve the application. Therefore, a large part of the code semantics is embedded in the communication channels between developers i.e. the natural language identifiers that are chosen and the comments that are written in the code. Based on this insight, dual_channel (dual_channel) described two communication channels in source code: the algorithmic channel (AL) and the natural language channel (NL). The algorithmic channel comprises of all the instructions written by the developers which will be executed by a computer. The natural language channel, which consists of identifiers and comments, provides information about the purpose of the code in a human-readable format. The relation between the AL and NL channel can be utilised to improve software analysis tools.

flexeme (flexeme) have developed a tool called HEDDLE to detect and separate tangled commits into atomic concerns. HEDDLE generates a graph data structure that encodes different versions of the program and annotates the data flow edges using the natural language information from the source code. HEDDLE performs faster and is more accurate in the detection of tangled commits than the previous state-of-the-art. posit (posit) have also developed a technique called POSIT, which adapts NLP techniques for tagging between code and natural language. POSIT can generate more accurate tags for both source code tokens and natural language words than the previous state-of-the-art.

Dual-channel Research On Extracting Meaning From Names.

Identifier names represent the majority of tokens from the source code. identifiers_code_quality (identifiers_code_quality) have shown through an empirical study on Java applications that there is a direct relation between the naming quality of identifiers and source code quality. Thus, poor named identifiers show a lack of understanding of the problem, which is translated into poor quality software. The authors measured the quality of identifiers based on identifier naming guidelines and subtokens comparison to Java and application specific terms. Even if the subtokens’ semantic meaning is ignored in the analysis, this empirical study proves that the relation between the dual-channel information is not entirely harvested and applied in software analysis tools.

refinym (refinym) used dual-channel constraints to mine conceptual types from identifiers and assignment flows between them. Conceptual types are types that are latent in the program but not explicitly declared by the developer. Generally, conceptual type corresponds to the actual types, but there are cases where they can be latent. For instance, password and username may have the same type, string, but their conceptual types are different. If a password, which is generally a highly protected field, was declared the same way as the username, it would lead to a vulnerability.

deepbugs (deepbugs) developed a learning approach, called DeepBugs, for discovering bugs based on the semantic meaning of the identifier names. This approach uses embeddings, a vector representation for identifiers, which preserve the semantic similarities between identifiers. The bug detection is treated as a binary classification problem. DeepBugs approach trains a classifier to distinguish correct code from incorrect code. The training data consist of correct code and incorrect code generated by the authors. The bug detectors use the embeddings from the training phase to discover bugs. Three bug detectors were built based on this approach to discover accidentally swapped function arguments, incorrect binary operators, and incorrect operands in binary operations. The bug detectors have a high accuracy between 89% and 95% to distinguish correct and incorrect code. The bug detectors are also very efficient, with less than 20 milliseconds to analyse a file. False positives are inevitable in static analysis tools; however, the bug detectors have a 68% true positive rate.

Another approach that makes use of the semantic meaning of the identifier names is presented by context2name (context2name)

and it is called Context2Name. JavaScript code is usually deployed in a minified version in which the identifiers are replaced with short and random names. Context2Name is a deep learning-based technique that predicts identifier names for variables that have a minified name. This technique generates context vectors for each identifier by inspecting five tokens before and after the identifier’s occurrence. The context vectors are then summarised in embeddings. Those embeddings are used by a recurrent neural network to predict natural names for the minified variables. Context2Name predicts correct identifiers with a 47.5% accuracy of all minified names and it predicts 5.3% additional identifiers missed by the state-of-art tools.

The improvements made by the dual-channel research shows how much potential the dual-channel information presents for software analysis. Our study uses similar approaches with the work from dual-channel research on a different problem. Hints of the developer’s intent have been extracted from natural language information to guide the detection of anti-patterns of named casts.

8 Conclusion

Our study provides insight into how developers use named casts. This technique provides the opportunity to prioritise refactorings for named cast operators. The results have shown that identifiers can add insights into program semantics. This is beneficial in several ways, one of which is to build novel representations of programs for a variety of software analysis tasks. One such task is sanity checking cast operations where the developers cross type boundaries for a variety of reasons, such as code reuse, time-to-market pressure, coding standards etc. We believe that the approaches presented in this work are leightweight enought to be used by developers as an IDE plugin during development. This work also provides a strong foundation to help richer forms of static analysis scale by using a novel form of program representation that draws from the natural language channel.

References