Binscatter Regressions

We introduce the Stata (and R) package Binsreg, which implements the binscatter methods developed in Cattaneo-Crump-Farrell-Feng_2019_Binscatter. The package includes the commands binsreg, binsregtest, and binsregselect. The first command (binsreg) implements binscatter for the regression function and its derivatives, offering several point estimation, confidence intervals and confidence bands procedures, with particular focus on constructing binned scatter plots. The second command (binsregtest) implements hypothesis testing procedures for parametric specification and for nonparametric shape restrictions of the unknown regression function. Finally, the third command (binsregselect) implements data-driven number of bins selectors for binscatter implementation using either quantile-spaced or evenly-spaced binning/partitioning. All the commands allow for covariate adjustment, smoothness restrictions, weighting and clustering, among other features. A companion R package with the same capabilities is also available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

06/01/2019

lspartition: Partitioning-Based Least Squares Regression

Nonparametric partitioning-based least squares regression is an importan...
02/25/2019

On Binscatter

Binscatter is very popular in applied microeconomics. It provides a flex...
07/25/2021

Adaptive Estimation and Uniform Confidence Bands for Nonparametric IV

We introduce computationally simple, data-driven procedures for estimati...
02/12/2021

Linear programming approach to nonparametric inference under shape restrictions: with an application to regression kink designs

We develop a novel method of constructing confidence bands for nonparame...
02/21/2017

General Semiparametric Shared Frailty Model Estimation and Simulation with frailtySurv

The R package frailtySurv for simulating and fitting semi-parametric sha...
11/15/2019

GET: Global envelopes in R

This work describes the R package GET that implements global envelopes, ...
09/27/2021

pyStoNED: A Python Package for Convex Regression and Frontier Estimation

Shape-constrained nonparametric regression is a growing area in economet...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

0.1 Introduction

Binscatter has become a very popular methodology in applied microeconomics since its introduction (Chetty-Szeidl_2006_wp; Chetty-Looney-Kroft_2009_AER; Chetty-Friedman-Olsen-Pistaferri_2011_QJE; Chetty-Friedman-Hilger-Saez-Schanzenbach-Yagan_2011_QJE). See Stepner_2014_StataConf for a popular Stata implementation of canonical binscatter, and Starr-Goldfarb_2018_wp

for a very recent heuristic overview of these methods. Binscatter techniques offer flexible, yet parsimonious ways of visualizing and summarizing “big data” in regression settings. They can also be used for formal estimation and inference, including testing of substantive hypothesis such as linearity or monotonicity, of the regression function and its derivatives. Despite its popularity among empirical researchers, little was known about the statistical properties of binscatter until very recently:

Cattaneo-Crump-Farrell-Feng_2019_Binscatter offered the first foundational, thorough analysis of binscatter, giving an array of theoretical and practical results that aid both in understanding current practices (i.e., their validity or lack thereof) and in offering theory-based guidance for future applications.

This paper introduces the Stata (and R) package Binsreg, which includes three commands implementing the main methodological results in Cattaneo-Crump-Farrell-Feng_2019_Binscatter: binsreg, binsregtest, and binsregselect. The first command (binsreg) implements binscatter for the regression function and its derivatives, offering several point estimation, confidence intervals and confidence bands procedures, with particular focus on constructing binned scatter plots. The second command (binsregtest) implements hypothesis testing procedures for parametric specification and for nonparametric shape restrictions of the unknown regression function and its derivatives. Finally, the third command (binsregselect) implements data-driven number of bins selectors for binscatter implementation using either quantile-spaced or evenly-spaced binning.

There exists another very popular Stata command also implementing binscatter methods: binscatter. See Stepner_2014_StataConf for an introduction to this alternative command. The command binsreg offers several new capabilities relative to binscatter, in addition to also implementing covariate adjustment in a different, more principled way. First, the command binsreg implements binscatter methods allowing for within-bin higher-order polynomial fitting and for across-bins smoothness restrictions, which enables derivative estimation and also produces smooth approximations of the regression function and its derivatives. Further, the command binsreg implements valid confidence intervals and confidence bands. None of these features are available in the command binscatter. Second, both commands allow for covariate adjustment, but each command does it in very different way: binscatter employs a residualization approach, while binsreg employs a semi-linear approach. As shown in Cattaneo-Crump-Farrell-Feng_2019_Binscatter

, the residualization approach is, in general, inconsistent for the target function of interest, while the semi-linear approach is either consistent (under correct specification) or has a clear probability limit interpretation (under misspecification). Finally, the other two commands in the package

Binsreg, binsregtest and binsregselect, are novel implementations in Stata (and in R).

The rest of the article is organized as follows. Section 0.2 gives an overview of the main methods available in the package Binsreg and discusses some implementation details. Sections 0.3, 0.4 and 0.5 discuss, respectively, the syntax of the commands binsreg, binsregtest and binsregselect. Section 0.6 gives a numerical illustration, while Section 0.7 concludes. The latest version of this software, as well other related software and materials, can be found at:

https://sites.google.com/site/nppackages/binscatter/

0.2 Overview of Methods and Implementation Details

This section summarizes the main methods implemented in the package Binsreg. For further methodological and theoretical details see Cattaneo-Crump-Farrell-Feng_2019_Binscatter and references therein.

Given a random sample , binscatter seeks to flexibly approximate the function

in the partially linear regression model:

(1)

where is an outcome of interest and is a covariate of interest, while represents other covariates possibly entering the semi-linear regression model. In particular, if the latter covariates are not present, then , while otherwise denotes the average (partial) relationship between and after controlling for . More general interpretations are also possible; see CCFF for more discussion.

In some applications, interest may be on the -th derivative of , which we denote by . We employ the usual notation .

0.2.1 Binscatter Construction

To approximate and its derivatives in model (1), binscatter first partitions the support of into quantile-spaced bins, leading to the partitioning scheme:

where denotes the -th order statistic of the sample , is the floor operator, and . Each estimated bin contains roughly the same number of observations , where with denoting the indicator function. This binning approach is the most popular in empirical work but, for completeness, all commands in the package Binsreg also allow for evenly-spaced binning. See below for more implementation details.

Given the quantile-spaced partitioning/binning scheme, for a choice of number of bins , a generalized binscatter estimator of the -th derivative of , employing a -th order polynomial approximation within each bin, imposing -times differentiability across bins, and adjusting for additional covariates , is given by

where , , and with

being the -th order polynomial basis of approximation within each bin, hence of dimension , and being a matrix of linear restrictions ensuring that the -th derivative of is continuous.

When ,

, the identity matrix of dimension

, and therefore no restrictions are imposed: is the basis used for (disjoint) piecewise -th order polynomial fits. Consequently, the binscatter is discontinuous at the bins’ edges whenever . On the other hand, implies that a least squares -th order polynomial fit is constructed within each bin , in which case setting forces these fits to be connected at the boundaries of adjacent bins, forces these fits to be connected and continuously differentiable at the boundaries of adjacent bins, and so on for each .

Enforcing smoothness on binscatter boils down to incorporating restrictions on the basis of approximation. The resulting constrained basis, , corresponds to a choice of spline basis for approximation of , with estimated quantile-spaced knots according to the partition . The package Binsreg employs leading to B-splines, which tend to have very good finite sample properties. When binscatter reduces to the canonical binscatter commonly found in empirical work.

Specifically, in canonical binscatter the basis is a

-dimensional vector of orthogonal dummy variables, that is, the

-th component of records whether the evaluation point belongs to the -th bin in the partition . Therefore, canonical binscatter can be expressed as the collection of

sample averages of the response variable

, one for each bin: for . Empirical work employing canonical binscatter typically plots these binned samples averages along with some other estimate(s) of the regression function .

Finally, covariate adjustment in is justified via model (1), and does not coincide with the one commonly used by most practitioners and other software implementations of binscatter (c.f., Stepner_2014_StataConf). An alternative covariate-adjustment approach is based on residualization, that is, first regressing out the additional covariates and then constructing binscatter based on the residuals only. This alternative covariate-adjustment approach is very hard to rationalize or justify in general, and will lead to an inconsistent estimator of in model (1) unless very special assumptions hold. Furthermore, even when model (1) is misspecified, the approach to covariate adjustment employed by the package Binsreg enjoys a natural probability limit interpretation, while the residualization approach does not. See CCFF for more discussion, numerical examples, and technical details.

Main implementation details

The command binsreg implements multiple versions of for a common choice of partitioning/binning . The option deriv() is used to set a common value of across all implementations. The options dots(p,s) and line(p,s) generate “dots” and a “line” tracing out two distinct implementations of with the corresponding choices of and selected in each case. Defaults are deriv(0) so that the object of interest is , and dots(0,0) so that the “dots” represent canonical binscatter (i.e., sample averages within each bin). The line option is muted by default, and needs to be set explicitly to appear in the resulting plot: for example, the option line(3,3) adds a line tracing out , implemented with and , a cubic B-spline estimator of .

The common partitioning/binning used by the command binsreg across all implementations is set to be quantile-spaced for some choice of . The option nbins() sets manually (e.g., nbins(20) corresponds to quantile-spaced bins), but if this option is not supplied then the companion command binsregselect is used to choose in a fully data-driven way, as described below. As an alternative, an evenly-spaced partitioning/binning can be implemented via the option binspos().

Several other options are available for the command binsreg, including optimal evenly-spaced binning via the command binsregselect and substantive specification or shape restrictions hypothesis testing via the command binsregtest, as discussed in detail further below. See Section 0.3 for the full syntax of binsreg.

0.2.2 Choosing the number of bins

CCFF develops valid integrated mean square error (IMSE) approximations for binscatter and its extensions in the context of model (1). These expansions give IMSE-optimal selection of the number bins forming the quantile-spaced (or other) partitioning scheme , depending on polynomial order within bins and smoothness level across bins, and on the target estimand set by derivative order . Specifically, the IMSE-optimal choice of is given by:

where denotes the ceiling operator, and and

represent an approximation to the integrated (square) bias and variance of

, respectively. These constants depend on the partitioning scheme and binscatter estimator used. Recall that these three integer choices must respect and .

Both IMSE constants, and , can be estimated consistently using a preliminary choice of . Thus, our main implementation offers two selectors:

  • : implements a rule-of-thumb (ROT) approximation for the constants and , employing a trimmed-from-below Gaussian reference model for the density of , and global polynomial approximations for the other two unknown features needed, and . This selector employs the correct rate but an inconsistent constant approximation.

  • : implements a direct-plug-in (DPI) approximation for the constants and , based on the desired binscatter, set by the choices and , and employing a preliminary . If a preliminary is not provided by the user, then is used for DPI implementation. Therefore, the selector employs the correct rate and a nonparametric consistent constant approximation in the latter case.

The precise form of the constants changes depending on whether quantile-spaced or evenly-space partitioning/binning is used, but the two methods for selecting , ROT and DPI, remain conceptually the same. See CCFF and references therein for more details.

Main implementation details

The command binsregselect implements ROT and DPI data-driven, IMSE-optimal selection of for all possible choices of , and for both quantile-spaced or evenly-space partitioning/binning. For DPI implementation, the user can provide the initialization value of or, if not provided, then is used.

Several other options are available for the command binsregselect, including the possibility of generating an output file with the IMSE-optimal partitioning/binning structure selected and the corresponding grid of evaluation points, which can be used by the other companion commands (binsreg and binsregtest) for plotting, simulation, testing, and other calculations. See Section 0.5 for the full syntax of binsregselect.

0.2.3 Confidence intervals

Both confidence intervals and confidence bands for the unknown function are constructed employing the same type of Studentized -statistic:

where the binscatter variance estimator is of the usual “sandwich” form

with and .

CCFF showed that pointwise in , that is, for each evaluation point on the support of , provided the misspecification error introduced by binscatter is removed from the distributional approximation. Such a result justifies asymptotically valid confidence intervals for , pointwise in , after bias correction. Specifically, for each , the confidence interval takes the form:

where

denotes the distribution function of a standard normal random variable (e.g.,

for a Gaussian confidence intervals), and provided the choice of is such that the misspecification error can be ignored.

However, employing an IMSE-optimal binscatter (i.e., setting for the selected polynomial order ) introduces a first-order misspecification error leading to invalidity of these confidence intervals, and hence cannot be directly used to form the confidence intervals in general. To address this problem, we rely on a simple application of robust bias-correction (Calonico-Cattaneo-Titiunik_2014_ECMA; Calonico-Cattaneo-Farrell_2018_JASA; Calonico-Cattaneo-Farrell_2019_CEOptimal; Cattaneo-Farrell-Feng_2018_wp) to form valid confidence intervals based on IMSE-optimal binscatter, that is, without altering the partitioning scheme used.

Our recommended implementation employs robust bias-corrected binscatter confidence intervals as follows. First, for a given choice of , select the number of bins in according to , which gives an IMSE-optimal binscatter (point estimator). Then, employ the confidence interval with , which satisfy

Main implementation details

The command binsreg implements confidence intervals, and reports them as part of the final binned scatter plot. Specifically, the option ci(p,s) estimates confidence intervals with the corresponding choices of and selected, and plots them as vertical segments along the support of . The implementation is done over a grid of evaluations points, which can be modified via the option cingrid(), and the desired level is set by the option level(). Notice that dots(p,s), lines(p,s), and ci(p,s) may all take different choices of and , which allows for robust bias-correction implementation of the confidence intervals and permits incorporating different levels of smoothness restrictions.

Several other options are available for the command binsreg, including substantive hypothesis testing via the companion command binsregtest, as described in detail further below. See Section 0.3 for the full syntax of binsreg.

0.2.4 Confidence bands

In many empirical applications of binscatter, the goal is to conduct inference about the entire function , simultaneously, that is, uniformly over all on the support of . This goal is fundamentally different from pointwise inference. A leading example of uniform inference is reporting confidence bands for and its derivatives, which are different from (pointwise) confidence intervals. The package Binsreg offers asymptotically valid constructions of both confidence intervals, as discussed above, and confidence bands, which can be implemented with the same choices of used to construct or different ones.

Following the theoretical work in CCFF, for a choice of and partition/binning of size , the confidence band for is:

where the quantile value is now approximated via simulations using

with denoting the original data,

and being a -dimensional standard normal random vector. The distribution of , which is unknown, is approximated by that of conditional on the data , which can be simulated by taking repeated samples from and recomputing the supremum each time. In other words, the quantiles used to construct confidence bands can be approximated by resampling from the standard normal random vector , keeping fixed the data (and hence all quantities depending on it). See CCFF for more details.

A confidence band covers of the time the entire function in repeated sampling, whenever the misspecification error can be ignored. As before, we recommend employing robust bias correction to remove misspecification error introduced by binscatter, that is, following the same logic discussed above for the case of confidence intervals construction. To be more precise, first is chosen, along with and , and the optimal partitioning/binning is selected according to . Then, the confidence bands are constructed using with . This ensures that

Main implementation details

The command binsreg implements confidence bands, and reports them as part of the final binned scatter plot. The option cb(p,s) estimates an asymptotically valid confidence band with the corresponding choices of and selected, and plots it as a shaded region along the support of . The implementations is done over a grid of evaluations points, which can be either modified via the option cbngrid(), and the desired level is set by the option level(). The options dots(p,s), lines(p,s), ci(p,s), and cb(p,s) can all take different choices of and , which allows for robust bias correction implementations, as well as many other practically relevant possibilities.

See Section 0.3 for the full syntax of binsreg.

0.2.5 Parametric specification testing

In addition to implementing binscatter and producing binned scatter plots, with both point and uncertainty estimators, the package Binsreg also allows for formal testing of substantive hypotheses. The command binsregtest implements all hypothesis tests available. This command can be used as a stand-alone command, or can be called via binsreg when constructing binned scatter plots.

The command binsregtest implements two types of substantive hypothesis tests about : (i) parametric specification testing, and (ii) nonparametric shape restriction testing. This subsection discusses the first type of hypothesis testing, while the next subsection discusses the second one.

For a choice of , and partitioning/binning scheme of size , the implemented parametric specification testing approach contrasts a (nonparametric) binscatter approximation of with a hypothesized parametric specification of the form for some known up to a finite parameter , which can be estimated using the available data. Formally, the null and alternative hypothesis are, respectively,

for choice of derivative order .

For example, is compared to in order to assess whether there is a relationship between and or, more formally, whether is a constant function. Similarly, it is possible to formally test for a linear, quadratic, or even non-linear parametric relationship , where

would be estimated from the data under the null hypothesis, that is, assuming that the postulated relationship is indeed correct.

Following CCFF, the command binsregtest

employs the test statistic

Then, a parametric specification hypothesis testing procedure is:

(2)

where is again computed by simulation from a standard Gaussian random vector, conditional on the data , as in the case of confidence bands already discussed. This testing procedure is an asymptotically valid -level test if the misspecification error is removed from the test statistic .

The command binsregtest employs robust bias correction by default: first and are chosen, and the partitioning/binning scheme is selected by setting for these choices. Then, using this partitioning scheme, the testing procedure (2) is implemented with the choice instead of , with

. CCFF shows that, under regularity conditions, the resulting parametric specification testing approach controls Type I error with non-trivial power: for given

, , and ,

and

where . This testing approach formalizes the intuitive idea that if the confidence band for does not contain the parametric fit considered entirely, then such parametric fit is incompatible with the data, i.e., should be rejected.

Main implementation details

The command binsregtest implements parametric specification testing in two ways. First, polynomial regression (parametric) specification testing is implemented directly via the option polyreg(P), where the null hypothesis is and is estimated by least squares regression. For other parametrizations of , the command takes as input an auxiliary array/database (dta in Stata, or csv in R) via the option testmodelparfit(filename) containing the following columns/variables: grid of evaluation points in one column, and fitted values

(over the evaluation grid) for each parametric model considered in other columns/variables. The ordering of these variables is arbitrary, but they have to follow a naming rule: the evaluation grid has the same name as the independent variable

, and the names of other variables storing fitted values take the form binsreg_fit*. The binscatter (nonparametric) estimate used to construct the testing procedure is set by the options testmodel(p,s) and deriv(v), and the partitioning/binning scheme selected.

See Section 0.4 for other options and more details on the syntax of this command.

0.2.6 Nonparametric shape testing

The second type of hypothesis tests implemented by the command binsregtest concern nonparametric testing of shape restrictions. For a choice of , the null and alternative hypotheses of these hypothesis problems are:

that is, one-sided testing problem to the right. For example, negativity, monotonicity and concavity of correspond to , and , respectively. Of course, the analogous testing problem to the left is also implemented, but not discussed here to avoid unnecessary repetition.

The relevant Studentized test statistic for this class of testing problems is:

Then, the testing procedure is:

(3)

with . As before, misspecification errors of binscatter need to be taken into account in order to control Type I error. As in previous cases, CCFF show that for given , , and accordingly, then

and

for any , that is, using a robust bias-correction approach. These results imply that the testing procedure (3) is an asymptotically valid hypothesis test provided it is implemented with the choice after the IMSE-optimal partitioning/binning scheme for binscatter of order is selected.

Main implementation details

The command binsregtest implements one-sided and two-sided nonparametric shape restriction testing as follows. Option testshapel(a) implements one-sided testing to the left: . Option testshaper(a) for one-sided to the right: . Option testshape2(a) for two-sided testing: . The constant a needs to be specified by the user. The binscatter (nonparametric) estimate used to construct the testing procedure is set by the options testshape(p,s) and deriv(v), and the chosen partitioning/binning scheme.

See Section 0.4 for more details on the syntax of this command.

0.2.7 Extensions and other implementation details

Whenever possible, the package Binsreg is implemented using the general purpose least squares regression command regress in Stata. (In R, the function lm() is used as the building block for implementation.) This approach sacrifices speed of the implementation, but improves substantially in terms of stability and replicability.

This section reviews some specific extensions and other numerical issues of the package Binsreg and discusses related choices made for implementation, all of which can affect speed and/or robustness of the package.

Mass points and minimum effective sample size

All three commands in the package Binsreg incorporate specific implementation decisions to deal with mass points in the distribution of the independent variable . The number of distinct values of , denoted by , is taken as the effective sample size as opposed to the total number of observations . If is continuously distributed, then . However, in many applications, can be substantially smaller than , and this affects some of the implementations in the package.

First, assume that is set by the user (via the option nbins(J)). Then, given the choice , the commands binsreg and binsregtest

perform a degrees of freedom check to decide whether the

data exhibit enough variation. Specifically, given and set by the option dots(p,s), both commands check whether with by default. If this check is not passed, then the package Binsreg regards the data as having “too little” variation in , and turns off all nonparametric estimation and inference results based on large sample approximations. Thus, in this extreme case, the command binsreg only allows for dots(0,0), ci(0,0), and polyreg(P) for any , while the command binsregtest does not return any results and issues a warning message instead.

If, on the other hand, for given , the numerical check is passed, then all nonparametric methods implemented by the commands binsreg and binsregtest become available. However, before implementing each method (dots(p,s), lines(p,s), ci(p,s), cb(p,s), polyreg(p,s), and the hypothesis testing procedures), a degrees of freedom check is performed in each individual case. Specifically, each nonparametric procedure is implemented only if , where recall that and may change from one procedure to the next.

Second, as discussed above, whenever is not set by the user via the option nbins(), the command binsregselect is employed to select in a data-driven way, provided there is enough variation in . To determine the latter, an initial degrees of freedom check is performed to assess whether selection is possible or, alternatively, if the unique values of should be used as bins directly. Specifically, if , with set by the option dots(p,s) and by default, then the data is deemed appropriate for ROT selection of via the command binsregselect, and hence is implemented. If, in addition, , then is also implemented whenever requested. Furthermore, the command binsregselect employs the following alternative formula for selection:

with a slightly different constant , taking into account the frequency of data at each mass point. All other estimators in the package Binsreg

, including bias and standard error estimators, automatically adapt to the presence of mass points. Once the final

is estimated, the degrees of freedom checks discussed in the previous paragraphs are performed based on this choice.

If is not set by the user and , so that not even ROT estimation of is possible, then is taken as “too small.” In this extreme case, the package Binsreg sets and constructs a partitioning/binning structure with each bin containing one unique value of . In other words, the support of the raw data is taken as the binning structure itself. In this extreme case, the follow up degrees of freedom checks based on the formula fail by construction, and hence the nonparametric asymptotic methods are turned off as explained above.

Finally, the specific numerical checks mentioned in this subsection can be adjusted or omitted. First, checking and accounting for repeated values in can be turned off using the option nomassadj. Second, the default cutoffs point and , corresponding to the degrees of freedom checks for nonparametric binscatter and parametric global polynomial, respectively, can be changed using the option dfcheck( ).

Clustered data and minimum effective sample size

As discussed in CCFF, the main methodological results for binscatter can be extended to accommodate clustered data. All three commands in the package Binsreg allow for clustered data via the option vce(). In this case, the number of clusters is taken as the effective sample size, assuming (see below for the other case). The only substantive change occurs in the command binsregselect, which now employs the following alternative formula for selection:

with a variance constant accounting for the clustered structure of the data. Accordingly, cluster-robust variance estimators are used in this case.

Minimum effective sample size

The package Binsreg requires some minimal variation in in order to successfully implement nonparametric methods based on large sample approximations. The minimal variation is captured by the number of distinct values on the support of , denoted by , and the number of clusters, denoted by . Thus, all three commands in the pacakge perform degrees of freedom numerical checks using as the general definition of effective sample size, and proceeding as explained above for the case of mass points in the distribution of .

0.3 binsreg syntax

The main purpose of the command binsreg is to produce binned scatter plots. This command implements multiple binscatter estimators, accompanying confidence intervals and confidence bands, and also a global polynomial approximation for completeness. It also implements hypothesis testing via the companion command binsregtest. A partitioning/binning structure is required but, if not provided, then one is selected in a data-driven way using the companion command binsregselect.

This section describes the syntax of the command binsreg, grouping its many options according to their use.

binsreg depvar indvar othercovs , deriv(v)
  dots(p s) dotsngrid(numeric) dotsplotopt(string)
  line(p s) linengrid(numeric) lineplotopt(string)
  ci(p s) cingrid(numeric) ciplotopt(string)
  cb(p s) cbngrid(numeric) cbplotopt(string)
  polyreg(p) polyregngrid(numeric) polyregcingrid(numlist)
  polyregplotopt(string)
  by() bycolors(colorstylelist) bysymbols(symbolstylelist)
  bylpatterns(linepatternstylelist)
  testmodel(p s) testmodelparfit(filename) testmodelpoly(p)
  testshape(p s) testshapel(numlist) testshaper(numlist)
  testshape2(numlist)
  nbins(J) binspos(numlist) binsmethod(string)
  nbinsrot(numeric) samebinsby
  nsims(S) simsngrid(numeric) simsseed(num)
  vce(vcetype) level(numeric) nomassadj noplot
  savedata(filename) replace dfcheck(n1 n2) twoway_options

depvar is the dependent variable ().

indvar is the independent variable ().

othercovs is a varlist for covariate adjustment ().

, and are integers satisfying .

weights allow for fweights, aweights and pweights; see weights in Stata for more details. (In R, weights allows for the equivalent of fweights only; see lm() help for more details.)

Estimand

deriv(v) specifies the derivative order of the regression function for estimation, testing and plotting. The default is deriv(0), which corresponds to the function itself, .

Dots

dots(p s) sets a piecewise polynomial of degree with smoothness constraints when constructing for point estimation and plotting as “dots”. The default is dots(0 0), which corresponds to piecewise constant (canonical binscatter).

dotsngrid(numlist) specifies the number of dots within each bin to be plotted. Given the choice, these dots are point estimates evaluated over an evenly-spaced grid within each bin. The default is dotsngrid(1), which corresponds to one dot per bin (canonical binscatter).

dotsplotopt(string) standard graphs options to be passed on to the twoway command to modify the appearance of the plotted dots.

Line

line(p s) sets a piecewise polynomial of degree with smoothness constraints when constructing for point estimation and plotting as a “line”. By default, the line is not included in the plot unless explicitly specified. Recommended specification is line(3 3), which adds a cubic B-spline estimate of the regression function of interest to the binned scatter plot.

linengrid(numeric) specifies the number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the line(p s) option. The default is linengrid(20), which corresponds to evenly-spaced evaluation points within each bin for fitting/plotting the line.

lineplotopt(string) standard graphs options to be passed on to the twoway command to modify the appearance of the plotted line.

Confidence Intervals

ci(p s) specifies the piecewise polynomial of degree with smoothness constraints used for constructing confidence intervals . By default, the confidence intervals are not included in the plot unless explicitly specified. Recommended specification is ci(3 3), which adds confidence intervals based on a cubic B-spline estimate of the regression function of interest to the binned scatter plot.

cingrid(numeric) specifies the number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the ci(p s) option. The default is cingrid(1), which corresponds to evenly-spaced evaluation point within each bin for confidence interval construction.

ciplotopt(string) standard graphs options to be passed on to the twoway command to modify the appearance of the confidence intervals.

Confidence Band

cb(p s) specifies the piecewise polynomial of degree with smoothness constraints used for constructing the confidence band . By default, the confidence band is not included in the plot unless explicitly specified. Recommended specification is cb(3 3), which adds a confidence band based on a cubic B-spline estimate of the regression function of interest to the binned scatter plot.

cbngrid(numeric) specifies the number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the cb(p s) option. The default is cbngrid(20), which corresponds to evenly-spaced evaluation points within each bin for confidence band construction.

cbplotopt(string) standard graphs options to be passed on to the twoway command to modify the appearance of the confidence band.

Global Polynomial Regression

polyreg(p) sets the degree of a global polynomial regression model for plotting. By default, this fit is not included in the plot unless explicitly specified. Recommended specification is polyreg(3), which adds a fourth order global polynomial fit of the regression function of interest to the binned scatter plot.

polyregngrid(numeric) specifies the number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point estimate set by the polyreg(p) option. The default is polyregngrid(20), which corresponds to evenly-spaced evaluation points within each bin for confidence interval construction.

polyregcingrid(numeric) specifies the number of evaluation points of an evenly-spaced grid within each bin used for constructing confidence intervals based on polynomial regression set by the polyreg(p) option. The default is polyregcingrid(0), which corresponds to not plotting confidence intervals for the global polynomial regression approximation.

polyregplotopt(string) standard graphs options to be passed on to the twoway command to modify the appearance of the global polynomial regression fit.

Subgroup Analysis

by() specifies the variable containing the group indicator to perform subgroup analysis; both numeric and string variables are supported. When by() is specified, binsreg implements estimation and inference by each subgroup separately, but produces a common binned scatter plot. By default, the binning structure is selected for each subgroup separately, but see the option samebinsby below for imposing a common binning structure across subgroups.

bycolors(colorstylelist) specifies an ordered list of colors for plotting each subgroup series defined by the option by().

bysymbols(symbolstylelist) specifies an ordered list of symbols for plotting each subgroup series defined by the option by().

bylpatterns(linepatternstylelist) specifies an ordered list of line patterns for plotting each subgroup series defined by the option by().

Parametric Model Specification Testing

testmodel(p s) sets a piecewise polynomial of degree with smoothness constraints for parametric model specification testing, implemented via the companion command binsregtest. The null hypothesis is . The default is testmodel(3 3), which corresponds to a cubic B-spline estimate of the regression function of interest for testing against the fitting from a parametric model specification.

testmodelparfit(filename) specifies a dataset which contains the evaluation grid and fitted values of the model(s) to be tested against. The file must have a variable with the same name as indvar, which contains a series of evaluation points at which the binscatter model and the parametric model of interest are compared with each other. Each parametric model is represented by a variable named as binsreg_fit*, which must contain the fitted values at the corresponding evaluation points.

testmodelpoly(p) specifies the degree of a global polynomial model to be tested against.

Nonparametric Shape Restriction Testing

testshape(p s) sets a piecewise polynomial of degree with smoothness constraints for nonparametric shape restriction testing, implemented via the companion command binsregtest. The default is testmodel(3 3), which corresponds to a cubic B-spline estimate of the regression function of interest for one-sided or two-sided testing.

testshapel(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a one-sided hypothesis test to the left of the form .

testshaper(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a one-sided hypothesis test to the right of the form .

testshape2(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a two-sided hypothesis test of the form .

Partitioning/Binning Selection

nbins(J) sets the number of bins for partitioning/binning of indvar. If not specified, the number of bins is selected via the companion command binsregselect in a data-driven, optimal way whenever possible.

binspos(numlist) specifies the position of binning knots. The default is binspos(qs), which corresponds to quantile-spaced binning (canonical binscatter). Other options are: es for evenly-spaced binning, or a numlist for manual specification of the positions of inner knots (which must be within the range of indvar).

binsmethod(string) specifies the method for data-driven selection of the number of bins via the companion command binsregtest. The default is binsmethod(dpi), which corresponds to the IMSE-optimal direct plug-in rule . The other option is: rot for rule of thumb implementation, .

nbinsrot(numeric) specifies an initial number of bins value used to construct the DPI number of bins selector via the the companion command binsregtest. If not specified, the data-driven ROT selector is used instead.

samebinsby forces a common partitioning/binning structure across all subgroups specified by the option by(). The knots positions are selected according to the option binspos() and using the full sample. If nbins() is not specified, then the number of bins is selected via the companion command binsregselect and using the full sample.

Simulation

nsims(S) specifies the number of random draws for constructing confidence bands and hypothesis testing. The default is nsims(500), which corresponds to 500 draws from a standard Gaussian random vector of size .

simsngrid(numeric) specifies the number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum (or infimum) operation needed to construct confidence bands and hypothesis testing procedures. The default is simsngrid(20), which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum (or infimum) operator.

simsseed(numeric) sets the seed for simulations.

Other Options

vce(vcetype) specifies the vcetype for variance estimation used by the command regress. The default is vce(robust).

level(numeric) sets the nominal confidence level for confidence interval and confidence band estimation.

nomassadj omits mass points (in indvar) adjustments for estimation and inference.

noplot omits binscatter plotting.

savedata(filename) specifies a filename for saving all data underlying the binscatter plot (and more).

replace overwrites the existing file when saving the graph data.

dfcheck(n1 n2) sets adjustments for minimum effective sample size checks, which take into account the number of unique values of indvar (i.e., adjusting for the number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. Specifically, and . The default is dfcheck(20 30), as discussed above.

twoway_options any unrecognized options are appended to the end of the twoway command generating the binned scatter plot.

0.4 binsregtest syntax

The main purpose of the command binsregtest is to conduct hypothesis testing of parametric specifications and nonparametric shape restrictions for using binscatter methods. This stand-alone command is used by the companion command binsreg. A partitioning/binning structure is required but, if not provided, then one is selected in a data-driven way using the companion command binsregselect.

This section describes the syntax of the command binsregtest, grouping its many options according to their use.

binsregtest depvar indvar othercovs , deriv(v)
  testmodel(p s) testmodelparfit(filename) testmodelpoly(p)
  testshape(p s) testshapel(numlist) testshaper(numlist)
  testshape2(numlist)
  bins(p s) nbins(J) binspos(numlist) binsmethod(string)
  nbinsrot(numeric)
  nsims(S) simsngrid(numeric) simsseed(num)
  vce(vcetype) nomassadj dfcheck(n1 n2)

depvar is the dependent variable ().

indvar is the independent variable ().

othercovs is a varlist for covariate adjustment ().

, and are integers satisfying .

weights allow for fweights, aweights and pweights; see weights in Stata for more details. (In R, weights allows for the equivalent of fweights only; see lm() help for more details.)

Estimand

deriv(v) specifies the derivative order of the regression function for estimation, testing and plotting. The default is deriv(0), which corresponds to the function itself, .

Parametric Model Specification Testing

testmodel(p s) sets a piecewise polynomial of degree with smoothness constraints for parametric model specification testing, implemented via the companion command binsregtest. The null hypothesis is . The default is testmodel(3 3), which corresponds to a cubic B-spline estimate of the regression function of interest for testing against the fitting from a parametric model specification.

testmodelparfit(filename) specifies a dataset which contains the evaluation grid and fitted values of the model(s) to be tested against. The file must have a variable with the same name as indvar, which contains a series of evaluation points at which the binscatter model and the parametric model of interest are compared with each other. Each parametric model is represented by a variable named as binsreg_fit*, which must contain the fitted values at the corresponding evaluation points.

testmodelpoly(p) specifies the degree of a global polynomial model to be tested against.

Nonparametric Shape Restriction Testing

testshape(p s) sets a piecewise polynomial of degree with smoothness constraints for nonparametric shape restriction testing, implemented via the companion command binsregtest. The default is testmodel(3 3), which corresponds to a cubic B-spline estimate of the regression function of interest for one-sided or two-sided testing.

testshapel(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a one-sided hypothesis test to the left of the form .

testshaper(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a one-sided hypothesis test to the right of the form .

testshape2(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a two-sided hypothesis test of the form .

Partitioning/Binning Selection

bins(p s) sets a piecewise polynomial of degree with smoothness constraints for data-driven (IMSE-optimal) selection of the partitioning/binning scheme. The default is bins(0 0), which corresponds to piecewise constant (canonical binscatter).

nbins(J) sets the number of bins for partitioning/binning of indvar. If not specified, the number of bins is selected via the companion command binsregselect in a data-driven, optimal way whenever possible.

binspos(numlist) specifies the position of binning knots. The default is binspos(qs), which corresponds to quantile-spaced binning (canonical binscatter). Other options are: es for evenly-spaced binning, or a numlist for manual specification of the positions of inner knots (which must be within the range of indvar).

binsmethod(string) specifies the method for data-driven selection of the number of bins via the companion command binsregselect. The default is binsmethod(dpi), which corresponds to the IMSE-optimal direct plug-in rule . The other option is: rot for rule of thumb implementation, .

nbinsrot(numeric) specifies an initial number of bins value used to construct the DPI number of bins selector via the the companion command binsregtest. If not specified, the data-driven ROT selector is used instead.

Simulation

nsims(S) specifies the number of random draws for constructing confidence bands and hypothesis testing. The default is nsims(500), which corresponds to 500 draws from a standard Gaussian random vector of size .

simsngrid(numeric) specifies the number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum (or infimum) operation needed to construct confidence bands and hypothesis testing procedures. The default is simsngrid(20), which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum (or infimum) operator.

simsseed(numeric) sets the seed for simulations.

Other Options

vce(vcetype) specifies the vcetype for variance estimation used by the command regress. The default is vce(robust).

nomassadj omits mass points (in indvar) adjustments for estimation and inference.

dfcheck(n1 n2) sets adjustments for minimum effective sample size checks, which take into account the number of unique values of indvar (i.e., adjusting for the number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. Specifically, and . The default is dfcheck(20 30), as discussed above.

0.5 binsregselect syntax

The main purpose of the command binsregselect is to implement data-driven (IMSE-optimal) selection of partitioning/binning structure for binscatter. This stand-alone command is used by the companion commands binsreg and binsregtest whenever the user does not specify the binning structure manually.

This section describes the syntax of the command binsregselect, grouping its many options according to their use.

binsregselect depvar indvar othercovs , deriv(v)
  bins(p s) binspos(numlist) binsmethod(string) nbinsrot(numeric)
  simsngrid(numeric) savegrid(filename) replace
  vce(vcetype) nomassadj dfcheck(n1 n2)

depvar is the dependent variable ().

indvar is the independent variable ().

othercovs is a varlist for covariate adjustment ().

, and are integers satisfying .

weights allow for fweights, aweights and pweights; see weights in Stata for more details. (In R, weights allows for the equivalent of fweights only; see lm() help for more details.)

Estimand

deriv(v) specifies the derivative order of the regression function for estimation, testing and plotting. The default is deriv(0), which corresponds to the function itself, .

Partitioning/Binning Selection

bins(p s) sets a piecewise polynomial of degree with smoothness constraints for data-driven (IMSE-optimal) selection of the partitioning/binning scheme. The default is bins(0 0), which corresponds to piecewise constant (canonical binscatter).

binspos(numlist) specifies the position of binning knots. The default is binspos(qs), which corresponds to quantile-spaced binning (canonical binscatter). Other options are: es for evenly-spaced binning, or a numlist for manual specification of the positions of inner knots (which must be within the range of indvar).

binsmethod(string) specifies the method for data-driven selection of the number of bins. The default is binsmethod(dpi), which corresponds to the IMSE-optimal direct plug-in rule . The other option is: rot for rule of thumb implementation, .

nbinsrot(numeric) specifies an initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.

Evaluation Points Grid Generation

simsngrid(numeric) specifies the number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum (or infimum) operation needed to construct confidence bands and hypothesis testing procedures. The default is simsngrid(20), which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum (or infimum) operator.

savegrid(filename) specifies a filename for storing the simulation grid of evaluation points. It contains the following variables: indvar, which is a sequence of evaluation points grids used in approximation; all control variables in covars, which take values of zero for prediction purpose; binsreg_isknot, indicating whether the grid is an inner knot; and binsreg_bin, indicating which bin the grid belongs to.

replace overwrites the existing file when saving the grid.

Other Options

vce(vcetype) specifies the vcetype for variance estimation used by the command regress. The default is vce(robust).

nomassadj omits mass points (in indvar) adjustments for estimation and inference.

dfcheck(n1 n2) sets adjustments for minimum effective sample size checks, which take into account the number of unique values of indvar (i.e., adjusting for the number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. Specifically, and . The default is dfcheck(20 30), as discussed above.

0.6 Illustration of Methods

We illustrate the package Binsreg using a simulated dataset, which is available in the file binscatter_simdata.dta. In this dataset, y is the outcome variable, x is the independent variable for binning, w is a continuously distributed covariate, and t is a binary covariate, and id is a group identifier. Summary statistics of the simulated data are as follows.

[auto]. use binsreg_simdata, clear . sum Variable Obs Mean Std. Dev. Min Max 1357 x 1,000 .4907072 .2932553 .0002281 .9985808 w 1,000 .0120224 .5799381 -.9993055 .9973198 t 1,000 .515 .500025 0 1 id 1,000 250.5 144.4095 1 500 y 1,000 .5283884 1.727878 -5.159858 5.751276

The basic syntax for binsreg is the following:

[auto]. binsreg y x w Binscatter plot Bin selection method: IMSE-optimal plug-in choice Placement: Quantile-spaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 21 3015 930 p s df 930 dots 0 0 21 930

The main output is a binned scatter plot as shown in Figure 1. By default, the (nonparametric) mean relationship between y and x is approximated by piecewise constants (dots(0 0)). Each dot in the figure represents the point estimate corresponding to each bin, which is the canonical binscatter plot. The number of bins, whenever not specified, is automatically selected via the companion command binsregselect. In this case, bins are used. Other useful information is also reported, including total sample size, the number of distinct values of x, bin selection results, and the degrees of freedom of the statistical model(s) employed.

Figure 1: Canonical Binned Scatter Plot.

Users may specify the number of bins manually rather than relying on the automatic data-driven procedures. For example, a popular ad-hoc choice in practice is setting quntile-spaced bins:

[auto]. binsreg y x w, nbins(20) polyreg(1) Binscatter plot Bin selection method: User-specified Placement: Quantile-spaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial . # of smoothness constraints . # of bins 20 3015 930 p s df 930 dots 0 0 20 polyreg 1 NA 2 930

The option polyreg(1) adds a linear prediction line to the canonical binscatter plot, but the resulting binned scatter plot is not reported here to conserve space.

(a) Add cubic -spline fit
(b) Add confidence intervals
(c) Add confidence band
(d) Add a polynomial fit of degree
Figure 2: Binned Scatter Plot with Lines, Confidence Intervals and Bands.

The command binsreg allows users to add a binscatter-based line approximating the unknown regression function, pointwise confidence intervals, a confidence band, and a global polynomial regression approximation. For example, the following syntax cumulatively adds in four distinct plots a fitted line, confidence intervals and a confidence band, all three based on cubic -splines, and also a fitted line based on a global polynomial of degree . The results are shown in Figure 2.

[auto]. qui binsreg y x w, nbins(20) dots(0,0) line(3,3) . qui binsreg y x w, nbins(20) dots(0,0) line(3,3) ci(3,3) . qui binsreg y x w, nbins(20) dots(0,0) line(3,3) ci(3,3) cb(3,3) . qui binsreg y x w, nbins(20) dots(0,0) line(3,3) ci(3,3) cb(3,3) polyreg(4)

By construction, a cubic -spline fit is a piecewise cubic polynomial function which is continuous, and has continuous first- and second-order derivatives. Thus, the prediction line and confidence band generated are quite smooth. In this case, it is arguably under-smoothed because of the “large” choice of . The degree and smoothness of polynomials can be changed by adjusting the values of p and s in the options dots(), line(), ci() and cb().

The command binsreg also allows for the standard weight options, vce options, factor variables, and twoway graph options, among other features. This is illustrated in the following code:

[auto]. binsreg y x w i.t, dots(0,0) line(3,3) ci(3,3) cb(3,3) polyreg(4) /// ¿ vce(cluster id) savedata(output/graphdat) replace /// ¿ title(”Binned Scatter Plot”) Binscatter plot Bin selection method: IMSE-optimal plug-in choice Placement: Quantile-spaced Derivative: 0 Output file: output/graphdat.dta 3015 # of observations 1000 # of distinct values 1000 # of clusters 500 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 20 3015 930 p s df 930 dots 0 0 20 line 3 3 23 CI 3 3 23 CB 3 3 23 polyreg 4 NA 5 930

Specifically, a dummy variable based on the binary covariate t is added to the estimation, standard errors are clustered at the group level indicator id, and a graph title is added to the resulting binned scatter plot. Note that any unrecognized options for the command binsreg will be understood as twoway options and therefore appended to the final plot command. Thus, users may easily modify, for example, axis properties, legends, etc. The option savedata(graphdat) saves the underlying data used in the binned scatter plot in the file graphdat.dta.

In addition, the command binsreg can be used for subgroup analysis. The following command implements binscatter estimation and inference across two subgroups separately, defined by the variable t, and then produces a common binned scatter plot (Figure 3):

[auto]. binsreg y x w, by(t) dots(0,0) line(3,3) cb(3,3) /// ¿ bycolors(blue red) bysymbols(O T) Binscatter plot Bin selection method: IMSE-optimal plug-in choice Placement: Quantile-spaced Derivative: 0 Group: t = 0 3015 # of observations 485 # of distinct values 485 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 20 3015 930 p s df 930 dots 0 0 20 line 3 3 23 CB 3 3 23 930 Group: t = 1 3015 # of observations 515 # of distinct values 515 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 15 3015 930 p s df 930 dots 0 0 15 line 3 3 18 CB 3 3 18 930

Figure 3: Binned Scatter Plot: Group Comparison

Figure 3 highlights a difference across the two subgroups defined by the variable t, which corresponds to the fact that our simulated data adds a to the outcome variable for those units with . The colors, symbols, and line patterns in Figure 3 can be modified via the options bycolors(), bysymbols(), and bylpatterns(). When the number of bins is unspecified, the command binsreg selects the number of bins for each subsample separately, via the companion command binsregselect. This means that, by default, the choice of binning/partitioning structure will be different across subgroups in general. However, if the option samebinsby is specified, then a common binning scheme for all subgroups is constructed based on the full sample.

Next, we illustrate the syntax of the command binsregtest. The basic syntax is the following:

[auto]. binsregtest y x w, testmodelpoly(1) Hypothesis tests based on binscatter estimates Bin selection method: IMSE-optimal plug-in choice Placement: Quantile-spaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 21 3015 Model specification Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: mu = sup —T— p value 1930 poly. degree 1 6.503 0.000 1930

A test for linearity of the regression function is implemented using the binscatter estimator. By default, a cubic -spline is employed in the inference procedure, which can be adjusted by the option testmodel(). In addition, when unspecified, the number of bins is selected using a data-driven procedure via the companion command binsregselect. The selected number of bins is IMSE-optimal for piecewise constant point estimates by default. A summary of the sample and binning scheme is displayed, and then the test statistic and p-value are reported. In this case, the test statistic is the supremum of the absolute value of the -statistic evaluated over a sequence of grid points, and the p-value is calculated based on simulation. Clearly, the p-value is quite small, and thus the null hypothesis of linearity of the regression function is rejected.

The command binsregtest can implement testing for any parametric model specification by comparing the fitted values based on the binscatter estimator (computed by the command) and the parametric model of interest (provided by the user). For example, the following code creates an auxiliary database with a grid of evaluation points, implements a linear regression first, makes an out-of-sample prediction using the auxiliary dataset, and then tests for linearity based on the binscatter estimator by specifying the auxiliary file containing the fitted values.

[auto]. qui binsregselect y x w, simsngrid(30) savegrid(output/parfitval) replace . qui reg y x w . use output/parfitval, clear . predict binsreg_fit_lm (option xb assumed; fitted values) . save output/parfitval, replace file output/parfitval.dta saved . use binsreg_simdata, clear . binsregtest y x w, testmodelparfit(output/parfitval) Hypothesis tests based on binscatter estimates Bin selection method: IMSE-optimal plug-in choice Placement: Quantile-spaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 21 3015 Model specification Tests: Degree: 3 # of smoothness constraints: 3 Input file: output/parfitval.dta 1930 H0: mu = sup —T— p value 1930 binsreg_fit_lm 6.503 0.000 1930

The first command, qui binsregselect y x w, simsngrid(30) savegrid(output/parfitval) replace generates the auxiliary file containing the grid of evaluation points. Since the parameter of interest is only the mean relation between y and x, i.e., , at the out-of-sample prediction step, the testing dataset parfitval.dta must contain a variable x containing a sequence of evaluation points at which the binscatter and parametric models are compared, and the covariate w whose values are set as zeros. In addition, the variable containing fitted values has to follow a specific naming rule, i.e., takes the form of binsreg_fit*. The companion command binsregselect can be used to construct the required auxiliary dataset, as illustrated above. We discuss this other command further below.

In addition to model specification tests, the command binsregtest can test for nonparametric shape restrictions on the regression function. For example, the following syntax tests whether the regression function is increasing:

[auto]. binsregtest y x w, deriv(1) nbins(20) testshaper(0) Hypothesis tests based on binscatter estimates Bin selection method: User-specified Placement: Quantile-spaced Derivative: 1 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial . # of smoothness constraints . # of bins 20 3015 Shape Restriction Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: inf mu ¿= inf T p value 1930 0 -2.680 0.202 1930

The null hypothesis here is that the infimum of the first-order derivative of the regression function is no less than . The output reports the test statistic, which is the infimum of the -statistic over a sequence of evaluation points, and the corresponding simulation-based p-value.

The command binsregtest may implement many tests simultaneously (given the derivative of interest). For example,

[auto]. binsregtest y x w, nbins(20) testshaper(-2 0) testshapel(4) testmodelpoly(1) /// ¿ nsims(1000) simsngrid(30) Hypothesis tests based on binscatter estimates Bin selection method: User-specified Placement: Quantile-spaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial . # of smoothness constraints . # of bins 20 3015 Shape Restriction Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: sup mu ¡= sup T p value 1930 4 -1.683 1.000 1930 1930 H0: inf mu ¿= inf T p value 1930 -2 1.461 1.000 0 -9.694 0.000 1930 Model specification Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: mu = sup —T— p value 1930 poly. degree 1 6.108 0.000 1930

The above syntax tests three shape restrictions and one model specification (linearity), employing 1000 random draws from and evaluation points to evaluate the supremum/infimum in the simulation.

As already mentioned, the commands binsreg and binsregtest rely on data-driven bin selection procedures via the command binsregselect whenver the option nbins() is not employed by the user. Its basic syntax is as follows:

[auto]. binsregselect y x w Bin selection for binscatter estimates Method: IMSE-optimal: plug-in choice Position: Quantile-spaced 2810 # of observations 1000 # of distince values 1000 # of clusters . 2810 Degree of polynomial 0 # of smoothness constraint 0 2810 141310 method # of bins df 141310 ROT-POLY 18 18 ROT-REGUL 18 18 ROT-UKNOT 18 18 DPI 21 21 DPI-UKNOT 21 21 141310

The following choices of number of bins are reported: ROT-POLY, the rule-of-thumb (ROT) choice based on global polynomial estimation; ROT-REGUL, the ROT choice regularized as discussed in Section 0.2, or the user’s choice specified in the option nbinsrot(); ROT-UKNOT, the ROT choice with unique knots; DPI, the direct plug-in (DPI) choice; and DPI-UKNOT, the DPI choice with unique knots.

The direct plug-in choice is implemented based on the rule-of-thumb choice, which can be set by users directly:

[auto]. binsregselect y x w, nbinsrot(20) binspos(es) Bin selection for binscatter estimates Method: IMSE-optimal: plug-in choice Position: Evenly-spaced 2810 # of observations 1000 # of distince values 1000 # of clusters . 2810 Degree of polynomial 0 # of smoothness constraint 0 2810 141310 method # of bins df 141310 ROT-POLY . . ROT-REGUL 20 20 ROT-UKNOT 20 20 DPI 22 22 DPI-UKNOT 22 22 141310

Notice that in the example above an even-spaced, rather than quantile-spaced, binning scheme is selected via the option binspos(es). The binning used in the commands binsreg and binsregtest may be adjusted similarly.

In addition, as illustrated above, the command binsregselect also provides a convenient option savegrid(), which can be used to generate the auxiliary dataset needed for parametric specification testing of user-chosen models via the command binsregtest. Specifically, the following command was (quietly) used above:

[auto]. binsregselect y x w, simsngrid(30) savegrid(output/parfitval) replace Bin selection for binscatter estimates Method: IMSE-optimal: plug-in choice Position: Quantile-spaced Output file: output/parfitval.dta 2810 # of observations 1000 # of distince values 1000 # of clusters . 2810 Degree of polynomial 0 # of smoothness constraint 0 2810 141310 method # of bins df 141310 ROT-POLY 18 18 ROT-REGUL 18 18 ROT-UKNOT 18 18 DPI 21 21 DPI-UKNOT 21 21 141310

The resulting file, parfitval.dta, includes x and w as well as some other variables related to the binning scheme. The variable x contains a sequence of evalution points, in this case set to within each bin via the option simsngrid(), and the values of w are set to zero on purpose (this is used to generate the fitting model correctly).

Finally, the main command binsreg is highly integrated with the companion commands in the package Binsreg. Specifically, binsreg can simultaneously implement binscatter plotting and hypothesis testing with the number of bins automatically selected via the commands binsregtest and binsregselect. For example,

[auto]. binsreg y x w, dots(0,0) line(3,3) ci(3,3) cb(3,3) polyreg(4) /// ¿ testmodelpoly(1) testshapel(4) Binscatter plot Bin selection method: IMSE-optimal plug-in choice Placement: Quantile-spaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 21 3015 930 p s df 930 dots 0 0 21 line 3 3 24 CI 3 3 24 CB 3 3 24 polyreg 4 NA 5 930 Hypothesis tests based on binscatter estimates Bin selection method: IMSE-optimal plug-in choice Placement: Quantile-spaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 21 3015 Shape Restriction Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: sup mu ¡= sup T p value 1930 4 -1.920 1.000 1930 Model specification Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: mu = sup —T— p value 1930 poly. degree 1 6.503 0.000 1930 As a general rule, the implementations within the command binsreg is based on the binning scheme either specified by the user via the option nbins() or selected in a data-driven procedure given the choice of the degree and smoothness in the option dots(). Valid inference results require careful choice of binning or, more specifically, choice of relative to . It is recommended to use the data-driven method to select the IMSE-optimal number of bins for a given polynomial degree , but then inference methods should be implemented with a higher order degree , with , which corresponds to a simple application of robust bias-corrected inference (Calonico-Cattaneo-Titiunik_2014_ECMA; Calonico-Cattaneo-Farrell_2018_JASA; Calonico-Cattaneo-Farrell_2019_CEOptimal; Cattaneo-Farrell-Feng_2018_wp).

0.7 Conclusion

We introduced the Stata package Binsreg, which provides general-purpose software implementations of binscatter via three commands binsreg, binsregtest, and binsregselect. A companion R package with similar syntax and the same capabilities is also available.

0.8 Acknowledgments

We thank John Friedman, Andreas Fuster, Paul Goldsmith-Pinkham, David Lucca, and Xinwei Ma for helpful comments and discussions.

References

About the Authors

Matias D. Cattaneo is a Professor of Economics and a Professor of Statistics at the University of Michigan.

Richard K. Crump is a Vice President and the Function Head of the Capital Markets Function at the Federal Reserve Bank of New York.

Max H. Farrell is an Associate Professor of Statistics and Econometrics at the University of Chicago Booth School of Business.

Yingjie Feng in a Ph.D. candidate in Economics at the University of Michigan.