0.1 Introduction
Binscatter has become a very popular methodology in applied microeconomics since its introduction (ChettySzeidl_2006_wp; ChettyLooneyKroft_2009_AER; ChettyFriedmanOlsenPistaferri_2011_QJE; ChettyFriedmanHilgerSaezSchanzenbachYagan_2011_QJE). See Stepner_2014_StataConf for a popular Stata implementation of canonical binscatter, and StarrGoldfarb_2018_wp
for a very recent heuristic overview of these methods. Binscatter techniques offer flexible, yet parsimonious ways of visualizing and summarizing “big data” in regression settings. They can also be used for formal estimation and inference, including testing of substantive hypothesis such as linearity or monotonicity, of the regression function and its derivatives. Despite its popularity among empirical researchers, little was known about the statistical properties of binscatter until very recently:
CattaneoCrumpFarrellFeng_2019_Binscatter offered the first foundational, thorough analysis of binscatter, giving an array of theoretical and practical results that aid both in understanding current practices (i.e., their validity or lack thereof) and in offering theorybased guidance for future applications.This paper introduces the Stata (and R) package Binsreg, which includes three commands implementing the main methodological results in CattaneoCrumpFarrellFeng_2019_Binscatter: binsreg, binsregtest, and binsregselect. The first command (binsreg) implements binscatter for the regression function and its derivatives, offering several point estimation, confidence intervals and confidence bands procedures, with particular focus on constructing binned scatter plots. The second command (binsregtest) implements hypothesis testing procedures for parametric specification and for nonparametric shape restrictions of the unknown regression function and its derivatives. Finally, the third command (binsregselect) implements datadriven number of bins selectors for binscatter implementation using either quantilespaced or evenlyspaced binning.
There exists another very popular Stata command also implementing binscatter methods: binscatter. See Stepner_2014_StataConf for an introduction to this alternative command. The command binsreg offers several new capabilities relative to binscatter, in addition to also implementing covariate adjustment in a different, more principled way. First, the command binsreg implements binscatter methods allowing for withinbin higherorder polynomial fitting and for acrossbins smoothness restrictions, which enables derivative estimation and also produces smooth approximations of the regression function and its derivatives. Further, the command binsreg implements valid confidence intervals and confidence bands. None of these features are available in the command binscatter. Second, both commands allow for covariate adjustment, but each command does it in very different way: binscatter employs a residualization approach, while binsreg employs a semilinear approach. As shown in CattaneoCrumpFarrellFeng_2019_Binscatter
, the residualization approach is, in general, inconsistent for the target function of interest, while the semilinear approach is either consistent (under correct specification) or has a clear probability limit interpretation (under misspecification). Finally, the other two commands in the package
Binsreg, binsregtest and binsregselect, are novel implementations in Stata (and in R).The rest of the article is organized as follows. Section 0.2 gives an overview of the main methods available in the package Binsreg and discusses some implementation details. Sections 0.3, 0.4 and 0.5 discuss, respectively, the syntax of the commands binsreg, binsregtest and binsregselect. Section 0.6 gives a numerical illustration, while Section 0.7 concludes. The latest version of this software, as well other related software and materials, can be found at:
0.2 Overview of Methods and Implementation Details
This section summarizes the main methods implemented in the package Binsreg. For further methodological and theoretical details see CattaneoCrumpFarrellFeng_2019_Binscatter and references therein.
Given a random sample , binscatter seeks to flexibly approximate the function
in the partially linear regression model:
(1) 
where is an outcome of interest and is a covariate of interest, while represents other covariates possibly entering the semilinear regression model. In particular, if the latter covariates are not present, then , while otherwise denotes the average (partial) relationship between and after controlling for . More general interpretations are also possible; see CCFF for more discussion.
In some applications, interest may be on the th derivative of , which we denote by . We employ the usual notation .
0.2.1 Binscatter Construction
To approximate and its derivatives in model (1), binscatter first partitions the support of into quantilespaced bins, leading to the partitioning scheme:
where denotes the th order statistic of the sample , is the floor operator, and . Each estimated bin contains roughly the same number of observations , where with denoting the indicator function. This binning approach is the most popular in empirical work but, for completeness, all commands in the package Binsreg also allow for evenlyspaced binning. See below for more implementation details.
Given the quantilespaced partitioning/binning scheme, for a choice of number of bins , a generalized binscatter estimator of the th derivative of , employing a th order polynomial approximation within each bin, imposing times differentiability across bins, and adjusting for additional covariates , is given by
where , , and with
being the th order polynomial basis of approximation within each bin, hence of dimension , and being a matrix of linear restrictions ensuring that the th derivative of is continuous.
When ,
, the identity matrix of dimension
, and therefore no restrictions are imposed: is the basis used for (disjoint) piecewise th order polynomial fits. Consequently, the binscatter is discontinuous at the bins’ edges whenever . On the other hand, implies that a least squares th order polynomial fit is constructed within each bin , in which case setting forces these fits to be connected at the boundaries of adjacent bins, forces these fits to be connected and continuously differentiable at the boundaries of adjacent bins, and so on for each .Enforcing smoothness on binscatter boils down to incorporating restrictions on the basis of approximation. The resulting constrained basis, , corresponds to a choice of spline basis for approximation of , with estimated quantilespaced knots according to the partition . The package Binsreg employs leading to Bsplines, which tend to have very good finite sample properties. When binscatter reduces to the canonical binscatter commonly found in empirical work.
Specifically, in canonical binscatter the basis is a
dimensional vector of orthogonal dummy variables, that is, the
th component of records whether the evaluation point belongs to the th bin in the partition . Therefore, canonical binscatter can be expressed as the collection ofsample averages of the response variable
, one for each bin: for . Empirical work employing canonical binscatter typically plots these binned samples averages along with some other estimate(s) of the regression function .Finally, covariate adjustment in is justified via model (1), and does not coincide with the one commonly used by most practitioners and other software implementations of binscatter (c.f., Stepner_2014_StataConf). An alternative covariateadjustment approach is based on residualization, that is, first regressing out the additional covariates and then constructing binscatter based on the residuals only. This alternative covariateadjustment approach is very hard to rationalize or justify in general, and will lead to an inconsistent estimator of in model (1) unless very special assumptions hold. Furthermore, even when model (1) is misspecified, the approach to covariate adjustment employed by the package Binsreg enjoys a natural probability limit interpretation, while the residualization approach does not. See CCFF for more discussion, numerical examples, and technical details.
Main implementation details
The command binsreg implements multiple versions of for a common choice of partitioning/binning . The option deriv() is used to set a common value of across all implementations. The options dots(p,s) and line(p,s) generate “dots” and a “line” tracing out two distinct implementations of with the corresponding choices of and selected in each case. Defaults are deriv(0) so that the object of interest is , and dots(0,0) so that the “dots” represent canonical binscatter (i.e., sample averages within each bin). The line option is muted by default, and needs to be set explicitly to appear in the resulting plot: for example, the option line(3,3) adds a line tracing out , implemented with and , a cubic Bspline estimator of .
The common partitioning/binning used by the command binsreg across all implementations is set to be quantilespaced for some choice of . The option nbins() sets manually (e.g., nbins(20) corresponds to quantilespaced bins), but if this option is not supplied then the companion command binsregselect is used to choose in a fully datadriven way, as described below. As an alternative, an evenlyspaced partitioning/binning can be implemented via the option binspos().
Several other options are available for the command binsreg, including optimal evenlyspaced binning via the command binsregselect and substantive specification or shape restrictions hypothesis testing via the command binsregtest, as discussed in detail further below. See Section 0.3 for the full syntax of binsreg.
0.2.2 Choosing the number of bins
CCFF develops valid integrated mean square error (IMSE) approximations for binscatter and its extensions in the context of model (1). These expansions give IMSEoptimal selection of the number bins forming the quantilespaced (or other) partitioning scheme , depending on polynomial order within bins and smoothness level across bins, and on the target estimand set by derivative order . Specifically, the IMSEoptimal choice of is given by:
where denotes the ceiling operator, and and
represent an approximation to the integrated (square) bias and variance of
, respectively. These constants depend on the partitioning scheme and binscatter estimator used. Recall that these three integer choices must respect and .Both IMSE constants, and , can be estimated consistently using a preliminary choice of . Thus, our main implementation offers two selectors:

: implements a ruleofthumb (ROT) approximation for the constants and , employing a trimmedfrombelow Gaussian reference model for the density of , and global polynomial approximations for the other two unknown features needed, and . This selector employs the correct rate but an inconsistent constant approximation.

: implements a directplugin (DPI) approximation for the constants and , based on the desired binscatter, set by the choices and , and employing a preliminary . If a preliminary is not provided by the user, then is used for DPI implementation. Therefore, the selector employs the correct rate and a nonparametric consistent constant approximation in the latter case.
The precise form of the constants changes depending on whether quantilespaced or evenlyspace partitioning/binning is used, but the two methods for selecting , ROT and DPI, remain conceptually the same. See CCFF and references therein for more details.
Main implementation details
The command binsregselect implements ROT and DPI datadriven, IMSEoptimal selection of for all possible choices of , and for both quantilespaced or evenlyspace partitioning/binning. For DPI implementation, the user can provide the initialization value of or, if not provided, then is used.
Several other options are available for the command binsregselect, including the possibility of generating an output file with the IMSEoptimal partitioning/binning structure selected and the corresponding grid of evaluation points, which can be used by the other companion commands (binsreg and binsregtest) for plotting, simulation, testing, and other calculations. See Section 0.5 for the full syntax of binsregselect.
0.2.3 Confidence intervals
Both confidence intervals and confidence bands for the unknown function are constructed employing the same type of Studentized statistic:
where the binscatter variance estimator is of the usual “sandwich” form
with and .
CCFF showed that pointwise in , that is, for each evaluation point on the support of , provided the misspecification error introduced by binscatter is removed from the distributional approximation. Such a result justifies asymptotically valid confidence intervals for , pointwise in , after bias correction. Specifically, for each , the confidence interval takes the form:
where
denotes the distribution function of a standard normal random variable (e.g.,
for a Gaussian confidence intervals), and provided the choice of is such that the misspecification error can be ignored.However, employing an IMSEoptimal binscatter (i.e., setting for the selected polynomial order ) introduces a firstorder misspecification error leading to invalidity of these confidence intervals, and hence cannot be directly used to form the confidence intervals in general. To address this problem, we rely on a simple application of robust biascorrection (CalonicoCattaneoTitiunik_2014_ECMA; CalonicoCattaneoFarrell_2018_JASA; CalonicoCattaneoFarrell_2019_CEOptimal; CattaneoFarrellFeng_2018_wp) to form valid confidence intervals based on IMSEoptimal binscatter, that is, without altering the partitioning scheme used.
Our recommended implementation employs robust biascorrected binscatter confidence intervals as follows. First, for a given choice of , select the number of bins in according to , which gives an IMSEoptimal binscatter (point estimator). Then, employ the confidence interval with , which satisfy
Main implementation details
The command binsreg implements confidence intervals, and reports them as part of the final binned scatter plot. Specifically, the option ci(p,s) estimates confidence intervals with the corresponding choices of and selected, and plots them as vertical segments along the support of . The implementation is done over a grid of evaluations points, which can be modified via the option cingrid(), and the desired level is set by the option level(). Notice that dots(p,s), lines(p,s), and ci(p,s) may all take different choices of and , which allows for robust biascorrection implementation of the confidence intervals and permits incorporating different levels of smoothness restrictions.
Several other options are available for the command binsreg, including substantive hypothesis testing via the companion command binsregtest, as described in detail further below. See Section 0.3 for the full syntax of binsreg.
0.2.4 Confidence bands
In many empirical applications of binscatter, the goal is to conduct inference about the entire function , simultaneously, that is, uniformly over all on the support of . This goal is fundamentally different from pointwise inference. A leading example of uniform inference is reporting confidence bands for and its derivatives, which are different from (pointwise) confidence intervals. The package Binsreg offers asymptotically valid constructions of both confidence intervals, as discussed above, and confidence bands, which can be implemented with the same choices of used to construct or different ones.
Following the theoretical work in CCFF, for a choice of and partition/binning of size , the confidence band for is:
where the quantile value is now approximated via simulations using
with denoting the original data,
and being a dimensional standard normal random vector. The distribution of , which is unknown, is approximated by that of conditional on the data , which can be simulated by taking repeated samples from and recomputing the supremum each time. In other words, the quantiles used to construct confidence bands can be approximated by resampling from the standard normal random vector , keeping fixed the data (and hence all quantities depending on it). See CCFF for more details.
A confidence band covers of the time the entire function in repeated sampling, whenever the misspecification error can be ignored. As before, we recommend employing robust bias correction to remove misspecification error introduced by binscatter, that is, following the same logic discussed above for the case of confidence intervals construction. To be more precise, first is chosen, along with and , and the optimal partitioning/binning is selected according to . Then, the confidence bands are constructed using with . This ensures that
Main implementation details
The command binsreg implements confidence bands, and reports them as part of the final binned scatter plot. The option cb(p,s) estimates an asymptotically valid confidence band with the corresponding choices of and selected, and plots it as a shaded region along the support of . The implementations is done over a grid of evaluations points, which can be either modified via the option cbngrid(), and the desired level is set by the option level(). The options dots(p,s), lines(p,s), ci(p,s), and cb(p,s) can all take different choices of and , which allows for robust bias correction implementations, as well as many other practically relevant possibilities.
See Section 0.3 for the full syntax of binsreg.
0.2.5 Parametric specification testing
In addition to implementing binscatter and producing binned scatter plots, with both point and uncertainty estimators, the package Binsreg also allows for formal testing of substantive hypotheses. The command binsregtest implements all hypothesis tests available. This command can be used as a standalone command, or can be called via binsreg when constructing binned scatter plots.
The command binsregtest implements two types of substantive hypothesis tests about : (i) parametric specification testing, and (ii) nonparametric shape restriction testing. This subsection discusses the first type of hypothesis testing, while the next subsection discusses the second one.
For a choice of , and partitioning/binning scheme of size , the implemented parametric specification testing approach contrasts a (nonparametric) binscatter approximation of with a hypothesized parametric specification of the form for some known up to a finite parameter , which can be estimated using the available data. Formally, the null and alternative hypothesis are, respectively,
for choice of derivative order .
For example, is compared to in order to assess whether there is a relationship between and or, more formally, whether is a constant function. Similarly, it is possible to formally test for a linear, quadratic, or even nonlinear parametric relationship , where
would be estimated from the data under the null hypothesis, that is, assuming that the postulated relationship is indeed correct.
Following CCFF, the command binsregtest
employs the test statistic
Then, a parametric specification hypothesis testing procedure is:
(2) 
where is again computed by simulation from a standard Gaussian random vector, conditional on the data , as in the case of confidence bands already discussed. This testing procedure is an asymptotically valid level test if the misspecification error is removed from the test statistic .
The command binsregtest employs robust bias correction by default: first and are chosen, and the partitioning/binning scheme is selected by setting for these choices. Then, using this partitioning scheme, the testing procedure (2) is implemented with the choice instead of , with
. CCFF shows that, under regularity conditions, the resulting parametric specification testing approach controls Type I error with nontrivial power: for given
, , and ,and
where . This testing approach formalizes the intuitive idea that if the confidence band for does not contain the parametric fit considered entirely, then such parametric fit is incompatible with the data, i.e., should be rejected.
Main implementation details
The command binsregtest implements parametric specification testing in two ways. First, polynomial regression (parametric) specification testing is implemented directly via the option polyreg(P), where the null hypothesis is and is estimated by least squares regression. For other parametrizations of , the command takes as input an auxiliary array/database (dta in Stata, or csv in R) via the option testmodelparfit(filename) containing the following columns/variables: grid of evaluation points in one column, and fitted values
(over the evaluation grid) for each parametric model considered in other columns/variables. The ordering of these variables is arbitrary, but they have to follow a naming rule: the evaluation grid has the same name as the independent variable
, and the names of other variables storing fitted values take the form binsreg_fit*. The binscatter (nonparametric) estimate used to construct the testing procedure is set by the options testmodel(p,s) and deriv(v), and the partitioning/binning scheme selected.See Section 0.4 for other options and more details on the syntax of this command.
0.2.6 Nonparametric shape testing
The second type of hypothesis tests implemented by the command binsregtest concern nonparametric testing of shape restrictions. For a choice of , the null and alternative hypotheses of these hypothesis problems are:
that is, onesided testing problem to the right. For example, negativity, monotonicity and concavity of correspond to , and , respectively. Of course, the analogous testing problem to the left is also implemented, but not discussed here to avoid unnecessary repetition.
The relevant Studentized test statistic for this class of testing problems is:
Then, the testing procedure is:
(3) 
with . As before, misspecification errors of binscatter need to be taken into account in order to control Type I error. As in previous cases, CCFF show that for given , , and accordingly, then
and
for any , that is, using a robust biascorrection approach. These results imply that the testing procedure (3) is an asymptotically valid hypothesis test provided it is implemented with the choice after the IMSEoptimal partitioning/binning scheme for binscatter of order is selected.
Main implementation details
The command binsregtest implements onesided and twosided nonparametric shape restriction testing as follows. Option testshapel(a) implements onesided testing to the left: . Option testshaper(a) for onesided to the right: . Option testshape2(a) for twosided testing: . The constant a needs to be specified by the user. The binscatter (nonparametric) estimate used to construct the testing procedure is set by the options testshape(p,s) and deriv(v), and the chosen partitioning/binning scheme.
See Section 0.4 for more details on the syntax of this command.
0.2.7 Extensions and other implementation details
Whenever possible, the package Binsreg is implemented using the general purpose least squares regression command regress in Stata. (In R, the function lm() is used as the building block for implementation.) This approach sacrifices speed of the implementation, but improves substantially in terms of stability and replicability.
This section reviews some specific extensions and other numerical issues of the package Binsreg and discusses related choices made for implementation, all of which can affect speed and/or robustness of the package.
Mass points and minimum effective sample size
All three commands in the package Binsreg incorporate specific implementation decisions to deal with mass points in the distribution of the independent variable . The number of distinct values of , denoted by , is taken as the effective sample size as opposed to the total number of observations . If is continuously distributed, then . However, in many applications, can be substantially smaller than , and this affects some of the implementations in the package.
First, assume that is set by the user (via the option nbins(J)). Then, given the choice , the commands binsreg and binsregtest
perform a degrees of freedom check to decide whether the
data exhibit enough variation. Specifically, given and set by the option dots(p,s), both commands check whether with by default. If this check is not passed, then the package Binsreg regards the data as having “too little” variation in , and turns off all nonparametric estimation and inference results based on large sample approximations. Thus, in this extreme case, the command binsreg only allows for dots(0,0), ci(0,0), and polyreg(P) for any , while the command binsregtest does not return any results and issues a warning message instead.If, on the other hand, for given , the numerical check is passed, then all nonparametric methods implemented by the commands binsreg and binsregtest become available. However, before implementing each method (dots(p,s), lines(p,s), ci(p,s), cb(p,s), polyreg(p,s), and the hypothesis testing procedures), a degrees of freedom check is performed in each individual case. Specifically, each nonparametric procedure is implemented only if , where recall that and may change from one procedure to the next.
Second, as discussed above, whenever is not set by the user via the option nbins(), the command binsregselect is employed to select in a datadriven way, provided there is enough variation in . To determine the latter, an initial degrees of freedom check is performed to assess whether selection is possible or, alternatively, if the unique values of should be used as bins directly. Specifically, if , with set by the option dots(p,s) and by default, then the data is deemed appropriate for ROT selection of via the command binsregselect, and hence is implemented. If, in addition, , then is also implemented whenever requested. Furthermore, the command binsregselect employs the following alternative formula for selection:
with a slightly different constant , taking into account the frequency of data at each mass point. All other estimators in the package Binsreg
, including bias and standard error estimators, automatically adapt to the presence of mass points. Once the final
is estimated, the degrees of freedom checks discussed in the previous paragraphs are performed based on this choice.If is not set by the user and , so that not even ROT estimation of is possible, then is taken as “too small.” In this extreme case, the package Binsreg sets and constructs a partitioning/binning structure with each bin containing one unique value of . In other words, the support of the raw data is taken as the binning structure itself. In this extreme case, the follow up degrees of freedom checks based on the formula fail by construction, and hence the nonparametric asymptotic methods are turned off as explained above.
Finally, the specific numerical checks mentioned in this subsection can be adjusted or omitted. First, checking and accounting for repeated values in can be turned off using the option nomassadj. Second, the default cutoffs point and , corresponding to the degrees of freedom checks for nonparametric binscatter and parametric global polynomial, respectively, can be changed using the option dfcheck( ).
Clustered data and minimum effective sample size
As discussed in CCFF, the main methodological results for binscatter can be extended to accommodate clustered data. All three commands in the package Binsreg allow for clustered data via the option vce(). In this case, the number of clusters is taken as the effective sample size, assuming (see below for the other case). The only substantive change occurs in the command binsregselect, which now employs the following alternative formula for selection:
with a variance constant accounting for the clustered structure of the data. Accordingly, clusterrobust variance estimators are used in this case.
Minimum effective sample size
The package Binsreg requires some minimal variation in in order to successfully implement nonparametric methods based on large sample approximations. The minimal variation is captured by the number of distinct values on the support of , denoted by , and the number of clusters, denoted by . Thus, all three commands in the pacakge perform degrees of freedom numerical checks using as the general definition of effective sample size, and proceeding as explained above for the case of mass points in the distribution of .
0.3 binsreg syntax
The main purpose of the command binsreg is to produce binned scatter plots. This command implements multiple binscatter estimators, accompanying confidence intervals and confidence bands, and also a global polynomial approximation for completeness. It also implements hypothesis testing via the companion command binsregtest. A partitioning/binning structure is required but, if not provided, then one is selected in a datadriven way using the companion command binsregselect.
This section describes the syntax of the command binsreg, grouping its many options according to their use.
binsreg depvar indvar othercovs , deriv(v)
dots(p s) dotsngrid(numeric) dotsplotopt(string)
line(p s) linengrid(numeric) lineplotopt(string)
ci(p s) cingrid(numeric) ciplotopt(string)
cb(p s) cbngrid(numeric) cbplotopt(string)
polyreg(p) polyregngrid(numeric) polyregcingrid(numlist)
polyregplotopt(string)
by() bycolors(colorstylelist) bysymbols(symbolstylelist)
bylpatterns(linepatternstylelist)
testmodel(p s) testmodelparfit(filename) testmodelpoly(p)
testshape(p s) testshapel(numlist) testshaper(numlist)
testshape2(numlist)
nbins(J) binspos(numlist) binsmethod(string)
nbinsrot(numeric) samebinsby
nsims(S) simsngrid(numeric) simsseed(num)
vce(vcetype) level(numeric) nomassadj noplot
savedata(filename) replace
dfcheck(n1 n2) twoway_options
depvar is the dependent variable ().
indvar is the independent variable ().
othercovs is a varlist for covariate adjustment ().
, and are integers satisfying .
weights allow for fweights, aweights and pweights; see weights in Stata for more details. (In R, weights allows for the equivalent of fweights only; see lm() help for more details.)
Estimand
deriv(v) specifies the derivative order of the regression function for estimation, testing and plotting. The default is deriv(0), which corresponds to the function itself, .
Dots
dots(p s) sets a piecewise polynomial of degree with smoothness constraints when constructing for point estimation and plotting as “dots”. The default is dots(0 0), which corresponds to piecewise constant (canonical binscatter).
dotsngrid(numlist) specifies the number of dots within each bin to be plotted. Given the choice, these dots are point estimates evaluated over an evenlyspaced grid within each bin. The default is dotsngrid(1), which corresponds to one dot per bin (canonical binscatter).
dotsplotopt(string) standard graphs options to be passed on to the twoway command to modify the appearance of the plotted dots.
Line
line(p s) sets a piecewise polynomial of degree with smoothness constraints when constructing for point estimation and plotting as a “line”. By default, the line is not included in the plot unless explicitly specified. Recommended specification is line(3 3), which adds a cubic Bspline estimate of the regression function of interest to the binned scatter plot.
linengrid(numeric) specifies the number of evaluation points of an evenlyspaced grid within each bin used for evaluation of the point estimate set by the line(p s) option. The default is linengrid(20), which corresponds to evenlyspaced evaluation points within each bin for fitting/plotting the line.
lineplotopt(string) standard graphs options to be passed on to the twoway command to modify the appearance of the plotted line.
Confidence Intervals
ci(p s) specifies the piecewise polynomial of degree with smoothness constraints used for constructing confidence intervals . By default, the confidence intervals are not included in the plot unless explicitly specified. Recommended specification is ci(3 3), which adds confidence intervals based on a cubic Bspline estimate of the regression function of interest to the binned scatter plot.
cingrid(numeric) specifies the number of evaluation points of an evenlyspaced grid within each bin used for evaluation of the point estimate set by the ci(p s) option. The default is cingrid(1), which corresponds to evenlyspaced evaluation point within each bin for confidence interval construction.
ciplotopt(string) standard graphs options to be passed on to the twoway command to modify the appearance of the confidence intervals.
Confidence Band
cb(p s) specifies the piecewise polynomial of degree with smoothness constraints used for constructing the confidence band . By default, the confidence band is not included in the plot unless explicitly specified. Recommended specification is cb(3 3), which adds a confidence band based on a cubic Bspline estimate of the regression function of interest to the binned scatter plot.
cbngrid(numeric) specifies the number of evaluation points of an evenlyspaced grid within each bin used for evaluation of the point estimate set by the cb(p s) option. The default is cbngrid(20), which corresponds to evenlyspaced evaluation points within each bin for confidence band construction.
cbplotopt(string) standard graphs options to be passed on to the twoway command to modify the appearance of the confidence band.
Global Polynomial Regression
polyreg(p) sets the degree of a global polynomial regression model for plotting. By default, this fit is not included in the plot unless explicitly specified. Recommended specification is polyreg(3), which adds a fourth order global polynomial fit of the regression function of interest to the binned scatter plot.
polyregngrid(numeric) specifies the number of evaluation points of an evenlyspaced grid within each bin used for evaluation of the point estimate set by the polyreg(p) option. The default is polyregngrid(20), which corresponds to evenlyspaced evaluation points within each bin for confidence interval construction.
polyregcingrid(numeric) specifies the number of evaluation points of an evenlyspaced grid within each bin used for constructing confidence intervals based on polynomial regression set by the polyreg(p) option. The default is polyregcingrid(0), which corresponds to not plotting confidence intervals for the global polynomial regression approximation.
polyregplotopt(string) standard graphs options to be passed on to the twoway command to modify the appearance of the global polynomial regression fit.
Subgroup Analysis
by() specifies the variable containing the group indicator to perform subgroup analysis; both numeric and string variables are supported. When by() is specified, binsreg implements estimation and inference by each subgroup separately, but produces a common binned scatter plot. By default, the binning structure is selected for each subgroup separately, but see the option samebinsby below for imposing a common binning structure across subgroups.
bycolors(colorstylelist) specifies an ordered list of colors for plotting each subgroup series defined by the option by().
bysymbols(symbolstylelist) specifies an ordered list of symbols for plotting each subgroup series defined by the option by().
bylpatterns(linepatternstylelist) specifies an ordered list of line patterns for plotting each subgroup series defined by the option by().
Parametric Model Specification Testing
testmodel(p s) sets a piecewise polynomial of degree with smoothness constraints for parametric model specification testing, implemented via the companion command binsregtest. The null hypothesis is . The default is testmodel(3 3), which corresponds to a cubic Bspline estimate of the regression function of interest for testing against the fitting from a parametric model specification.
testmodelparfit(filename) specifies a dataset which contains the evaluation grid and fitted values of the model(s) to be tested against. The file must have a variable with the same name as indvar, which contains a series of evaluation points at which the binscatter model and the parametric model of interest are compared with each other. Each parametric model is represented by a variable named as binsreg_fit*, which must contain the fitted values at the corresponding evaluation points.
testmodelpoly(p) specifies the degree of a global polynomial model to be tested against.
Nonparametric Shape Restriction Testing
testshape(p s) sets a piecewise polynomial of degree with smoothness constraints for nonparametric shape restriction testing, implemented via the companion command binsregtest. The default is testmodel(3 3), which corresponds to a cubic Bspline estimate of the regression function of interest for onesided or twosided testing.
testshapel(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a onesided hypothesis test to the left of the form .
testshaper(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a onesided hypothesis test to the right of the form .
testshape2(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a twosided hypothesis test of the form .
Partitioning/Binning Selection
nbins(J) sets the number of bins for partitioning/binning of indvar. If not specified, the number of bins is selected via the companion command binsregselect in a datadriven, optimal way whenever possible.
binspos(numlist) specifies the position of binning knots. The default is binspos(qs), which corresponds to quantilespaced binning (canonical binscatter). Other options are: es for evenlyspaced binning, or a numlist for manual specification of the positions of inner knots (which must be within the range of indvar).
binsmethod(string) specifies the method for datadriven selection of the number of bins via the companion command binsregtest. The default is binsmethod(dpi), which corresponds to the IMSEoptimal direct plugin rule . The other option is: rot for rule of thumb implementation, .
nbinsrot(numeric) specifies an initial number of bins value used to construct the DPI number of bins selector via the the companion command binsregtest. If not specified, the datadriven ROT selector is used instead.
samebinsby forces a common partitioning/binning structure across all subgroups specified by the option by(). The knots positions are selected according to the option binspos() and using the full sample. If nbins() is not specified, then the number of bins is selected via the companion command binsregselect and using the full sample.
Simulation
nsims(S) specifies the number of random draws for constructing confidence bands and hypothesis testing. The default is nsims(500), which corresponds to 500 draws from a standard Gaussian random vector of size .
simsngrid(numeric) specifies the number of evaluation points of an evenlyspaced grid within each bin used for evaluation of the supremum (or infimum) operation needed to construct confidence bands and hypothesis testing procedures. The default is simsngrid(20), which corresponds to 20 evenlyspaced evaluation points within each bin for approximating the supremum (or infimum) operator.
simsseed(numeric) sets the seed for simulations.
Other Options
vce(vcetype) specifies the vcetype for variance estimation used by the command regress. The default is vce(robust).
level(numeric) sets the nominal confidence level for confidence interval and confidence band estimation.
nomassadj omits mass points (in indvar) adjustments for estimation and inference.
noplot omits binscatter plotting.
savedata(filename) specifies a filename for saving all data underlying the binscatter plot (and more).
replace overwrites the existing file when saving the graph data.
dfcheck(n1 n2) sets adjustments for minimum effective sample size checks, which take into account the number of unique values of indvar (i.e., adjusting for the number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. Specifically, and . The default is dfcheck(20 30), as discussed above.
twoway_options any unrecognized options are appended to the end of the twoway command generating the binned scatter plot.
0.4 binsregtest syntax
The main purpose of the command binsregtest is to conduct hypothesis testing of parametric specifications and nonparametric shape restrictions for using binscatter methods. This standalone command is used by the companion command binsreg. A partitioning/binning structure is required but, if not provided, then one is selected in a datadriven way using the companion command binsregselect.
This section describes the syntax of the command binsregtest, grouping its many options according to their use.
binsregtest depvar indvar othercovs , deriv(v)
testmodel(p s) testmodelparfit(filename) testmodelpoly(p)
testshape(p s) testshapel(numlist) testshaper(numlist)
testshape2(numlist)
bins(p s) nbins(J) binspos(numlist) binsmethod(string)
nbinsrot(numeric)
nsims(S) simsngrid(numeric) simsseed(num)
vce(vcetype) nomassadj dfcheck(n1 n2)
depvar is the dependent variable ().
indvar is the independent variable ().
othercovs is a varlist for covariate adjustment ().
, and are integers satisfying .
weights allow for fweights, aweights and pweights; see weights in Stata for more details. (In R, weights allows for the equivalent of fweights only; see lm() help for more details.)
Estimand
deriv(v) specifies the derivative order of the regression function for estimation, testing and plotting. The default is deriv(0), which corresponds to the function itself, .
Parametric Model Specification Testing
testmodel(p s) sets a piecewise polynomial of degree with smoothness constraints for parametric model specification testing, implemented via the companion command binsregtest. The null hypothesis is . The default is testmodel(3 3), which corresponds to a cubic Bspline estimate of the regression function of interest for testing against the fitting from a parametric model specification.
testmodelparfit(filename) specifies a dataset which contains the evaluation grid and fitted values of the model(s) to be tested against. The file must have a variable with the same name as indvar, which contains a series of evaluation points at which the binscatter model and the parametric model of interest are compared with each other. Each parametric model is represented by a variable named as binsreg_fit*, which must contain the fitted values at the corresponding evaluation points.
testmodelpoly(p) specifies the degree of a global polynomial model to be tested against.
Nonparametric Shape Restriction Testing
testshape(p s) sets a piecewise polynomial of degree with smoothness constraints for nonparametric shape restriction testing, implemented via the companion command binsregtest. The default is testmodel(3 3), which corresponds to a cubic Bspline estimate of the regression function of interest for onesided or twosided testing.
testshapel(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a onesided hypothesis test to the left of the form .
testshaper(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a onesided hypothesis test to the right of the form .
testshape2(numlist) specifies a numlist of null boundary values for hypothesis testing. Each number a in the numlist corresponds to one boundary of a twosided hypothesis test of the form .
Partitioning/Binning Selection
bins(p s) sets a piecewise polynomial of degree with smoothness constraints for datadriven (IMSEoptimal) selection of the partitioning/binning scheme. The default is bins(0 0), which corresponds to piecewise constant (canonical binscatter).
nbins(J) sets the number of bins for partitioning/binning of indvar. If not specified, the number of bins is selected via the companion command binsregselect in a datadriven, optimal way whenever possible.
binspos(numlist) specifies the position of binning knots. The default is binspos(qs), which corresponds to quantilespaced binning (canonical binscatter). Other options are: es for evenlyspaced binning, or a numlist for manual specification of the positions of inner knots (which must be within the range of indvar).
binsmethod(string) specifies the method for datadriven selection of the number of bins via the companion command binsregselect. The default is binsmethod(dpi), which corresponds to the IMSEoptimal direct plugin rule . The other option is: rot for rule of thumb implementation, .
nbinsrot(numeric) specifies an initial number of bins value used to construct the DPI number of bins selector via the the companion command binsregtest. If not specified, the datadriven ROT selector is used instead.
Simulation
nsims(S) specifies the number of random draws for constructing confidence bands and hypothesis testing. The default is nsims(500), which corresponds to 500 draws from a standard Gaussian random vector of size .
simsngrid(numeric) specifies the number of evaluation points of an evenlyspaced grid within each bin used for evaluation of the supremum (or infimum) operation needed to construct confidence bands and hypothesis testing procedures. The default is simsngrid(20), which corresponds to 20 evenlyspaced evaluation points within each bin for approximating the supremum (or infimum) operator.
simsseed(numeric) sets the seed for simulations.
Other Options
vce(vcetype) specifies the vcetype for variance estimation used by the command regress. The default is vce(robust).
nomassadj omits mass points (in indvar) adjustments for estimation and inference.
dfcheck(n1 n2) sets adjustments for minimum effective sample size checks, which take into account the number of unique values of indvar (i.e., adjusting for the number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. Specifically, and . The default is dfcheck(20 30), as discussed above.
0.5 binsregselect syntax
The main purpose of the command binsregselect is to implement datadriven (IMSEoptimal) selection of partitioning/binning structure for binscatter. This standalone command is used by the companion commands binsreg and binsregtest whenever the user does not specify the binning structure manually.
This section describes the syntax of the command binsregselect, grouping its many options according to their use.
binsregselect depvar indvar othercovs , deriv(v)
bins(p s) binspos(numlist) binsmethod(string) nbinsrot(numeric)
simsngrid(numeric) savegrid(filename)
replace
vce(vcetype) nomassadj dfcheck(n1 n2)
depvar is the dependent variable ().
indvar is the independent variable ().
othercovs is a varlist for covariate adjustment ().
, and are integers satisfying .
weights allow for fweights, aweights and pweights; see weights in Stata for more details. (In R, weights allows for the equivalent of fweights only; see lm() help for more details.)
Estimand
deriv(v) specifies the derivative order of the regression function for estimation, testing and plotting. The default is deriv(0), which corresponds to the function itself, .
Partitioning/Binning Selection
bins(p s) sets a piecewise polynomial of degree with smoothness constraints for datadriven (IMSEoptimal) selection of the partitioning/binning scheme. The default is bins(0 0), which corresponds to piecewise constant (canonical binscatter).
binspos(numlist) specifies the position of binning knots. The default is binspos(qs), which corresponds to quantilespaced binning (canonical binscatter). Other options are: es for evenlyspaced binning, or a numlist for manual specification of the positions of inner knots (which must be within the range of indvar).
binsmethod(string) specifies the method for datadriven selection of the number of bins. The default is binsmethod(dpi), which corresponds to the IMSEoptimal direct plugin rule . The other option is: rot for rule of thumb implementation, .
nbinsrot(numeric) specifies an initial number of bins value used to construct the DPI number of bins selector. If not specified, the datadriven ROT selector is used instead.
Evaluation Points Grid Generation
simsngrid(numeric) specifies the number of evaluation points of an evenlyspaced grid within each bin used for evaluation of the supremum (or infimum) operation needed to construct confidence bands and hypothesis testing procedures. The default is simsngrid(20), which corresponds to 20 evenlyspaced evaluation points within each bin for approximating the supremum (or infimum) operator.
savegrid(filename) specifies a filename for storing the simulation grid of evaluation points. It contains the following variables: indvar, which is a sequence of evaluation points grids used in approximation; all control variables in covars, which take values of zero for prediction purpose; binsreg_isknot, indicating whether the grid is an inner knot; and binsreg_bin, indicating which bin the grid belongs to.
replace overwrites the existing file when saving the grid.
Other Options
vce(vcetype) specifies the vcetype for variance estimation used by the command regress. The default is vce(robust).
nomassadj omits mass points (in indvar) adjustments for estimation and inference.
dfcheck(n1 n2) sets adjustments for minimum effective sample size checks, which take into account the number of unique values of indvar (i.e., adjusting for the number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. Specifically, and . The default is dfcheck(20 30), as discussed above.
0.6 Illustration of Methods
We illustrate the package Binsreg using a simulated dataset, which is available in the file binscatter_simdata.dta. In this dataset, y is the outcome variable, x is the independent variable for binning, w is a continuously distributed covariate, and t is a binary covariate, and id is a group identifier. Summary statistics of the simulated data are as follows.
[auto]. use binsreg_simdata, clear . sum Variable Obs Mean Std. Dev. Min Max 1357 x 1,000 .4907072 .2932553 .0002281 .9985808 w 1,000 .0120224 .5799381 .9993055 .9973198 t 1,000 .515 .500025 0 1 id 1,000 250.5 144.4095 1 500 y 1,000 .5283884 1.727878 5.159858 5.751276
The basic syntax for binsreg is the following:
[auto]. binsreg y x w Binscatter plot Bin selection method: IMSEoptimal plugin choice Placement: Quantilespaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 21 3015 930 p s df 930 dots 0 0 21 930
The main output is a binned scatter plot as shown in Figure 1. By default, the (nonparametric) mean relationship between y and x is approximated by piecewise constants (dots(0 0)). Each dot in the figure represents the point estimate corresponding to each bin, which is the canonical binscatter plot. The number of bins, whenever not specified, is automatically selected via the companion command binsregselect. In this case, bins are used. Other useful information is also reported, including total sample size, the number of distinct values of x, bin selection results, and the degrees of freedom of the statistical model(s) employed.
Users may specify the number of bins manually rather than relying on the automatic datadriven procedures. For example, a popular adhoc choice in practice is setting quntilespaced bins:
[auto]. binsreg y x w, nbins(20) polyreg(1) Binscatter plot Bin selection method: Userspecified Placement: Quantilespaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial . # of smoothness constraints . # of bins 20 3015 930 p s df 930 dots 0 0 20 polyreg 1 NA 2 930
The option polyreg(1) adds a linear prediction line to the canonical binscatter plot, but the resulting binned scatter plot is not reported here to conserve space.
The command binsreg allows users to add a binscatterbased line approximating the unknown regression function, pointwise confidence intervals, a confidence band, and a global polynomial regression approximation. For example, the following syntax cumulatively adds in four distinct plots a fitted line, confidence intervals and a confidence band, all three based on cubic splines, and also a fitted line based on a global polynomial of degree . The results are shown in Figure 2.
[auto]. qui binsreg y x w, nbins(20) dots(0,0) line(3,3) . qui binsreg y x w, nbins(20) dots(0,0) line(3,3) ci(3,3) . qui binsreg y x w, nbins(20) dots(0,0) line(3,3) ci(3,3) cb(3,3) . qui binsreg y x w, nbins(20) dots(0,0) line(3,3) ci(3,3) cb(3,3) polyreg(4)
By construction, a cubic spline fit is a piecewise cubic polynomial function which is continuous, and has continuous first and secondorder derivatives. Thus, the prediction line and confidence band generated are quite smooth. In this case, it is arguably undersmoothed because of the “large” choice of . The degree and smoothness of polynomials can be changed by adjusting the values of p and s in the options dots(), line(), ci() and cb().
The command binsreg also allows for the standard weight options, vce options, factor variables, and twoway graph options, among other features. This is illustrated in the following code:
[auto]. binsreg y x w i.t, dots(0,0) line(3,3) ci(3,3) cb(3,3) polyreg(4) /// ¿ vce(cluster id) savedata(output/graphdat) replace /// ¿ title(”Binned Scatter Plot”) Binscatter plot Bin selection method: IMSEoptimal plugin choice Placement: Quantilespaced Derivative: 0 Output file: output/graphdat.dta 3015 # of observations 1000 # of distinct values 1000 # of clusters 500 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 20 3015 930 p s df 930 dots 0 0 20 line 3 3 23 CI 3 3 23 CB 3 3 23 polyreg 4 NA 5 930
Specifically, a dummy variable based on the binary covariate t is added to the estimation, standard errors are clustered at the group level indicator id, and a graph title is added to the resulting binned scatter plot. Note that any unrecognized options for the command binsreg will be understood as twoway options and therefore appended to the final plot command. Thus, users may easily modify, for example, axis properties, legends, etc. The option savedata(graphdat) saves the underlying data used in the binned scatter plot in the file graphdat.dta.
In addition, the command binsreg can be used for subgroup analysis. The following command implements binscatter estimation and inference across two subgroups separately, defined by the variable t, and then produces a common binned scatter plot (Figure 3):
[auto]. binsreg y x w, by(t) dots(0,0) line(3,3) cb(3,3) /// ¿ bycolors(blue red) bysymbols(O T) Binscatter plot Bin selection method: IMSEoptimal plugin choice Placement: Quantilespaced Derivative: 0 Group: t = 0 3015 # of observations 485 # of distinct values 485 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 20 3015 930 p s df 930 dots 0 0 20 line 3 3 23 CB 3 3 23 930 Group: t = 1 3015 # of observations 515 # of distinct values 515 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 15 3015 930 p s df 930 dots 0 0 15 line 3 3 18 CB 3 3 18 930
Figure 3 highlights a difference across the two subgroups defined by the variable t, which corresponds to the fact that our simulated data adds a to the outcome variable for those units with . The colors, symbols, and line patterns in Figure 3 can be modified via the options bycolors(), bysymbols(), and bylpatterns(). When the number of bins is unspecified, the command binsreg selects the number of bins for each subsample separately, via the companion command binsregselect. This means that, by default, the choice of binning/partitioning structure will be different across subgroups in general. However, if the option samebinsby is specified, then a common binning scheme for all subgroups is constructed based on the full sample.
Next, we illustrate the syntax of the command binsregtest. The basic syntax is the following:
[auto]. binsregtest y x w, testmodelpoly(1) Hypothesis tests based on binscatter estimates Bin selection method: IMSEoptimal plugin choice Placement: Quantilespaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 21 3015 Model specification Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: mu = sup —T— p value 1930 poly. degree 1 6.503 0.000 1930
A test for linearity of the regression function is implemented using the binscatter estimator. By default, a cubic spline is employed in the inference procedure, which can be adjusted by the option testmodel(). In addition, when unspecified, the number of bins is selected using a datadriven procedure via the companion command binsregselect. The selected number of bins is IMSEoptimal for piecewise constant point estimates by default. A summary of the sample and binning scheme is displayed, and then the test statistic and pvalue are reported. In this case, the test statistic is the supremum of the absolute value of the statistic evaluated over a sequence of grid points, and the pvalue is calculated based on simulation. Clearly, the pvalue is quite small, and thus the null hypothesis of linearity of the regression function is rejected.
The command binsregtest can implement testing for any parametric model specification by comparing the fitted values based on the binscatter estimator (computed by the command) and the parametric model of interest (provided by the user). For example, the following code creates an auxiliary database with a grid of evaluation points, implements a linear regression first, makes an outofsample prediction using the auxiliary dataset, and then tests for linearity based on the binscatter estimator by specifying the auxiliary file containing the fitted values.
[auto]. qui binsregselect y x w, simsngrid(30) savegrid(output/parfitval) replace . qui reg y x w . use output/parfitval, clear . predict binsreg_fit_lm (option xb assumed; fitted values) . save output/parfitval, replace file output/parfitval.dta saved . use binsreg_simdata, clear . binsregtest y x w, testmodelparfit(output/parfitval) Hypothesis tests based on binscatter estimates Bin selection method: IMSEoptimal plugin choice Placement: Quantilespaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 21 3015 Model specification Tests: Degree: 3 # of smoothness constraints: 3 Input file: output/parfitval.dta 1930 H0: mu = sup —T— p value 1930 binsreg_fit_lm 6.503 0.000 1930
The first command, qui binsregselect y x w, simsngrid(30) savegrid(output/parfitval) replace generates the auxiliary file containing the grid of evaluation points. Since the parameter of interest is only the mean relation between y and x, i.e., , at the outofsample prediction step, the testing dataset parfitval.dta must contain a variable x containing a sequence of evaluation points at which the binscatter and parametric models are compared, and the covariate w whose values are set as zeros. In addition, the variable containing fitted values has to follow a specific naming rule, i.e., takes the form of binsreg_fit*. The companion command binsregselect can be used to construct the required auxiliary dataset, as illustrated above. We discuss this other command further below.
In addition to model specification tests, the command binsregtest can test for nonparametric shape restrictions on the regression function. For example, the following syntax tests whether the regression function is increasing:
[auto]. binsregtest y x w, deriv(1) nbins(20) testshaper(0) Hypothesis tests based on binscatter estimates Bin selection method: Userspecified Placement: Quantilespaced Derivative: 1 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial . # of smoothness constraints . # of bins 20 3015 Shape Restriction Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: inf mu ¿= inf T p value 1930 0 2.680 0.202 1930
The null hypothesis here is that the infimum of the firstorder derivative of the regression function is no less than . The output reports the test statistic, which is the infimum of the statistic over a sequence of evaluation points, and the corresponding simulationbased pvalue.
The command binsregtest may implement many tests simultaneously (given the derivative of interest). For example,
[auto]. binsregtest y x w, nbins(20) testshaper(2 0) testshapel(4) testmodelpoly(1) /// ¿ nsims(1000) simsngrid(30) Hypothesis tests based on binscatter estimates Bin selection method: Userspecified Placement: Quantilespaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial . # of smoothness constraints . # of bins 20 3015 Shape Restriction Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: sup mu ¡= sup T p value 1930 4 1.683 1.000 1930 1930 H0: inf mu ¿= inf T p value 1930 2 1.461 1.000 0 9.694 0.000 1930 Model specification Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: mu = sup —T— p value 1930 poly. degree 1 6.108 0.000 1930
The above syntax tests three shape restrictions and one model specification (linearity), employing 1000 random draws from and evaluation points to evaluate the supremum/infimum in the simulation.
As already mentioned, the commands binsreg and binsregtest rely on datadriven bin selection procedures via the command binsregselect whenver the option nbins() is not employed by the user. Its basic syntax is as follows:
[auto]. binsregselect y x w Bin selection for binscatter estimates Method: IMSEoptimal: plugin choice Position: Quantilespaced 2810 # of observations 1000 # of distince values 1000 # of clusters . 2810 Degree of polynomial 0 # of smoothness constraint 0 2810 141310 method # of bins df 141310 ROTPOLY 18 18 ROTREGUL 18 18 ROTUKNOT 18 18 DPI 21 21 DPIUKNOT 21 21 141310
The following choices of number of bins are reported: ROTPOLY, the ruleofthumb (ROT) choice based on global polynomial estimation; ROTREGUL, the ROT choice regularized as discussed in Section 0.2, or the user’s choice specified in the option nbinsrot(); ROTUKNOT, the ROT choice with unique knots; DPI, the direct plugin (DPI) choice; and DPIUKNOT, the DPI choice with unique knots.
The direct plugin choice is implemented based on the ruleofthumb choice, which can be set by users directly:
[auto]. binsregselect y x w, nbinsrot(20) binspos(es) Bin selection for binscatter estimates Method: IMSEoptimal: plugin choice Position: Evenlyspaced 2810 # of observations 1000 # of distince values 1000 # of clusters . 2810 Degree of polynomial 0 # of smoothness constraint 0 2810 141310 method # of bins df 141310 ROTPOLY . . ROTREGUL 20 20 ROTUKNOT 20 20 DPI 22 22 DPIUKNOT 22 22 141310
Notice that in the example above an evenspaced, rather than quantilespaced, binning scheme is selected via the option binspos(es). The binning used in the commands binsreg and binsregtest may be adjusted similarly.
In addition, as illustrated above, the command binsregselect also provides a convenient option savegrid(), which can be used to generate the auxiliary dataset needed for parametric specification testing of userchosen models via the command binsregtest. Specifically, the following command was (quietly) used above:
[auto]. binsregselect y x w, simsngrid(30) savegrid(output/parfitval) replace Bin selection for binscatter estimates Method: IMSEoptimal: plugin choice Position: Quantilespaced Output file: output/parfitval.dta 2810 # of observations 1000 # of distince values 1000 # of clusters . 2810 Degree of polynomial 0 # of smoothness constraint 0 2810 141310 method # of bins df 141310 ROTPOLY 18 18 ROTREGUL 18 18 ROTUKNOT 18 18 DPI 21 21 DPIUKNOT 21 21 141310
The resulting file, parfitval.dta, includes x and w as well as some other variables related to the binning scheme. The variable x contains a sequence of evalution points, in this case set to within each bin via the option simsngrid(), and the values of w are set to zero on purpose (this is used to generate the fitting model correctly).
Finally, the main command binsreg is highly integrated with the companion commands in the package Binsreg. Specifically, binsreg can simultaneously implement binscatter plotting and hypothesis testing with the number of bins automatically selected via the commands binsregtest and binsregselect. For example,
[auto]. binsreg y x w, dots(0,0) line(3,3) ci(3,3) cb(3,3) polyreg(4) /// ¿ testmodelpoly(1) testshapel(4) Binscatter plot Bin selection method: IMSEoptimal plugin choice Placement: Quantilespaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 21 3015 930 p s df 930 dots 0 0 21 line 3 3 24 CI 3 3 24 CB 3 3 24 polyreg 4 NA 5 930 Hypothesis tests based on binscatter estimates Bin selection method: IMSEoptimal plugin choice Placement: Quantilespaced Derivative: 0 3015 # of observations 1000 # of distinct values 1000 # of clusters . 3015 Bin selection: Degree of polynomial 0 # of smoothness constraints 0 # of bins 21 3015 Shape Restriction Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: sup mu ¡= sup T p value 1930 4 1.920 1.000 1930 Model specification Tests: Degree: 3 # of smoothness constraints: 3 1930 H0: mu = sup —T— p value 1930 poly. degree 1 6.503 0.000 1930 As a general rule, the implementations within the command binsreg is based on the binning scheme either specified by the user via the option nbins() or selected in a datadriven procedure given the choice of the degree and smoothness in the option dots(). Valid inference results require careful choice of binning or, more specifically, choice of relative to . It is recommended to use the datadriven method to select the IMSEoptimal number of bins for a given polynomial degree , but then inference methods should be implemented with a higher order degree , with , which corresponds to a simple application of robust biascorrected inference (CalonicoCattaneoTitiunik_2014_ECMA; CalonicoCattaneoFarrell_2018_JASA; CalonicoCattaneoFarrell_2019_CEOptimal; CattaneoFarrellFeng_2018_wp).
0.7 Conclusion
We introduced the Stata package Binsreg, which provides generalpurpose software implementations of binscatter via three commands binsreg, binsregtest, and binsregselect. A companion R package with similar syntax and the same capabilities is also available.
0.8 Acknowledgments
We thank John Friedman, Andreas Fuster, Paul GoldsmithPinkham, David Lucca, and Xinwei Ma for helpful comments and discussions.
References
About the Authors
Matias D. Cattaneo is a Professor of Economics and a Professor of Statistics at the University of Michigan.
Richard K. Crump is a Vice President and the Function Head of the Capital Markets Function at the Federal Reserve Bank of New York.
Max H. Farrell is an Associate Professor of Statistics and Econometrics at the University of Chicago Booth School of Business.
Yingjie Feng in a Ph.D. candidate in Economics at the University of Michigan.
Comments
There are no comments yet.