1 A two-step approach to account for unobserved spatial heterogeneity1 Anna Gloria Billé ?*, Roberto Benedetti b, Paolo Postiglione b ? Department of Economics and Finance, University of Rome Tor Vergata b Department of Economic Studies, University G. d’Annunzio of Chieti-Pescara. *corresponding author. Email:

[email protected] Abstract. Empirical analysis in economics often faces the difficulty that the data are correlated and heterogeneous in some unknown form. Spatial econometric models have been widely used to account for dependence structures, but the problem of directly dealing with unobserved spatial heterogeneity has been largely unexplored. The problem can be serious especially if we have no prior information justified by the economic theory. In this paper we propose a two-step procedure to endogenously identify spatial regimes in the first step and to account for spatial dependence in the second step, with an application to hedonic house price analysis. JEL Codes: C14, C31, C63 Keywords: spatial econometrics, two-step, spatial heterogeneity, hedonic house prices 1. Introduction and literature review It is a well-established fact that if spatial models are correctly specified then they can also be consistently and efficiently estimated by the commonly used estimators. However, incorrect functional forms, correlated omitted variables, models with near unit roots and row-normalized weighting matrices, and so on, typically produce spurious spatial autocorrelations (Fingleton, 1999; McMillen, 2003, Lauridsen and Kosfeld, 2006; Lee and Yu, 2009; Lauridsen and Kosfeld, 2011), which can lead to inconsistency of the usual estimators. Spatial heterogeneity is a particular form of 1 The title of the first version of the paper was: “Spatial Heterogeneity in House Price Models: An Iterative Locally Weighted Regression Approach”. 2 heterogeneity, usually unobserved, that is related to geo-referred data sets and would lead to misspecification of the model if not account for. Empirical analysis in economics often faces the difficulty that the data are correlated and heterogeneous in some unknown form. A first attempt to explicitly model discontinuities in space is for example the work by McDonald and Owen (1986), which procedure has been then used by McMillen (1994) to study potential discontinuities in the population density of Chicago in 1980. As Anselin (1988a, p. 119) stressed, there are two distinct aspects that pertain to spatial heterogeneity: the former is the structural instability as expressed by changing functional forms or varying parameters, the latter is the heteroscedasticity which follows from missing variables or other forms of misspecification that lead to error terms with non-constant variance. In this paper, we are going to deal with the idea that coefficient estimates can vary over space leading to a spatial structural instability, i.e. when its parameters take on distinct values in subsets of the spatial sample. Moreover, if spatial heterogeneity can be categorized into a small number of regimes, each represented by different values for the regression coefficients, the phenomenon is also known as spatial regimes. In this case, if spatial heterogeneity is present the functional form of the model will be misspecified because of the wrongly assumed constant relationships between dependent variables and regressors. The following Anselin’s (2010, p. 5) statement is useful to understand “Spatial heterogeneity becomes particularly challenging since it is often difficult to separate from spatial dependence. This is known in the literature as the inverse problem. It is also related to the impossible distinction between true and apparent contagion. The essence of the problem is that cross-sectional data, while allowing the identification of clusters and patterns, do not provide sufficient information to identify the processes that led to the patterns.” The problem of spatial heterogeneity in terms of spatially varying parameters has been largely unexplored by spatial econometricians, typically because of the main purpose of controlling only for spatial spillover effects. As Postiglione et al. (2013, p. 171) stressed “the problem of spatial 3 heterogeneity is often neglected in empirical analysis of geographic data and this negligence can affect sensibly model estimates”. Some authors have attempted to detect the presence of spatial heterogeneity by constructing statistical tests that are typically based on the LM statistic (Anselin, 1988b; de Graaff et al., 2001; Lauridsen and Kosfeld, 2006; Lauridsen and Kosfeld, 2011; Pede et al., 2014; among others). Unfortunately, once detected the presence of spatial heterogeneity no test is able to suggest how to correctly model our spatial data set and in which direction we have to proceed for further analyses. Recently, Ibragimov and Müller (2010) have derived the small and large sample properties of the t statistic, also in the context of spatially correlated data, by assuming a reasonable partition in q groups of the data. As Ibragimov and Müller (2010, p. 454) emphasized “some a priori knowledge about the correlation structure is required …”. However, in practical cases, there is usually no reason to accept one partition instead of another, which is in some way justified by the economic theory. Following a parametric approach, the typical starting point to estimate a spatial econometric model is usually based on the choice of a row-standardized spatial weighting matrix, ??, which is able to specify the relationship between neighboring observations. In some cases the significance of the spatial spillover effects through the autoregressive coefficient might be simply due to an omitted spatially-correlated regressors problem, which can easily justifies the use of the well-known more flexible spatial Durbin models (see e.g. Corrado and Fingleton, 2011, LeSage, 2014, for comprehensive discussions). However, neighborhood influence is not calibrated in terms of the data but is prescribed by the specification of W2. Imposing a predefined spatial structure of the data can be sometimes too restrictive in practical cases and it can bias results when inappropriate3, so that McMillen (2012), among others, has 2 See the papers of LeSage and Pace (2014), Getis (2009,2007) for considerations on the spatial weight matrix and the autocorrelation coefficient. 3 Recently, both from a theoretical and a computational perspective, some excellent works on the definition of the W matrix has been proposed (see LeSage and Pace, 2007; Seya et al., 2013; Bhattacharjee and Jensen-Butler, 2013; Qu and Lee, 2015). In particular, Qu and Lee (2015) defined a particular endogenous W matrix (where the usual exogenous W matrix can be considered a particular case) and showed the consequences on the estimates by considering commonly used estimators in SAR cross-sectional models when the true W is endogenous. 4 criticized this approach. Although our purpose is not to criticize the parametric approach, it is reasonable to assume that for some economic phenomena there is no reason, justified by the economic theory, to choose a priori a particular spatial structure of the underlined spatial process. The main purpose of the present paper is then to propose a possible partition of the spatial data, i.e. a classification of the data due to unobserved heterogeneity, with no a priori information of the true dependence structure. For instance, the advanced recent literature in hedonic house price models accounts for spatial spillover effects but still ignore the possibility of a spatial heterogeneity effect (Holly et al., 2010; Holly et al., 2011). Some researchers are then recognizing that the spatial structures can be sufficiently different that the data should not be pooled and estimated together and global spatial regression models usually fail in taking into account any potential variations over space, with the consequences of biased resulting estimates. Along the same line, one of the purposes of the present paper is to show how the presence of spatial heterogeneity might modify (and generally reduces) the significance of the spatial effects in a global spatial autoregressive model by simply estimating spatial autoregressive models with spatial regimes (which introduce a sort of flexibility in spatial autoregressive models). This can be interpreted in the fact that the spatial heterogeneity might generates part of the spatial autocorrelation effect, or in other words that the autoregressive coefficient is sometimes overestimated. In line with this view is the work of Basile et al. (2014). In order to simultaneously account for spatial dependence, unknown functional form and unobserved heterogeneity, they proposed the use of the so-called Spatial Autoregressive Semiparametric Geoadditive Models, which are based on a combination between spatial parametric autoregressive models and unknown smooth functions. Semiparametric and nonparametric estimation methods, such as geographically weighted regressions (GWRs) (Brunsdon et al., 1996), are proving to be valid alternatives to parametric 5 approaches and should be used as diagnostic tools to detect the presence of spatial heterogeneity4. The above methods allow coefficient estimates to vary over space by calibrating the global model separately for each spatial units in order to produce n sets of parameter estimates, with the behind basic idea that simple econometric models represent the data best in small geographic areas. They indeed have the attracting advantage to control for misspecified spatial effects while using highly flexible functional forms, with the only condition that nearby observations need more weight when constructing an estimate for a target point. McMillen and Redfearn (2010) showed that the GWR specification can be viewed as a special case of the already known locally weighted regression (LWR) method5. GWRs or Geoadditive Models usually interpret the spatially varying parameter problem as a smooth changing over space. As a matter of fact, pairs of beta coefficient GWR estimates that are proximal in space could not exhibit statistically significant differences. As Anselin (2010, p. 17) underlined “…in models of spatial heterogeneity, the spatial regimes or spatially varying coefficients show evidence of the heterogeneity, but do not explain it. Ideally, one would want to make the structure of dependence and/or the structure of heterogeneity endogenous”. This is the aim of the present paper. Our purpose is to iteratively find, in a way that spatial parameter variations can be described by breaks in continuity over space (or spatial regimes). In the spatial statistics literature an adaptive weights smoothing (AWS) algorithm have been recently proposed by Polzehl and Spokoiny (2000, 2006) to describe, in a data-driven iterative way, a maximal possible local neighborhood of every point in space ?? in which the local parametric assumption is justified by the data. The basic assumption of the proposed approach is that for every point ?? there exists a vicinity of ?? in which the underlying model can be well approximated by a parametric model with the constant set of parameters. Their method has been applied in the field of image analysis and it is based on both a successive increase of the local neighborhoods around every point ?? and on a description of the local models by assigning weights to every spatial unit that 4 A valid alternative to the GWR approach is the NCSTAR model as pointed out by Lebreton (2005). 5 The main difference pertains the way of thinking the distance measure. In GWRs, distance is thought as a mere geographic distance (Euclidean, great circle, etc.), whereas in LWRs an economic interpretation can be also assigned. 6 depends on the result of the previous step of the procedure. The potential of this method in an econometric environment concerns the possibility of endogenously obtaining cluster of observations due to unobserved spatial heterogeneity (i.e. spatial regimes) exhibiting similar coefficient estimates. Andreano et al. (2016) defined a first algorithm based on the work by Polzehl and Spokoiny (2000, 2006) for the identification of economic convergence clubs. In this paper we substantially modify the previous contribution proposing a two-step procedure (see Section 3 for further details and comparisons), which is based on the conjunction between the LWR approach and the AWS procedure. A two-step procedure to deal with both spatial dependence and spatial heterogeneity within the estimation of hedonic house price functions has been proposed for instance by Beron et al. (2004). However, their method focused on the estimation of a first set of parameters related to environmental characteristics in the first step, whereas it proposed a spatial econometric model that accounts for both spatial dependence and spatial heterogeneity in the second step, assuming a spatial trend as a quadratic form of latitudes and longitudes for the heterogeneity effects. In our two step procedure, instead, we propose a first step that focuses on the estimation of unobserved discrete spatial heterogeneity (i.e. spatial regimes) and a second step that estimates the effects of both spatial dependence and the identified spatial regimes. Our purpose is based on the idea that we can combine the potential of local estimation with the usefulness of a modified AWS procedure, which is able to identify spatial regimes, i.e. subsamples over space with an estimated set of beta coefficients for each of them. The paper is structured in the following way. In section 2 we introduce the LWR and the GWR as a special case. In section 3 we explain the first step of our procedure, i.e. the algorithm, whereas section 4 explain the second step to estimate both spatial dependence and spatial regimes. Section 5 illustrates the data set used and their main estimation results in terms of the marginal effects obtained by using different spatial econometric models. Finally, section 6 concludes. 7 2. Locally Weighted Regressions Spatial econometric models may be not appropriate in the presence of (unobserved) spatial heterogeneity. Locally weighted regressions (LWRs) (Cleveland and Devlin, 1988) or Geographically weighted regressions (GWRs) (Brunsdon et al., 1996; Fotheringham et al., 1998, Fotheringham et al., 2002), which are recognized to be natural evolutions of the expansion method (Casetti, 1972), allow us to estimate local rather than global parameters. Residual terms usually exhibit a different from zero spatial autocorrelation parameter that, actually, might be not statistically different from zero if the true reason of error autocorrelation is different from a true contagion process (i.e. spurious autocorrelation). The first law of geography (Tobler, 1970, p. 236) states: “everything is related to everything else, but near things are more related th