25 Sep 2024

IASS Webinar 44: Random Forests and Mixed Effects Random Forests for Small Area Estimation of General Parameters

Date 25 Sep 2024
Time 12:00 GMT+02:00 - 13:30 GMT+02:00
Level of instruction Intermediate
Instructor
Nikos Tzavidis
Registration fee

Webinar Abstract

Random forests demonstrate excellent predictive performance. The use of few tuning parameters, automated model-selection, and their ability to detect higher order interactions and complex relationships make their use appealing. Examples of research on tree-based methods for the analysis of complex survey data and survey estimation include Toth & Eltinge (2011), Breidt & Opsomer (2017), Bilton et al. (2017), and Bilton et al. (2020). More recently, Dagdoug et al. (2021) study the theoretical properties of random forests for complex survey data. In this work we study the use of random forests and extensions for estimating general small area parameters. Conventionally, random forests do not include random effects. Random effects play a central role in small area estimation. We propose an extension of random forests to mixed effects random forests that combines the random forest fitting algorithm with a mixed effects model to exploit clustering in out-of-bag residuals. The
proposed fitting algorithm extends previous work by Krennmair & Schmid (2022). The fitting algorithm uses non-parametric bootstrap to correct the bias due to the estimation of the random forest in the estimated residual variance (Mendez & Lohr, 2011) before proceeding to estimate the variance components and the random effects. Ignoring this bias impacts both on point and mean squared error (MSE) estimation. Estimators of general small area parameters are derived by using a smearing estimator of the area-specific distribution function (Chambers et al., 2014). MSE estimation under machine learning methods remains a largely unexplored research area. In this work we study MSE estimation using non-parametric block bootstrap with appropriate scaling of the residuals. The proposed methods are evaluated in model-based simulations and by using real data from a poverty assessment case study in Mozambique. Comparisons to industry standard methods under a linear mixed model e.g., the Empirical Best Predictor, also with data driven transformations, and to a synthetic estimator under the random forest are presented. The empirical evaluations and the real data application inform us about (a) the impact of including random effects in random forests, (b) the importance of using data transformations with random forests, and (c) the performance of point and MSE estimators. The current approach to including random effects in random forests is more in line with the data modelling culture than the algorithmic modelling culture (Breiman, 2001; Efron, 2020). We critically discuss the proposed approach and outline possible alternatives. 

Instructors

Nikos Tzavidis
Instructor
Nikos Tzavidis

About the instructor

Nikos Tzavidis is Professor of Statistical Methodology at the University of Southampton. His research focuses on topics in small area estimation, outlier robust estimation, quantile regression, applications of machine learning to official statistics and the integration of survey and geospatial data. Nikos’s work is supported by funding from the UK Economic and Social Research Council, the European Commission, the UK Office for National Statistics, FCDO, the United Nations and the World Bank. Nikos has served as Head of Department of Social Statistics and Demography and Head of the School of Economic, Social and Political Sciences at Southampton. Between 2021 and 2023, he served as Vice-President of the International Association of Survey Statisticians and is currently member of the Advisory Board on Ethics of the International Statistical Institute. Nikos currently serves on the editorial boards of the Journal of the Royal Statistical Society Series C, the Journal of Official Statistics, and International Statistical Review.  From January 1 st , 2025, he will be joint Editor of the Journal of the Royal Statistical Society Series A.