RIDIR: Collaborative Research: Bayesian analytical tools to improve survey estimates for subpopulations and small areas


Andrew Gelman
Higgins Professor of Statistics and Professor of Political Science


In this project, a set of tools will be built for in-depth analysis of survey data, making use of and extending statistical methods for estimation for small subgroups. Classical methods for surveys are focused on aggregate population-level estimates but we can learn much more using small-area estimation. The goal of this project is to build a user-accessible platform for modeling and visualizing survey data that would give estimates for arbitrary subgroups of the population, along with visualization tools to display estimates of interest. The model would be fit in Stan, a state-of-the-art open-source platform for Bayesian inference, and implemented for the Cooperative Congressional Election Survey (CCES). An example of the sort of analysis that could be performed using these methods is a study of how demographic gaps in voting vary by age, education, and state.

The statistical method of multilevel regression and poststratification (MRP) allows inferences for narrow slices of the population. In the terminology of survey methods, MRP is "model-based" in that it uses regression to do partial pooling (smoothing) for small areas and demographic slices, and it is "design-based" in adjusting for variables such as age, sex, ethnicity, and education that are predictive of inclusion in the sample. One reason for extracting inferences for population subgroups using a flexible tool rather than one-time analyses is that key variables can change over time. Multilevel modeling gives the flexibility to adjust for large numbers of predictors, which makes poststratification more effective. As a bonus, this modeling and adjustment enables extraction of estimates of average survey responses for small slices of the population, which can correspond to the very sorts of inferences that consumers particularly want, and which typically are unavailable from surveys without huge sample sizes.