Flexible Log-Likelihood Functions


Benjamin Goodrich
Associate Research Scholar; Lecturer in the Department of Political Science
Andrew Gelman
Higgins Professor of Statistics and Professor of Political Science


Scientific models often make poor predictions of their outcomes due to underfit or overfit. Just like the radio of an old car has only two adjustable dials for frequency and volume, probability distributions, like the bell-shaped normal distribution, that are used in scientific models typically only have one or two adjustable dials. Underfit occurs when no combination of the dials ever produces a good signal, while overfit occurs when the dials produce an extremely good signal in particular circumstances that becomes poor in slightly different circumstances, such as the location of the car or a new group of observations in a scientist’s data set. Scientists should be able to add as many adjustable dials to their models as they need to avoid underfitting the data but also need the capability to avoid overfitting when simultaneously adjusting more dials. Doing so will promote the progress of science by allowing scientists to make decent predictions of their outcomes that readily generalize when circumstances change moderately. Since scientists in all fields face these ubiquitous problems, providing software to help combat them will have a broad impact.

Those metaphorical dials are the parameters in the likelihood function used by scientists when utilizing maximum likelihood or Bayesian estimation techniques. Almost all likelihood functions are taken from the exponential family of probability distributions and have only one or two unknown parameters. In the past few years, a new continuous probability distribution from outside the exponential family has been derived where scientists can specify any fixed number of parameters, but this metalog(istic) distribution is difficult to use as a likelihood function because its density lacks an explicit expression. Moreover, while there is a convex set of parameters that produce a valid probability distribution, some combinations of parameters do not, and it is presumably impossible to characterize the admissible set explicitly. Nevertheless, it is quite possible to evaluate a metalog likelihood function numerically while imposing the required constraints on the unknown parameters, which would allow it to be utilized by scientists in many situations. The investigators will implement the metalog likelihood function in the free and open-source software known as Stan, whose algorithms have become the workhorse for Bayesian analysis in many scientific fields. The Bayesian approach also would permit a principled way to choose a model with the appropriate number of parameters in the likelihood in order to avoid overfitting future data. The project will also further advance interdisciplinary professional development of the next generation of researchers in statistical, data science, and other STEM disciplines.