• Journal: Statistics in Medicine
  • Date: May 4, 2017
  • DOI: 10.1002/sim.7320
  • Category: Scientific Research


David Benkeser, Cheng Ju, Sam Lendle, and Mark van der Laan of the Biostatistics Group from UCLA, Berkeley created flexible, ensemble-based online estimators and tested their performance by predicting hepatitis A incidence with data from Project Tycho.


David Benkeser

Cheng Ju

Sam Lendle

Mark van der Laan

Related Project Tycho Datasets

United States of America - Viral hepatitis

United States of America - Viral hepatitis, type A

United States of America - Acute type A viral hepatitis


Online estimators update a current estimate with a new incoming batch of data without having to revisit past data thereby providing streaming estimates that are scalable to big data. We develop flexible, ensemble-based online estimators of an infinite-dimensional target parameter, such as a regression function, in the setting where data are generated sequentially by a common conditional data distribution given summary measures of the past. This setting encompasses a wide range of time-series models and, as special case, models for independent and identically distributed data. Our estimator considers a large library of candidate online estimators and uses online cross-validation to identify the algorithm with the best performance. We show that by basing estimates on the cross-validation-selected algorithm, we are asymptotically guaranteed to perform as well as the true, unknown best-performing algorithm. We provide extensions of this approach including online estimation of the optimal ensemble of candidate online estimators. We illustrate excellent performance of our methods using simulations and a real data example where we make streaming predictions of infectious disease incidence using data from a large database.

Read the full article