Researchers led by Hyun Ah Song, of the Machine Learning Department at Carnegie Mellon University, developed an algorithm that effectively reconstructs time series counts from aggregated reports by careful infusion of domain knowledge, when compared with Project Tycho data.


Hyun Ah Song
Fan Yang
Zongge Liu
Wilbert van Panhuis
Nicholas Sidiropoulos
Christos Faloutsos
Vladimir Zadorozhny

Related Project Tycho Datasets

United States of America - Acute nonparalytic poliomyelitis
United States of America - Acute paralytic poliomyelitis United States of America - Acute poliomyelitis United States of America - Acute type A viral hepatitis
United States of America - Congenital rubella syndrome
United States of America - Measles
United States of America - Mumps United States of America - Pertussis United States of America - Rubella United States of America - Smallpox United States of America - Smallpox without rash
United States of America - Viral hepatitis, type A


Given some (but not all) monthly totals of people with measles (or counts of product-units sold, or counts of retweets), how can we recover the weekly counts? Requiring smoothness between successive weeks is reasonable - but can we do better, if we have some domain knowledge? For example, we know that measles (flu, count-of-retweets, etc) follow a specific cascade model, like the so-called 'SIS'. The answer is 'yes'. With our proposed GB-R we show how to inject domain knowledge, creating a gray-box model; we show how to set up and efficiently solve the appropriate optimization problem. The desirable properties of our GB-R are: (a) effectiveness, outperforming the best competitors on real, epidemiology data, often by 3x - 25x in reconstruction error; (b) scalability, being linear on the sequence length and (c) interpretability, accurately estimating the parameters of the gray-box model.

Read the full article