• Journal: ACM SIGKDD International Conference
  • Date: Aug. 24, 2014
  • DOI: 10.1145/2623330.2623624
  • Category: Scientific Research


Researchers from Kumamoto University in Japan, the University of Pittsburgh, and Carnegie Mellon University led by Dr. Yasuko Matsubara, used Project Tycho historical time series data to create FUNNEL and FUNNELFIT. This research used weekly case counts from Project Tycho for a range of epidemic diseases (diphtheria, scarlet fever, mumps, rubella, chickenpox, whooping cough, influenza, smallpox, typhoid fever, cryptosporidiosis, Lyme disease, typhus fever, gonorrhea, rabies in animals, and brucellosis) for US states. The FUNNEL algorithm correctly identified unusual patterns in the data, outlier values, seasonality, and spatial patterns. This project is a great example of how curated historical epidemiological data can inform modern data mining algorithms.


Yasuko Matsubara

Yasushi Sakurai

Willem G. van Panhuis

Christos Faloutsos

Related Project Tycho Datasets

United States of America - Diphtheria

United States of America - Scarlet Fever

United States of America - Mumps

United States of America - Rubella

United States of America - Varicella (Chicken Pox)

United States of America - Pertussis (Whooping Cough)

United States of America - Influenza

United States of America - Smallpox

United States of America - Typhoid Fever

United States of America - Typhoid and Paratyphoid Fevers

United States of America - Cryptosporidiosis

United States of America - Lyme Disease

United States of America - Murine Typhus

United States of America - Typhus Group Rickettsial Disease

United States of America - Gonorrhea

United States of America - Brucellosis


Given a large collection of epidemiological data consisting of the count of d contagious diseases for l locations of duration n, how can we find patterns, rules and outliers? For example, the Project Tycho provides open access to the count infections for U.S. states from 1888 to 2013, for 56 contagious diseases (e.g., measles, influenza), which include missing values, possible recording errors, sudden spikes (or dives) of infections, etc. So how can we find a combined model, for all these diseases, locations, and time-ticks? In this paper, we present FUNNEL, a unifying analytical model for large scale epidemiological data, as well as a novel fitting algorithm, FUNNELFIT, which solves the above problem. Our method has the following properties: (a) Sense-making: it detects important patterns of epidemics, such as periodicities, the appearance of vaccines, external shock events, and more; (b) Parameter-free: our modeling framework frees the user from providing parameter values; (c) Scalable: FUNNELFIT is carefully designed to be linear on the input size; (d) General: our model is general and practical, which can be applied to various types of epidemics, including computer-virus propagation, as well as human diseases. Extensive experiments on real data demonstrate that FUNNELFIT does indeed discover important properties of epidemics: (P1) disease seasonality, e.g., influenza spikes in January, Lyme disease spikes in July and the absence of yearly periodicity for gonorrhea; (P2) disease reduction effect, e.g., the appearance of vaccines; (P3) local/state-level sensitivity, e.g., many measles cases in NY; (P4) external shock events, e.g., historical flu pandemics; (P5) detect incongruous values, i.e., data reporting errors.

Read the full article