About Project Tycho® Data

The Project Tycho® database aims are to advance the availability and use of public health data for science and policy. We do this by acquisition of new data, by building infrastructure for data standardization, integration, quality control, and data redistribution, by developing innovative analytics, and by advocacy. Read more about aims and activities.

Tycho BraheWe named the Project Tycho® database after the Danish nobleman Tycho Brahe (1546—1601), who is known for his detailed astronomical and planetary observations. Tycho was not able to use all of his data for breakthrough discoveries, but his assistant Johannes Kepler (1571-1630) used Tycho's data to derive the laws of planetary motion. Similarly, this project aims to advance the availablity of large scale public health data to the worldwide community to accelerate advancements in scientific discovery and technological progress.

Currently, we have completed digitization of the entire history of weekly National Notifiable Disease Surveillance System (NNDSS) reports for the United States (1888-2013) into a database in computable format (Level 3 data). We have standardized a major part of these data for online access (Level 2 data). A subset of the U.S. data was cleaned further and used for a study on the impact of vaccination programs in the United States that was recently published in the NEJM (Level 1 data).

Levels of data

The Project Tycho® data are organized as counts. A count is defined as the number of cases or deaths due to a disease in a specific location and time period. A count is equivalent to a data point. During the 126 year period of weekly disease reporting, the types of reports have been changed regularly, leading to different types of data counts across time. This makes the integration and standardization of these data a complex task. Currently, available data are categorized in three levels based on the type of counts included. Level 1 includes different types of counts that have been standardized into a common format for a specific analysis published recently in the NEJM. Level 2 data only includes counts that have been reported in a common format, e.g. diseases reported for a one week period and without disease subcategories. These data can be used immediately for analysis, includes a wide range of diseases and locations but this level does not include data that have not been standardized yet. Level 3 data include all the different types of counts ever reported. Although this is the most complete data, the large number of different counts requires extensive standardization and various judgment calls before they can be used for analysis.

 

Data levelType of countsNumber of countsContent
Level 1 Different types of counts but standardized in a common format 759,483 8 diseases, 50 states, 122 cities, 1916-2009
Level 2 Informational counts that have been reported in a common format 3,666,141 50 diseases, 50 states, 1284 cities, 1888-2014
Level 3 Many different types of counts that have not been standardized and that need extensive harmonization work 14,047,623 58 diseases, 81 disease subcategories, 3026 cities, etc.

 

Types of data counts

This table describes the different types of counts that have been used at any point during the 125 year history of weekly U.S. disease surveillance. Each count represents one reported number of cases or deaths for a disease, location and time period.

Type of count (included in level)Description
Weekly (L1-3) Counts for a one week time period
Contemporaneous (L1-3) Counts reported for the first time in a concurrent year
Reported (L1-3) Counts on the number of reported cases or deaths, contrary to counts that were calculated based on reported counts, such as a median or average
States (L1-3) Counts at the state level
Cities, harmonized (L1-3) Counts at the city level for which various name and spelling variants have been harmonized into into a standard (current) name.
Diseases without subcategories (L1-3) Counts for diseases for which no subcategory was listed.
Non-weekly (L3) Counts for a time period that is not one week (month, year, etc.)
Cumulative (L3) Counts for a time period from week 1 to the current week of a year
Updated, delayed (L3) Counts reported a second time in the following year (these are listed as comparison and often updated from their first appearance as contemporaneous count)
Calculated (L3) Counts representing computed summary statistics
Re-calculated (L1,3) Counts recalculated by the Project Tycho® team for standardization purposes
Regions (L3) Counts at the level of epidemiological regions
Cities, not harmonized (L3) Counts at the city level for which name and spelling variants have not yet been harmonized
Counties (L3) Counts at the county level (these are for smallpox only and names have not been harmonized)
Diseases with subcategory (L1,3) Counts for diseases for which a subcategory was reported; disease subcategories have not been standardized formally
Non-diseases Counts that contain non-disease information, such as population number, comments, etc.
Missing information (L3) Counts for which the location, disease, or time period could not be established from contextual information
Smallpox average (L2) Different counts that have been averaged into one count for a specific location and week. Many duplicate reports for smallpox cases or deaths were found in the original data, often reporting the same number for a location and week, but sometimes reporting a different number. If multiple counts reported a different number for the same location and week, these were averaged in level 2. This is an intermediate solution until more extensive research and standardization will have been completed.

 

Methodology

Detailed methods for digitization, standardization, and inclusion of U.S. National Notifiable Disease Surveillance System (NNDSS) data in Project Tycho® data have been described in the online supplementary material of the paper by Van Panhuis, et. al., Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158. A summary is provided below.


Project Tycho® protocol schematic
(Click image to enlarge)


Data selection S1

Tables containing weekly reports of the U.S. NNDSS published in the Morbidity Mortality Weekly Report (MMWR) and precursor journals were selected from various locations online. For many weeks, no PDF file could be found and paper copies from the University of Pittsburgh Library were used. The following online sources were used:

  1. PubMed Central
  2. Hathi Trust Digital Library
  3. Morbidity Mortality Weekly Report

More recently, another repository became available that provides PDF files of MMWR issues.

We included all tables that provided disease specific information by week for U.S. cities, townships, counties or states. Tables that provided summary or aggregated information by month, year or at the national level were not included. Similarly, tables that did not contain disease specific information (such as all-cause mortality) were not included.

Data Entry S2

Weekly reports were downloaded or scanned as PDF files and selected tables were double entered (independently) into computer spreadsheets in a highly standardized process. Data entry was implemented by a social venture named Digital Divide Data that provides jobs and education opportunities to disadvantages families in Southeast Asia.

Quality control for data entry S3

First, data completeness was verified by comparing the content of entered data with PDF source files. Secondly, the accuracy of data entry was verified by multiple rounds of comparing random samples of entered data with PDF source files. Thirdly, data formatting was verified by various checks to ensure appropriate formatting for data loading.

Data re-entry S4

During the quality control process, missing data or data with incorrect file formatting were identified. These data were re-entered.

Scanning S5

Paper copies of weekly reports were scanned to create PDF documents for data entry if tables for certain weeks were missing from online repositories.

Data loading S6

All data were entered in Excel spreadsheets and various components (table body, title, column headers, footnotes, etc.) of these spreadsheets were loaded separately in data files.

Extraction of disease information S7

Information from table titles and column headers were used to extract the names of diseases and subcategories, the outcome (cases or deaths) and the indicator used (reported number or summary statistics).

Extraction of reporting periods S8

Information from table titles, column headers and specific columns in the dataset were used to extract the reporting period for each reported number of cases or deaths. These reporting periods were standardized to a start- and end-date.

Standardization of place names S9

Misspellings and changes over time in the names of reporting locations were standardized to current names.

Inclusion of publication dates S10

The publication dates of each weekly report were retrieved from the online sources of weekly tables or from the University of Pittsburgh Library. These dates were used to compute the delay between reporting period and publication of a report for each record in the database. This allowed the distinction between contemporaneous and delayed/updated counts.

Integration S11

All reported numbers and extracted information were integrated in one MySQL database with a unique record per reported number and associated information.

Data computation S12

Various new indicators were calculated based on extracted and standardized information including:

  1. Assignment of epidemiological weeks
  2. Calculation of reporting period
  3. Calculation of publication delay
  4. Re-calculation of cumulative reports in weekly values where possible

Post-processing quality control S13

After integration of all data in one database, checks were performed to detect duplicate reports and data inconsistencies. Duplicate records were removed and inconsistencies were resolved by verification with the original PDF source files.

Data filtering S14

Due to the extensive heterogeneity and data complexities in over a century of weekly surveillance data, additional processing will be required for many disease counts before these can be used for analysis. All standardized data are included in level 2 data provided online. Counts that have not yet been will be available through the Project Tycho® level 3 data request form.

 

Aims and Activities

Data for Health

We aim to advance the use of public health data for the improvement of public health. Oftentimes, restricted access to public health data limits opportunities for scientific discovery and technological innovation in disease control programs. A free flow of data and information maximizes opportunities for more efficient and effective public health programs leading to higher impact and better health. Our activities are focused on accelerating the availability and use of public health data as listed below.

Focus area 1 — Acquisition of new data

The Project Tycho® team is continuously engaging in new partnerships with scientists, funding and public health agencies around the world to add or connect new historical and current datasets to the system. This involves addressing common barriers such as privacy and ownership concerns through transparent negotiations and data use agreements. We are making every effort to fully engage data contributors during the entire process from data identification to data redistribution and advocacy.

Focus area 2 — Data infrastructure

Advancing the availability and use of public health data from many sources and in many formats requires an innovative data processing and warehouse infrastructure. Project Tycho® investigators are conducting active research on new algorithms to digitize, standardize, integrate, and store public health data using combinations of automated processes and manual verification. The Project Tycho® Web development team is dedicated to an optimal user Web interface to explore and download public health data for various types of use.

Focus area 3 — Analytics

The Project Tycho® team is collaborating with international partners from a large variety of scientific disciplines to create innovative analytical approaches to add value to public health data. Analytics range from creative data visualizations to reveal population level patterns of disease spread that help to understand disease causality leading to better control strategies. Analytics also include spatial and temporal statistics and data mining methods for hypothesis generating research and classical epidemiological and statistical methods.

Focus area 4 — Advocacy

Many barriers have been identified that currently limit the availability of public health data for research and policy. We are actively engaged in advocacy for better data availability. The project is continuously improving tools for data use in public health training and education at the high school, undergraduate, graduate, and post-graduate level. We are also developing new analytical and modeling tools for use by policy makers to improve public health programs. The project is involved in research and consultations on principles and guidelines to make public health data more widely available and will continue to disseminate data and analyses through scientific presentations, publications, and other outreach activities.

 

How to cite

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

 

People

Core Team

  • Wilbert van Panhuis, MD, PhD
    Project Tycho® lead investigator
    Assistant Professor of Epidemiology
  • Donald Burke, MD
    Dean, Graduate School of Public Health
    Distinguished University Professor of Health Science and Policy
  • John Grefenstette, PhD
    Director, Public Health Dynamics Laboratory
    Professor of Biostatistics
  • Anne Cross, MLIS
    Project Tycho® lead programmer
     
  • Sharon Crow, MEd
    Project Tycho® coordinator
     

Faculty

Staff

  • Anne Cross, Programmer, Public Health Dynamics Laboratory, University of Pittsburgh
  • Sharon Crow, Project Coordinator, Public Health Dynamics Laboratory, University of Pittsburgh
  • Mary Krauland, Programmer, Public Health Dynamics Laboratory, University of Pittsburgh

Graduate Students

  • Stephen Liu, Department of Epidemiology, University of Pittsburgh
  • Proma Paul, Department of Epidemiology, University of Pittsburgh
  • Wenjing Qi, Department of Statistics, University of Pittsburgh

Former Faculty, Staff and Students

  • Chantz Anderson
  • Suzanne Cake
  • Nian Shong Chok
  • Kate Colligan
  • Heather Eng
  • Ying-Feng Hsu
  • Xi Huang
  • Erin Jenkins
  • Su Yon Jung
  • Hanseul Kim
  • Tiffany Kinney
  • Raaka Kumbhakar
  • Bruce Lee
  • Patrick Manning
  • Victor Martinez-Cassmeyer, University of Missouri
  • Elizabeth Mitgang, University of Pittsburgh
  • Yewande Olugbade
  • Irene Ruberto
  • Divyasheel Sharma
  • Justin Smith
  • Stephen Wisniewski
  • Vladimir Zadorozhny
  • Yongxu Zang
  • Lifan Zhang, Emory University

 

Partners

 

Sources of current data

 

 

 

The Project Tycho® database is funded by the Bill & Melinda Gates Foundation and the National Institutes of Health

© 2013, University of Pittsburgh. All Rights Reserved.