About Project Tycho® Data
The Project Tycho® database aims are to advance the availability and use of public health data for science and policy. We do this by acquisition of new data, by building infrastructure for data standardization, integration, quality control, and data redistribution, by developing innovative analytics, and by advocacy. Read more about aims and activities.
We named the Project Tycho® database after the Danish nobleman Tycho Brahe (1546—1601), who is known for his detailed astronomical and planetary observations. Tycho was not able to use all of his data for breakthrough discoveries, but his assistant Johannes Kepler (1571-1630) used Tycho's data to derive the laws of planetary motion. Similarly, this project aims to advance the availablity of large scale public health data to the worldwide community to accelerate advancements in scientific discovery and technological progress.
Currently, we have completed digitization of the entire history of weekly National Notifiable Disease Surveillance System (NNDSS) reports for the United States (1888-2013) into a database in computable format (Level 3 data). We have standardized a major part of these data for online access (Level 2 data). A subset of the U.S. data was cleaned further and used for a study on the impact of vaccination programs in the United States that was recently published in the NEJM (Level 1 data).
The Project Tycho® data are organized as counts. A count is defined as the number of cases or deaths due to a disease in a specific location and time period. A count is equivalent to a data point. During the 126 year period of weekly disease reporting, the types of reports have been changed regularly, leading to different types of data counts across time. This makes the integration and standardization of these data a complex task. Currently, available data are categorized in three levels based on the type of counts included. Level 1 includes different types of counts that have been standardized into a common format for a specific analysis published recently in the NEJM. Level 2 data only includes counts that have been reported in a common format, e.g. diseases reported for a one week period and without disease subcategories. These data can be used immediately for analysis, includes a wide range of diseases and locations but this level does not include data that have not been standardized yet. Level 3 data include all the different types of counts ever reported. Although this is the most complete data, the large number of different counts requires extensive standardization and various judgment calls before they can be used for analysis.
|Data level||Type of counts||Number of counts||Content|
|Level 1||Different types of counts but standardized in a common format||759,483||8 diseases, 50 states, 122 cities, 1916-2009|
|Level 2||Informational counts that have been reported in a common format||3,666,141||50 diseases, 50 states, 1284 cities, 1888-2014|
|Level 3||Many different types of counts that have not been standardized and that need extensive harmonization work||14,047,623||58 diseases, 81 disease subcategories, 3026 cities, etc.|
This table describes the different types of counts that have been used at any point during the 125 year history of weekly U.S. disease surveillance. Each count represents one reported number of cases or deaths for a disease, location and time period.
|Type of count (included in level)||Description|
|Weekly (L1-3)||Counts for a one week time period|
|Contemporaneous (L1-3)||Counts reported for the first time in a concurrent year|
|Reported (L1-3)||Counts on the number of reported cases or deaths, contrary to counts that were calculated based on reported counts, such as a median or average|
|States (L1-3)||Counts at the state level|
|Cities, harmonized (L1-3)||Counts at the city level for which various name and spelling variants have been harmonized into into a standard (current) name.|
|Diseases without subcategories (L1-3)||Counts for diseases for which no subcategory was listed.|
|Non-weekly (L3)||Counts for a time period that is not one week (month, year, etc.)|
|Cumulative (L3)||Counts for a time period from week 1 to the current week of a year|
|Updated, delayed (L3)||Counts reported a second time in the following year (these are listed as comparison and often updated from their first appearance as contemporaneous count)|
|Calculated (L3)||Counts representing computed summary statistics|
|Re-calculated (L1,3)||Counts recalculated by the Project Tycho® team for standardization purposes|
|Regions (L3)||Counts at the level of epidemiological regions|
|Cities, not harmonized (L3)||Counts at the city level for which name and spelling variants have not yet been harmonized|
|Counties (L3)||Counts at the county level (these are for smallpox only and names have not been harmonized)|
|Diseases with subcategory (L1,3)||Counts for diseases for which a subcategory was reported; disease subcategories have not been standardized formally|
|Non-diseases||Counts that contain non-disease information, such as population number, comments, etc.|
|Missing information (L3)||Counts for which the location, disease, or time period could not be established from contextual information|
|Smallpox average (L2)||Different counts that have been averaged into one count for a specific location and week. Many duplicate reports for smallpox cases or deaths were found in the original data, often reporting the same number for a location and week, but sometimes reporting a different number. If multiple counts reported a different number for the same location and week, these were averaged in level 2. This is an intermediate solution until more extensive research and standardization will have been completed.|
Detailed methods for digitization, standardization, and inclusion of U.S. National Notifiable Disease Surveillance System (NNDSS) data in Project Tycho® data have been described in the online supplementary material of the paper by Van Panhuis, et. al., Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158. A summary is provided below.
Data selection S1
Tables containing weekly reports of the U.S. NNDSS published in the Morbidity Mortality Weekly Report (MMWR) and precursor journals were selected from various locations online. For many weeks, no PDF file could be found and paper copies from the University of Pittsburgh Library were used. The following online sources were used:
More recently, another repository became available that provides PDF files of MMWR issues.
We included all tables that provided disease specific information by week for U.S. cities, townships, counties or states. Tables that provided summary or aggregated information by month, year or at the national level were not included. Similarly, tables that did not contain disease specific information (such as all-cause mortality) were not included.
Data Entry S2
Weekly reports were downloaded or scanned as PDF files and selected tables were double entered (independently) into computer spreadsheets in a highly standardized process. Data entry was implemented by a social venture named Digital Divide Data that provides jobs and education opportunities to disadvantages families in Southeast Asia.
Quality control for data entry S3
First, data completeness was verified by comparing the content of entered data with PDF source files. Secondly, the accuracy of data entry was verified by multiple rounds of comparing random samples of entered data with PDF source files. Thirdly, data formatting was verified by various checks to ensure appropriate formatting for data loading.
Data re-entry S4
During the quality control process, missing data or data with incorrect file formatting were identified. These data were re-entered.
Paper copies of weekly reports were scanned to create PDF documents for data entry if tables for certain weeks were missing from online repositories.
Data loading S6
All data were entered in Excel spreadsheets and various components (table body, title, column headers, footnotes, etc.) of these spreadsheets were loaded separately in data files.
Extraction of disease information S7
Information from table titles and column headers were used to extract the names of diseases and subcategories, the outcome (cases or deaths) and the indicator used (reported number or summary statistics).
Extraction of reporting periods S8
Information from table titles, column headers and specific columns in the dataset were used to extract the reporting period for each reported number of cases or deaths. These reporting periods were standardized to a start- and end-date.
Standardization of place names S9
Misspellings and changes over time in the names of reporting locations were standardized to current names.
Inclusion of publication dates S10
The publication dates of each weekly report were retrieved from the online sources of weekly tables or from the University of Pittsburgh Library. These dates were used to compute the delay between reporting period and publication of a report for each record in the database. This allowed the distinction between contemporaneous and delayed/updated counts.
All reported numbers and extracted information were integrated in one MySQL database with a unique record per reported number and associated information.
Data computation S12
Various new indicators were calculated based on extracted and standardized information including:
- Assignment of epidemiological weeks
- Calculation of reporting period
- Calculation of publication delay
- Re-calculation of cumulative reports in weekly values where possible
Post-processing quality control S13
After integration of all data in one database, checks were performed to detect duplicate reports and data inconsistencies. Duplicate records were removed and inconsistencies were resolved by verification with the original PDF source files.
Data filtering S14
Due to the extensive heterogeneity and data complexities in over a century of weekly surveillance data, additional processing will be required for many disease counts before these can be used for analysis. All standardized data are included in level 2 data provided online. Counts that have not yet been will be available through the Project Tycho® level 3 data request form.
Data for Health
We aim to advance the use of public health data for the improvement of public health. Oftentimes, restricted access to public health data limits opportunities for scientific discovery and technological innovation in disease control programs. A free flow of data and information maximizes opportunities for more efficient and effective public health programs leading to higher impact and better health. Our activities are focused on accelerating the availability and use of public health data as listed below.
Focus area 1 — Acquisition of new data
The Project Tycho® team is continuously engaging in new partnerships with scientists, funding and public health agencies around the world to add or connect new historical and current datasets to the system. This involves addressing common barriers such as privacy and ownership concerns through transparent negotiations and data use agreements. We are making every effort to fully engage data contributors during the entire process from data identification to data redistribution and advocacy.
Focus area 2 — Data infrastructure
Advancing the availability and use of public health data from many sources and in many formats requires an innovative data processing and warehouse infrastructure. Project Tycho® investigators are conducting active research on new algorithms to digitize, standardize, integrate, and store public health data using combinations of automated processes and manual verification. The Project Tycho® Web development team is dedicated to an optimal user Web interface to explore and download public health data for various types of use.
Focus area 3 — Analytics
The Project Tycho® team is collaborating with international partners from a large variety of scientific disciplines to create innovative analytical approaches to add value to public health data. Analytics range from creative data visualizations to reveal population level patterns of disease spread that help to understand disease causality leading to better control strategies. Analytics also include spatial and temporal statistics and data mining methods for hypothesis generating research and classical epidemiological and statistical methods.
Focus area 4 — Advocacy
Many barriers have been identified that currently limit the availability of public health data for research and policy. We are actively engaged in advocacy for better data availability. The project is continuously improving tools for data use in public health training and education at the high school, undergraduate, graduate, and post-graduate level. We are also developing new analytical and modeling tools for use by policy makers to improve public health programs. The project is involved in research and consultations on principles and guidelines to make public health data more widely available and will continue to disseminate data and analyses through scientific presentations, publications, and other outreach activities.
Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.
Assistant Professor of Epidemiology
Distinguished University Professor of Health Science and Policy
Anne Cross, MLISProject Tycho® lead programmer
Sharon Crow, MEdProject Tycho® coordinator
- Dan Bain, Assistant Professor, Department of Geology and Planetary Science, University of Pittsburgh
- Shawn Brown, Director of Public Health Applications, Pittsburgh Supercomputing Center
- Donald Burke, Dean, Graduate School of Public Health, University of Pittsburgh
- Derek Cummings, Associate Professor, Johns Hopkins University
- John Grefenstette, Professor of Health Policy & Management, University of Pittsburgh
- Hasan Guclu, Assistant Professor of Biostatistics, University of Pittsburgh
- Ernesto Marques, Associate Professor of Infectious Disease and Microbiology, Center for Vaccine Research
- Willem van Panhuis, Assistant Professor of Epidemiology, University of Pittsburgh
- Anne Cross, Programmer, Public Health Dynamics Laboratory, University of Pittsburgh
- Sharon Crow, Project Coordinator, Public Health Dynamics Laboratory, University of Pittsburgh
- Mary Krauland, Programmer, Public Health Dynamics Laboratory, University of Pittsburgh
- Stephen Liu, Department of Epidemiology, University of Pittsburgh
- Proma Paul, Department of Epidemiology, University of Pittsburgh
Former Faculty, Staff and Students
- Chantz Anderson
- Suzanne Cake
- Nian Shong Chok
- Kate Colligan
- Heather Eng
- Ying-Feng Hsu
- Xi Huang
- Erin Jenkins
- Su Yon Jung
- Hanseul Kim
- Tiffany Kinney
- Raaka Kumbhakar
- Bruce Lee
- Patrick Manning
- Victor Martinez-Cassmeyer, University of Missouri
- Elizabeth Mitgang, University of Pittsburgh
- Yewande Olugbade
- Wenjing Qi
- Irene Ruberto
- Divyasheel Sharma
- Justin Smith
- Stephen Wisniewski
- Vladimir Zadorozhny
- Yongxu Zang
- Lifan Zhang, Emory University
- Benter Foundation
- Bill & Melinda Gates Foundation
- Brazil Ministry of Health
- Cambodia Ministry of Health
- Council of State and Territorial Epidemiologists (CSTE)
- Digital Divide Data
- Johns Hopkins University Bloomberg School of Public Health
- Laos Ministry of Health
- NIH National Institute of General Medical Sciences (NIGMS)
- Pan American Health Organization (PAHO)
- Taiwan Ministry of Health
- Thailand Ministry of Health
- University of Pittsburgh Department of History
- University of Pittsburgh School of Information Sciences
- U.S. Department of Health & Human Services
- U.S. Open Government Initiative
- Vietnam Ministry of Health
- World Health Organization (WHO)
- Public Health Reports from PubMed Central
- Morbidity and Mortality Weekly Report from the Hathi Trust Digital Library
- U.S. Centers for Disease Control MMWR Past Volumes
- U.S.Centers for Disease Control Stacks