Katrina Sostek
Katrina Sostek is a Software Engineer at Google.
Authored Publications
Sort By
Discovering Datasets on the Web Scale: Challenges and Recommendations for Google Dataset Search
Daniel Russell
Stella Dugall
Harvard Data Science Review (2024)
Preview abstract
With the rise of open data in the last two decades, more datasets are online and more people are using them for projects and research. But how do people find datasets? We present the first user study of Google Dataset Search, a dataset-discovery tool that uses a web crawl and open ecosystem to find datasets. Google Dataset Search contains a superset of the datasets in other dataset-discovery tools—a total of 45 million datasets from 13,000 sources. We found that the tool addresses a previously identified need: a search engine for datasets across the entire web, including datasets in other tools. However, the tool introduced new challenges due to its open approach: building a mental model of the tool, making sense of heterogeneous datasets, and learning how to search for datasets. We discuss recommendations for dataset-discovery tools and open research questions.
View details
How complete are the CDC's COVID-19 Case Surveillance and NCHS datasets for deaths with race/ethnicity at the state and county levels?
Google, Inc. (2021)
Preview abstract
The Covid Tracking Project was the most reliable source for COVID-19 data with race/ethnicity at the state level until it stopped collecting data on March 7, 2021. The CDC's Case Surveillance Restricted Access and National Center for Health Statistics provisional deaths datasets are the best available replacements for the Covid Tracking Project's dataset, and they additionally include county-level data and age along with race/ethnicity. This paper evaluates the completeness of the CDC datasets at the state and county levels in terms of (1) the total number of deaths included compared to the New York Times, and (2) the number of deaths included with race/ethnicity data compared to the Covid Tracking Project.
The CDC's Restricted Access dataset contains 79% of the deaths in the New York Times up to April 15, and 84% of deaths have race/ethnicity information vs. 93% in the Covid Tracking Project. At the state and county levels, the dataset's completeness is highly variable with 11 states reporting fewer than 10% of deaths and eight reporting 0% of the deaths included in the New York Times. The National Center for Health Statistics' dataset is highly complete in all states except for North Carolina. At the county level, the National Center for Health Statistics' dataset is more complete within the counties it contains, but it only contains counties with at least 100 COVID-19 deaths, which are generally counties with larger populations.
View details
How complete are the CDC's COVID-19 Case Surveillance datasets for race/ethnicity at the state and county levels?
Google, Inc. (2021)
Preview abstract
The Covid Tracking Project was the most reliable source for COVID-19 data with race/ethnicity at the state level until it stopped collecting data on March 7, 2021. The CDC's Case Surveillance Restricted Access and Public Use with Geography datasets are the only available replacements for the Covid Tracking Project's dataset, and they additionally include county-level data and age along with race/ethnicity. This paper evaluates the completeness of the CDC datasets at the state and county levels in terms of (1) the total number of cases included compared to the New York Times, and (2) the number of cases included with race/ethnicity data compared to the Covid Tracking Project.
The CDC's Restricted Access dataset contains 78% of the cases in the New York Times up to April 15, 2021, and 65% of cases have race/ethnicity information vs. 67% in the Covid Tracking Project. The dataset's completeness has steadily and gradually improved over time; e.g., the first available version from May 2020 had race/ethnicity information for only 43% of cases. At the state and county levels, the dataset's completeness has also improved with a state-level average of 62% of cases with race/ethnicity in April 2021 vs. 46% in June 2020. However, the dataset's completeness at the state level is highly variable; for example, Minnesota has 102% of the cases included in the New York Times, while Louisiana has only 4% of the cases in the New York Times. Minnesota has 91% of cases with race/ethnicity, while Louisiana has only 19% with race/ethnicity (vs. 94% in the Covid Tracking Project). Texas alone is missing 2.8M cases, accounting for more than a third of the total 7.1M missing cases. New York is missing race/ethnicity for 1.3M cases and California for 1.1M cases, accounting for more than a quarter of the 8.6M cases missing race/ethnicity when combined.
The CDC's Public Use with Geography dataset is similar to the Restricted Access dataset for total case counts, but is less complete due to more privacy suppression; e.g., only 49% of cases have race/ethnicity information.
View details
Evaluating the Accuracy of Google Surveys
Google Inc. (2019)
Preview abstract
Google Surveys is a market research platform that surveys internet and smartphone users. Our methodology whitepaper (g.co/SurveysWhitepaper) explains how Google Surveys works and discusses its ability to mitigate different kinds of biases. This paper evaluates the accuracy of Google Surveys by comparing its survey results against benchmarks and other online survey platforms.
View details
Preview abstract
Google Surveys is a market research platform that surveys internet and smartphone users.
Since its launch in 2012, Google Surveys has evolved in several ways: the maximum questions
per survey has increased from two to 10, the online panel has expanded to tens of
millions of unique daily users, and a new mobile app panel has 4M active users and additional
targeting capabilities. This paper will explain how Google Surveys works as of May 2017,
while also discussing its advantages and limitations for mitigating different kinds of biases.
View details