Emily Reif

Emily Reif

I'm a software engineer on Google's People and AI Research team. I make tools for researchers, students, laypeople and other end users to better understand the ML models that are now ubiquitous in our lives. A short list of these projects includes the Embedding Projector , this recent paper on interpretability for language models, the Waterfall of Meaning , and SMILY, a tool for pathologists.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
    Michael Xieyang Liu
    Krystal Kallarackal
    Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024)
    Preview abstract Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at Google. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models. View details
    Automatic Histograms: Leveraging Language Models for Text Dataset Exploration
    Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM, Honolulu, HI, USA (2024), pp. 9
    Preview abstract Making sense of unstructured text datasets is perennially difficult, yet increasingly relevant with Large Language Models. Data practitioners often rely on dataset summaries, especially distributions of various derived features. Some features, like toxicity or topics, are relevant to many datasets, but many interesting features are domain specific, e.g., instruments and genres for a music dataset, or diseases and symptoms for a medical dataset. Accordingly, data practitioners often run custom analyses for each dataset, which is cumbersome and difficult, or use unsupervised methods. We present AutoHistograms, a visualization tool leveraging LLMs. AutoHistograms automatically identifies relevant entity-based features, visualizes their distributions, and allows the user to interactively query the dataset for new categories of entities. In a user study with (n=10) data practitioners, we observe that participants were able to quickly onboard to AutoHistograms, use the tool to identify actionable insights, and conceptualize a broad range of applicable use cases. We also describe a variety of usage scenarios from different types of users to highlight how this app can provide value in many different contexts. Finally, we present a quantitative evaluation of the tool. Together, this tool and user study contribute to the growing field of LLM-assisted sensemaking tools. View details
    Understanding the Dataset Practitioners Behind Large Language Models
    Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM, Honolulu, HI, USA (2024)
    Preview abstract As large language models (LLMs) become more advanced and impactful, it is increasingly important to scrutinize the data that they rely upon and produce. What is it to be a dataset practitioner doing this work? We approach this in two parts: first, we define the role of "dataset practitioners'' by performing a retrospective analysis on the responsibilities of teams contributing to LLM development at a technology company, Google. Then, we conduct semi-structured interviews with a cross-section of these practitioners (N=10). We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it. Consequently, practitioners either rely on their own intuition or write custom code to evaluate their data. We discuss potential reasons for this phenomenon and opportunities for alignment. View details
    Preview abstract Large language models (LLMs) can be used to generate smaller, more refined datasets via few-shot prompting for benchmarking, fine-tuning or other use cases. However, understanding and evaluating these datasets is difficult, and the failure modes of LLM-generated data are still not well understood. Specifically, the data can be repetitive in surprising ways, not only semantically but also syntactically and lexically. We present LinguisticLens, a novel interactive visualization tool for making sense of and analyzing syntactic diversity of LLM-generated datasets. LinguisticLens clusters text along syntactic, lexical, and semantic axes. It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples. The live demo is available at https://shorturl.at/zHOUV. View details
    PaLM: Scaling Language Modeling with Pathways
    Aakanksha Chowdhery
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Brennan Saeta
    Michele Catasta
    Jason Wei
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    A Gentle Introduction to Graph Neural Networks
    Adam Pearce
    Alex Wiltschko
    Benjamin Sanchez-Lengeling
    Distill, 2021 (2021), N/A
    Preview abstract Neural networks have been adapted to leverage the structure and properties of graphs. We explore the components needed for building a graph neural network - and motivate the design choices behind them. View details
    The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models
    Andy Coenen
    Sebastian Gehrmann
    Ellen Jiang
    Carey Radebaugh
    Ann Yuan
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics (to appear)
    Preview abstract We present the Language Interpretability Tool (LIT), an open-source platform for visualization and understanding of NLP models. We focus on core questions about model behavior: Why did my model make this prediction? When does it perform poorly? What happens under a controlled change in the input? LIT integrates local explanations, aggregate analysis, and counterfactual generation into a streamlined, browser-based interface to enable rapid exploration and error analysis. We include case studies for a diverse set of workflows, including exploring counterfactuals for sentiment analysis, measuring gender bias in coreference systems, and exploring local behavior in text generation. LIT supports a wide range of models--including classification, seq2seq, and structured prediction--and is highly extensible through a declarative, framework-agnostic API. LIT is under active development, with code and full documentation available at https://github.com/pair-code/lit. View details
    Evaluating Attribution for Graph Neural Networks
    Alexander B Wiltschko
    Benjamin Sanchez-Lengeling
    Brian Lee
    Jennifer Wei
    Wesley Qian
    Yiliu Wang
    Advances in Neural Information Processing Systems 33 (2020)
    Preview abstract Interpretability of machine learning models is critical to scientific understanding, AI safety, and debugging. Attribution is one approach to interpretability, which highlights input dimensions that are influential to a neural network’s prediction. Evaluation of these methods is largely qualitative for image and text models, because acquiring ground truth attributions requires expensive and unreliable human judgment. Attribution has been comparatively understudied for graph neural networks (GNNs), a model class of growing importance that makes predictions on arbitrarily-sized graphs. Graph-valued data offer an opportunity to quantitatively benchmark attribution methods, because challenging synthetic graph problems have computable ground-truth attributions. In this work we adapt commonly-used attribution methods for GNNs and quantitatively evaluate them using the axes of attribution accuracy, stability, faithfulness and consistency. We make concrete recommendations for which attribution methods to use, and provide the data and code for our benchmarking suite. Rigorous and open source benchmarking of attribution methods in graphs could enable new methods development and broader use of attribution in real-world ML tasks. View details
    Preview abstract Machine learning (ML) is increasingly being used in image retrieval systems for medical decision making. One application of ML is to retrieve visually similar medical images from past patients (e.g. tissue from biopsies) to reference when making a medical decision with a new patient. However, no algorithm can perfectly capture an expert's ideal notion of similarity for every case: an image that is algorithmically determined to be similar may not be medically relevant to a doctor's specific diagnostic needs. In this paper, we identified the needs of pathologists when searching for similar images retrieved using a deep learning algorithm, and developed tools that empower users to cope with the search algorithm on-the-fly, communicating what types of similarity are most important at different moments in time. In two evaluations with pathologists, we found that these refinement tools increased the diagnostic utility of images found and increased user trust in the algorithm. The tools were preferred over a traditional interface, without a loss in diagnostic accuracy. We also observed that users adopted new strategies when using refinement tools, re-purposing them to test and understand the underlying algorithm and to disambiguate ML errors from their own errors. Taken together, these findings inform future human-ML collaborative systems for expert decision-making. View details
    Similar Image Search for Histopathology: SMILY
    Jason Hipp
    Michael Emmert-Buck
    Daniel Smilkov
    Mahul Amin
    Craig Mermel
    Lily Peng
    Martin Stumpe
    Nature Partner Journal (npj) Digital Medicine (2019)
    Preview abstract The increasing availability of large institutional and public histopathology image datasets is enabling the searching of these datasets for diagnosis, research, and education. Although these datasets typically have associated metadata such as diagnosis or clinical notes, even carefully curated datasets rarely contain annotations of the location of regions of interest on each image. As pathology images are extremely large (up to 100,000 pixels in each dimension), further laborious visual search of each image may be needed to find the feature of interest. In this paper, we introduce a deep-learning-based reverse image search tool for histopathology images: Similar Medical Images Like Yours (SMILY). We assessed SMILY’s ability to retrieve search results in two ways: using pathologist-provided annotations, and via prospective studies where pathologists evaluated the quality of SMILY search results. As a negative control in the second evaluation, pathologists were blinded to whether search results were retrieved by SMILY or randomly. In both types of assessments, SMILY was able to retrieve search results with similar histologic features, organ site, and prostate cancer Gleason grade compared with the original query. SMILY may be a useful general purpose tool in the pathologist’s arsenal, to improve the efficiency of searching large archives of histopathology images, without the need to develop and implement specific tools for each application. View details