Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10119 publications
    Preview abstract We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones. The model is trained with simulated room impulse responses (RIRs) using omni-directional microphones without needing to collect real RIRs. By relying on specific angular regions and multiple room simulations, the model utilizes consistent time difference of arrival (TDOA) cues, or what we call delay contrast, to separate target and interference sources while remaining robust in various reverberation environments. We demonstrate the model is not only generalizable to a commercially available device with a slightly different microphone geometry, but also outperforms our previous work which uses one additional microphone on the same device. The model runs in real-time on-device and is suitable for low-latency streaming applications such as telephony and video conferencing. View details
    Preview abstract With growing machine learning (ML) and large language model applications in healthcare, there have been calls for fairness in ML to understand and mitigate ethical concerns these systems may pose. Fairness has implications for health in Africa, which already has inequitable power imbalances between the Global North and South. This paper seeks to explore fairness for global health, with Africa as a case study. We conduct a scoping review to propose fairness attributes for consideration in the African context and delineate where they may come into play in different ML-enabled medical modalities. We then conduct qualitative research studies with 625 general population study participants in 5 countries in Africa and 28 experts in ML, Health, and/or policy focussed on Africa to obtain feedback on the proposed attributes. We delve specifically into understanding the interplay between AI, health and colonialism. Our findings demonstrate that among experts there is a general mistrust that technologies that are solely developed by former colonizers can benefit Africans, and that associated resource constraints due to pre-existing economic and infrastructure inequities can be linked to colonialism. General population survey responses found about an average of 40% of people associate an undercurrent of colonialism to AI and this was most dominant amongst participants from South Africa. However the majority of the general population participants surveyed did not think there was a direct link between AI and colonialism.Colonial history, country of origin, National income level were specific axes of disparities that participants felt would cause an AI tool to be biased This work serves as a basis for policy development around Artificial Intelligence for health in Africa and can be expanded to other regions. View details
    Preview abstract Use of Text-to-Image models is expanding beyond generating generic objects, as they are increasingly being adopted by diverse global communities to create visual representations of their unique culture. Current T2I benchmarks primarily evaluate image-text alignment, aesthetics and fidelity of generations for complex prompts with generic objects, overlooking the critical dimension of cultural understanding. In this work, we address this gap by defining a framework to evaluate cultural competence of T2I models, and present a scalable approach to collect cultural artifacts unique to a particular culture from Knowledge Graphs and Large Language Models in tandem. We assess the ability of state-of-the-art T2I models to generate culturally faithful and realistic images across 8 countries and 3 cultural domains. Furthermore, we emphasize the importance of T2I models reflecting a culture's diversity and introduce cultural diversity as a novel metric for T2I evaluation, drawing inspiration from the Vendi Score. We introduce T2I-GCube, a first-of-its-kind benchmark for T2I evaluation. T2I-GCube includes cultural prompts, metrics, and cultural concept spaces, enabling comprehensive assessment of T2I models' cultural knowledge and diversity. Our evaluations reveal significant gaps in the cultural knowledge of existing models and provide valuable insights into the diversity of image outputs for under-specified prompts. By introducing a novel approach to evaluating cultural diversity and knowledge in T2I models, T2I-GCube will be instrumental in fostering the development of models with enhanced cultural competence. View details
    A Decentralized SDN Architecture for the WAN
    Nitika Saran
    Ashok Narayanan
    Sylvia Ratnasamy
    Ankit Singla
    Hakim Weatherspoon
    2024 ACM Special Interest Group on Data Communication (SIGCOMM) (2024)
    Preview abstract Motivated by our experiences operating a global WAN, we argue that SDN’s reliance on infrastructure external to the data plane has significantly complicated the challenge of maintaining high availability. We propose a new decentralized SDN (dSDN) architecture in which SDN control logic instead runs within routers, eliminating the control plane’s reliance on external infrastructure and restoring fate sharing between control and data planes. We present dSDN as a simpler approach to realizing the benefits of SDN in the WAN. Despite its much simpler design, we show that dSDN is practical from an implementation viewpoint, and outperforms centralized SDN in terms of routing convergence and SLO impact. View details
    Preview abstract Help documents are supposed to aid smartphone users in resolving queries such as "How to block calls from unknown numbers?". However, given a query, identifying the right help document, understanding instructions from the document, and using them to resolve the issue at hand is challenging. The user experience may be enhanced by converting the instructions in the help document to a step-by-step tutorial overlaid on the phone UI. Successful execution of this task requires overcoming research challenges in retrieval, parsing, and grounding in the multilingual-multimodal setting. For example, user queries in one language may have to be matched against instructions in another language, which in turn needs to be grounded in a multimodal UI in yet another language. Moreover, there isn’t any relevant dataset for such a task. In order to bridge this gap, we introduce UGIF-DataSet, a multi-lingual, multi-modal UI grounded dataset for step-by-step task completion on the smartphone, containing 4,184 tasks across 8 languages. The instruction steps in UGIF-DataSet are available only in English, so the challenge involves operations in the cross-modal, cross-lingual setting. We compare the performance of different large language models for this task and find that the end-to-end task completion rate drops from 48% in English to 32% for other languages, demonstrating significant overall headroom for improvement. We are hopeful that UGIF-DataSet and our analysis will aid further research on the important problem of sequential task completion in the multilingual and multimodal setting. View details
    Towards Conversational Diagnostic AI
    Anil Palepu
    Khaled Saab
    Jan Freyberg
    Ryutaro Tanno
    Amy Wang
    Brenna Li
    Nenad Tomašev
    Karan Singhal
    Le Hou
    Albert Webson
    Kavita Kulkarni
    Sara Mahdavi
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Arxiv (2024) (to appear)
    Preview abstract At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI. View details
    DORSal: Diffusion for Object-centric Representations of Scenes et al.
    Allan Jabri
    Emiel Hoogeboom
    Thomas Kipf
    International Conference on Learning Representations (2024)
    Preview abstract Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on frozen object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches. View details
    Large Scale Self-Supervised Pretraining for Active Speaker Detection
    Alice Chuang
    Keith Johnson
    Tony (Tuấn) Nguyễn
    Wei Xia
    Yunfan Ye
    ICASSP 2024 (2024) (to appear)
    Preview abstract In this work we investigate the impact of a large-scale self-supervised pretraining strategy for active speaker detection (ASD) on an unlabeled dataset consisting of over 125k hours of YouTube videos. When compared to a baseline trained from scratch on much smaller in-domain labeled datasets we show that with pretraining we not only have a more stable supervised training due to better audio-visual features used for initialization, but also improve the ASD mean average precision by 23\% on a challenging dataset collected with Google Nest Hub Max devices capturing real user interactions. View details
    Preview abstract Generative AI (GAI) is proliferating, and among its many applications are to support creative work (e.g., generating text, images, music) and to enhance accessibility (e.g., captions of images and audio). As GAI evolves, creatives must consider how (or how not) to incorporate these tools into their practices. In this paper, we present interviews at the intersection of these applications. We learned from 10 creatives with disabilities who intentionally use and do not use GAI in and around their creative work. Their mediums ranged from audio engineering to leatherwork, and they collectively experienced a variety of disabilities, from sensory to motor to invisible disabilities. We share cross-cutting themes of their access hacks, how creative practice and access work become entangled, and their perspectives on how GAI should and should not fit into their workflows. In turn, we offer qualities of accessible creativity with responsible AI that can inform future research. View details
    Preview abstract Browser fingerprinting is often associated with cross-site user tracking, a practice that many browsers (e.g., Safari, Brave, Edge, Firefox and Chrome) want to block. However, less is publicly known about its uses to enhance online safety, where it can provide an additional security layer against service abuses (e.g., in combination with CAPTCHAs) or during user authentication. To the best of our knowledge, no fingerprinting defenses deployed thus far consider this important distinction when blocking fingerprinting attempts, so they might negatively affect website functionality and security. To address this issue we make three main contributions. First, we propose and evaluate a novel machine learning-based method to automatically identify authentication pages (i.e. sign-in and sign-up pages). Our algorithm -- which relies on a hybrid unsupervised/supervised approach -- achieves 96-98% precision and recall on a large, manually-labelled dataset of 10,000 popular sites. Second, we compare our algorithm with other methods from prior works on the same dataset, showing that it significantly outperforms all of them (+83% F1-score). Third, we quantify the prevalence of fingerprinting scripts across sign-in and sign-up pages (9.2%) versus those executed on other pages (8.9%); while the rates of fingerprinting are similar, home pages and authentication pages differ in the third-party scripts they include and how often these scripts are labeled as tracking. We also highlight the substantial differences in fingerprinting behavior on login and sign-up pages. Our work sheds light on the complicated reality that fingerprinting is used to both protect user security and invade user privacy, and that this dual nature must be considered by fingerprinting mitigations. View details
    Leveraging Foundation Models for Zero-Shot IoT Sensing
    Dinghao Xue
    Tao Chen
    Guohao Lan
    Qun Song
    ECAI 2024: The 27th European Conference on Artificial Intellegence (to appear)
    Preview abstract Zero-shot learning (ZSL) aims to classify data of unseen classes with the help of semantic information. Foundation models (FMs) trained on web-scale data have shown impressive ZSL capability in natural language processing and visual understanding. However, leveraging FMs' generalized knowledge for zero-shot Internet of Things (IoT) sensing using signals such as mmWave, IMU, and Wi-Fi has not been fully investigated. In this work, we align the IoT data embeddings with the semantic embeddings generated by an FM's text encoder for zero-shot IoT sensing. To utilize the physics principles governing the generation of IoT sensor signals to derive more effective prompts for semantic embedding extraction, we propose to use cross-attention to combine a learnable soft prompt that is optimized automatically on training data and an auxiliary hard prompt that encodes domain knowledge of the IoT sensing task. To address the problem of IoT embeddings biasing to seen classes due to the lack of unseen class data during training, we propose a data augmentation method to synthesize unseen class IoT data for fine-tuning IoT feature extractor and embedding projector. We evaluate our approach on multiple IoT sensing tasks. Experiment results show that our approach achieves superior open-set detection and generalized zero-shot learning performance compared with various baselines. View details
    Preview abstract To tackle the challenge of optimizing middle-mile logistics, the crucial link between warehouses and final deliveries, we introduce a novel instance generator that aims to create a rich and adaptable dataset of diverse instances to empower researchers and developers. The instance defines a logistics network with hubs, vehicles, routes, lines, and rotations. Additionally, it specifies a list of shipments that need to be transported through this network. To customize the instance, the user can adjust various parameters, such as the number of hubs, density of the space graphs, distribution of shipment weights, or the maximum number of vehicles. The generator reflects real-world complexities through variations in network size and structure. We developed a random graph generator to mimic real-world middle mile networks, by generating space graphs for hubs. Subsequently, lines and routes are randomly constructed on the generated space graphs, while adhering to user-defined constraints. The tool is in the form of an optimized C++ library that enables the generation of instances with a large number of hubs and shipments. It offers the immense potential for advancing middle-mile logistics optimization by providing a comprehensive and adaptable dataset for benchmarking optimization approaches, training machine learning models, and analyzing the impact of network configurations and shipments characteristics on overall efficiency. View details
    Preview abstract We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix. View details
    Preview abstract We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained endto-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a ‘cross-modal’ chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. We release our audio samples and spoken QA dataset via our website. View details
    Preview abstract Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing large language models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs. View details