Santiago Ontañón

Santiago Ontañón

PhD at the Artificial Intelligence Research Institute (IIIA-CSIC) in Barcelona (Spain), advised by Dr. Enric Plaza (2005). Postdoctoral fellow at Georgia Tech under Ashwin Ram (2006-2009). Juan de la Cierva researcher at IIIA-CSIC (2009-2011). Associate Professor of Computer Science at Drexel University (2012 - 2023). Research Faculty at Drexel University (2023 - present). Research Scientist at Google (2019 - present).
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Data Bootstrapping for Interactive Recommender Systems
    Ajay Joshi
    Ajit Apte
    Anand Kesari
    Anushya Subbiah
    Dima Kuzmin
    John Anderson
    Li Zhang
    Marty Zinkevich
    The 2nd International Workshop on Online and Adaptive Recommender Systems (2022)
    Preview abstract Modifying recommender systems for new kinds of user interactions is costly and exploration is slow since machine learning models can be trained and evaluated on live data only after a product supporting these new interactions is deployed. Our data bootstrapping approach moves the task of developing models for new interactions into the input representation allowing a standard machine learning model (e.g. a transformer model) to be used to train a model capturing the new interactions. More specifically, we use data obtained from a launched system to generate simulated data that includes the new interactions options. This approach helps accelerate model and algorithm development, and reduce the time to launch new interaction experiences. We present machine learning methods designed specifically to work well with limited and noisy data produced via data bootstrapping. View details
    Making Transformers Solve Compositional Tasks
    Joshua Ainslie
    Vaclav Cvicek
    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics (2022), pp. 3591-3607
    Preview abstract Several studies have reported the inability of Transformer models to generalize compositionally. In this paper, we explore the design space of Transformer models, showing that several design decisions, such as the position encodings, decoder type, model architecture, and encoding of the target task imbue Transformers with different inductive biases, leading to better or worse compositional generalization. In particular we show that Transformers can generalize compositionally significantly better than previously reported in the literature if configured appropriately. View details
    LogicInference: A New Dataset for Teaching Logical Inference to seq2seq Models
    Joshua Ainslie
    Vaclav Cvicek
    ICLR 2022 Workshop on Elements of Reasoning: Objects, Structure and Causality
    Preview abstract Machine learning models such as Transformers or LSTMs struggle with tasks that are compositional in nature such as those involving reasoning/inference. Although many datasets exist to evaluate compositional generalization, when it comes to evaluating inference abilities, options are more limited. This paper presents LogicInference, a new dataset to evaluate the ability of models to perform logical inference. The dataset focuses on inference using propositional logic and a small subset of first-order logic, represented both in semi-formal logical notation, as well as in natural language. We also report initial results using a collection of machine learning models to establish an initial baseline in this dataset. View details
    FNet: Mixing Tokens with Fourier Transforms
    Ilya Eckstein
    James Patrick Lee-Thorp
    Joshua Ainslie
    NAACL 2022 (Association for Computational Linguistics)
    Preview abstract We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear transformations, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains nearly seven times faster on GPUs and twice as fast on TPUs. The resulting model, FNet, also scales very efficiently to long inputs. Specifically, when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, but is faster than the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes: for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts. View details
    LongT5: Efficient Text-To-Text Transformer for Long Sequences
    Joshua Ainslie
    David Uthus
    Jianmo Ni
    Yinfei Yang
    Findings of the Association for Computational Linguistics: NAACL 2022, Association for Computational Linguistics
    Preview abstract Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call Transient Global (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks. View details
    Mondegreen: A Post-Processing Solution to Speech Recognition Error Correction for Voice Search Queries
    Ajit Apte
    Ambarish Jash
    Amol H Wankhede
    Ankit Kumar
    Ayooluwakunmi Jeje
    Dima Kuzmin
    Ellie Ka In Chio
    Harry Fung
    Jon Effrat
    Nitin Jindal
    Pei Cao
    Senqiang Zhou
    Sukhdeep S. Sodhi
    Tameen Khan
    Tarush Bali
    KDD (2021)
    Preview abstract As more and more online search queries come from voice, automatic speech recognition becomes a key component to deliver relevant search results. Errors introduced by automatic speech recognition (ASR) lead to irrelevant search results returned to the user, thus causing user dissatisfaction. In this paper, we introduce an approach, Mondegreen, to correct voice queries in text space without depending on audio signals, which may not always be available due to system constraints or privacy or bandwidth (for example, some ASR systems run on-device) considerations. We focus on voice queries transcribed via several proprietary commercial ASR systems. These queries come from users making internet, or online service search queries. We first present an analysis showing how different the language distribution coming from user voice queries is from that in traditional text corpora used to train off-the-shelf ASR systems. We then demonstrate that Mondegreen can achieve significant improvements in increased user interaction by correcting user voice queries in one of the largest search systems in Google. Finally, we see Mondegreen as complementing existing highly-optimized production ASR systems, which may not be frequently retrained and thus lag behind due to vocabulary drifts. View details
    Preview abstract Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (in terms of memory mainly) on the sequence length due to their full attention mechanism. To remedy this, we propose, \emph{BigBird}, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that \emph{BigBird} is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis demonstrates the need for having an O(1) global tokens, such as CLS, that attend to the entire sequence as part of the sparse attentions. We show that the proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, \emph{BigBird} drastically improves performance on various NLP tasks such as question answering. View details
    ETC: Encoding Long and Structured Inputs in Transformers
    Anirudh Ravula
    Joshua Ainslie
    Li Yang
    Qifan Wang
    Vaclav Cvicek
    2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)
    Preview abstract Transformer models have advanced the state of the art in many NLP tasks. In this paper, we present a new Transformer architecture, Extended Transformer Construction (ETC), that addresses two key limitations of existing architectures, namely: scaling input length, and ingesting structured inputs. The main innovation is a new global-local attention mechanism between a global memory and the input tokens, which allows scaling attention to longer inputs. We show that combining global-local attention with relative position encodings and a Contrastive Predictive Coding (CPC) pre-training task allows ETC to naturally handle structured data. We achieve new state-of-the-art results on two natural language datasets requiring long and/or structured inputs. View details