Ji Ma

Ji Ma

I was born in the northeastern part of China where I could enjoy heavy snows in winter. I completed my B.Sc. at Jiangsu University, where heavy snow rarely happens. After that, I went back to north again to pursue the snow together with my PhD. In 2015, I graduated from Northeastern University, where I worked with Jingbo Zhu and Tong Xiao. Since then, I have been working at Google. My research mainly focuses on natural language processing and information retrieval related tasks.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses
    Jianmo Ni
    Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2023)
    Preview abstract Pretrained language models such as BERT have been shown to be exceptionally effective for text ranking. However, there are limited studies on how to leverage more powerful sequence-to-sequence models such as T5. Existing attempts usually formulate text ranking as a classification problem and rely on postprocessing to obtain a ranked list. In this paper, we propose RankT5 and study two T5-based ranking model structures, an encoder-decoder and an encoder-only one, so that they not only can directly output ranking scores for each query-document pair, but also can be fine-tuned with "pairwise" or "listwise" ranking losses to optimize ranking performance. Our experiments show that the proposed models with ranking losses can achieve substantial ranking performance gains on different public text ranking data sets. Moreover, ranking models fine-tuned with listwise ranking losses have better zero-shot ranking performance on out-of-domain data than models fine-tuned with classification losses. View details
    Learning List-Level Domain-Invariant Representations for Ranking
    Ruicheng Xian
    Hamed Zamani
    Han Zhao
    37th Conference on Neural Information Processing Systems (NeurIPS 2023)
    Preview abstract Domain adaptation aims to transfer the knowledge learned on (data-rich) source domains to (low-resource) target domains, and a popular method is invariant representation learning, which matches and aligns the data distributions on the feature space. Although this method is studied extensively and applied on classification and regression problems, its adoption on ranking problems is sporadic, and the few existing implementations lack theoretical justifications. This paper revisits invariant representation learning for ranking. Upon reviewing prior work, we found that they implement what we call item-level alignment, which aligns the distributions of the items being ranked from all lists in aggregate but ignores their list structure. However, the list structure should be leveraged, because it is intrinsic to ranking problems where the data and the metrics are defined and computed on lists, not the items by themselves. To close this discrepancy, we propose list-level alignment—learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method, and it achieves better empirical transfer performance for unsupervised domain adaptation on ranking tasks, including passage reranking. View details
    Preview abstract The availability of large, high-quality datasets has been one of the main drivers of recent progress in question answering (QA). Such annotated datasets however are difficult and costly to collect, and rarely exist in languages other than English, rendering QA technology inaccessible to underrepresented languages. An alternative to building large monolingual training datasets is to leverage pre-trained language models (PLMs) under a few-shot learning setting. Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained, thus avoiding costly annotation. Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines, bridges nearly 60% of the gap between an English-only baseline and a fully supervised upper bound trained on almost 50,000 hand labeled examples, and always leads to substantial improvements compared to fine-tuning a QA model directly on labeled examples in low resource settings. Experiments on the TyDiQA-GoldP and MLQA benchmarks show that few-shot prompt tuning for data synthesis scales across languages and is a viable alternative to large-scale annotation. View details
    Large Dual Encoders Are Generalizable Retrievers
    Jianmo Ni
    Zhuyun Dai
    Vincent Zhao
    Yi Luan
    Keith B. Hall
    Ming-Wei Chang
    Yinfei Yang
    (2022)
    Preview abstract It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the bottleneck layer of a dual encoder, where the final score is simply a dot-product between a query vector and a passage vector, is too limited to make dual encoders an effective retrieval model for out-ofdomain generalization. In this paper, we challenge this belief by scaling up the size of the dual encoder model while keeping the bottleneck embedding size fixed. With multi-stage training, surprisingly, scaling up the model size brings significant improvement on a variety of retrieval tasks, especially for out-of-domain generalization. Experimental results show that our dual encoders, Generalizable T5-based dense Retrievers (GTR), outperform existing sparse and dense retrievers on the BEIR dataset (Thakur et al., 2021) significantly. Most surprisingly, our ablation study finds that GTR is very data efficient, as it only needs 10% of MS Marco supervised data to achieve the best out-of-domain performance. View details
    Preview abstract Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?). View details
    Zero-shot Hybrid Retrieval and Reranking Models for Biomedical Literature
    Keith B. Hall
    CLEF 2022: Conference and Labs of the Evaluation Forum
    Preview abstract We describe our participating system in the document retrieval sub-task (Task B Phase A) at the 10th BioASQ challenge. We designed and implemented a zero-shot hybrid model using only synthetic train-ing data. The model consists of two stages: retrieval and reranking. The retrieval model is a hybrid of sparse and dense retrieval models, which is an extension of our participating system at 8th BioASQ challenge. We improved the dense retrieval model with a T5-based synthetic question generation model and an iterative training strategy involving techniques to filter low-quality synthetic data. In the second stage, we proposed a hybrid reranking model, which is trained using the candidates retrieved from the first stage. We further study if the knowledge from the hybrid reranking model can be transferred to the dense retrieval model through distillation. Our experiments show the proposed hybrid ranking model is effective with different first-stage retrieval models and applying reciprocal rank fusion on them brings additional boosts. Evaluation shows that our model compares favorably with other top participating systems, achieving MAP scores of 0.4696, 0.3984, 0.4586, 0.4089, 0.4065 and 0.1704 on six batches. View details
    Preview abstract State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms, however, are not without flaws, i.e., running the model on all query-document pairs at inference-time incurs a significant computational cost. This paper proposes a new training and inference paradigm for re-ranking. We propose to finetune a pretrained encoder-decoder model using in the form of document to query generation. Subsequently, we show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference. This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference. Our experiments show that this new paradigm achieves results that are comparable to the more expensive cross-attention ranking approaches while being up to 6.8X faster. We believe this work paves the way for more efficient neural rankers that leverage large pretrained models. View details
    Preview abstract We provide the first exploration of sentence embeddings from text-to-text transformers (T5) including the effects of scaling up sentence encoders to 11B parameters. Sentence embeddings are broadly useful for language processing tasks. While T5 achieves impressive performance on language tasks, it is unclear how to produce sentence embeddings from encoder-decoder models. We investigate three methods to construct Sentence-T5 (ST5) models: two utilize only the T5 encoder and one using the full T5 encoder-decoder. We establish a new sentence representation transfer benchmark, SentGLUE, which extends the SentEval toolkit to nine tasks from the GLUE benchmark. Our encoder-only models outperform the previous best models on both SentEval and SentGLUE transfer tasks, including semantic textual similarity (STS). Scaling up ST5 from millions to billions of parameters shown to consistently improve performance. Finally, our encoder-decoder method achieves a new state-of-the-art on STS when using sentence embeddings. View details
    Multi-stage Training with Improved Negative Contrast for Neural Passage Retrieval
    Jianmo Ni
    Yinfei Yang
    EMNLP 2021, Association for Computational Linguistics (2021), pp. 6091-6103
    Preview abstract In this paper we explore the effects of negative sampling in dual encoder models used to retrieve passages in automatic question answering tasks. We explore four negative sampling strategies that complement the straightforward random sampling of negatives, typically used to train dual encoder models. Out of the four strategies, three are based on retrieval and one on heuristics. Of the three retrieval based strategies, two are based on the semantic similarity between the actual passage and its alternatives and another one is based on the lexical overlap between them. In our experiments we train the dual encoder models in two stages: pre-training with synthetic data and fine tuning with domain-specific data. Negative sampling is applied in both stages. Our negative sampling is particularly useful when we augment the generic data for pre-training with synthetic examples. We evaluate our approach in three passage retrieval tasks for open-domain question answering. Even though it is not evident that there is one single sampling strategy that works best in all three tasks, it is clear that they all contribute to improving the contrast between the actual retrieval and its alternatives. Furthermore, mixing the negatives from different strategies can achieve performance on par with the best performing strategy in all tasks. Our results establish a new state-of-the-art level of performance on two of the open-domain question answering tasks that we evaluated. View details
    Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation
    Ivan Korotkov
    Yinfei Yang
    Keith B. Hall
    Ryan Thomas Mcdonald
    Association for Computational Linguistics, Online, pp. 1075-1088
    Preview abstract A major obstacle to the wide-spread adoption of neural retrieval models is that they require large supervised training sets to surpass traditional term-based techniques, which are constructed from raw corpora. In this paper, we propose an approach to zero-shot learning for passage retrieval that uses synthetic question generation to close this gap. The question generation system is trained on general domain data, but is applied to documents in the targeted domain. This allows us to create arbitrarily large, yet noisy, question-passage relevance pairs that are domain specific. Furthermore, when this is coupled with a simple hybrid term-neural model, first-stage retrieval performance can be improved further. Empirically, we show that this is an effective strategy for building neural passage retrieval models in the absence of large training corpora. Depending on the domain, this technique can even approach the accuracy of supervised models. View details