Pearl: Personalizing Large Language Model Writing
Assistants with Generation-Calibrated Retrievers

Sheshera Mysore1△† Zhuoran Lu2 Mengting Wan1Longqi Yang1
Bahar Sarrafzadeh1Steve Menezes1Tina Baghaee1
Emmanuel Barajas Gonzalez1Jennifer Neville1Tara Safavi1
2
Purdue University, IN, USA 1Microsoft, WA, USA
Corresponding authors: [email protected], [email protected]
  Work done during internship at Microsoft Research.
Abstract

Powerful large language models have facilitated the development of writing assistants that promise to significantly improve the quality and efficiency of composition and communication. However, a barrier to effective assistance is the lack of personalization in LLM outputs to the author’s communication style, specialized knowledge, and values. In this paper, we address this challenge by proposing Pearl, a LLM writing assistant personalized with a retriever that is trained to be generation-calibrated for personalization. Generation calibration ensures that our retriever selects historic user authored documents to augment an LLM prompt such that they are likely to help an LLM generation better adhere to a users’ preferences. We propose two key novelties for training such a retriever: (1) A training data selection method that identifies historical user requests likely to benefit from personalization and documents that provide that benefit; and (2) A scale-calibrating KL-divergence objective that ensures that our retriever scores remain proportional to the downstream generation quality from using the document for personalized generation. In a series of holistic evaluations, we demonstrate the effectiveness of Pearl in generating long-form texts on multiple social media datasets. Finally, we demonstrate how a generation-calibrated retriever can double as a performance predictor – detecting low quality retrieval, and improving potentially under-performing outputs via revision with LLMs.

\newfloatcommand

capbtabboxtable[][\FBwidth]

Pearl: Personalizing Large Language Model Writing
Assistants with Generation-Calibrated Retrievers


Sheshera Mysore1△† Zhuoran Lu2thanks:   Work done during internship at Microsoft Research. Mengting Wan1 Longqi Yang1 Bahar Sarrafzadeh1Steve Menezes1Tina Baghaee1 Emmanuel Barajas Gonzalez1Jennifer Neville1Tara Safavi1 2Purdue University, IN, USA 1Microsoft, WA, USA Corresponding authors: [email protected], [email protected]


1 Introduction

Machine-assisted writing has seen a long history of development, progressing from providing simple syntactic checks, to revising human authored text, to recent assistants being able to fully compose texts on direction from authors (Mahlow, 2023; Dale and Viethen, 2021). The text-generation capabilities of current LLMs and has led current research to explore a new frontier of writing assistants for complex applications such as knowledge synthesis (Shen et al., 2023), peer review (Chen et al., 2023), and journalism (Wang et al., 2023c). An important element of effective writing assistants is being able to personalize generated text to retain the knowledge, style, and values of a user – an essential element of interpersonal communication (Pickering and Garrod, 2013). With current LLMs prone to generating overly generic text (Pu and Demberg, 2023), author personalization of LLMs is an important problem.

Refer to caption
Figure 1: Pearl is a request-driven generation model that personalizes LLM outputs through retrieval augmentation with a generation calibrated retriever.

Personalizing LLM outputs may be seen as a form of alignment to individual users of the LLM (Kirk et al., 2023). However, leveraging fine-tuning for alignment in a personalization setup poses challenges to serving trained per-user models and obtaining sufficient per-user alignment training data. Therefore, we pursue in-context alignment through retrieval augmentation (Salemi et al., 2023; Li et al., 2023a). First, we assume access to a set of historic user-authored documents (e.g. emails, social media posts, etc.) and a user request for a personalized generation. To personalize LLM outputs we propose an approach to train a retrieval model that selects historic user documents to augment an LLM’s prompt. Historic documents capture users’ personal style, knowledge, and values and can serve as useful context for personalized generation. While training retrievers for non-personalized applications have been explored in prior work Gonen et al. (2022), this exploration has been limited in personalized text generation. Finally, we pursue personalization of LLMs only accessible via prompt-based APIs since this represents a common form of accessing performant large scale LLMs.

The starting point for our retriever in prior work examining effective prompts for non-personalized applications: Gonen et al. (2022) show the best prompts to be those with the highest conditional likelihood of generating a target text, and Rubin et al. (2022) use these likelihoods to train retrieval models for non-personalized retrieval augmentation of LLMs. While this approach performs well in non-personalized setups, personalized text generation presents unique challenges and opportunities: There are fewer historic documents per user (similar-to\simhundreds) than common non-personalized retrieval collections, and user requests may diverge from their history as users’ preferences change. A smaller retrieval corpus and shifting interests mean that all requests cannot be satisfied by retrieval from a users’ historical documents – as a result, all historic requests and documents are unlikely to be useful for training a retriever. Our first contribution addresses this: We present a novel difference of likelihoods-based method that identifies only the personalizable user requests and associated documents that are likely to personalize downstream generations, and use these to train our retriever.

Next, the personalization setup offers an opportunity: Fewer historical documents per user permits the use of expressive cross-encoder retrievers instead of scalable but less expressive biencoders commonly used for non-personalized tasks (Rubin et al., 2022). However, cross-encoders produce skewed scores at the ends of their score ranges (Menon et al., 2022; Yadav et al., 2022), hampering their ability to closely track the utility of a document for personalized generation. We remedy this with our second contribution – a personalized scale-calibrating training objective (Yan et al., 2022). This ensures that scores from our retriever are generation-calibrated for personalization – i.e. the score it produces for request-document pairs is proportional to the output quality of an LLM prompted with the pair. In a case study, we show how generation calibration enables the retriever’s scores to be used for retrieval performance prediction – detecting low-quality retrievals, and revising potentially low-quality generations.

We instantiate Pearl with multiple LLMs, davinci-003 and gpt-35-turbo, at privacy compliant enterprise API endpoints and evaluate it on a private dataset of workplace communications and a public dataset of Reddit comments. For evaluation, we use a variety of evaluation methods spanning intrinsic, extrinsic, and personalized LLM-as-judge evaluations to demonstrate the value of Pearl. Further, since we train calibrated retrieval models, we present additional evaluations for calibration, ablations, and analysis in Appendices. Our evaluations demonstrate that Pearl consistently matches or outperforms strong baseline approaches.

Refer to caption

Figure 2: To train retriever, fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT, an auxiliary language model is first used to identify historical requests that can be personalized and the best document to use for personalization \raisebox{-.9pt} {\footnotesize1}⃝. Then, fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT is trained on the selected data with a scale calibrating loss function \raisebox{-.9pt} {\footnotesize2}⃝. Given an unseen request, fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT is used to select the best instances from historical texts for augmenting an LLM prompt for personalized generation \raisebox{-.9pt} {\footnotesize3}⃝. Our training results in a generation calibrated retriever where scores for documents are proportional to the quality of the LLM output.

2 Related Work

Example selection for LLMs Early work on training retrievers for augmenting LLM contexts in non-personalized applications was proposed by Rubin et al. (2022). They train retrieval models by distilling LLM likelihoods of the target completions conditioned on the prompt. Similarly Wang et al. (2023b) train retrieval models on finer-grained feedback from a trained reward model through distillation. More distantly, Zhang et al. (2022) train instances selection models on rewards from a downstream evaluation metric using reinforcement learning. Parallel with our work, Salemi et al. (2024) train bi-encoders for personalized classification and short text generation and find knowledge distillation from downstream LLMs to outperform reinforcement learning based training of retrievers. In this regard, Salemi et al. (2024) and Rubin et al. (2022) are closely related and represent closest work to ours – we compare to such an approach in ablations (Appendix C.2). Despite similarities to our work, all prior work has explored training retrievers for document selection while assuming that satisfactory predictions can be made for all inputs/requests. In addition to selecting documents for training, we also select training requests that benefit from retrieval augmentation – a necessity in personalization where retrieval is performed over a smaller historical document set instead of a large shared corpus. Further, no prior approaches explore calibration for retrievers and their ability to identify low-quality retrievals, and selectively revise LLM outputs – we explore this. Appendix D discusses additional work on optimizing prompts, robustness to prompt errors, and calibrated retrievers.

Personalized writing assistants While writing assistants have seen considerable exploration, only some prior work has focused on author personalization. These applications range from email (Chen et al., 2019; Trajanovski et al., 2021), to social media (Gero et al., 2022), and grammatical error correction (GEC) (Nadejde and Tetreault, 2019). These systems commonly leverage nearest-neighbor models (Chen et al., 2019; Trajanovski et al., 2021) and user-group level parameter-efficient fine-tuning for personalization (Nadejde and Tetreault, 2019). In contrast, we explore retrieval models for in-context alignment/personalization with LLMs. Parallel work has also explored personalized writing with LLMs. Li et al. (2023b) construct prompts with pre-trained retrieval and summarization models and fine-tune an LLM for personalized completion. Follow-on work has explored training a prompt-re-writer to tune prompts for a fixed LLM (Li et al., 2023a). Prompt re-writing is a complementary approach to a trained retriever, with future systems likely to benefit from both. Appendix D discusses non-personalized writing assistants and reader personalization.

3 Problem Definition

We consider a request-conditional, personalized text generation task. As input to the system, we assume a user u𝑢uitalic_u who is associated with a set of Nusubscript𝑁𝑢N_{u}italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT historical documents 𝒟u={du(i)}i=1Nusubscript𝒟𝑢superscriptsubscriptsuperscriptsubscript𝑑𝑢𝑖𝑖1subscript𝑁𝑢\mathcal{D}_{u}=\{d_{u}^{(i)}\}_{i=1}^{N_{u}}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where each document dusubscript𝑑𝑢d_{u}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT may be a previously-authored social media post, email, etc. The user u𝑢uitalic_u is further associated with a textual request qusubscript𝑞𝑢q_{u}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT submitted to the writing assistant. The request may be authored by the user or constructed from the task context. Explicitly authored requests are increasingly common in conversational LLM interfaces (Papenmeier et al., 2021), and task contexts may be seen as implicit requests e.g. email prefixes that require completion (Chen et al., 2019). Finally, we assume access to a large language model fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT available via a prompt-based text generation API.

Given 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, qusubscript𝑞𝑢q_{u}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT, our retriever, fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT is trained to select a subset of historical documents 𝒟u𝒟usuperscriptsubscript𝒟𝑢subscript𝒟𝑢\mathcal{D}_{u}^{\prime}\subset\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊂ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT as few-shot examples for the LLM. Then the LLM generates a target text tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT of up to 300 words: tu=fLLM(ϕ(qu,𝒟u))subscript𝑡𝑢subscript𝑓LLMitalic-ϕsubscript𝑞𝑢superscriptsubscript𝒟𝑢t_{u}=f_{\textrm{LLM}}(\phi(q_{u},\mathcal{D}_{u}^{\prime}))italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT ( italic_ϕ ( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ), where ϕitalic-ϕ\phiitalic_ϕ is a prompt construction function that inputs the user’s request and retrieved historical documents, tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT reflects the style, knowledge, and values of u𝑢uitalic_u.

4 Proposed Approach

We present Pearl, an in-context aligned LLM-based model for personalized writing assistance. Our approach (Figure 2) consists of an offline retriever training stage and an online LLM inference stage. Offline, we train a retriever fretr:(qu,du):subscript𝑓retrsubscript𝑞𝑢subscript𝑑𝑢f_{\textrm{retr}}:(q_{u},d_{u})\rightarrow\mathbb{R}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT : ( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) → blackboard_R that scores the user’s historical documents for their ability to personalize the output for a user request. Further, we ensure that fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT is generation calibrated i.e. the scores it produces for (qu,du)subscript𝑞𝑢subscript𝑑𝑢(q_{u},d_{u})( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) pairs are proportional to the quality of the generated text from using (qu,du)subscript𝑞𝑢subscript𝑑𝑢(q_{u},d_{u})( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) in a prompt. We train such a retriever through two key novelties: (1) Training data selection based on a novel difference of likelihoods from an auxiliary text generation model – we identify requests which benefit from personalization and documents which likely help personalize a target, and (2) A scale-calibrating training objective which ensures that retrievers closely track the benefit of request-document pairs for generation. Given a new request, our LLM is prompted to generate a target text tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT conditioned on the request and the documents retrieved by fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT. Next, we describe the retriever training set construction (Algorithm 1), how we optimize the retriever, and the details of our implementation.

4.1 Training Data Setup

To optimize fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT for a personalized text generation task, we carefully create a training set for fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT from historical user documents by using an auxiliary text generation model fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT to identify which requests and documents will help to personalize the generation of a target text.

Data organization We organize the training data to create a setup close to the problem defined in §3. Given a set of M𝑀Mitalic_M users and their historical document sets {𝒟u}u=1Msuperscriptsubscriptsubscript𝒟𝑢𝑢1𝑀\{\mathcal{D}_{u}\}_{u=1}^{M}{ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, for each user u𝑢uitalic_u we partition 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT into two non-overlapping sets, a candidate document set 𝒟uc𝒟usuperscriptsubscript𝒟𝑢𝑐subscript𝒟𝑢\mathcal{D}_{u}^{c}\subset\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⊂ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and a “target” text set 𝒟ut𝒟usuperscriptsubscript𝒟𝑢𝑡subscript𝒟𝑢\mathcal{D}_{u}^{t}\subset\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊂ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, such that 𝒟uc+𝒟ut=𝒟usuperscriptsubscript𝒟𝑢𝑐superscriptsubscript𝒟𝑢𝑡subscript𝒟𝑢\mathcal{D}_{u}^{c}+\mathcal{D}_{u}^{t}=\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The partitioning is done temporally, i.e. the target texts occur after the candidate documents, mimicking the personalization scenario where past texts are used to personalize later targets. If time data isn’t available, the partitioning may be done randomly.

Next, for each target text tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT in each users 𝒟utsuperscriptsubscript𝒟𝑢𝑡\mathcal{D}_{u}^{t}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we pair the text with a corresponding request qusubscript𝑞𝑢q_{u}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. For training, requests may be naturally present in the data, e.g., email prefixes that require completion (Chen et al., 2019), or they may be generated synthetically (Bonifacio et al., 2022). We detail request generation in §5.1.

Auxiliary model scoring Next, we use the auxiliary text generation model fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT to score each candidate document in du𝒟ucsubscript𝑑𝑢superscriptsubscript𝒟𝑢𝑐d_{u}\in\mathcal{D}_{u}^{c}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT for producing the personalized tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT corresponding to the qusubscript𝑞𝑢q_{u}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for each (qu,tu)𝒟utsubscript𝑞𝑢subscript𝑡𝑢superscriptsubscript𝒟𝑢𝑡(q_{u},t_{u})\in\mathcal{D}_{u}^{t}( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. We define the score as a difference in the likelihood, per fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT, of the target given the request with and without the historical document:

yqudu=logpaux(tu|du,qu)logpaux(tu|qu),superscriptsubscript𝑦subscript𝑞𝑢subscript𝑑𝑢logsubscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑑𝑢subscript𝑞𝑢logsubscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑞𝑢y_{q_{u}}^{d_{u}}=\text{log}~{}p_{\textrm{aux}}(t_{u}|d_{u},q_{u})-\text{log}~% {}p_{\textrm{aux}}(t_{u}|q_{u}),italic_y start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = log italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) - log italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) , (1)

Importantly, Eq. (1) is highest when the request is suitable for personalization and the candidate document is the “right” example for personalization. That is, the request alone is not sufficient for generating the target text (i.e., the quantity defined by the second term is lower), and this candidate document is particularly beneficial to generation (i.e., the quantity defined by the first term is higher). Finally, we assume model fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT to be smaller than fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT to support efficient creation of training data, and that we have access to its token likelihoods. Appendix A shows prompts used for fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT.

4.2 Training Data Selection

We use the scores from Eq. 1 to identify: (1) a subset of training requests that are likely to benefit from personalization; and (2) candidate documents that are likely to benefit those requests i.e. positive training documents.

Request selection Using Eq. 1, we score all request-target pairs of a user in 𝒟utsuperscriptsubscript𝒟𝑢𝑡\mathcal{D}_{u}^{t}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT against all of their candidate documents du𝒟ucsubscript𝑑𝑢superscriptsubscript𝒟𝑢𝑐d_{u}\in\mathcal{D}_{u}^{c}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, across all M𝑀Mitalic_M users. After scoring, we retain the top scoring T𝑇Titalic_T request-target pairs. In practice, we find that setting T𝑇Titalic_T to the top two-thirds across the dataset works well. This step reflects the intuition that not all request-target pairs will benefit from retrieval augmentation, either due to the lack of suitable candidate documents in a user’s historical document set, or due to underspecified requests making the target text simply too difficult to generate well – this is contrast with RAG setups in non-personalized scenarios where a large retrieval corpus ensures that most requests are likely to benefit from retrieval. After obtaining a high-quality set of training requests {qu}t=1Tsuperscriptsubscriptsubscriptsuperscript𝑞𝑢𝑡1𝑇\{q^{*}_{u}\}_{t=1}^{T}{ italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, we discard the target texts, since they aren’t used for training fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT or for inference.

Candidate document selection Next, we use Eq. 1 to select the best documents for the retained requests, i.e. identify positive training documents. Given a request qusuperscriptsubscript𝑞𝑢q_{u}^{*}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT selected for training, we take the P𝑃Pitalic_P highest-scoring candidate documents du𝒟ucsubscript𝑑𝑢superscriptsubscript𝒟𝑢𝑐d_{u}\in\mathcal{D}_{u}^{c}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT as per Eq. (1) as positives, {du+}p=1Psuperscriptsubscriptsubscriptsuperscript𝑑𝑢𝑝1𝑃\{d^{+}_{u}\}_{p=1}^{P}{ italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT. We sample N𝑁Nitalic_N negative samples per positive randomly from the candidate document set for the user.

Algorithm 1 Selecting requests and positive candidate documents to train fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT
1:Input: {𝒟u}u=1Msuperscriptsubscriptsubscript𝒟𝑢𝑢1𝑀\{\mathcal{D}_{u}\}_{u=1}^{M}{ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT \triangleright Historical documents for M𝑀Mitalic_M users and an auxiliary LM
2:for each user u𝑢uitalic_u do
3:     𝒟uc,𝒟utTemporalPartition(𝒟u)superscriptsubscript𝒟𝑢𝑐superscriptsubscript𝒟𝑢𝑡TemporalPartitionsubscript𝒟𝑢\mathcal{D}_{u}^{c},\mathcal{D}_{u}^{t}\leftarrow\mathrm{TemporalPartition}(% \mathcal{D}_{u})caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← roman_TemporalPartition ( caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) \triangleright Temporally partition 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT into candidate and target documents
4:     for each target text tu𝒟utsubscript𝑡𝑢superscriptsubscript𝒟𝑢𝑡t_{u}\in\mathcal{D}_{u}^{t}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT do
5:         quGetRequest(tu)subscript𝑞𝑢GetRequestsubscript𝑡𝑢q_{u}\leftarrow\mathrm{GetRequest}(t_{u})italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ← roman_GetRequest ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) \triangleright Obtain a synthetic or natural request
6:     end for
7:     for each (qu,tu)subscript𝑞𝑢subscript𝑡𝑢(q_{u},t_{u})( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) pair in 𝒟utsuperscriptsubscript𝒟𝑢𝑡\mathcal{D}_{u}^{t}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT do \triangleright Compute benefit of personalization for request-target pairs
8:         for each candidate dusubscript𝑑𝑢d_{u}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT in 𝒟ucsuperscriptsubscript𝒟𝑢𝑐\mathcal{D}_{u}^{c}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT do
9:              Y[qu,du]=logpaux(tu|du,qu)logpaux(tu|qu)𝑌subscript𝑞𝑢subscript𝑑𝑢logsubscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑑𝑢subscript𝑞𝑢logsubscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑞𝑢Y[q_{u},{d_{u}}]=\text{log}~{}p_{\textrm{aux}}(t_{u}|d_{u},q_{u})-\text{log}~{% }p_{\textrm{aux}}(t_{u}|q_{u})italic_Y [ italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ] = log italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) - log italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) \triangleright Equation (1)
10:         end for
11:     end for
12:end for
13:{qu}t=1TTopK(Y[qu,du])superscriptsubscriptsubscriptsuperscript𝑞𝑢𝑡1𝑇TopK𝑌subscript𝑞𝑢subscript𝑑𝑢\{q^{*}_{u}\}_{t=1}^{T}\leftarrow\mathrm{TopK}(Y[q_{u},{d_{u}}]){ italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ← roman_TopK ( italic_Y [ italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ] ) \triangleright Retain the top T𝑇Titalic_T unique requests which are personalizable
14:for each retained request in {qu}k=1Tsuperscriptsubscriptsubscriptsuperscript𝑞𝑢𝑘1𝑇\{q^{*}_{u}\}_{k=1}^{T}{ italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT do
15:     {du+}p=1PTopK(Y[qu,du])superscriptsubscriptsubscriptsuperscript𝑑𝑢𝑝1𝑃TopK𝑌subscriptsuperscript𝑞𝑢subscript𝑑𝑢\{d^{+}_{u}\}_{p=1}^{P}\leftarrow\mathrm{TopK}(Y[q^{*}_{u},{d_{u}}]){ italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ← roman_TopK ( italic_Y [ italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ] ) \triangleright Retain the top P𝑃Pitalic_P candidates that best personalize the target
16:end for
17:return {qu,{du+}p=1P}t=1Tsuperscriptsubscriptsubscriptsuperscript𝑞𝑢superscriptsubscriptsubscriptsuperscript𝑑𝑢𝑝1𝑃𝑡1𝑇\{q^{*}_{u},\{d^{+}_{u}\}_{p=1}^{P}\}_{t=1}^{T}{ italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , { italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT

4.3 Retriever Optimization

Our fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT is a cross-encoder initialized with a pretrained LM encoder and trained using data selected per Algorithm 1, through distillation of scores in Equation 1. While cross-encoders are expressive they produce scores which lie at the extremes of their score ranges (Menon et al., 2022; Yadav et al., 2022) – this hampers their ability to closely track the benefit of candidate documents for personalizing requests. We propose to remedy this through a scale calibrating training objective.

Scale calibration Let 𝐲q=[yq+,,yq]subscript𝐲𝑞superscriptsubscript𝑦𝑞superscriptsubscript𝑦𝑞\mathbf{y}_{q}=[y_{q}^{+},\dots,y_{q}^{-}]bold_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = [ italic_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ], where yq+superscriptsubscript𝑦𝑞y_{q}^{+}italic_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT corresponds to the score of a positive document and yqsuperscriptsubscript𝑦𝑞y_{q}^{-}italic_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT corresponds to the score of a negative document from Eq. 1. Here, 𝐲qsubscript𝐲𝑞\mathbf{y}_{q}bold_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT contains N𝑁Nitalic_N negatives and 1 positive document. Similarly, let the predicted logits from fretr:(qu,du):subscript𝑓retrsubscript𝑞𝑢subscript𝑑𝑢f_{\textrm{retr}}:(q_{u},d_{u})\rightarrow\mathbb{R}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT : ( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) → blackboard_R be denoted as 𝐬q=[sq+,,sq]subscript𝐬𝑞superscriptsubscript𝑠𝑞superscriptsubscript𝑠𝑞\mathbf{s}_{q}=[s_{q}^{+},\dots,s_{q}^{-}]bold_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = [ italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ]. Then, a standard KL-divergence loss is written as KL(𝐲q,𝐬q)=ism(yq,i)logsm(sq,i)𝐾𝐿subscript𝐲𝑞subscript𝐬𝑞subscript𝑖smsubscript𝑦𝑞𝑖logsmsubscript𝑠𝑞𝑖KL(\mathbf{y}_{q},\mathbf{s}_{q})=-\sum_{i}\texttt{sm}(y_{q,i})\text{log}~{}% \texttt{sm}(s_{q,i})italic_K italic_L ( bold_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT sm ( italic_y start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT ) log sm ( italic_s start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT ), where sm represents the softmax function. Our proposed scale calibration modifies the KL divergence loss by adding an “anchor” example with target score y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which is a tunable hyperparameter, and logit s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT set to 00, resulting in score vectors 𝐲q=[y0,𝐲q]superscriptsubscript𝐲𝑞subscript𝑦0subscript𝐲𝑞\mathbf{y}_{q}^{\prime}=[y_{0},\mathbf{y}_{q}]bold_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] and 𝐬q=[s0,𝐬q]superscriptsubscript𝐬𝑞subscript𝑠0subscript𝐬𝑞\mathbf{s}_{q}^{\prime}=[s_{0},\mathbf{s}_{q}]bold_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ]. The scale-calibrated KL-divergence loss is thus

KL(𝐲q,𝐬q)=ism(yq,i)logsm(sq,i)𝐾𝐿superscriptsubscript𝐲𝑞superscriptsubscript𝐬𝑞subscript𝑖smsubscriptsuperscript𝑦𝑞𝑖logsmsubscriptsuperscript𝑠𝑞𝑖\displaystyle KL(\mathbf{y}_{q}^{\prime},\mathbf{s}_{q}^{\prime})=-\sum_{i}% \texttt{sm}(y^{\prime}_{q,i})\text{log}~{}\texttt{sm}(s^{\prime}_{q,i})italic_K italic_L ( bold_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT sm ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT ) log sm ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT ) (2)
=ieyq,ijeyq,j+ey0logesq,ijesq,j+1+ey0jeyq,j+ey0log(jesq,j+1).absentsubscript𝑖superscript𝑒subscript𝑦𝑞𝑖subscript𝑗superscript𝑒subscript𝑦𝑞𝑗superscript𝑒subscript𝑦0logsuperscript𝑒subscript𝑠𝑞𝑖subscript𝑗superscript𝑒subscript𝑠𝑞𝑗1superscript𝑒subscript𝑦0subscript𝑗superscript𝑒subscript𝑦𝑞𝑗superscript𝑒subscript𝑦0logsubscript𝑗superscript𝑒subscript𝑠𝑞𝑗1\displaystyle\begin{split}&=-\sum_{i}\frac{e^{y_{q,i}}}{\sum_{j}e^{y_{q,j}}+e^% {y_{0}}}\text{log}\frac{e^{s_{q,i}}}{\sum_{j}e^{s_{q,j}}+1}\\ &+\frac{e^{y_{0}}}{\sum_{j}e^{y_{q,j}}+e^{y_{0}}}\text{log}~{}({\sum_{j}e^{s_{% q,j}}+1}).\end{split}start_ROW start_CELL end_CELL start_CELL = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_e start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_q , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG log divide start_ARG italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_q , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + 1 end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG italic_e start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_q , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG log ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_q , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + 1 ) . end_CELL end_ROW (3)

We find that setting y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to the median value of scores from Eq  (1) for positive candidate documents works well. This ensures that very large scores from fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT are penalized (second term Eq 3) and smaller scores are prevented from being driven lower (first term Eq 3). Therefore fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT scores are more evenly distributed over the score range. In practice, this ensures that predicted scores from fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT more accurately reflect the distribution of fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT, which in turn more closely tracks the utility of request-document pairs for personalization. We compare Pearl to baselines in §5.2 and present ablations in §C.2.

4.4 System Details

After training retriever fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT offline, Pearl may be used to serve requests online. Given a unseen request, fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT retrieves the top-k𝑘kitalic_k historical texts from 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, these are formatted into a prompt and input to fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT to generate a personalized target text tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

Our fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT is initialized with a 110M parameter MPNet encoder (Song et al., 2020). For fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT we consider two performant LLMs, davinci-003 and gpt-3.5-turbo. For fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT we use FlanT5-XL with 3 billion parameters  (Chung et al., 2022). Appendix A details our prompts and implementation.

5 Experiments

We demonstrate the effectiveness of Pearl on two personalized text generation datasets from social media platforms. For evaluation, we employ standard intrinsic evaluations, extrinsic evaluation based on downstream tasks using the generated text, and recently proposed personalized LLM-as-judge Wang et al. (2023d). Then, in §5.3 we show how a calibrated retriever can be used for selective revision of underperforming requests. We present ablations in §C.2 and we demonstrate the calibration performance for our retriever in §C.3.

5.1 Experimental Setup

Data For evaluation, we use two open-ended long-form text generation datasets for social media: (1) Personalized post writing on WorkSm and (2) Personalized comment writing on AITA.

WorkSM WorkSm is an enterprise social network used for communication within organizations presenting a highly realistic platform for writing assistance. We obtain a random sample of similar-to\sim18k posts written by 1116 users from November 2020 to July 2023. To create an evaluation set, we manually examine posts greater than 50 words and receiving \geq2 comments, about 1K posts, and select 163 of the most recent posts from similar-to\sim80 users to serve as reference target texts tusuperscriptsubscript𝑡𝑢t_{u}^{*}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. These posts represent a diverse, engaging set that could benefit from personalized writing assistance and serve as high quality target references. At a high level, these posts share events, research studies, campaigns, and organizational news. Since WorkSm does not contain requests to the writing assistant, two authors not involved in model development manually wrote requests qusubscript𝑞𝑢q_{u}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT per target text. Note that this was necessary given the highly regulated and private enterprise data in WorkSm preventing exposure to external crowdworkers. Our requests were authored following Guideline 1. To construct 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT posts created before tusuperscriptsubscript𝑡𝑢t_{u}^{*}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT were used: On average, users had 31313131 historic posts (max of 169169169169). To create our training set, we only retain posts >10absent10>10> 10 words and users with 5absent5\geq 5≥ 5 historic posts while excluding posts in our evaluation set. We generate synthetic requests with GPT-4 for training given the expense of manually authored requests – resulting in a set of similar-to\sim7k training requests. Enterprise contracts with API providers ensured the privacy of user data shared over the API.

AITA AITA is a Reddit subforum in which original posters (OP) describe personal moral conflicts and receive comments from other users judging them to be “the a**hole” or “not the a**hole”. This dataset has been used in prior work on modeling the personal values of users Plepi et al. (2022). We construct a personalized comment generation task from this data. We treat the OP posts as requests qusubscript𝑞𝑢q_{u}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, user comments as reference target texts tusuperscriptsubscript𝑡𝑢t_{u}^{*}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and a user’s previous comments as 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Since the dataset lacks time metadata, we construct an evaluation set by sampling 10% of the posts as test requests, and further filter to 600600600600 random target texts for our evaluation set to keep LLM experiments feasible. Evaluation users had 29292929 posts in 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT on average (max of 590590590590). Our training set used the historical post-comment pairs from users in 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, resulting in similar-to\sim84k requests. Note that while Reddit comments are not the ideal platform for writing assistance, AITA is one of the few public datasets available for the task and resembles applications such as email response generation (Kannan et al., 2016). Appendix B details our datasets further.

LLM \rightarrow davinci-003 gpt-35-turbo
Method \downarrow Macro F1(%) Macro F1(%)
zShot-NP 41.97 50.43
kShot-NP 51.71 59.76
Random 55.52 59.47
BM25 57.26 61.66
MPNet-1B 53.72 59.23
UPR 55.76 58.15
RelevanceCE 56.85 59.59
Pearl 61.21 65.34
(a) Extrinsic classification accuracy in AITA.
 LLM \rightarrow davinci-003 gpt-35-turbo
Method \downarrow BS-F1 R1 R2 BS-F1 R1 R2
zShot-NP 36.25 0.5029 0.2516 31.03 0.4627 0.2091
kShot-NP 34.08 0.4931 0.2431 32.51 0.4825 0.2258
Random 35.04 0.5036 0.2505 33.46 0.4893 0.2345
BM25 37.96 0.5287 0.2911 36.57 0.5089 0.2673
MPNet-1B 38.30 0.5281 0.2931 36.02 0.5063 0.2639
UPR 38.70 0.5337 0.3019 35.98 0.5054 0.2642
RelevanceCE 37.81 0.5288 0.2953 35.99 0.5038 0.2613
Pearl 39.60 0.5419 0.3094 36.49 0.5082 0.2676
(b) Intrinsic reference based metrics in WorkSm .
Table 1: Pearl is compared to non-personalized (NP) and LLMs personalized with retrieval on datasets of social media communication: (a) a dataset constructed from Reddit and (b) a workplace social media dataset.

Generation metrics Since personalized text generation aims to adhere to the style, knowledge, and values of specific users, effective evaluation for personalized generation remains an open problem Wang et al. (2023d, a). This is in contrast to non-personalized generation, where desirable aspects of outputs can be defined uniformly across all test cases. As a result, we present evaluations using a host of standard evaluation setups aiming to demonstrate the effectiveness of Pearl from various perspectives. Our evaluations span the following standard setups Dou et al. (2023): intrinsic evaluations based on n-gram/embedding similarity to reference texts, extrinsic evaluation through a classification accuracy based on generated text, and pairwise evaluation with personalized LLM-as-judge.

Specifically, for WorkSm we report standard evaluation measures based on n-gram and embedding similarity between generations and reference targets: ROUGE-1 (R1), ROUGE-2 (R2), and BertScore-F1 (BS-F1) (Zhang* et al., 2020). This serves as an intrinsic evaluation for WorkSm measuring the extent to which generations are similar to user authored texts. Next, since AITA users’ comments primarily make a stance based on users’ moral values, we measure if the stance in generated comments matches that of the user through a downstream stance prediction task – serving as an extrinsic evaluation. This evaluation may be seen as evaluating the extent to which model generations adhere to a user’s values. We map generated comments to a binary “YTA” or “NTA” label based on simple high-precision rules mapping lexical variations of “you’re the a**hole” and “not the a**hole” to the labels. This procedure was also found reliable for constructing ground truth labels in AITA (Plepi et al., 2022). Note that early attempts of using n-gram/embedding similarity measures for evaluation (BS-F1, R1, R2) resulted in unreliable evaluations for AITA due the large variation (length, vocabulary, emojis etc.) in AITA comments, therefore we opt for more stable extrinsic evaluations and LLM based evaluations described next.

For both AITA and WorkSm we conduct a pairwise evaluation with a recently proposed personalized LLM-as-judge Wang et al. (2023d). Wang et al. show LLM based author identifications to be a reliable proxy task for distinguishing models of various qualities and being correlated with human quality ratings. Here, a judge LLM is presented with a reference text from a user and generations from the pair of systems being compared, then, it is prompted to select the system generation more likely to be authored by the author of the reference text. An author identification task aims to capture several aspects which distinguish individuals’ writing, spanning style, knowledge and their values. In our evaluation, we compare Pearl outputs to the outputs from the best baseline as indicated by intrinsic/extrinsic evaluations and use the target reference text tusuperscriptsubscript𝑡𝑢t_{u}^{*}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the LLM prompt as an example of the users writing. We use GPT-4o as our judge LLM and present the judge prompt in Appendix B.4. In our evaluation we avoid rating aspects such as fluency, non-redundancy, etc. (Celikyilmaz et al., 2021) since we are primarily concerned with personalization performance and these qualities may be in conflict with specific users writing.

Baselines As baselines, we consider non-personalized models based on zero shot prompting (zShot-NP) and few-shot prompting with k𝑘kitalic_k randomly chosen example documents (kShot-NP). We consider retrieval-augmented personalized baselines, which selecting from a user’s historical documents 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. They span selection at random from 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT (Random), with sparse retrieval by BM25, with dense retrieval by a strong MPNet model trained on 1 billion text pairs (MPNet-1B), an unsupervised crossencoder (Sachan et al., 2022) ranking documents with FlanT5-Base likelihood: p(qu|du)𝑝conditionalsubscript𝑞𝑢subscript𝑑𝑢p(q_{u}|d_{u})italic_p ( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) (UPR), and a supervised crossencoder optimized on our dataset with request-document pairs, (qu,du)subscript𝑞𝑢subscript𝑑𝑢(q_{u},d_{u})( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) in 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT (RelevanceCE). Appendix B.3 details our baselines.

5.2 Generation Evaluation

Table 1 and 2 report our evaluations. Appendix C presents ablation (C.2) and calibration (C.3) results.

Reference based evaluation Tables 1b and 1a reports automated metrics on AITA and WorkSm. First we observe that personalization through retrieval, even at Random, generally improves upon non-personalized approaches (NP), which is consistent with prior work (Salemi et al., 2023). Next, we note that the best baseline is not consistent, varying between BM25, and unsupervised crossencoder (UPR) – indicating that retrieval models designed for request-document relevance vary in performance depending on the dataset and inference LLM. Finally, we note that Pearl consistently performs at par or better than the best baselines across datasets and LLMs, indicating the effectiveness of training fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT for personalized generation. For the more reliable classification metrics obtainable in AITA, Pearl outperforms all baselines with improvements of 1.5 to 5 Macro F1 points. Next, we report performance in more expressive LLM-as-judge evaluations.

davinci-003 gpt-35-turbo
P / B / T (%) P / B / T (%)
AITA 46.8 /40.3 /12.8α=0.56 46.6 /44.9 /8.3α=0.55
WorkSm 46.6 /42.5 /10.8α=0.42 38.9 /42.6 /18.5α=0.28
Table 2: LLM-as-judge win-rate evaluation for AITA and WorkSm selecting a generation to be more aligned with an authors writing sample. The LLM could prefer the Proposed system (Pearl), the Baseline (BM25), or judge the outputs as Tied – denoted with P, B, and T.

Pairwise LLM-as-judge evaluation In Table 2 we report the results of personalization evaluation following the setup described in §5.1. Here, we compare against BM25-augmented as it performs within our top 2 baselines in automatic evaluations - this strong performance is consistent with prior work (Izacard et al., 2022; Thakur et al., 2021). We use GPT-4o as a judge LLM and run every pair of inputs through the judge LLM 3 times, we report average win rates over all the instances in our test set and over 3 repeated runs. Further, we randomly swap the position of the baseline and proposed method generations in the prompt to account for position biases in the judge LLM. Finally, we also report the agreement between the 3 judge LLM runs using Krippendorff’s alpha (α𝛼\alphaitalic_α) to ensure that LLM judgements are consistent across runs.

In Table 2, Pearl achieves a greater win-rate than BM25 in 3 of 4 settings. In these settings we also note that the LLM judgments remain consistent across 3 repeated runs with Krippendorff’s alpha between 0.410.560.410.560.41-0.560.41 - 0.56 (0 indicates chance agreement). While BM25 sees a greater win-rate in WorkSm with gpt-35-turbo, the judgments see lower agreement (α=0.28𝛼0.28\alpha=0.28italic_α = 0.28) indicating the outputs to be harder to distinguish. Finally, comparing to Table 1 we see that the trends of extrinsic and intrinsic reference based evaluations are retained in LLM-as-judge evaluations – consistently indicating the benefit of Pearl across evaluation setups, inference LLMs, and datasets. In Appendix C we show an example from AITA to show the kinds of retrievals and outputs that make Pearl effective.

5.3 Selective Revision with Pearl

Having established Pearl to be an effective model for generation, we show fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT to be generation calibrated in Appendix C.3. Here, we demonstrate the usefulness of a calibrated retriever in a case study using the retriever scores to selectively revise generations. Specifically, we treat the scores from fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT as a predictor of retrieval performance, and in-turn text generation performance. We assume that if fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT cannot find a highly scored in-context example, the generated response will be of low quality and can benefit from LLM revision (Figure 3).

Setup Given our trained retriever, we take all top-1 document scores for each request s1=maxdu𝒟ufretr(qu,du)subscript𝑠1subscriptsubscript𝑑𝑢subscript𝒟𝑢subscript𝑓retrsubscript𝑞𝑢subscript𝑑𝑢s_{1}=\max_{d_{u}\in\mathcal{D}_{u}}f_{\textrm{retr}}(q_{u},d_{u})italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) and learn a threshold θ𝜃\thetaitalic_θ on s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that maximizes a downstream performance metric on a held-out development set (R2 in WorkSm and Macro-F1 in AITA). Then, given a generated target text tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT with s1<θsubscript𝑠1𝜃s_{1}<\thetaitalic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_θ, we selectively revise tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT where fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT is prompted to edit the target text. We report results of selective revision compared to a single round of generation (i.e., no revision) and full revision over the entire dataset (i.e., 100% revision). We repeat this for BM25. We provide further details and analysis in Appendix C.4.

Dataset \rightarrow AITA WorkSm
Method \downarrow / LLM \rightarrow gpt-35-turbo gpt-35-turbo
fretr=subscript𝑓retrabsentf_{\textrm{retr}}=italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT = BM25 Macro F1 (%) BS-F1 R1 R2
Stage 1 (no revision) 59.99 36.15 0.5052 0.2611
All (100% revision) 58.36 35.45 0.5096 0.2573
Selective revision 57.71 37.29 0.5206 0.2738
fretr=subscript𝑓retrabsentf_{\textrm{retr}}=italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT = Proposed Macro F1 (%) BS-F1 R1 R2
Stage 1 (no revision) 65.15 37.02 0.5124 0.2709
All (100% revision) 64.85 35.47 0.5045 0.2520
Selective revision 65.36 37.71 0.5236 0.2818
Table 3: Selectively revising target texts tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT based on scores from our retriever vs BM25. Also present are results of no revision and revising all outputs (100% revision) from Stage 1 outputs.
Refer to caption
Figure 3: Generation calibration of fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT allows us to use its predicted scores for performance prediction and selectively revise potentially bad generations.

Results In Table 3 we see that selective revision improves or retains performance upon a single round of generation (“Stage 1”) by 2-4% in downstream performance metrics with fretr=subscript𝑓retrabsentf_{\textrm{retr}}=italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT =Proposed and BM25 for WorkSm. However, for AITA we see that selective revision based on BM25 shows a marked drop in performance indicating its dataset dependent calibration performance. Importantly, note that Macro F1 doesn’t measure aspects of style which may have changed in revision. Finally, editing all outputs produced by Stage 1 generation consistently leads to degraded performance (“All”), indicating that editing is not always helpful.

We also observe that Pearl chooses 75.8%percent75.875.8\%75.8 % and 77.9%percent77.977.9\%77.9 % instances for editing in WorkSm and AITA, respectively. This indicates the potential for generation calibrated retrievers to reduce the number of expensive LLM calls made while ensuring better personalization performance. In Figure 5 (Appendix C.4) we analyze the performance of selective revision against request and user profile length. In a manual examination of requests with a low s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score by the Pearl fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT, we find the requests to be underspecified and often require further information from a user e.g. the request “Write a post about how I like to relax after work”, aims to generate a target discussing more specific forms of relaxation not present in any historical documents. This indicates that generation calibrated retrievers may be used for other forms of selective prediction and user interaction – e.g. selectively withholding predictions when satisfactory generations are unlikely or obtaining more information from users through follow-up questions. We leave such explorations to exciting future work.

6 Conclusion

In this paper we present Pearl– an LLM based writing assistant personalized with generation calibrated retrievers. We propose a method for training generation calibrated retrievers through a careful selection of training data and a scale calibrated objective. In a series of holistic evaluations, we demonstrate the effectiveness of our approach in datasets of social media communication compared to baselines (§5.2) as well as ablated models (Appendix C.2). We demonstrate the calibration performance for our retriever (Appendix C.3), and show how our retrieval model can double as a performance predictor (§5.3) and can identify outputs which can benefit from LLM revision.

7 Ethical and broader impact

Having introduced Pearl as an effective personalization strategy for writing assistance and discussed its benefits we review two implications of concern arising from better personalized text generation: challenges to factuality, and longer term influence on language use and communication.

Challenges to factuality The emergence of LLMs and their ability to generate compelling text has seen a subsequent rise in the cases of malicious use of these technologies. Augenstein et al. (2023) overview four such classes of harm: personalized attacks on individuals in the form of phishing attacks and tailored misinformation, impersonation of trusted figures (e.g. journalists or regulators), a glut of paraphrased misinformation evading detection by automatic tools often used by fact checkers, and large scale creation of fake social media profiles and plagiarized content (Brewster et al., 2023). It is possible that improvements in personalized text generation are likely to excacertabe each of these problems. To account for this, several technology and policy initiatives are under active development Augenstein et al. (2023). These span detection of AI-generated content, cryptographic signatures intended to prove the authenticity of content, to government regulation and public education, however, their effectiveness remains under investigation.

Language use and communication Current understanding of computer mediated communication suggests that users interpersonal communication patterns are influenced by the tool/medium used for communication (Poddar et al., 2023) with a potential for these influences to have longer term influences on communication in the absence of these tools (Hancock et al., 2020). Hancock et al. outline these implications as ranging from shifts in language use (e.g a social expectation of more positive responses (Hohenstein and Jung, 2018)), issues of how individuals portray themselves and evaluate others, to long term feedback loops resulting in how we perceive ourselves. However, understanding of the implications of AI mediated communication, specially those powered by powerful LLMs, is largely developing (Hancock et al., 2020). It is likely that wide spread personalization in LLM communication agents, will necessitate further understanding of these factors and the design of systems that incorporates this understanding to ameliorate harms.

References

Appendix A Model Details

For a POST from the subreddit Am I The Asshole write a COMMENT explaining if the author of a post is an asshole or not the asshole as a COMMENTER.
Use the following instructions for your response:
1. Read the below example comments by the COMMENTER.
2. Write the comment as the COMMENTER mimicing the length, style, reasoning, and stances of their comments.
Here are some example comments by the COMMENTER: {{historical_examples}}
POST: {{target_request}}
Write the COMMENT mimicing the length, style, reasoning, and stances of the COMMENTERS comments.
Prompt 1 fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT prompt used to generate a target text given historical examples retrieved by fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT and a target request for AITA.
Given a REQUEST from a USER to author a POST, write a POST for an enterprise social media site mimicking the user to satisfy the REQUEST.
Use the following instructions for your response:
1. You should maintain consistency in tone and style with the USERs historical posts.
2. You should imitate the language style of the USERs historical posts.
3. You should employ similar rhetorical methods as the USERs historical posts.
Here are some historical posts by the USER: {{historical_examples}}
REQUEST: {{target_request}}
Write the POST to satisfy the REQUEST mimicing the tone, style, and rhetorical methods of the USERs historical posts.

Prompt 2 fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT prompt used to generate a target text given historical examples retrieved by fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT and a target request for WorkSm.

Retriever We instantiate fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT with the pre-trained MPnet, which is relatively lightweight at 110M parameters (Song et al., 2020). We obtain an output score from fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT as 𝐰Ttanh(𝐖TEnc([qu,du]))superscript𝐰𝑇tanhsuperscript𝐖𝑇Encsubscript𝑞𝑢subscript𝑑𝑢\mathbf{w}^{T}\texttt{tanh}\left(\mathbf{W}^{T}\textsc{Enc}([{q}_{u},{d}_{u}])\right)bold_w start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT tanh ( bold_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT Enc ( [ italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ] ) ), where Enc represents the cls token from the final layer of the encoder, and qusubscript𝑞𝑢{q}_{u}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and dusubscript𝑑𝑢{d}_{u}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT are the text of the input request and historical document. The encoder parameters, 𝐰𝐰\mathbf{w}bold_w, and 𝐖𝐖\mathbf{W}bold_W are trained.

Text generation models For fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT we consider two performant LLMs offered via API by Azure OpenAI, davinci-003 and gpt-3.5-turbo. For fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT we consider a smaller but still effective encoder-decoder language model, FlanT5-XL, with 3 billion parameters (Chung et al., 2022). The latter model is open-sourced, allowing us to access its token likelihoods directly, a requirement of Eq. 1. We obtain target text likelihoods by taking the average of log-probabilities of individual token likelihoods from FlanT5-XL.

LLM prompts We use Prompts 2 and 1 for LLM inference. The same prompts are used with davinci-003 and gpt-35-turbo. For constructing training data in Eq 1 with a FlanT5-XL, fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT we use Prompts 5, 6, 3, 4. Note that computing paux(tu|qu)subscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑞𝑢p_{\textrm{aux}}(t_{u}|q_{u})italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) uses a set of randomly chosen few shot examples from the training set fixed across requests rather than the request alone.

Pearl implementation In constructing training data for fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT we use |𝒟ut|=8superscriptsubscript𝒟𝑢𝑡8|\mathcal{D}_{u}^{t}|=8| caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | = 8, i.e we treat the 8 most recent texts per user as their target texts. To train fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT, we consider the top two canadidate documents per Eq. (1) as positive examples per request and use three negatives per positive, i.e., P=2𝑃2P=2italic_P = 2 and N=3𝑁3N=3italic_N = 3. In our LLM prompts, we use k=3𝑘3k=3italic_k = 3 retrieved examples for WorkSm and k=4𝑘4k=4italic_k = 4 for AITA, tuned on a dev set, and set generation temperature to zero.

We also use temperatures for target scores input to softmax functions in Eq. (2), 𝐲u/τsuperscriptsubscript𝐲𝑢𝜏\mathbf{y}_{u}^{\prime}/\taubold_y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_τ with τ=5𝜏5\tau=5italic_τ = 5. Finally, we set y0=110subscript𝑦0110y_{0}=110italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 110 for WorkSm and y0=5subscript𝑦05y_{0}=5italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 5 for AITA, which are the median values of Eq. (1) for each respective dataset on the training data. We tuned y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on a dev set constructed similar to our training set to 25 and 75 percentile values of Eq. (1). Our retrievers were trained on Nvidia V100 GPUs with 16GB memory or Nvidia RTX A6000 GPUs with 48GB memory. Experiments for training retrievers required about 300 hours in total.

Here are some example posts on the Am I The Asshole subreddit: {{random_fewshot_examples}}. Target post: {{target_post}}. Write a users comment for this post:
Prompt 3 fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT prompt used to compute paux(tu|qu)subscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑞𝑢p_{\textrm{aux}}(t_{u}|q_{u})italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) in Eq (1) for AITA.
Here is an comment on a post by a user on the Am I the Asshole subreddit: {{candidate_comment}}. Target post: {{target_post}}. Write a users comment for this post:
Prompt 4 fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT prompt used to compute paux(tu|du,qu)subscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑑𝑢subscript𝑞𝑢p_{\textrm{aux}}(t_{u}|d_{u},q_{u})italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) in Eq (1) for AITA.
Here is are some posts by a user on an enterprise social network: {{random_fewshot_examples}}
Here is an outline for a target post by the user: {{target_request}}. Write the target post:
Prompt 5 fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT prompt used to compute paux(tu|qu)subscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑞𝑢p_{\textrm{aux}}(t_{u}|q_{u})italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) in Eq (1) for WorkSm.
Here is an example post by a user on an enterprise social network: {{candidate_document}}. Here is an outline for a target post by the user: {{target_request}}. Write the target post:
Prompt 6 fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT prompt used to compute paux(tu|du,qu)subscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑑𝑢subscript𝑞𝑢p_{\textrm{aux}}(t_{u}|d_{u},q_{u})italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) in Eq (1) for WorkSm.

Appendix B Experimental Details

Here we present various details of datasets, baselines, and manual evaluation.

B.1 Evaluation Requests in WorkSm

For evaluation in WorkSm two authors not involved in model development manually authored requests for each of the 163 target posts in our evaluation set. Guidelines presented to annotators for the requests are presented in Guideline 1. The requests are intended to be brief and include the salient information contained in the post. Note that annotators external to the authors weren’t recruited for authoring requests due to the private and highly regulated nature of WorkSm.

B.2 Training Requests in WorkSm

Section 5.1 notes that our training set for WorkSm was constructed from synthetic requests generated by GPT4. The prompt for this is presented in Prompt 11. We follow an incremental approach to construct the synthetic requests: first extracting the salient aspects of the post, followed by concatenation of these aspects to result in the request. The salient aspects span: an overview of the post, proper nouns mentioned in the post, contact information, links to webpages, and any specialized knowledge or anecdotes in the post. Given the success of chain-of-thought prompting, we generate an explanation followed by salient aspects of the post – the explanations are not used elsewhere. Enterprise contracts ensure the privacy of user data shared over the API.

B.3 Baselines

We consider the following non-personalized baselines: zShot-NP: This represents a non-personalized approach prompting only with the request. kShot-NP: A zero-shot non-personalized approach using a fixed randomly selected set of k𝑘kitalic_k documents for all requests. For AITA, the examples are balanced across labels.

We consider the following retrieval-augmented personalized baselines, selecting from a user’s historical documents 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT: Random: Random selection of k𝑘kitalic_k documents from 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. BM25: Represents a classic performant retrieval model based on query-document term overlap. MPNet-1B: This a strong MPNet bi-encoder trained on 1 billion text pairs from numerous domains.111HF model: sentence-transformers/all-mpnet-base-v2 Documents are ranked for a request using cosine similarity between embeddings. QL-FT5: An approach which ranks documents based on p(qu|du)𝑝conditionalsubscript𝑞𝑢subscript𝑑𝑢p(q_{u}|d_{u})italic_p ( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) with a pretrained FlanT5-Base with 250M parameters  (Sachan et al., 2022). This may be seen as an unsupervised crossencoder. RelevanceCE: A supervised crossencoder with the same architecture as fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT in Pearl but differing in training. This is trained on pairs of (qu,du)subscript𝑞𝑢subscript𝑑𝑢(q_{u},d_{u})( italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) in 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT treated as positive training pairs with a crossentropy loss, with negatives selected as a random historical document from the same user not but corresponding to qusubscript𝑞𝑢q_{u}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Note that this corresponds to a crossencoder optimized for request-document relevance, i.e. p(relevance=1|qu,du)𝑝relevanceconditional1subscript𝑞𝑢subscript𝑑𝑢p(\text{relevance}=1|q_{u},d_{u})italic_p ( relevance = 1 | italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ), rather than personalized target text generation.

B.4 Judge LLM prompts

In Prompt 8 and 7 we present prompts for GPT-4o as a judge LLM discussed in §5.2.

You an an experienced linguist who helps people compare social media texts.
Given a REFERENCE POST and two
TARGET POSTS judge which of the TARGET POSTs is significantly more likely to be
written by the same author as the REFERENCE POST.
For your response use the following instructions:
1. Make your judgement based on stylistic patterns, ordering of information, and tone used.
2. Output POST ONE if it is significantly more likely to be written by the same author as the REFERENCE POST.
3. Output POST TWO if it is significantly more likely to be written by the same author as the REFERENCE POST.
4. Output BOTH if either post could have been written by the same author or neither could have been written by the same author.
Here are the POSTS:
REFERENCE POST: {{reference_post}}
POST ONE: {{post_one}}
POST TWO: {{post_two}}
Output a justification for your judgement, then output POST ONE, POST TWO, or BOTH to indicate your final decision.

Prompt 7 Judge LLM prompt used to select a generated post more likely to align with a reference post authored by a user for WorkSm.
You an an experienced linguist who helps people compare social media texts.
Given a REFERENCE POST and two
TARGET POSTS judge which of the TARGET POSTs is significantly more likely to be
written by the same author as the REFERENCE POST.
For your response use the following instructions:
1. Make your judgement based on similarity of stylistic patterns, arguments, stances, and word choices.
2. Output POST ONE if it is significantly more likely to be written by the same author as the REFERENCE POST.
3. Output POST TWO if it is significantly more likely to be written by the same author as the REFERENCE POST.
4. Output BOTH if either post could have been written by the same author or neither could have been written by the same author.
Here are the POSTS:
REFERENCE POST: {{reference_post}}
POST ONE: {{post_one}}
POST TWO: {{post_two}}
Output a justification for your judgement, then output POST ONE, POST TWO, or BOTH to indicate your final decision.

Prompt 8 Judge LLM prompt used to select a generated comment more likely to align with a reference comment authored by a user for AITA.

Appendix C Additional Results

Here we present additional results in addition to those presented in §5.2. We present these here primarily in the interest of space.

C.1 Qualitative example from AITA

Refer to caption
Figure 4: A qualitative example illustrating the effectiveness of PEARL on AITA: Given a request post qusubscript𝑞𝑢q_{u}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT describing an ambiguous interpersonal situation regarding sharing medical information, PEARL retrieves a historical user comment dusubscript𝑑𝑢d_{u}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT that demonstrates the user’s characteristic tone and values, and generates a comment tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT highly similar to the ground-truth user comment tusuperscriptsubscript𝑡𝑢t_{u}^{*}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We bold qualitatively similar phrases about individual liberties and italicize phrases about self-care and mental health. All texts are abbreviated for space, and provided in full in Appendix C.

Here we provide the full texts of the qualitative AITA examples from Figure 4, alongside the top-retrieved comment from BM25 to show the qualitative difference in retriever effectiveness.

Original post qusubscript𝑞𝑢q_{u}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT: going for surgery without telling my parents. So , I guess I ’ll start off with background information .I am 21 , and trans masculine , or female to male if that term works for you

Gold user-authored comment tusuperscriptsubscript𝑡𝑢t_{u}^{*}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: “NTA, honestly you can do whatever you choose feels best to you. Your parents have not been supportive of your transition so they have elected not to be a part of the process. You can tell them any amount of detail you like whether before, after, or at different times. It’s good you have helpful friends. Good luck!”

Top Pearl-retrieved comment dusubscript𝑑𝑢d_{u}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT: “NTA, you don’t owe your abusive ex anything. That being said, in your position I would consider getting a friend to tell him just so you yourself can get closure and not be wondering when he’s going to harass you about it.”

Top BM25-retrieved comment dusubscript𝑑𝑢d_{u}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT: “YTA, just another lame guy who is only interested in a girl if he has a chance to sleep with her”

Pearl LLM output: “NTA. You are an adult and have the right to make your own medical decisions. It’s understandable that you don’t have a great relationship with your parents and don’t feel comfortable sharing this information with them. It’s also important to note that being transgender is not self-harm and it’s great that you are taking care of your mental health by seeing a counselor and nurse regularly. Wishing you a speedy recovery!”

C.2 Ablating Training Strategies

In Table 4 and 5, we compare common alternative training of fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT while keeping inference steps identical for gpt-35-turbo and davinci-003 respectively. These serve to ablate our specific contributions: selection of training requests which benefit from personalization and our scale calibrating objective.

No calibrated objective Removing the scale calibration and using a standard KL divergence (– calibrated sm) degrades performance for AITA and sees comparable performance in WorkSm with gpt-35-turbo in Table 4. With davinci-003 we see scale calibration consistently improves performance (Table 5). This indicates the importance of calibration for estimating the benefit of a historical document to a request consistently across datasets and LLMs. Appendix C.3 shows scale calibration also consistently improves the correlation of retriever scores with task performance.

Dataset \rightarrow AITA WorkSm
LLM \rightarrow gpt-35-turbo gpt-35-turbo
Method \downarrow Macro F1(%) BS-F1 R1 R2
Pearl 65.34 36.49 0.5082 0.2676
no calibrated sm 63.01 36.69 0.5095 0.2654
distill paux(tu|du,qu)subscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑑𝑢subscript𝑞𝑢p_{\textrm{aux}}(t_{u}|d_{u},q_{u})italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) 62.32 35.64 0.5057 0.2652
Table 4: Pearl compared to alternate training of fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT for gpt-35-turbo.
Dataset \rightarrow AITA WorkSm
LLM \rightarrow davinci-003 davinci-003
Method \downarrow Macro F1(%) BS-F1 R1 R2
Pearl 61.21 39.60 0.5419 0.3094
no calibrated sm 57.27 38.88 0.5350 0.3033
distill paux(tu|du,qu)subscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑑𝑢subscript𝑞𝑢p_{\textrm{aux}}(t_{u}|d_{u},q_{u})italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) 55.52 39.34 0.5359 0.3059
Table 5: Pearl compared to alternate training of fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT for davinci-003.

Distill paux(tu|du,qu)subscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑑𝑢subscript𝑞𝑢p_{\textrm{aux}}(t_{u}|d_{u},q_{u})italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) to fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT. The proposed fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT is trained on documents which benefit personalization and requests which benefit from personalization. Here, we compare to an approach that only selects documents that benefit personalization by maximizing paux(tu|qu,du)subscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑞𝑢subscript𝑑𝑢p_{\textrm{aux}}(t_{u}|q_{u},d_{u})italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ). This assumes that all training requests benefit from personalization. We train fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT with a KL-divergence objective. This approach, also, closely resembles prior work example selection in non-personalized tasks (Rubin et al., 2022) as well as personalized tasks Salemi et al. (2024). We see in Table 4 and 5 (distill paux(tu|du,qu)subscript𝑝auxconditionalsubscript𝑡𝑢subscript𝑑𝑢subscript𝑞𝑢p_{\textrm{aux}}(t_{u}|d_{u},q_{u})italic_p start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT )) that this lowers performance markedly, indicating the value of our approach.

C.3 Calibration Evaluation

Since we aim to train generation calibrated retrievers, we evaluate calibration performance i.e a retrieval models scores to be predictive of downstream generation performance (Table 6). Here, Pearson r𝑟ritalic_r is reported between the top-1 document score for a request and the downstream generation evaluation metric – R2 for WorkSm, and Macro-F1 for AITA. To do this for AITA, we first bin evaluation requests into equal sized bins by top-1 document score, s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and then measure Pearson r𝑟ritalic_r between the bin start and the average evaluation metric per bin. Our metric is in contrast with prior work (Dhuliawala et al., 2022; Yan et al., 2022) that focuses on classification tasks, where model-predicted class probabilities can be used for measuring calibration, missing in our setup.

Among baseline methods, we see sparse and dense retrieval methods, BM25 and MPNet-1B scores to be better calibrated with downstream performance compared to likelihood-based methods like QL-FT5. Next, we see Pearl to be better correlated with downstream performance for WorkSm and AITA- indicating the effectiveness of our training strategy. Further, we also report on an ablated model, not using the scale-calibrated objective of Eq (3) (– calibrated sm). We see this approach underperform Pearl, indicating the importance of the scale-calibrated objective for a well-calibrated crossencoder. The poorer calibration of crossencoders also finds support in prior work showing their scores to lie at extremes of the score distribution (Menon et al., 2022; Yadav et al., 2022).

Method \downarrow / LLM \rightarrow davinci-003 gpt-35-turbo
Pearson r𝑟ritalic_r Pearson r𝑟ritalic_r
AITA BM25 0.08 -0.05
MPNet-1B 0.07 -0.14
UPR -0.48 -0.02
RelevanceCE 0.07 -0.19
Pearl fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT 0.11 0.45
– calibrated sm -0.48 0.12
WorkSm BM25 0.42 0.52
MPNet-1B 0.54 0.52
UPR -0.05 -0.02
RelevanceCE 0.56 0.49
Pearl fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT 0.64 0.64
– calibrated sm 0.58 0.55
Table 6: Calibration performance of Pearl evaluated through correlation between score for top-1 document and Macro-F1 for AITA, and R2 for WorkSm.

C.4 Selective Revision with Pearl – Extended Results

In §5.3 we demonstrate how our trained retrieval model can be used for selective revision with gpt-35-turbo. Prompt 9, 10 present the prompts used for revision with both LLMs.

Given a POST from the subreddit Am I the Asshole and a DRAFT comment from the USER responding if the author of the POST is an asshole or not the asshole, edit the DRAFT comment.
Use the following instructions for your response:
1. Maintain consistency in tone and style with the USERs historical comments.
2. Edit the DRAFT to use more reddit lingo.
3. Remove statements of the POST from the DRAFT.
4. Output a justification for your edits starting with the word JUSTIFICATION.
5. Output the edited DRAFT comment starting with the words EDITED DRAFT.
Here are some historical comments by the USER: {{historical_examples}}
REQUEST: {{target_request}}
DRAFT: {{target_draft}}
Output a justification for your edits, then output the edited DRAFT starting with the words EDITED DRAFT.
Prompt 9 fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT prompt used to for selective revision given a Stage 1 draft for AITA.
Given a REQUEST and a DRAFT from a USER to author a social media POST, edit the DRAFT to satisfy the REQUEST.
Use the following instructions for your response:
1. Enumerate any missing missing information from the REQUEST in the DRAFT.
2. Enumerate any irrelevant information for the REQUEST in the DRAFT.
3. Then output the edited DRAFT starting with the words EDITED DRAFT.
REQUEST:{{target_request}}
DRAFT: {{target_draft}}
Output missing or irrelevant information for the REQUEST, then output the EDITED DRAFT satisfying the REQUEST.
Prompt 10 fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT prompt used to for selective revision given a Stage 1 draft for WorkSm.
Refer to caption
(a) Effectiveness of selective revision for requests of different lengths (in words).
Refer to caption
(b) Effectiveness of selective revision for users of different numbers of historical posts.
Figure 5: The impact of selective revision (§5.3) in Pearl on WorkSm compared for requests of different length and users with varying number of historical posts.

In Figure 5, we examine the impact of selective revision in WorkSm for requests of different lengths and users with different number of historical posts. We see that revision benefits requests of medium length and users with few posts. From Figure 5a, we hypothesize that requests that are too short may require additional user input and see no gains from revision. On the other hand requests that are too long, may be more challenging to follow and are unlikely to improve from revisions. From Figure 5b, we see that users with few posts benefit from revision indicating that these users see poorer retrievals. On the other hand users with larger profiles see a drop in performance indicating that even better calibration performance may improve performance of selective revision further.

Note that we don’t report results with davinci-003 since our procedure for learning a threshold θ𝜃\thetaitalic_θ for selective revision failed to find a threshold where dev set performance was improved from selective revision. Finally note that metrics reported for selective revision in Table 3 isn’t directly comparable to those of Tables 1, 4, and 5 since they represent different LLM runs and exclude a dev set from WorkSm and AITA for learning θ𝜃\thetaitalic_θ (50 and 200 instances respectively).

Appendix D Extended Related Work

Having discussed the closest body of related work in §2 we discuss additional related work here.

Dynamic prompts for LLMs Besides training retrievers for in-context example selection discussed in §2, other approaches have explored better use of pre-trained models for example selection. Creswell et al. (2023) select examples based on the target LLM likelihood - necessitating access to LLM likelihoods and incurring latency in retrieval. Gupta et al. (2023) explore selecting sets of examples with dense retrieval models, presenting a complementary approach to ours. Finally, Pan et al. (2023) use retrieval models to select examples from multiple knowledge sources and train a routing model to decide the source of knowledge to retrieve from – selective revision (§5.3) based on a retriever may be seen as a form of routing.

Prompt robustness in LLMs Simultaneous routing and retrieval also relates our approach to work ensuring that LLMs are robust to noisy retrievals. Prior approaches ensured robustness by only using retrieved documents based on simple frequency filters on entities mentioned in the input query (Mallen et al., 2023) or based on predictions from an NLI model that determines if the query entails the retrieved contexts (Yoran et al., 2023). Other approaches have sought to fine-tune the LLM to be robust to irrelevant contexts (Li et al., 2023c; Luo et al., 2023; Yoran et al., 2023) or modify the decoding procedure (Shi et al., 2023). In contrast, we determine the quality of the input context based on scale-calibrated retrieval model scores.

LLM chaining In selectively editing generations with an LLM for low-performing requests, our approach also relates to recent work on composing LLMs with other models to build more complex systems (Wu et al., 2022; Arora et al., 2023; Khattab et al., 2023). Close work is presented by approaches that leverage repeated LLM calls to verify the reasoning or factuality of previous generations (Shridhar et al., 2023; Dhuliawala et al., 2023). In contrast, our work leverages an efficient retrieval model to selectively direct low-performing generations for further revision, reducing the total number of expensive LLM calls necessary. In this respect, our approach bears similarity to Zhang et al. (2023b), who progressively run larger LLMs only when necessary for an input.

Calibrated retrievers A small body of work has explored calibrated ranking models. Yan et al. (2022) train scale-calibrated ranking models for recommendation models used for advertisement pricing systems. On the other hand, our work leverages scale-calibration for personalized writing assistance. Other work has explored joint training of retrievers and generative models to obtain calibrated retrievers Dhuliawala et al. (2022), using Gaussian embeddings to estimate retriever uncertainty Zamani and Bendersky (2023), or estimating retriever confidence with monte-carlo dropout Cohen et al. (2021). In contrast with probabilistic uncertainty estimation, Pearl minimally modifies training to result in a calibrated model and does not require extensive changes to training, model architecture, or additional inference costs.

Writing assistants A sizable body of work has explored the development of writing assistants. Compared to assistants for communication applications, these have been targeted at authors of creative texts like screenplays (Mirowski et al., 2023), stories (Akoury et al., 2020), and poems (Gonçalo Oliveira, 2017) – consequently, they focus on diverse generations and long-range coherence, rather than personalization. Further, while our work leverages a request-driven assistant, prior systems have used a variety of interaction and control methods. While text completion presents a common form of interaction (Clark et al., 2018), recent work has seen use of infilling, tag-based control (Sun et al., 2021), and instruction guided generations (Chakrabarty et al., 2022) – a deeper examination of control and interaction strategies and their trade offs are presented in related reviews (Zhang et al., 2023a; Lin et al., 2023). While our approach to personalization may be extended to some alternative interaction paradigms, other interaction techniques are likely to necessitate additional work.

Personalized text generation While we have focussed on author personalization that aims to mimic stylistic patterns, interests, and values of an author, we briefly review reader-personalized text generation – a setup aiming to generate texts that are engaging and relevant to readers’ preferences. This has historically been explored for generating personalized reviews (Ni et al., 2017), recipes (Majumder et al., 2019), news headlines (Ao et al., 2021) and in dialogue agents (Mazaré et al., 2018; Zhang et al., 2018). Related work is also found in text simplification and lay summarization in the context of scientific text – this work has explored generating definitions for scientific concepts at varying levels of complexity (August et al., 2022; Murthy et al., 2022) or summarizing scientific text for lay readers (Guo et al., 2021). While recent work has explored this with modern LLMs (Li et al., 2023d; Farajidizaji et al., 2023), reader personalization remains an understudied problem and presents a rich area for future work.

Appendix E Limitations

Here, we discuss limitations of our work derive from our choice of fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT and fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT, our evaluation setup, and the design of our method.

Choice of LLMs Our experiments use two closed LLMs through API access (davinci-003, gpt-35-turbo). While we show the value of Pearl with LLM’s of varying performance, establishing its effectiveness with other LLMs will require further work. We also acknowledge that closed LLMs limit experimental reproducibility - however, given the widespread use of GPT models (Hu, 2023) we believe our investigation is meaningful. Finally, in constructing training data for instance selection models for an LLM, prior work has noted the best empirical performance from matching fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT and fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT (Rubin et al., 2022). While we demonstrate benefits from using significantly smaller models for fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT, using an open LLM will allow further validation of this result in the context of our approach. However, using a larger (open) model for fauxsubscript𝑓auxf_{\textrm{aux}}italic_f start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT will incur additional costs in creating training data, and smaller models for fLLMsubscript𝑓LLMf_{\textrm{LLM}}italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT are likely to see a worse generation performance - exploring this tradeoff requires future work.

Evaluation setup Next, while WorkSm represents an impactful and realistic use case for writing assistants, we acknowledge that its private nature limits reproducibility. Further, our evaluation set of WorkSm and AITA represents a limited set of scenarios that are likely to leverage writing assistants. While we believe our work represents a meaningful first step, additional future work, and online evaluations are necessary to establish the value of Pearl across the myriad of scenarios where writing assistants may be used. Finally, while we leverage several evaluation strategies to demonstrate the value of Pearl, evaluating text generations under personalization setups represents is an under-explored and a currently emerging body of work (Wang et al., 2023a, d).

Method design Finally, we note that the current design of Pearl is likely to have some drawbacks. It is possible that our proposed method for training instance selection biases system performance toward some users or requests – we leave examination of this to future work. It is also possible that formulating fretrsubscript𝑓retrf_{\textrm{retr}}italic_f start_POSTSUBSCRIPT retr end_POSTSUBSCRIPT as an expressive crossencoder and the use of large LLMs will present latency limitations for interactive applications – exploration of models supporting faster retrieval and text generation inference represent important future work.

## TASK
Given an enterprise social media post, generate a set of writing instructions that explain how to
"reverse-engineer"; the post. Use the following steps:
- The instructions must give a high-level overview of what the post aims to communicate. Example: [readcted]
- The instructions must include specific proper nouns (people, places, organzations). Example: [redacted]
- The instructions must include contact information if available. Example: [redacted]
- The instructions must include specific links to websites or files if available. Example: [redacted]
- The instructions must contain any knowledge that is highly specialized and is likely to be only known to the individual who wrote the post, if available. Example: [redacted]
- The instructions must contain rough sketches of any personal anecdotes in the post, if available. Example: [redacted]
- The instruction must **not** contain any formatting or ordering information from the post.
## OUTPUT
Output the following:
<Explanation>{explanation of your reasoning for how you generated the instructions, in 3 sentences or fewer}</Explanation>
<Instruction.Overview>{1-2 sentences overview of what the post aims to communicate}</Instruction.Overview>
<Instruction.Names>{1-2 sentences about the people, places, or organizations mentioned in the post, _NONE_ if not applicable}</Instruction.Names>
<Instruction.Contacts>{1-2 sentences about the contact information copied verbatim in the post, _NONE_ if not applicable}</Instruction.Contacts>
<Instruction.Links>{1-2 sentences including the links copied verbatim from the post, _NONE_ if not applicable}</Instruction.Links>
<Instruction.Knowledge>{1-2 sentences paraphrasing the specialized knowledge included in the post, _NONE_ if not applicable}</Instruction.Knowledge>
<Instruction.Anecdotes>{1-2 sentences paraphrasing the anecdotes included in the post, _NONE_ if not applicable}</Instruction.Anecdotes>
## INPUT
{{input_post}}
Prompt 11 GPT4 prompt used to generate synthetic requests for WorkSm posts in our training set.
Overview:
In this study, we are developing LLM-based approaches for writing social media posts on enterprise social networks. Your task is as follows: Given a social media post from an enterprise social media platform, write a short outline of the post. In writing your outline, imagine you are a manager, social media manager, or event organizer writing a rough sketch of the post with the key information you would like to share.
Data Format:
You are given a spreadsheet consisting of ~150 English posts. Each row corresponds to a single post. The spreadsheet contains the following columns: PostId, InputPost, OutputShortOutline. The first column is the ID of the post; you can ignore this column. The second column is the full text of the input post. In the third column, you will write your short outline based on the input post.
DO’s for your outline:
When writing your short outline, do include the following:
- One sentence about the goal of the post: Include a brief description of what the post is trying to communicate. Example: [redacted]
- Specific proper nouns (people, places, things): Include names of specific people, places, or things in your outline. Example: [redacted]
- Specialized knowledge: If the knowledge contained in the post is highly specialized and is likely to be only known to the individual writing the post, include a rough sketch of that information in your outline. Example: [redacted]
- Personal anecdotes: If the post contains specific personal anecdotes, include a rough sketch of that information in your outline. Example: [redacted]
- Special emphasis or call to action: If the post makes a special emphasis, include a rough sketch of that emphasis or call to action in your outline. Example: [redacted]
- External website links: If the post links to an external website, include the link in your outline. Example: [redacted]
DONT’s for your outline:
When writing your short outline, do not include the following:
- Anything related to the ordering of content.
- Formatting instructions.
- Any verbatim text other than specific proper nouns.
Guideline 1 Instructions provided to annotators for authoring requests for our evaluation set in WorkSm.