LABOR-LLM: Language-Based Occupational Representations with Large Language Models

Tianyu Du
Institute for Computational
and Mathematical Engineering
Stanford University
[email protected]
&Ayush Kanodia¹¹footnotemark: 1
Department of Computer Science
Stanford University
[email protected]
&Herman Brunborg
Institute for Computational
and Mathematical Engineering
Stanford University
[email protected]
\ANDKeyon Vafa
Harvard Data Science Initiative
Harvard University
[email protected]
&Susan Athey
Graduate School of Business
Stanford University
[email protected]
These authors contributed equally to this work.
Acknowledgment. We want to express our sincere appreciation and gratitude to the Golub Capital Social Impact Lab, Business Government and Society at Stanford GSB, and Stanford Institute for Human-Centered AI (HAI) for their research support. We would also like to thank members and staff, especially Analía Gómez Vidal, in the Golub Capital Social Impact Lab for their contribution to the completion of this paper.

Abstract

Many empirical studies of labor market questions rely on estimating relatively simple predictive models using small, carefully constructed longitudinal survey datasets based on hand-engineered features. Large Language Models (LLMs), trained on massive datasets, encode vast quantities of world knowledge and can be used for the next job prediction problem. However, while an off-the-shelf LLM produces plausible career trajectories when prompted, the probability with which an LLM predicts a particular job transition conditional on career history will not, in general, align with the true conditional probability in a given population. Recently, Vafa et al. (2024) introduced a transformer-based “foundation model”, CAREER, trained using a large, unrepresentative resume dataset, that predicts transitions between jobs; it further demonstrated how transfer learning techniques can be used to leverage the foundation model to build better predictive models of both transitions and wages that reflect conditional transition probabilities found in nationally representative survey datasets. This paper considers an alternative where the fine-tuning of the CAREER foundation model is replaced by fine-tuning LLMs. For the task of next job prediction, we demonstrate that models trained with our approach outperform several alternatives in terms of predictive performance on the survey data, including traditional econometric models, CAREER, and LLMs with in-context learning, even though the LLM can in principle predict job titles that are not allowed in the survey data. Further, we show that our fine-tuned LLM-based models’ predictions are more representative of the career trajectories of various workforce subpopulations than off-the-shelf LLM models and CAREER. We conduct experiments and analyses that highlight the sources of the gains in the performance of our models for representative predictions.

1 Introduction

Predictive models of individual career trajectories are important components of many labor economic analyses and career planning tools. These predictive models are building blocks used in empirical analyses in economics and social sciences to understand labor markets. For example, such models are used for labor market turnover (Hall et al., 1972) and to quantify and decompose wage gaps by gender (Blau and Kahn, 2017) and race (Fairlie and Sundstrom, 1999). For many applications, it is important to have predictions of job transitions conditional on career history that are representative of the general population. For example, models that predict outcomes conditional on worker history are used to estimate average (over a representative set of workers in a defined subpopulation, such as U.S. high school graduates) counterfactual differences in outcomes that result from a policy intervention, for example when estimating the causal effect of interventions such as training programs (Ashenfelter, 1978; Dehejia and Wahba, 1999), or when estimating the causal effect of displacement (Jacobson et al., 1993). Policy-makers may also need to predict the future transitions for particular subgroups of workers when considering policies that affect them or their families (Athey et al., 2024). In the context of recommendation systems (de Ruijt and Bhulai, 2021), in some settings, it may be desirable that job recommendation tools predict the most likely outcomes for a worker conditional on their history and context.

With sufficiently large data from a representative longitudinal dataset, estimating a predictive model with unbiased conditional predictions about labor market outcomes such as transitions would boil down to training a sufficiently flexible model of transitions conditional on history. However, in practice, the number of possible career paths for workers is extremely large relative to the population, let alone relative to available data. In the U.S., there are a few relatively small, survey-based longitudinal datasets broadly available to researchers where the surveys attempt to find a representative sample of the population. Although labor economists have studied a wide range of questions using these survey datasets (Rothstein et al., 2019; Johnson et al., 2018), because of their small size, traditional models estimated on these datasets impose restrictive functional form assumptions. For example, models that condition career history often assume that the next occupation of a worker depends only on their last occupation and some covariates (Hall et al., 1972) or a few summary statistics about their past (Blau and Riphahn, 1999). As a result, traditional models have limited predictive power.

Recent advances in deep sequential models (e.g., RNNs and transformers), which encode career histories as low-dimensional representations, offer a promising way to design better occupational prediction models. These deep learning models often require a lot of data to train on, and existing small-scale survey datasets fail to meet this requirement. Fortunately, online sources, from job posting websites to news articles about the labor market, encode a large amount of information about career transitions and can be used as a supplementary data source. Breakthroughs in artificial intelligence offer a method to leverage these online data sources: one can train a foundation model (Brown et al., 2020) using large-scale datasets. Because foundation models are built on a backbone of larger data, they can learn the general structure underlying observed data (i.e., labor market) (Bommasani et al., 2022). Then, the researcher can fine-tune the foundation model on much smaller survey datasets of interest. For example, (Vafa et al., 2024) develop the CAREER framework, which uses a transformer architecture to model transitions as first, a discrete choice of whether to change jobs at all, and second, a discrete choice among a set of occupations. CAREER is trained using a large, unrepresentative resume dataset and fine-tuned using U.S. survey data. Vafa et al. (2024) shows that this approach yields more accurate predictions than models trained only on survey data, and improves predictive power substantially over traditional econometric models.

Large language models (LLMs) are foundation models for natural language. They consist of tens or hundreds of billions of parameters, are trained on massive, broad text corpora (Brown et al., 2020), and encode a wide variety of world knowledge, potentially capturing a more comprehensive range of labor market information. The public release of LLMs, pre-trained on massive amounts of text data using substantial computational resources, has ushered in a new era where these models are used for tasks beyond Natural Language Processing (NLP), such as protein sequence generation (Taylor et al., 2022), scientific research (Rives et al., 2021) and more. It is natural to consider using these models for the next job prediction problem.

In this paper, we propose the LAnguage-Based Occupational Representations with Large Language Models (LABOR-LLM) framework, which incorporates several approaches to leveraging LLMs for modeling labor market data and producing representative predictions. The simplest way to produce next-job predictions is to condition an LLM on job history and demographics by prompting the LLM using a text representation of such a job history and demographics, produced using a text-based template. We also consider more complex approaches, including fine-tuning as well as approaches that extract embeddings from LLMs and incorporate them into multinomial classifier models trained to predict the choice of next job for a worker given the embedding that summarizes worker history. We compare the performance of several alternative models within the LABOR-LLM framework, and further we contrast these with alternative baselines, including in particular the state-of-the-art CAREER (Vafa et al., 2024) framework.

A concern with approaches that build on general-purpose LLMs is that they may or may not yield predictions that are representative of the job transitions of the general public. An LLM is generally not trained on representative data, or even for the task of next job prediction, so it may produce poor predictions for demographics that are underrepresented in its training set (Buolamwini and Gebru, 2018). If we query an LLM to predict a job trajectory for an individual, it will likely generate a coherent and plausible trajectory. However, there is no guarantee that the probability that a particular transition is specified by an LLM will be consistent with the true probability of that career transition for workers with similar histories in the population at large.

Questions about the use of foundation models for tasks have arisen in other areas, such as opinion surveys. A recent literature has emerged that aims to assess whether the outputs of foundation models are representative of larger populations (Santurkar et al., 2023; Argyle et al., 2023). A common strategy to assess this for LLMs is to query them with survey responses from long-standing opinion surveys and see how aligned their responses are with the survey average. For example, if 70% of the survey respondents in these surveys respond “Yes” when asked “Do you support taxing the rich?”, we can query an LLM with the same question and assess if it responds with “Yes” 70% of the time.

In this paper, we propose to evaluate representative predictions in a stronger sense: the distributions of predicted next jobs should be representative of true next jobs conditional on job histories. This type of conditional representativeness can be analyzed with reference to a particular population, and in some contexts, it may be important that a model be sufficiently representative within subpopulations of interest, such as disadvantaged socioeconomic groups or groups that are the target of policy interventions. In general, a transformer model such as an LLM trained with the objective of accurate next-token prediction (conditional on a sequence of past tokens) will make predictions that are representative of the set of next-token prediction examples from the training data. However, the training data may or may not be representative of the population of interest to an analyst. Further, there are a variety of subtle choices to be made when defining the population of next-token predictions that would be used in an ideal test set for evaluating performance, as well as in how to measure performance. These choices include which subpopulations to focus on, whether to take the perspective of a population of individuals (and their full careers) or a population of transitions (where individuals with longer careers have more transitions), and what evaluation measures to use (e.g., accuracy of the most likely prediction or the complete likelihood assigned by the model to all possible next jobs). Different substantive goals lead to different choices of objective function as well as different weightings of examples in training and testing.

In this paper, we make several choices about how to operationalize representativeness. First, a language model, such as a transformer, can be viewed as estimating conditional probability distributions over future tokens given past tokens. For the task of next job prediction, we evaluate the model’s performance at estimating conditional probabilities over the next occupation (which, when considered as text, consist of several tokens) given job history and various covariates. We evaluate the quality of the estimates produced by a model using measures such as perplexity (Jelinek et al., 2005) constructed from the log-likelihood of the observed occupation according to a model’s estimates. For expositional and computational simplicity, we consider our target population to be a population of career transitions, so that workers with longer careers are weighted more heavily; and we focus the set of career transitions that appear in three widely used government-collected representative U.S. administrative surveys as a target population of interest. We further examine representativeness within subpopulations.

Our results suggest that off-the-shelf LLMs provide unsatisfactory performance using these datasets compared to previous baseline models, but that fine-tuning LLMs on survey data improves performance beyond the state-of-the-art methods (e.g., Vafa et al. (2024)). This is a surprising fact: while CAREER was created using resumes specifically for the problem of job prediction, general-purpose LLMs acquire this ability passively.

Furthermore, we find that the predictions from these fine-tuned LLMs are representative of career trajectories of various demographic subgroups in the workforce, conditioned on job histories. This allows us to use these models as predictive modules conditioned on various demographic subgroups and job histories despite LLMs being pre-trained on datasets that are not representative of the entire workforce.

Importantly, these LLMs we develop are more accessible than CAREER because CAREER requires proprietary resume data. Instead, anyone with computational resources can fine-tune the publicly available LLMs. We will release our best-performing LLM.

We conduct a series of experiments and analyses to understand the advantages brought by LLMs, analyzing how the knowledge base of an LLM informs its predictions. We also compare model performance on subpopulations defined by different educational backgrounds, which indicates that fine-tuned LLMs make more accurate predictions overall and by subgroup.

Our findings demonstrate a method for adapting LLMs to make representative labor market predictions without relying on proprietary models or data.

2 Related Work

Career Trajectory Modeling and Next Job Prediction Economists have historically fitted relatively simple predictive models of labor markets to relatively small datasets. These methods typically only predict a few occupation categories. Boskin (1974) used conditional logit models to study workers’ choices among 11 occupational groups with estimated earnings, training expenses, and costs due to unemployment. Schmidt and Strauss (1975) utilized logit models to assess how race, sex, educational attainment, and labor market experience influence the probability that individuals attain five different occupational categories, revealing significant effects of these variables on occupational outcomes. Although future occupations can have complex dependencies on the entire sequence of previous jobs, traditional methods typically only leverage the most recent past job with curated features summarizing the job history (Hall et al., 1972) or some summary statistics (Blau and Riphahn, 1999).

Machine Learning Methods for Next Job Prediction In the context of resume datasets, researchers have utilized deep learning and graph neural network methods to develop machine learning algorithms that model sequences (Li et al., 2017; Meng et al., 2019; Zhang et al., 2021). Extending these methods, Vafa et al. (2024) developed CAREER, a transformer model pre-trained on a massive resume dataset of 24 million resumes. However, the model was then fine-tuned on survey datasets; the CAREER model demonstrated superior performance compared to other approaches for the next job prediction problem on these survey datasets. Our paper introduces an alternative approach to CAREER, starting with a pre-trained LLM instead of pre-training our own model. We show how to leverage this model to the task of predicting the next job on survey data sets.

Natural Language Process and Language Modeling In the approaches mentioned above, jobs are represented as individual discrete choices. However, job titles also have an inherent linguistic meaning. We review recent developments in the Natural Language Processing (NLP) and LLM literature, which inform our approach to modeling the next job prediction problem as a language modeling problem. Recurrent Neural Network (RNN) models based on architectures such as GRU (Cho et al., 2014) and LSTM (Hochreiter and Schmidhuber, 1997) were an important class of performant NLP methods. However, they process tokens in a sentence sequentially, forcing high computational complexity due to the dependence on sequential processing. Transformer architectures (Vaswani et al., 2017) with attention broke through this computational barrier by utilizing a key-query-value design to allocate attention while making the prediction dynamically. These models were accompanied by powerful unsupervised training methods such as Causal Language Modeling (CLM) (Brown et al., 2020) and Masked Language Modeling (MLM) (Devlin et al., 2019). Recently, industry practitioners have leveraged transformers’ scalability and developed LLMs with billions of trained parameters, such as GPT-3 (Brown et al., 2020) and Llama (Touvron et al., 2023).

The Biases and Representativeness of LLMs LLMs have since been used in several open-ended tasks, such as dialog (Yi et al., 2024) and recommendations (Geng et al., 2022), and have begun to have an impact on public opinion. Social science researchers have begun using them to emulate survey responses, leading to an emerging research agenda on the study of LLM’s biases and their effectiveness in simulating survey responses. Recent work has shown that LLMs produce responses to public opinion surveys that are not representative of various demographic groups, even after being steered toward them (Santurkar et al., 2023). (Dominguez-Olmedo et al., 2024) show that a binary classifier can almost perfectly differentiate model-generated data from the responses of the U.S. census in the context of the American Community Survey. However, other work (O’Hagan and Schein, 2024) has shown that LLMs can be used to characterize complex manifestations of political ideology in text. Argyle et al. (2023) shows that language models can be used to simulate human samples through prompting and appropriate conditioning on sociodemographic backstories, making these samples effective proxies for specific human subpopulations. We contribute to this literature by showing that off-the-shelf LLMs are not representative of survey responses for job transitions, and we show methods to make them representative. Further, for the next job prediction problem, we seek a stronger form of conditional calibration - model predictions should be calibrated within demographic subgroups conditional on job histories. We show that our models produce more conditionally representative predictions than CAREER.

Adapting LLMs to Build Domain-Specific Models Training these LLMs from scratch requires computational resources that cost millions of dollars and a high carbon footprint (Luccioni et al., 2022). The pre-training and fine-tuning paradigm proposes a practical, tractable, and more sustainable way to use LLMs. The paradigm involves training a model on a large dataset to learn general knowledge and then refining it on a smaller, task-specific dataset to adapt its learned patterns to specific applications (Wei et al., 2022). Research has demonstrated that fine-tuning a pre-trained LLM on the dataset of interest can yield superior results than directly training a large model from scratch. This pre-training and fine-tuning paradigm has produced state-of-the-art models for dialogue systems (Yi et al., 2024), code generation (Chen et al., 2021), music generation (Agostinelli et al., 2023), scientific knowledge (Taylor et al., 2022), protein structure prediction (Rives et al., 2021), chemistry (Zhang et al., 2024), medicine (Singhal et al., 2022), and other settings. The literature on the adaptation of LLMs for recommendation systems is also closely related. Geng et al. (2022) introduced a general paradigm to adapt the recommendation task to language processing. We propose a language modeling approach to the next job prediction task. Moreover, we can predict the complete distribution over the next jobs treated as discrete choices with higher performance than prior state-of-the-art models while framing this problem as a causal language modeling problem.

Machine Learning in Economics Machine Learning methods are increasingly used in economics (Athey and Imbens, 2019) as modules for prediction (Kleinberg et al., 2015) and causal inference with high-dimensional data (Athey et al., 2018b). The following line of work in economics is closely related to our work and uses machine learning methods for discrete choice modeling. Athey et al. (2018a) introduces SHOPPER, a Bayesian demand model that builds item embeddings from large-scale grocery datasets and predicts customers’ choices, combining ideas from language modeling and econometrics. Donnelly et al. (2021) shows how to estimate a similar demand model using a nested Bayesian matrix factorization approach, while sharing parameters across products, customers, and product categories (Rudolph et al., 2016), while modeling product choice, on a per category basis, jointly for several categories. We contribute to this literature by further extending ideas from language modeling and discrete choice models with language modeling to build a model for labor choice.

3 Representative Occupation Modeling

The goal of occupation modeling is to predict an individual’s career trajectory. In many cases, it is important for these predictions to be representative of a larger population. In this section, we formalize the problem of occupation modeling and describe a data source that can be used to assess whether a model’s predictions are representative: national longitudinal survey datasets.

Occupation Modeling.

An individual’s career trajectory can be defined as a sequence of occupations, each held at a different timestep of their career history. An occupation model is a probabilistic model over these occupational sequences. We consider the case where occupations are represented as discrete variables. For example, survey datasets typically encode jobs into discrete occupations using taxonomies like the Standard Occupational Classification System (SOC) and the Occupation Classification Scheme (OCC).

More formally, denote by $y_{i,t}\in\mathcal{Y}$ the occupation that an individual $i$ has at time $t$ , with $\mathcal{Y}$ denoting the set of all occupations. Each worker is also associated with covariates — static covariates $x_{i}$ (e.g., ethnicity) are fixed over time, while dynamic covariates $x_{i,t}$ (e.g., education level) may change throughout the worker’s career. We use the shorthand $y_{i,<t}=(y_{i,1},\dots,y_{i,t-1})$ to denote an individual’s job sequence prior to their $t$ ’th observation (for $t\leq 1$ , define $y_{i,<t}=\varnothing$ ), and similarly $x_{i,\leq t}=(x_{i,1},\dots,x_{i,t})$ to denote the set of dynamic covariates up to the $t$ ’th observation. Lastly, we use $T_{i}$ to denote the total number of records from individual $i$ .

An occupation model is a predictive model of an individual’s next occupation:

\displaystyle P(y_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t}).

(1)

The model conditions on all previous occupations and all current and previous covariates. Covariates are treated as “pre-transition”; for example, a model may condition on an individual’s current education to predict their next job.

Representative Predictions.

In many settings, it is important for an occupation model to make representative predictions for several reasons. In economic analysis settings, when performing counterfactual simulations and policy analysis, it is necessary to have representative model predictions so that estimation within different demographic groups is unbiased. In a recommendation system setting, representative models may sometimes be required in recommendation system settings where it is important to surface recommendations that resemble the true underlying job transitions in a subgroup. For instance, a career guidance tool aimed at low-income workers may want to suggest feasible and common job transitions for that demographic rather than high-paying but unrealistic options. Further, it is important that career trajectory predictions are representative not only conditional on demographic subgroups but also con on job histories.

Representative Surveys.

To assess whether occupation models make representative predictions, we use longitudinal survey datasets. These datasets follow individual workers who are regularly interviewed about their lives and careers. Crucially, these datasets are constructed to be nationally representative. As a result, we can assess whether a model makes representative predictions by comparing predicted job sequences to actual sequences from survey data. We analyze three well-known survey datasets in the United States: the Panel Study of Income Dynamics (PSID), the National Longitudinal Survey of Youth 1979 (NLSY79), and the National Longitudinal Survey of Youth 1997 (NLSY97). Each survey is constructed differently and thus follows different populations. PSID, which began in 1968, aims to be representative of the United States as a whole and continues to add new workers over time. In contrast, the NLSY datasets follow specific birth cohorts: NLSY79 began in 1979 and followed individuals aged 14-22 at the time, while NLSY97 began in 1997 and followed individuals aged 12-16 at the time.

4 How Representative are LLMs as Occupation Models?

Any conditional distribution over job sequences is an occupation model. Here, we study the occupation modeling capabilities of LLMs and assess how accurately LLMs could model such conditional distributions.

LLMs are trained primarily to predict missing words from text culled from the Internet. However, they are capable of performing many tasks extending far beyond the next-word prediction task they were trained to perform, such as solving logic puzzles (Mittal et al., 2024) and modeling time series data (Jin et al., 2024). While LLMs are not explicitly trained to predict occupational sequences, they are trained on massive amounts of data containing information about career trajectories — such as news articles about the labor market and reports from the Bureau of Labor Statistics. This information may equip them with the ability to make accurate and representative predictions of occupational sequences.

Predicting occupational trajectories using LLMs requires converting occupational sequences to textual prompts that LLMs can understand. In this section, we describe a prompting strategy for eliciting occupational predictions from LLMs. With this strategy, LLMs predict plausible-sounding occupational trajectories. We then assess whether these predictions are representative of the American population by comparing them to trajectories from three nationally representative surveys. We show that LLMs consistently make unrepresentative predictions.

4.1 Prompting LLMs to Predict Occupations

LLMs are conditional probability distributions over text sequences; conditional on a sequence of text (i.e., a prompt), an LLM provides conditional probabilities over all possible continuations of the prompt. Therefore, repurposing LLMs to predict occupations requires representing occupational trajectories as text.

We create a text template, a function that transforms an individual’s career history into a textual summary; this function is denoted by $\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t})$ . Our text template takes advantage of the fact that each occupation has a natural textual representation: its title. For example, the title of the occupation with SOC code 19-1022 is Microbiologists¹¹1Readers can refer to the official Bureau of Labor Statistics (https://www.bls.gov/OES/CURRENT/oes_stru.htm) for a list of the latest SOC titles.. Job titles can be variable in length, depending on how an LLM tokenizes words; for our experiments, the length of job titles ranged from 2 to 28 tokens, with an average length of 8 tokens. (Figure 11 in Appendix D presents a word cloud example of job titles). We use a similar strategy for representing covariates as text; for example, we represent an individual’s educational status using values such as graduate degree.

To elicit an LLM’s predictions of an individual’s next job, we include all previous job information (along with all previous and current covariate information) in a text template. To predict an individual’s $t+1$ ’st job, the text template will begin with a description of the static covariates and then include a row for each of the $t$ previous occupations and dynamic covariates. It will conclude with a partial row for the occupation to be predicted. For example, the following text template would be used to elicit an LLM’s prediction of an individual’s third job:

<A Resume from the NLSY79 Dataset>The following is the resume of a male white US worker residing in the northcentral region.The worker has the following work experience on the resume, one entry per line, including job code, year, education level, and a description of the job:1988 to 1989 (graduate degree): Secretaries and administrative assistants1989 to 1990 (graduate degree): Carpet, floor, and tile installers and finishers1990 to 1991 (graduate degree):The template omits the title of the individual’s third job (“Elementary and middle school teachers”). When an LLM is prompted with this template, we can record its response as its prediction of the next job.

We can also use the text template to build the individual’s full job history. The example below shows the text representation of a worker’s entire career history generated by our text template. Note that the individual can stay in the same job for multiple records; the text representation explicitly reflects this information. This individual will have five prediction tasks in total, one for each record, throughout their job history. With a slight abuse of notation, let $\mathcal{T}(x_{i},x_{i,\leq T_{i}},y_{i,\leq T_{i}})$ denote the paragraph representing the entire career history of worker $i$ .

<A Resume from the NLSY79 Dataset>The following is the resume of a male white US worker residing in the northcentral region.The worker has the following work experience on the resume, one entry per line, including job code, year, education level, and a description of the job:1988 to 1989 (graduate degree): Secretaries and administrative assistants1989 to 1990 (graduate degree): Carpet, floor, and tile installers and finishers1990 to 1991 (graduate degree): Elementary and middle school teachers1991 to 1992 (graduate degree): Elementary and middle school teachers1992 to present (graduate degree): Adult Basic and Secondary Education and Literacy Teachers and Instructors<END OF RESUME>The corpus of text representations of full career histories is useful when fine-tuning language models.

4.2 Evaluating Representativeness

To assess the representativeness of an LLM’s occupational predictions, we compare its predictions to actual occupational trajectories from survey datasets. We study three commonly used survey datasets: the Panel Study of Income Dynamics (PSID) (Johnson et al., 2018) and two cohorts from the National Longitudinal Survey of Youth (NLSY79 and NLSY97) (Rothstein et al., 2019).

We randomly construct “test samples” containing 20% of individuals in each dataset (test samples contain all observations for each included individual). Table 10 in Appendix C presents summary statistics about each dataset.

For each individual in the test set, we prompt LLMs to predict each recorded observation of their career: predicting their first job from just their covariates, predicting their second job from their first job and covariates, etc. We evaluate a model’s representativeness by comparing its predictions of an individual’s next job to their actual next job. Specifically, we evaluate models by computing their perplexity, a commonly used metric in NLP. The perplexity is a monotonic transformation of log-likelihood, with lower perplexity indicating that a model’s predictions are more representative. Formally, for a model $\hat{P}(y_{i,t}|x_{i},x_{i,\leq t},y_{i,<t})$ that assigns a probability to each possible occupation, perplexity is given by

\displaystyle\exp\left\{-\frac{1}{\sum_{i}T_{i}}\sum_{i}\sum_{t=1}^{T_{i}}w_{% it}\left[\log\hat{P}(y_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})\right]\right\},

(2)

where $T_{i}$ is the number of observations for individual $i$ ; $w_{it}$ denotes the sampling weight for the individual; we can adjust these weights to assess the model’s performance on different subpopulations or other objectives. For example, we can set these weights to be such that we weight each transition equally, or we weight each individual equally. We can also set them such that we seek representative predictions only on the first few or last few transitions for every individual; In Section 4 and Section 5, we set $w_{it}=1$ to evaluate models’ performances on the general population. We consider additional evaluation metrics (such as calibration) in Section 6.

While perplexity evaluates the probabilities assigned to occupations, LLMs assign probabilities at the token level. Occupation titles typically span multiple tokens; for example, the title “software engineer” may be tokenized into two tokens, one for “software” and one for “engineer”. However, because LLMs are probabilistic models, we can use the chain rule of probability to extract probabilities assigned to full occupation titles. Equation (3) illustrates how one can obtain the conditional probability assigned to “software engineer”. See Appendix F for more details.

\displaystyle\begin{aligned} &\hat{P}(y_{i,t}=\text{``Software Engineer''}\mid x% _{i},x_{i,\leq t},y_{i,<t})\\ &=P_{\text{LLM}}(\text{``Software Engineer''}\mid\text{Prompt})\\ &=P_{\text{LLM}}(\text{``Software''}\mid\text{Prompt})P_{\text{LLM}}(\text{``% Engineer''}\mid\text{Prompt},\text{``Software''})\end{aligned}

(3)

Because evaluating perplexity requires accessing a model’s assigned probabilities, we can only study LLMs whose probabilities are accessible. We study three open-source LLMs from the Llama-2 family of models: Llama-2 (7B), Llama-2 (13B), and Llama-2 (70B); these models were trained on 2 trillion tokens of text from the Internet, and are among the most capable open-source LLMs (Touvron et al., 2023).

When we prompt these LLMs to predict an individual’s future occupations, they provide plausible-sounding trajectories. Readers can refer to Appendix A for examples. However, they also assign mass to strings that are not valid job titles. To encourage models to predict only valid occupations, we consider an additional prompting strategy that includes the list of all possible titles before the prompt.

Prompt Format	Model	PSID	NLSY79	NLSY97
	Llama-2 (7B)	3820.31 (241.71)	473.52 (11.58)	505.27 (18.85)
Without list of job titles	Llama-2 (13B)	1711.50 (82.79)	236.19 (5.95)	291.59 (9.92)
	Llama-2 (70B)	1527.95 (70.97)	162.80 (3.78)	216.09 (7.25)
	Llama-2 (7B)	179.96 (5.81)	53.71 (0.91)	71.13 (1.78)
With list of job titles	Llama-2 (13B)	131.26 (4.53)	44.97 (0.77)	50.13 (1.20)
With list of job titles	Llama-2 (70B)	131.29 (3.79)	39.53 (0.58)	46.24 (0.99)
—	CAREER (Vafa et al., 2024)	13.88 (0.30)	11.32 (0.12)	14.16 (0.24)

Table 1: The test-set perplexity of LLMs for predicting next occupations on three nationally representative survey datasets (lower is better). Standard errors are reported in parentheses.

Table 1 contains the perplexity of each model with both prompting strategies. As a comparison, we also include the perplexity of CAREER (Vafa et al., 2024), a non-language model developed solely to predict nationally representative occupational trajectories. The LLMs consistently make unrepresentative predictions, with perplexities ranging from 39.53 to 3820.31. For comparison, a completely uninformative model that assigns uniform mass to each possible occupation would achieve a perplexity of $|\mathcal{Y}|$ , which is 335. LLM predictions are improved by including the list of job titles in the prompt, but they’re still significantly worse than the CAREER model. Part of this poor performance is due to models assigning mass to occupational titles that do not exist (i.e., $\sum_{y\in\text{all jobs}}\hat{P}(\text{title}_{y}\mid\text{prompt})$ is far less than 1); however, explicitly removing this mass by renormalizing a model’s predictions does not make up a large difference. Readers can refer to Appendix G for more details on our experiments with baseline language models.

5 Modifying LLMs to Make More Representative Predictions

Refer to caption — Figure 1: Illustration of the inference pipeline for our career trajectory prediction approaches. To predict the individual’s $t^{th}$ occupation using our fine-tuned model, we first build a text representation of individual $i$ ’s career history before the $t^{th}$ record, then we feed the text representation into our language models for prediction. (a) In approach 1, we ask the model to predict the next occupation as tokens in job titles. (b) In approach 2, we extract embeddings from the language model (fine-tuned or off-the-shelf) and train a multinomial classifier to predict occupations from embedding vectors. We also explore an in-context learning approach to predict the individual’s $t^{th}$ occupation: we feed full-text representation examples and the partial text representation covering information up to record $t$ all together into the off-the-shelf Llama model; then, we extract the embedding and run a classifier to predict the next occupation.

In Section 4, we showed that while LLMs can generate plausible-sounding occupational trajectories, these trajectories are not representative of the broader population. Here, we consider two approaches to generate more representative occupational predictions from LLMs: one based on fine-tuning models and one based on training new classifiers on top of extracted embeddings. These approaches are illustrated in Figure 1.

5.1 Fine-Tuning Language Models

Our first strategy is to fine-tune LLMs to predict occupational trajectories on survey data. Fine-tuning on survey data would encourage models to make more representative predictions while retaining the knowledge they acquired during pre-training. Since LLMs make predictions at the token level, we fine-tune models to predict each token of a textual summary of worker careers. Specifically, we randomly divide each dataset into 70/10/20 train/validation/test splits. Splits are constructed at the individual level; if an individual is in a split, all of their observations are in the same split. We use the same test splits as for the exercises in Section 4. We then create a text template for each individual consisting of all of the observations of their career, as the second example template illustrated in Section 4.

We fine-tune the three Llama-2 models used in Section 4 on the training set text templates and evaluate models on the test split as in Section 4. Figure 2 illustrates the fine-tuning procedure. We perform fine-tuning by maximizing a model’s assigned likelihood to the true next token conditional on all previous tokens in a text template. Our objective includes each token of each template, regardless of whether or not it’s an occupation title (we do not include the full list of occupations in the prompts; as we will show later, fine-tuned models indeed learn the set of valid job titles). We perform full parameter and full precision fine-tuning for 3 epochs with a batch size 32. To improve computational efficiency for inference, we quantize fine-tuned language models to 8-bits. In Appendix E, we show that running model inference in full precision does not significantly improve performance.

Model	PSID	NLSY79	NLSY97
Bi-gram Markov	27.16 (0.49)	19.80 (0.19)	23.67 (0.34)
CAREER (Vafa et al., 2024)	13.88 (0.30)	11.32 (0.12)	14.16 (0.24)
Fine-tuned Llama-2 (7B)	13.62 (0.30)	11.37 (0.12)	14.62 (0.24)
Fine-tuned Llama-2 (13B)	13.32 (0.29)	11.27 (0.12)	14.15 (0.23)
Fine-tuned Llama-2 (70B)	13.14 (0.28)	11.03 (0.11)	13.87 (0.23)

Table 2: The test-set perplexity of fine-tuned LLMs for predicting next occupations on three nationally representative surveys (lower is better). Standard errors are reported in parentheses.

Table 2 reports the test set perplexity of the three fine-tuned Llama-2 LLMs along with two baselines trained on the training split: a bi-gram Markov model that only predicts an individual’s next job from the empirical frequency of transitions, and CAREER (Vafa et al., 2024), a foundation model designed to make representative predictions on survey data. Fine-tuned models make significantly more representative predictions than the original models Table 1.

Surprisingly, the fine-tuned LLMs make more representative predictions than CAREER, which was trained on 24 million resumes and designed specifically to make accurate predictions on survey data. Although the Llama-2 models are not explicitly trained to model occupations, the information they acquire about career trajectories in the training process enables them to outperform CAREER. While CAREER is trained on proprietary resume data, the pre-trained Llama-2 models are open-source, making it possible for practitioners with computational resources to build state-of-the-art models.

It is worth noting that two models’ perplexities on the same observation are often correlated; we use a bootstrap method to better understand how significantly and consistently our models outperform CAREER. Table 3 compares the performance of different variants of fine-tuned Llama-2 models and the previous CAREER transformer by analyzing the perplexity differences in pairs of models. Specifically, we generated 1,000 bootstrap samples from each of the three survey datasets; then, we computed two perplexities on each bootstrap sample using CAREER and one of our fine-tuned Llama-2 models. Readers can refer to Appendix I for visualizations and more comparison results.

Perplexity Improvement over CAREER	PSID	NLSY79	NLSY97
Fine-tuned Llama-2 (7B)	-0.27 (0.10)	0.04 (0.03)	0.46 (0.07)
Fine-tuned Llama-2 (13B)	-0.56 (0.09)	-0.05 (0.03)	-0.00 (0.05)
Fine-tuned Llama-2 (70B)	-0.77 (0.09)	-0.30 (0.03)	-0.28 (0.05)

Table 3: Perplexity improvement of models over the CAREER model. The table shows the perplexity difference between our fine-tuned LLMs and the previous state-of-the-art CAREER model. Since a lower perplexity stands for better model performance, a negative value in this table suggests that fine-tuned Llama-2 outperforms CAREER. Numbers in parenthesis are standard deviations of perplexity differences computed on 1,000 bootstrap samples of the test set.

Since LLMs make predictions on the token level, they may place mass on job titles that do not exist. In Appendix H, we show that fine-tuning encourages models to assign only mass to existing occupations. For example, the fine-tuned Llama-2 models place an average of 99% of their mass on valid occupations. As a result, renormalizing the predictions of fine-tuned LLMs to ensure that they only place mass on real occupations has little effect.

5.2 Extracting Embeddings from Language Models

While the fine-tuning approach is effective for generating representative predictions, it is computationally expensive. Here, we consider another approach with a lower computational cost. Our approach is based on passing in a text description of an individual’s career to an LLM and extracting the model’s embedding. A new classification model is trained on top of the embedding to predict the individual’s next job.

We first convert each job sequence to text using the template discussed in Section 4.1. When we pass in a text template to a language model, the model embeds the text in $d$ -dimensional Euclidean space to predict the individual’s next job. To predict an individual’s next job from their embedding, we train a multi-class classifier using multinomial logistic regression. Because we are training new models on top of embeddings, we are no longer constrained to make predictions on the token level. So we train the classifier to predict occupation codes directly rather than the job title. Crucially, this approach only requires performing inference steps for each prediction (i.e., to build the emebdding), and so it is more computationally efficient than fine-tuning model parameters. See Appendix J for more details.

To extract embeddings from the Llama-2 models (fine-tuned and off-the-shelf), we use the final-layer model representation of each model. We consider both the off-the-shelf Llama-2 models considered in Section 4 and the fine-tuned models described above. We also consider an approach that takes advantage of the in-context learning capabilities of LLMs. Specifically, in addition to including an individual’s job trajectory in the template, we also include the complete trajectories of three randomly sampled individuals at the beginning of the prompt. See Appendix K for more details.

In addition to extracting embeddings from LLMs, we also consider models that are designed specifically to provide embeddings. Specifically, we generated text embeddings using three models available from OpenAI: text-embedding-3-small, text-embedding-3-large, and text-embedding-ada-002. For each survey, we report results for the model with the best performance.

Table 4 contains the results summarizing model performance. All models form far better predictions for the survey population than the original LLMs. However, there is still a substantial gap between these models and the fully fine-tuned LLMs, pointing to the importance of fine-tuning for generating representative predictions. The embeddings extracted from the LLMs form better predictions than those from the embedding-only models. In-context learning also appears to provide additional benefits at minimal computational cost. It’s worth noting that the 70B parameter models form worse predictions than the 13B parameter models. However, this may be due to regularization challenges; the 70B parameter model contains 8,192 embedding dimensions compared to 5,120 for the 13B parameter model.

Embedding method	PSID	NLSY79	NLSY97
OpenAI Text Embeddings	16.29 (0.33)	14.42 (0.14)	20.48 (0.31)
Off-the-shelf Llama-2 (7B)	15.45 (0.32)	13.17 (0.14)	17.15 (0.27)
Off-the-shelf Llama-2 (13B)	15.21 (0.32)	12.90 (0.12)	16.62 (0.25)
Off-the-shelf Llama-2 (70B)	15.69 (0.34)	13.13 (0.14)	17.54 (0.31)
Off-the-shelf Llama-2 (7B) (with in-context learning)	15.02 (0.35)	12.53 (0.10)	16.07 (0.22)
Fine-tuned Llama-2 (7B)	14.42 (0.40)	11.64 (0.12)	15.38 (0.26)
Fine-tuned Llama-2 (13B)	13.39 (0.27)	11.33 (0.11)	14.90 (0.24)
Fine-tuned Llama-2 (70B)	14.04 (0.33)	11.48 (0.11)	15.65 (0.29)

Table 4: The test set perplexity (lower is better) of methods that form predictions from embeddings extracted from LLMs. Each row corresponds to a different method for generating embeddings. Standard errors are reported in parentheses.

5.3 Summary

Table 5 compares various approaches for predicting the next occupation in a career trajectory, evaluating them based on fixed cost, variable cost per observation, and representativeness. Classical methods such as logit models have relatively low fixed computational costs for model estimation; the variable cost per prediction is also extremely low. Since these models are not capable of capturing complicated temporal dependencies in job histories, their predictions are only moderately representative of survey datasets. Non-LLM deep learning approaches, like CAREER, require significant model training and hyperparameter tuning effort. However, they demonstrate more representative predictions compared to classical methods. The bottom three rows summarize the methods described in this paper. While using LLMs out-of-the-box requires no additional training, their predictions are poor. Fine-tuning these models results in the most representative predictions, albeit at a high fixed computational cost. The embedding approach alleviates the cost, yet the representativeness of predictions suffers.

One of the most surprising results is that fine-tuned LLMs form better predictions of occupational trajectories than CAREER, a model designed specifically to form representative predictions. CAREER was trained on a proprietary dataset of 24 million resumes. In contrast, Llama-2 is only trained on publicly available data, and it is open-source. The success of the Llama-2 model lowers the barrier to entry for researchers interested in studying career trajectories, as they no longer need access to proprietary datasets or significant computational resources to train deep learning models from scratch.

Approach

Fixed cost

Per-observation cost

Performance

Econometric models

(e.g. logit models)

Model training (typically fast)

Extremely low

Fair

Non-LLM machine learning models

(e.g. CAREER)

Pre-training on large-scale resume dataset

Fine-tuning on small-scale survey dataset

Hyper-parameter tuning required

Low

Good

Predict jobs as titles from LLMs (off-the-shelf)

None (model is already pre-trained)

Medium (inference step for each observation)

Poor

Extract embeddings from LLMs (off-the-shelf)

Training classifier on top of model

Medium (inference step for each observation)

Fair

Extract embeddings from LLMs (off-the-shelf)

with in-context learning examples in prompt

Training classifier on top of model

Higher (inference step for each observation;

the LLM needs to process a much longer prompt)

Fair

Predict jobs as titles from LLMs (fine-tuned)

Fine-tuning on survey dataset (expensive)

Medium (inference step for each observation)

Best

Extract embeddings from LLMs (fine-tuned)

Fine-tuning on survey dataset (expensive)

Training classifier on top of model

Medium (inference step for each observation)

Good

Table 5: Comparison of different approaches for predicting representative career trajectories.

6 Analyses

Our experiments demonstrate that our best-performing approach, which is to directly predict jobs through text tokens using a fine-tuned Llama-2 (70B) model, achieves superior perplexity scores compared to the previous state-of-the-art CAREER model, even without training on an extensive resume dataset. These findings imply that future researchers might forgo training transformers from scratch using large datasets while still producing an excellent labor choice model and potentially use it in other economic modeling contexts. Given our results, we investigate why this approach outperforms the previous state-of-the-art CAREER model, which is pre-trained on a massive resume dataset. This section delves deeper into the performance differences between the previous state-of-the-art CAREER model and our best-performing approach. Our experimental results demonstrate that our fine-tuned Llama-2 (70B) model is our best-performing model. Consequently, we will use this model as a reference point for comparison with other approaches.

6.1 Binary Prediction

We start with inspecting models’ performance on the binary task of whether an individual will change her job (i.e., $y_{i,t}\neq y_{i,t-1}$ ) or not. Specifically, we define $\text{stay}_{i,t}=\mathbf{1}\{y_{i,t}=y_{i,t-1}\}$ . The model predicts if an individual will stay in the same occupation with the following probability:

\displaystyle\hat{P}(\text{stay}_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})=\hat{P% }(y_{i,t-1}\mid x_{i},x_{i,\leq t},y_{i,<t})

(4)

and define $\text{move}_{i,t}=\mathbf{1}\{y_{i,t}\neq y_{i,t-1}\}$ ; the predicted probability of moving is therefore

\displaystyle\hat{P}(\text{move}_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})=1-\hat% {P}(\text{stay}_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})

(5)

We exclude the first record $t=1$ for every individual from our analysis.

ROC Curve

The ROC curve is a graphical representation that illustrates the performance of a binary classifier by plotting the true positive rate against the false positive rate across various threshold settings. Figure 4 compares the ROC curve of different models with moving as the positive label, which suggests that the fine-tuned language model outperforms the CAREER model by a slight margin.

Model Calibration

Model calibration is crucial, as it ensures that predictive models accurately reflect real-world outcomes, enhancing their reliability and applicability in scientific research. We investigate different models’ calibration in predicting whether an individual will change her job (i.e., $y_{i,t}\neq y_{i,t-1}$ ) or not. To assess how well-calibrated each model is, we split observations into ten groups based on deciles of predicted probability of changing jobs $\hat{P}(\text{move}_{i,t})$ (i.e., the next occupation $y_{i,t}$ is different from the previous one $y_{i,t-1}$ ). Then, for each group, we compute the empirical percentage of movers. If a model is well-calibrated, the average predicted $\hat{P}(\text{move}_{i,t})$ should match the actual proportion of movers within each group. Figure 4 demonstrates the calibration plot for CAREER and our best-performing approach, in which the diagonal line represents a perfectly calibrated model. Despite both models being calibrated on average, we observe that our best-performing approach is better-calibrated in predicting staying and moving than the CAREER model, which underestimates moving in some groups and overestimates it in others.

6.2 Performance on the multinomial prediction task conditional on moving

Figure 5 shows that the fine-tuned Llama-2 (70B) performs better on movers uniformly across the three datasets. However, we also note that it assigns a lower overall probability of staying in the same job. In this part, we investigate whether fine-tuned Llama-2 performs better on movers because of its tendency to allocate more probability mass to job changes in general, rather than its ability to accurately predict the specific job an individual transitions to, conditional on them moving to a new job.

To assess model performance for movers, we compute the probability of the next occupation conditional on moving:

\displaystyle\hat{P}(y_{i,t}\mid y_{i,t}\neq y_{i-1,t},x_{i},x_{i,\leq t},y_{i% ,<t})=\frac{\hat{P}(y_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})}{\hat{P}(\text{% move}_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})}

(6)

In Table 7, we further compute the differences in model perplexity between our fine-tuned Llama-2 models and CAREER using bootstrapping. We see that the fine-tuned Llama-2 (70B) outperforms all other models. We note that the perplexity measured on the conditional modeling problem in Table 6 is much higher than the perplexities reported in Table 16 in the experiments section, which is why we also report higher differences between these models.

Embedding Model $\backslash$ Dataset	PSID	NLSY79	NLSY97
CAREER	39.05 (1.15)	29.52 (0.38)	58.98 (1.03)
Fine-tuned Llama-2 (7B)	38.58 (1.15)	29.10 (0.39)	62.12 (1.12)
Fine-tuned Llama-2 (13B)	37.67 (1.17)	29.22 (0.40)	58.95 (1.12)
Fine-tuned Llama-2 (70B)	36.55 (1.10)	27.70 (0.37)	54.84 (0.98)

Table 6: Perplexities of regularized multi-nominal regression on language model embeddings conditional on moving. We use the same multi-nominal classification model discussed previously; then, we calculate the conditional probability of moving to a particular job using Equation (6). The number in parenthesis represents the standard deviation of perplexities computed on 1,000 bootstrap samples of the test set.

Perplexity Improvement over CAREER

Conditional on Moving

PSID

NLSY79

NLSY97

Fine-tuned Llama-2 (7B)

-0.51 (0.54)

-0.39 (0.15)

3.06 (0.64)

Fine-tuned Llama-2 (13B)

-1.44 (0.54)

-0.29 (0.15)

-0.04 (0.50)

Fine-tuned Llama-2 (70B)

-2.49 (0.53)

-1.82 (0.15)

-4.14 (0.50)

Table 7: Difference between perplexities of regularized multi-nominal regression on language model embeddings conditional on moving. For example, the first row shows the Perplexity(fine-tuned Llama-2 (7B) - Perplexity(CAREER). Since a lower perplexity indicates a better model fit, a negative value in this table suggests that fine-tuned Llama-2 outperforms CAREER. The number in parenthesis represents the standard deviation of perplexity differences computed on 1,000 bootstrap samples of the test set.

6.3 Explaining the Advantages of the LLM Approach

We explore the advantages of LLMs in predicting future occupations by asking the following question: for what kind of observations $(y_{i,t},x_{i},x_{i,\leq t},y_{i,<t})$ do language models outperform the previous specialized transformer?

We define our prediction target as the difference in the log-likelihood of the ground truth between predictions from the fine-tuned Llama-2 (70B) and CAREER. To analyze the heterogeneity in this prediction target, we employ a cross-fitting approach. First, we split the test set into ten folds. We then loop through the folds, where in each case, one of the folds is a held-out quintile evaluation fold, while the complement, the training folds, are used to construct a mapping from feature $X_{it}$ into quintiles. Using the data from the training folds, we train a regression forest (Athey et al., 2018b). We then rank the predicted values in the training fold and determine the thresholds for quintiles; this in turn determines a mapping from features into quintiles. Next, we apply this trained function to the data points in the evaluation fold, assigning each point to its corresponding quintile. We then calculate the mean value of the prediction target within each quintile using the assigned points. This process is repeated for all folds, where in each fold, the within-quintile means are estimated using an evaluation fold that is distinct from the data used to estimate the quintile mapping for that fold. Finally, we present the mean values per quintile averaged across all folds. The presence of heterogeneity in these quintile-level means, estimated on held-out data, indicates that the intensity of differences in performance between fine-tuned Llama-2 (70B) and CAREER vary as a function of the features $X_{it}$ . Then, we show the values of each of several features in each quintile, allowing us to understand the factors that vary systematically between higher and lower quintiles. We conduct this analysis separately for two prediction scenarios: binary move vs. stay, and job choice conditional on moving. ²²2This method used to measure prediction heterogeneity is closely related to existing methods for analyzing heterogeneous treatment effects (HTE) (Athey et al., 2018b) where conditional average treatment effects (CATEs) are estimated from observational or randomized data where units are exposed to treatment at random. However, unlike in traditional CATE estimation, where only one potential outcome is observed for each unit, we observe both counterfactuals (i.e., the log-likelihoods from fine-tuned Llama-2 (70B) and CAREER) for every observation. In this context, the ”treatment” is the use of fine-tuned Llama-2 (70B) instead of CAREER for prediction, and the ”treatment effect” is the difference in log likelihood between the two models. The Conditional Average Treatment Effect (CATE) represents the expected treatment effect given a set of features $X_{it}$ . By sorting observations into quintiles based on estimated CATE, we can assess whether there is detectable heterogeneity in the treatment effect as a function of the features. If the CATE estimates within each quintile group, which are estimated on held-out data, exhibit monotonicity, this implies that there is significant heterogeneity in the treatment effect that can be explained by the features (Chernozhukov et al., 2023). Furthermore, by examining heatmaps of feature values across the CATE quintiles, we can identify which features are associated with larger or smaller improvements in prediction performance when using fine-tuned Llama-2 (70B) compared to CAREER. This analysis allows us to interpret the sources of heterogeneity in the treatment effect and understand which features drive the differences in model performance. The fact that we observe both counterfactuals for each observation strengthens the validity of our heterogeneity analysis, as it eliminates the need for assumptions typically required in CATE estimation when only one potential outcome is observed.

	$\displaystyle\Delta\hat{P}_{\text{move}}$	$\displaystyle=\hat{P}_{\text{Fine-tuned Llama-2 (70B)}}(\text{move}_{i,t}\mid x% _{i},x_{i,\leq t},y_{i,<t})-\hat{P}_{\text{CAREER}}(\text{move}_{i,t}\mid x_{i% },x_{i,\leq t},y_{i,<t})$		(7)
	$\displaystyle\Delta\hat{P}_{\text{job}}$	$\displaystyle=\hat{P}_{\text{Fine-tuned Llama-2 (70B)}}(y_{i,t}\mid y_{i,t}% \neq y_{i-1,t},x_{i},x_{i,\leq t},y_{i,<t})-\hat{P}_{\text{CAREER}}(y_{i,t}% \mid y_{i,t}\neq y_{i-1,t},x_{i},x_{i,\leq t},y_{i,<t})$		(8)

We craft features $X_{it}$ from $(y_{i,t},x_{i},x_{i,\leq t},y_{i,<t})$ and use generalized random forests to discover heterogeneity in the space of $X_{it}$ with different treatment effects (i.e., the performance gap between fine-tuned Llama 2 and CAREER).

Since the feature set $X_{it}$ heavily depends on previous occupations, we exclude the first prediction ( $t=1$ ) of each worker from our analysis. We also merge observations from test splits of the three survey datasets, analyze them together, and retain a dataset indicator.

Specifically, the feature $X_{it}$ includes:

•

rank: The rank of the job title $y_{i,t}$ within the individual’s career trajectory, which is the integer $t$ .
•

job_freq: The number of occurrences of occupation $y_{i,t}$ in the dataset.
•

prev_job_freq: number of occurrences of occupation $y_{i,t-1}$ in the dataset
•

job_freq.prev_job_freq: the product of job_freq and next_job_freq
•

num_tokens_label: number of tokens in the next job title $y_{i,t}$ .
•

num_tokens_prev_label: number of tokens in the previous job title $y_{i,t-1}$
•

num_tokens_text: number of tokens in the text representation of the job history $\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t})$
•

empirical_transition_freq: The empirical number of transitions $y_{i,t-1}\to y_{i,t}$ , which is calculated as $\#[y_{i,t-1}\to y_{i,t}]$ .
•

empirical_transition_prob: The empirical probability of transition $y_{i,t-1}\to y_{i,t}$ , which is calculated as $\frac{\#[y_{i,t-1}\to y_{i,t}]}{\#[y_{i,t-1}]}$ .
•

I.PrevSOC.Group...SOC.Group...approx..30.groups and
I.PrevSOC.Group...SOC.Group...approx..30.groups: These variables are only defined for movers. We use the SOC hierarchy to cluster $y_{i,t-1}$ ( $y_{i,t}$ ) into $\text{SOC-group}(y_{i,t-1})$ ( $\text{SOC-group}(y_{i,t})$ ) and $\text{SOC-detailed-group}(y_{i,t-1})$ ( $\text{SOC-detailed-group}(y_{i,t})$ ). The SOC hierarchy generates around 10 SOC groups and 30 detailed SOC groups. We add two additional indicators intuitively measuring the magnitude of job transition for movers from $y_{i,t-1}$ to $y_{i,t}$ , $\mathbf{1}\{\text{SOC-group}(y_{i,t-1})=\text{SOC-group}(y_{i,t})\}$ and $\mathbf{1}\{\text{SOC-detailed-group}(y_{i,t-1})=\text{SOC-detailed-group}(y_{% i,t})\}$ .
•

dataset_indicator.NLSY79, dataset_indicator.NLSY97 and dataset_indicator.PSID - indicator variables for the dataset that each point comes from.
•

PCA_32: A 32-dimensional PCA representation of the full 8192-dimensional representation of the text representation $\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t})$ . While these features are not easily interpretable, they help discover heterogeneity, and we only analyze the interpretable features in the subgroups in our analyses.

6.3.1 Heterogeneous Treatment Effects for the Binary Staying versus Moving Prediction

First, we conduct heterogeneity analysis for the binary choice of staying versus moving (see Equation 7). Figure 6 shows that the observations predicted to be in the highest quintile have higher average differences as estimated using heldout data using our method, indicating that elements of job histories have strong predictive power for the gap in performance between fine-tuned Llama-2 (70B) and CAREER. Furthermore, Figure 7 splits observations into 5 quintiles based on estimated treatment effects and shows average values of each feature over observations belonging to each of these quintiles.

From the heatmap, we find that observations in the higher quintiles have lower values for the feature empirical_transition_frequency and slightly higher values for the features rank and lengths of the text from the text template, num_tokens_text. This indicates that the world knowledge embedded in Llama-2 through natural language in the pre-training phase transfers to our problem, and helps it predict better over rare transitions and longer job histories, analogous to performance results of LLMs over NLP tasks (Vaswani et al., 2017).

6.3.2 HTE for Movers Conditional on Moving

We now conduct heterogeneity analysis for the conditional choice problem of modeling job choice conditional on moving (see equation 8). Figure 8 shows that the difference in performance between fine-tuned Llama-2 (70B) and CAREER systematically differs as a function of characteristics of job history. The corresponding heat map is shown in Figure 9.

Similarly to the binary prediction case, 9 shows that fine-tuned Llama-2 performs better for the movers, as rank increases and num_tokens_text increases. This can again be attributed to the attention mechanism and pre-training.

6.4 Model Performance by Populations with Different Educational Backgrounds

Previous sections have shown that our new language-based approach models our target population better than previous state-of-the-art CAREER models. In this section, we explore how models perform on different subgroups defined by educational backgrounds. Table 8 presets the perplexity differences between fine-tuned Llama-2 (70B) and CAREER on different subgroups in different datasets, suggesting that our language-based approach consistently outperforms the previous state-of-the-art model for different subpopulations.

Perplexity(Fine-tuned Llama-2 (70B)) - Perplexity(CAREER)	PSID	NLSY79	NLSY97
Subpopulation: College or Above	-0.60 (0.15)	-0.57 (0.06)	-0.61 (0.12)
Subpopulation: Non-College	-0.90 (0.12)	-0.35 (0.04)	-0.52 (0.08)

Table 8: Perplexity difference between CAREER and our fine-tuned Llama-2 (70B) Model. The numbers in parentheses show the standard deviation of the perplexity different from 1,000 bootstrap samples of the test set.

Figure 10 depicts the calibration plots models’ performance of predicting moving and staying in different subgroups after combining three datasets. Our experiment results indicate that the fine-tuned language model is consistently better calibrated than CAREER across subpopulations. We thus show that fine-tuning allows us to adapt an LLM such that its predictions, conditional on both demographic characteristics and job history, are well calibrated. Figure 13 in Appendix L suggests that our conclusion holds when analyzing different datasets separately. Readers can refer to Appendix L for more details.

7 Conclusion

In this paper, we propose LABOR-LLM, a method for encoding workers’ career histories using texts and building representative occupational models with LLMs. Experiment results indicate that one can leverage pre-trained, publicly available LLMs to achieve state-of-the-art performance on career trajectory prediction via fine-tuning. The LABOR-LLM approaches provide researchers with ways to circumvent pre-training transformer models on massive resume datasets, which require excessive computational resources, cost of data access, and engineering effort. Our results further show that the perplexities of our best-performing approach of fine-tuning an LLM and predicting jobs based on its next token distribution are better than those in Table 4, suggesting that fine-tuned Llama models are more effective in predicting future occupations as tokens of job titles. One potential explanation is that the fine-tuned model integrates its knowledge of the general English language from pre-training with the specific job titles in the dataset it learns during fine-tuning. We also find that a fine-tuned model outperforms off-the-shelf models paired with in-context learning. While LLMs generate plausible career trajectories with prompting, these predictions are not representative of the workforce. We find that our fine-tuned LLMs produce representative career trajectory predictions, conditioned on demographics within subpopulations, as well as job history. Our approach produces predictions that are more representative than CAREER, a transformer-based dedicated next job prediction model, pre-trained on resume datasets. Thus, we show how to adapt LLMs for the purpose of next job prediction on nationally representative survey datasets.

We conclude our paper with an outline of future research directions. Due to the constraint on computational resources, we did not conduct the normalization step for the largest language model fine-tuned. The current perplexity metrics underestimate the performance of the fine-tuned Llama-2 (70B) model. We plan to compute the performance of the largest model after normalization to obtain a more accurate estimation of the potential of language models in the career trajectory prediction problem. In our second approach, using language models as embedding engines, we use a simple multinomial logistic regression with weight decay to generate predicted distributions from embeddings. We observe that such simple linear models fail to decipher the high-dimensional embeddings from the 70 billion parameter model. Exploring more sophisticated predicting head models, such as deep neural networks, could fully unleash these embeddings from the large model and potentially improve the perplexity. Our experimental results indicate that incorporating in-context learning examples enhances the predictive performance of pre-trained models. However, due to the limited context length of the Llama-2 models, we were constrained to adding only three in-context learning examples. In future research, we intend to investigate the value of information added by varying the number of in-context examples, leveraging language models with extended context windows.

Taken together, our approach and results show that LLMs can be used as powerful base models for predictive models of the labor market, and can be adapted using fine tuning to make nationally representative labor market predictions. More generally, our results indicate that LLMs may also be helpful for other economic modeling problems. They obviate the need to collect large datasets for pre-training and circumvent the challenges of training, which demands significant time and engineering expertise.

References

Agostinelli et al. (2023) Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank. MusicLM: Generating Music From Text, January 2023. URL http://arxiv.org/abs/2301.11325. arXiv:2301.11325 [cs, eess].
Argyle et al. (2023) Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate. Out of One, Many: Using Language Models to Simulate Human Samples. Political Analysis, 31(3):337–351, July 2023. ISSN 1047-1987, 1476-4989. doi: 10.1017/pan.2023.2. URL https://www.cambridge.org/core/product/identifier/S1047198723000025/type/journal_article.
Ashenfelter (1978) Orley Ashenfelter. Estimating the Effect of Training Programs on Earnings. The Review of Economics and Statistics, 60(1):47–57, 1978. ISSN 0034-6535. doi: 10.2307/1924332. URL https://www.jstor.org/stable/1924332. Publisher: The MIT Press.
Athey and Imbens (2019) Susan Athey and Guido Imbens. Machine Learning Methods Economists Should Know About, March 2019. URL http://arxiv.org/abs/1903.10075. arXiv:1903.10075 [econ, stat].
Athey et al. (2018a) Susan Athey, David Blei, Robert Donnelly, Francisco Ruiz, and Tobias Schmidt. Estimating Heterogeneous Consumer Preferences for Restaurants and Travel Time Using Mobile Location Data. AEA Papers and Proceedings, 108:64–67, May 2018a. ISSN 2574-0768. doi: 10.1257/pandp.20181031. URL https://www.aeaweb.org/articles?id=10.1257/pandp.20181031.
Athey et al. (2018b) Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized Random Forests, April 2018b. URL http://arxiv.org/abs/1610.01271. arXiv:1610.01271 [econ, stat].
Athey et al. (2024) Susan Athey, Lisa K. Simon, Oskar N. Skans, Johan Vikstrom, and Yaroslav Yakymovych. The Heterogeneous Earnings Impact of Job Loss Across Workers, Establishments, and Markets, February 2024. URL http://arxiv.org/abs/2307.06684. arXiv:2307.06684 [econ, q-fin].
Blau and Riphahn (1999) David M. Blau and Regina T. Riphahn. Labor force transitions of older married couples in Germany. Labour Economics, 6(2):229–252, June 1999. ISSN 0927-5371. doi: 10.1016/S0927-5371(99)00017-2. URL https://www.sciencedirect.com/science/article/pii/S0927537199000172.
Blau and Kahn (2017) Francine D. Blau and Lawrence M. Kahn. The Gender Wage Gap: Extent, Trends, and Explanations. Journal of Economic Literature, 55(3):789–865, September 2017. ISSN 0022-0515. doi: 10.1257/jel.20160995. URL https://www.aeaweb.org/articles?id=10.1257/jel.20160995.
Bommasani et al. (2022) Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the Opportunities and Risks of Foundation Models, July 2022. URL http://arxiv.org/abs/2108.07258. arXiv:2108.07258 [cs].
Boskin (1974) Michael J. Boskin. A Conditional Logit Model of Occupational Choice. Journal of Political Economy, 82(2, Part 1):389–398, March 1974. ISSN 0022-3808. doi: 10.1086/260198. URL https://www.journals.uchicago.edu/doi/abs/10.1086/260198. Publisher: The University of Chicago Press.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners, July 2020. URL http://arxiv.org/abs/2005.14165. arXiv:2005.14165 [cs].
Buolamwini and Gebru (2018) Joy Buolamwini and Timnit Gebru. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, pages 77–91. PMLR, January 2018. URL https://proceedings.mlr.press/v81/buolamwini18a.html. ISSN: 2640-3498.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating Large Language Models Trained on Code, July 2021. URL http://arxiv.org/abs/2107.03374. arXiv:2107.03374 [cs].
Chernozhukov et al. (2023) Victor Chernozhukov, Mert Demirer, Esther Duflo, and Iván Fernández-Val. Fisher-Schultz Lecture: Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments, with an Application to Immunization in India, October 2023. URL http://arxiv.org/abs/1712.04802. arXiv:1712.04802 [econ, math, stat].
Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, September 2014. URL http://arxiv.org/abs/1406.1078. arXiv:1406.1078 [cs, stat].
de Ruijt and Bhulai (2021) Corné de Ruijt and Sandjai Bhulai. Job Recommender Systems: A Review, November 2021. URL http://arxiv.org/abs/2111.13576. arXiv:2111.13576 [cs].
Dehejia and Wahba (1999) Rajeev H. Dehejia and Sadek Wahba. Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. Journal of the American Statistical Association, 94(448):1053–1062, 1999. ISSN 0162-1459. doi: 10.2307/2669919. URL https://www.jstor.org/stable/2669919. Publisher: [American Statistical Association, Taylor & Francis, Ltd.].
Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs, May 2023. URL http://arxiv.org/abs/2305.14314. arXiv:2305.14314 [cs].
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. URL http://arxiv.org/abs/1810.04805. arXiv:1810.04805 [cs].
Dominguez-Olmedo et al. (2024) Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-Dünner. Questioning the Survey Responses of Large Language Models, February 2024. URL http://arxiv.org/abs/2306.07951. arXiv:2306.07951 [cs].
Donnelly et al. (2021) Robert Donnelly, Francisco J.R. Ruiz, David Blei, and Susan Athey. Counterfactual inference for consumer choice across many product categories. Quantitative Marketing and Economics, 19(3):369–407, December 2021. ISSN 1573-711X. doi: 10.1007/s11129-021-09241-2. URL https://doi.org/10.1007/s11129-021-09241-2.
Fairlie and Sundstrom (1999) Robert W. Fairlie and William A. Sundstrom. The Emergence, Persistence, and Recent Widening of the Racial Unemployment Gap. Industrial and Labor Relations Review, 52(2):252–270, 1999. ISSN 0019-7939. doi: 10.2307/2525165. URL https://www.jstor.org/stable/2525165. Publisher: Sage Publications, Inc.
Geng et al. (2022) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). In Proceedings of the 16th ACM Conference on Recommender Systems, RecSys ’22, pages 299–315, New York, NY, USA, September 2022. Association for Computing Machinery. ISBN 978-1-4503-9278-5. doi: 10.1145/3523227.3546767. URL https://dl.acm.org/doi/10.1145/3523227.3546767.
Hall et al. (1972) Robert E. Hall, Aaron Gordon, and Charles Holt. Turnover in the Labor Force. Brookings Papers on Economic Activity, 1972(3):709, 1972. ISSN 00072303. doi: 10.2307/2534130. URL https://www.jstor.org/stable/2534130?origin=crossref.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735.
Jacobson et al. (1993) Louis S. Jacobson, Robert J. LaLonde, and Daniel G. Sullivan. Earnings Losses of Displaced Workers. The American Economic Review, 83(4):685–709, 1993. ISSN 0002-8282. URL https://www.jstor.org/stable/2117574. Publisher: American Economic Association.
Jelinek et al. (2005) F. Jelinek, R. L. Mercer, L. R. Bahl, and J. K. Baker. Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63, August 2005. ISSN 0001-4966. doi: 10.1121/1.2016299. URL https://doi.org/10.1121/1.2016299.
Jin et al. (2024) Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. TIME-LLM: TIME SERIES FORECASTING BY REPROGRAMMING LARGE LANGUAGE MODELS. 2024.
Johnson et al. (2018) David Johnson, Katherine McGonagle, Vicki Freedman, and Narayan Sastry. Fifty Years of the Panel Study of Income Dynamics: Past, Present, and Future. The Annals of the American Academy of Political and Social Science, 680(1):9–28, November 2018. ISSN 0002-7162. doi: 10.1177/0002716218809363. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6820672/.
Kleinberg et al. (2015) Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. Prediction Policy Problems. American Economic Review, 105(5):491–495, May 2015. ISSN 0002-8282. doi: 10.1257/aer.p20151023. URL https://www.aeaweb.org/articles?id=10.1257/aer.p20151023.
Li et al. (2017) Liangyue Li, How Jing, Hanghang Tong, Jaewon Yang, Qi He, and Bee-Chung Chen. NEMO: Next Career Move Prediction with Contextual Embedding. In Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, pages 505–513, Republic and Canton of Geneva, CHE, April 2017. International World Wide Web Conferences Steering Committee. ISBN 978-1-4503-4914-7. doi: 10.1145/3041021.3054200. URL https://dl.acm.org/doi/10.1145/3041021.3054200.
Luccioni et al. (2022) Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model, November 2022. URL http://arxiv.org/abs/2211.02001. arXiv:2211.02001 [cs].
Meng et al. (2019) Qingxin Meng, Hengshu Zhu, Keli Xiao, Le Zhang, and Hui Xiong. A Hierarchical Career-Path-Aware Neural Network for Job Mobility Prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pages 14–24, New York, NY, USA, July 2019. Association for Computing Machinery. ISBN 978-1-4503-6201-6. doi: 10.1145/3292500.3330969. URL https://dl.acm.org/doi/10.1145/3292500.3330969.
Mittal et al. (2024) Chinmay Mittal, Krishna Kartik, Mausam, and Parag Singla. PuzzleBench: Can LLMs Solve Challenging First-Order Combinatorial Reasoning Problems?, February 2024. URL http://arxiv.org/abs/2402.02611. arXiv:2402.02611 [cs].
O’Hagan and Schein (2024) Sean O’Hagan and Aaron Schein. Measurement in the Age of LLMs: An Application to Ideological Scaling, April 2024. URL http://arxiv.org/abs/2312.09203. arXiv:2312.09203 [cs].
Rives et al. (2021) Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America, 118(15):e2016239118, April 2021. ISSN 1091-6490. doi: 10.1073/pnas.2016239118.
Rothstein et al. (2019) Donna S Rothstein, Deborah Carr, and Elizabeth Cooksey. Cohort Profile: The National Longitudinal Survey of Youth 1979 (NLSY79). International Journal of Epidemiology, 48(1):22–22e, February 2019. ISSN 0300-5771. doi: 10.1093/ije/dyy133. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6380301/.
Rudolph et al. (2016) Maja R. Rudolph, Francisco J. R. Ruiz, Stephan Mandt, and David M. Blei. Exponential Family Embeddings, November 2016. URL http://arxiv.org/abs/1608.00778. arXiv:1608.00778 [cs, stat].
Santurkar et al. (2023) Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose Opinions Do Language Models Reflect?, March 2023. URL http://arxiv.org/abs/2303.17548. arXiv:2303.17548 [cs].
Schmidt and Strauss (1975) Peter Schmidt and Robert P. Strauss. The Prediction of Occupation Using Multiple Logit Models. International Economic Review, 16(2):471–486, 1975. ISSN 0020-6598. doi: 10.2307/2525826. URL https://www.jstor.org/stable/2525826. Publisher: [Economics Department of the University of Pennsylvania, Wiley, Institute of Social and Economic Research, Osaka University].
Singhal et al. (2022) Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan. Large Language Models Encode Clinical Knowledge, December 2022. URL http://arxiv.org/abs/2212.13138. arXiv:2212.13138 [cs].
Taylor et al. (2022) Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A Large Language Model for Science, November 2022. URL http://arxiv.org/abs/2211.09085. arXiv:2211.09085 [cs, stat].
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
Vafa et al. (2024) Keyon Vafa, Emil Palikot, Tianyu Du, Ayush Kanodia, Susan Athey, and David M. Blei. CAREER: A Foundation Model for Labor Sequence Data, February 2024. URL http://arxiv.org/abs/2202.08370. arXiv:2202.08370 [cs, econ].
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
Wei et al. (2022) Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned Language Models Are Zero-Shot Learners, February 2022. URL http://arxiv.org/abs/2109.01652. arXiv:2109.01652 [cs].
Yi et al. (2024) Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems, February 2024. URL http://arxiv.org/abs/2402.18013. arXiv:2402.18013 [cs].
Zhang et al. (2024) Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Wanli Ouyang, Dongzhan Zhou, Shufei Zhang, Mao Su, Han-Sen Zhong, and Yuqiang Li. ChemLLM: A Chemical Large Language Model, April 2024. URL http://arxiv.org/abs/2402.06852. arXiv:2402.06852 [cs].
Zhang et al. (2021) Le Zhang, Ding Zhou, Hengshu Zhu, Tong Xu, Rui Zha, Enhong Chen, and Hui Xiong. Attentive Heterogeneous Graph Embedding for Job Mobility Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2192–2201, Virtual Event Singapore, August 2021. ACM. ISBN 978-1-4503-8332-5. doi: 10.1145/3447548.3467388. URL https://dl.acm.org/doi/10.1145/3447548.3467388.

Appendix A Examples of Career Trajectories Generated by Off-the-Shelf LLMs

LLMs gain knowledge about the labor market and hierarchy among different occupations from the pre-training data. Therefore, off-the-shelf LLMs can generate plausible sequences of future occupations conditional on one’s job history via appropriate prompt engineering. These sections demonstrate a few examples of career trajectories generated by the off-the-shelf Llama-2 (7B) model. In each example, we inform the model with the worker’s career history and the desired data format by prompting the following text. We use a model with 8-bit quantization and top_k = 50, top_p=0.6, temperature=1.0 to generate texts. To save space, we only show the first five occupations generated in this section. The prompt text above is denoted with <JOB HISTORY PROMPT> in the following examples.

A.1 Examples on an Individual with Low Education Level

We use the following text as the prompt summarizing the individual’s career history.

The following is the resume of a female white US worker residing in the northeast region.The worker has the following work experience on the resume, one entry per line, including year, education level and the job title:1979 to 1980 (high school diploma): Cashiers1980 to 1981 (high school diploma): Not in labor force1981 to 1982 (high school diploma): Food servers, nonrestaurant1982 to 1983 (high school diploma): Food servers, nonrestaurant1983 to 1984 (high school diploma): Food servers, nonrestaurant1984 to 1985 (high school diploma):

Generation Example 1 (Low-Education)

<JOB HISTORY PROMPT> Waiters and waitresses1985 to 1986 (high school diploma): Cashiers1986 to 1987 (high school diploma): Cashiers and office clerks, general1987 to 1988 (high school diploma): Office clerks, general1988 to 1989 (high school diploma): Food servers, nonrestaurant

Generation Example 2 (Low-Education)

<JOB HISTORY PROMPT> Cooks, short order1985 to 1986 (high school diploma): Food servers, nonrestaurant1986 to 1987 (high school diploma): Cooks, short order1987 to 1988 (high school diploma): Cooks, short order1988 to 1989 (high school diploma): Waiters and waitresses

Generation Example 3 (Low-Education)

<JOB HISTORY PROMPT> Cashiers1985 to 1986 (high school diploma): Food servers, nonrestaurant1986 to 1987 (high school diploma): Food servers, nonrestaurant1987 to 1988 (high school diploma): Waiters and waitresses1988 to 1989 (high school diploma): Cashiers

A.2 Examples on an Individual with High Education Level

The second example illustrates model’s capability on the high-education group. We use the following prompt text in this example.

The following is the resume of a female white US worker residing in the northeast region.The worker has the following work experience on the resume, one entry per line, including year, education level and the job title:1984 to 1985 (college): Business Teachers Postsecondary1985 to 1986 (college): Postmasters and mail superintendents1986 to 1987 (graduate degree): Sales Representatives Services All Other1987 to 1988 (graduate degree): Wholesale and retail buyers, except farm products1988 to 1989 (graduate degree): Elementary and middle school teachers1989 to 1990 (graduate degree): Elementary and middle school teachers1990 to 1991 (graduate degree): Elementary and middle school teachers1991 to 1992 (graduate degree):

Generation Example 1 (High-Education)

<JOB HISTORY PROMPT> First-line supervisors of retail sales workers1992 to 1993 (graduate degree): First-line supervisors of retail sales workers1993 to 1994 (graduate degree): Elementary and middle school teachers1994 to 1995 (graduate degree): Sales representatives, services, all other1995 to 1996 (graduate degree): Postmasters and mail superintendents

Generation Example 2 (High-Education)

<JOB HISTORY PROMPT> Sales representatives services all other1992 to 1993 (graduate degree): Postmasters and mail superintendents1993 to 1994 (graduate degree): Business teachers postsecondary1994 to 1995 (graduate degree): Sales representatives services all other1995 to 1996 (graduate degree): Elementary and middle school teachers

Generation Example 3 (High-Education)

<JOB HISTORY PROMPT> Secondary school teachers1992 to 1993 (graduate degree): Postsecondary teachers1993 to 1994 (graduate degree): Postsecondary teachers1994 to 1995 (graduate degree): Social scientists and related workers1995 to 1996 (graduate degree): Social scientists and related workers

Appendix B Notation

Table 9 summarizes the notations we use in this paper.

Notation	Definition
$i$	Index for individual workers.
$t$	Index for records within individual worker’s career.
$T_{i}$	The total number of records from individual $i$ .
$y_{i,t}$	The occupation in individual $i$ ’s $t^{th}$ record.
$y_{i,<t}$	The sequence of individual $i$ ’s occupations before the $t^{th}$ records,
$y_{i,<t}$	which is a short-hand for $(y_{i,1},y_{i,2},\dots,y_{i,t-1})$ .
$\mathcal{Y}$	The set of all occupations.
$x_{i}$	The set of static covariates of individual $i$ , such as gender and ethnicity.
$x_{i,t}$	The set of dynamic covariates of individual $i$ in his/her $t^{th}$ record,
$x_{i,t}$	such as the level of education.
$x_{i,\leq t}$	The sequence of individual $i$ ’s dynamic covariates before the $t^{th}$ records,
$x_{i,\leq t}$	which is a short-hand for $(x_{i,1},x_{i,2},\dots,x_{i,t-1},x_{i,t})$ .
$P(y_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})$	The probability individual $i$ takes occupation $y_{i,t}$ ,
$P(y_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})$	conditioned on past occupations and past/current covariates.
$\hat{P}(y_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})$	The predicted value of $P(y_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})$ from model.
$\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t})$	The text representation of past occupations and past/current covariates.
$\mathcal{T}(x_{i},x_{i,\leq T_{i}},y_{i,\leq T_{i}})$	The text representation of individual $i$ ’s entire career history.
$\text{text}\oplus\text{text}$	The text concatenation operator.
$P_{\text{LLM}}(\text{response}\mid\text{prompt})$	The probability that a LLM generates “response”,
$P_{\text{LLM}}(\text{response}\mid\text{prompt})$	as the continuation of the “prompt” text.

Table 9: Summary of mathematical notations used in this paper.

Appendix C Summary of Datasets

Table 10 shows the number of observations and number of individuals in each split of each dataset. For example, there are 8,684 workers in the training split of the PSID dataset, and there are $\sum_{i}T_{i}=44,231$ prediction observations in the same split.

Dataset	PSID	NLSY79	NLSY97
Train Split	44,231 (8,684)	169,008 (8,637)	80,975 (6,184)
Validation Split	6,247 (1,229)	23,625 (1,221)	11,247 (879)
Test Split	12,187 (2,425)	46,912 (2,412)	21,919 (1,707)

Table 10: Summary statistics of dataset splits.

Table 11 summarizes the training corpus used to fine-tune our language models. Note that the maximum length of each prompt is less than 1,000 tokens, significantly shorter than the 4,096 context window size for the Llama-2 family.

Dataset	# Workers	Mean (# Tokens)	Std (# Tokens)	Min (# Tokens)	Median (# Tokens)	Max (# Tokens)	Total (# Tokens)
NLSY79	8,637	528.08	198.39	90	604	979	4,561,023
NLSY97	6,184	388.49	103.48	90	408	675	2,402,417
PSID	8,684	187.86	72.56	103	171	528	1,631,381

Table 11: Summary statistics of text lengths when job sequences are converted to textual prompts in the training set. The table reports the number of tokens in the prompt text

\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))

given by the Llama-2 tokenizer. The last column presents the total number of tokens used to fine-tune our language models.

Appendix D Details of the Job Titles

Figure 11 presents example job titles in a word cloud, weighted by their popularity. The popularity of an occupation title is determined by the frequency of its total occurrences across the test splits of the three datasets.

Appendix E Full-Precision versus Quantization

Model quantization is a technique for improving models’ computational efficiency and decreasing memory usage by reducing the numerical precision of model parameters (e.g., from 32-bit to 8-bit or 4-bit). Existing research has shown that LLMs with quantization can achieve similar performance to full-precision models Dettmers et al. (2023). All Llama experiments in the main paper used the 8-bit quantized versions of models to save computational resources. In this section, we compare performance of the full-precision and 8-bit quantization versions of the fine-tuned Llama-2 (7B). Specifically, the Llama-2 (7B) model was fine-tuned under full precision; then, we query predicted probabilities of future job titles using the two variants of the fine-tuned model, one in full precision and the other quantized to 8-bit. Table 12 compares models’ performance on different datasets. These results suggest no significant difference between the full-precision and quantized models in average normalization constant, perplexity (normalized), and perplexity (unnormalized).

Precision	Dataset	Average Normalization Constant	Perplexity (Normalized)	Perplexity (Unnormalized)
Fine-tuned Llama-2 (7B) (Full-Precision)	PSID	0.986347	13.44 (0.30)	13.64 (0.28)
Fine-tuned Llama-2 (7B) (8-bit)	PSID	0.986448	13.45 (0.29)	13.63 (0.30)
Fine-tuned Llama-2 (7B) (Full-Precision)	NLSY97	0.99338	14.51 (0.24)	14.62 (0.23)
Fine-tuned Llama-2 (7B) (8-bit)	NLSY97	0.993262	14.53 (0.23)	14.63 (0.24)
Fine-tuned Llama-2 (7B) (Full-Precision)	NLSY79	0.996024	11.33 (0.12)	11.37 (0.13)
Fine-tuned Llama-2 (7B) (8-bit)	NLSY79	0.996003	11.33 (0.12)	11.37 (0.12)

Table 12: Performance of full-Precision and quantized fine-tuned Llama-2 (7B) models.

Appendix F Predict Future Occupations as Tokens in Job Titles

We can directly leverage LLMs’ next token prediction capabilities to predict future occupations without building an additional classifier. To obtain the predicted probability of the next occupation, we first tokenize each job title $\text{title}_{y}$ into a sequence of tokens. Suppose the string $\text{title}_{y}$ is tokenized into $n$ tokens $\{\text{token}_{y}^{(1)},\text{token}_{y}^{(2)},\dots,\text{token}_{y}^{(n)}\}$ . Then, the unnormalized probability of predicting $y$ is the likelihood the language model assigns to the token sequence $\{\text{token}_{y}^{(1)},\text{token}_{y}^{(2)},\dots,\text{token}_{y}^{(n)}\}$ as the continuation of the text representation $\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t})$ . The predicted probability can further be expanded using the chain rule of probability, as shown in Equation (9).

\displaystyle\begin{aligned} \tilde{P}_{\text{LLM}}(y\mid\mathcal{T}(x_{i},x_{% i,\leq t},y_{i,<t}))&=P_{\text{LLM}}(\{\text{token}_{y}^{(1)},\text{token}_{y}% ^{(2)},\dots,\text{token}_{y}^{(n)}\}\mid\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<% t}))\\ &=\prod_{j=1}^{n}P_{\text{LLM}}(\text{token}_{y}^{(j)}\mid\mathcal{T}(x_{i},x_% {i,\leq t},y_{i,<t}),\text{token}_{y}^{(1)},\text{token}_{y}^{(2)},\dots,\text% {token}_{y}^{(j-1)})\end{aligned}

(9)

The $P_{\text{LLM}}(\text{token}_{y}^{(j)}\mid\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<% t}),\text{token}_{y}^{(1)},\text{token}_{y}^{(2)},\dots,\text{token}_{y}^{(j-1% )})$ is operationalized by (1) appending all tokens $\text{token}_{y}^{(1)},\text{token}_{y}^{(2)},\dots,\text{token}_{y}^{(j-1)}$ to the text representation $\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t})$ and (2) querying the likelihood the language model assigned to $\text{token}_{y}^{(j)}$ as the next token conditioned on all the previous tokens.

It is worth noting that we cannot guarantee that the model only assigns positive probabilities to valid job titles. In fact, given the presence of the softmax function in our language model, $P_{\text{LLM}}(\cdot\mid\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))>0$ for any sequence of tokens of any length. Therefore, the sum of all possible job titles’ probabilities is not necessarily one. We would need the following normalization to calculate the probability of predicting $y_{t}$ so that $\sum_{y}\hat{P}(y\mid\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))=1$ .

\displaystyle\hat{P}(y_{i,t}\mid x_{i},x_{i,\leq t},y_{i,<t})=\frac{\tilde{P}_% {\textsc{LLM}}(y_{i,t}\mid\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))}{\sum_{y^{% \prime}\in\mathcal{Y}}\tilde{P}_{\textsc{LLM}}(y^{\prime}\mid\mathcal{T}(x_{i}% ,x_{i,\leq t},y_{i,<t}))}

(10)

The normalization operation in Equation (10) is computationally expensive, since we need to perform LLM inference $|\mathcal{Y}|$ times. In the experiments section, we will assess the necessity of this normalization step by examining how well $\tilde{P}_{\text{LLM}}(\cdot)$ approximates $\hat{P}_{\text{LLM}}(\cdot)$ . It is worth noting that since the denominator in 10 is less than 1 (since the total probability mass on the subset of job title tokens is less than the total probability mass on all tokens), $\tilde{P}_{\textsc{LLM}}$ is always an overestimate of $\hat{P}_{\textsc{Model}}$ . As a result, Test Perplexity calculated using the former is also an overestimate of the latter since the normalization constant is less than one.

Appendix G Off-the-Shelf Language Models

To examine the performance of pre-trained LLMs without fine-tuning, we use the prediction-as-token approach (see Appendix F) and construct $\hat{P}(y\mid x_{i},x_{i,\leq t},y_{i,t})=P_{\text{LLM}}(\text{title}_{y}\mid% \mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))$ . Table 13 presents perplexity scores of the Llama-2 (7B) with bootstrap standard deviations. Our results indicate that off-the-shelf models fail to accomplish the career trajectory task well. One possible explanation for this inferior performance is that the pre-trained model lacks knowledge of the set of valid job titles. Consequently, the model assigns a significant probability mass to strings that are not valid job titles, resulting in small values of $P_{\text{LLM}}(\text{title}_{y}\mid\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))$ .

To improve the baseline model, we prepend the complete list of job titles to the text representation $\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t})$ in the prompt. The list of titles is a paragraph with a single $\text{title}_{y}$ on each line and a total of $|\mathcal{Y}|$ lines, the total length of this list is around 2,500 tokens. The predicted probability of landing at occupation $y$ is $P_{\text{LLM}}(y\mid\text{List of Titles}\oplus\mathcal{T}(x_{i},x_{i,\leq t},% y_{i,<t}))$ , where $\oplus$ denotes the string concatenation operation. The results in Table 13 indicate that providing the model with a list of job titles enhances its performance. However, even with this improvement, off-the-shelf models still perform worse than other baseline models.

Prompt Format	Model Size	PSID	NLSY79	NLSY97
	7B	179.96 (5.81)	53.71 (0.91)	71.13 (1.78)
$\text{List of Titles}\oplus\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t})$	13B	131.26 (4.53)	44.97 (0.77)	50.13 (1.20)
	70B	131.29 (3.79)	39.53 (0.58)	46.24 (0.99)
	7B	3820.31 (241.71)	473.52 (11.58)	505.27 (18.85)
$\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t})$	13B	1711.50 (82.79)	236.19 (5.95)	291.59 (9.92)
	70B	1527.95 (70.97)	162.80 (3.78)	216.09 (7.25)

Table 13: Perplexities of off-the-shelf Llama-2 models with different prompt formats.

We also examine how often the off-the-shelf Llama-2 (7B), without any fine-tuning, predicts valid job titles. Specifically, we randomly sample 10% of the test split of each survey dataset and evaluate the “normalization constant” in Equation (10), defined as $\sum_{y\in\mathcal{Y}}P_{\text{LLM}}(\text{title}_{y}\mid\text{prompt})$ . The average normalization constant is only around one-third using Llama-2 (7B) with $\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t})$ as the prompt. After adding the list of job titles to the prompt, the average normalization constant rises to around two-thirds but is still far away from one; the relatively low chance of hitting valid job titles partially explains the poor performance of LLMs off-the-shelf. Table 14 enumerates average normalization constants across datasets using different prompt formats.

Prompt Format and Normalization Constant	PSID	NLSY79	NLSY97
$\sum_{y\in\mathcal{Y}}P_{\text{LLM}}(\text{title}_{y}\mid\text{List of Titles}% \oplus\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))$	0.54 (N=1,219)	0.74 (N=4,691)	0.67 (N=2,192)
$\sum_{y\in\mathcal{Y}}P_{\text{LLM}}(\text{title}_{y}\mid\mathcal{T}(x_{i},x_{% i,\leq t},y_{i,<t}))$	0.26 (N=1,219)	0.33 (N=4,691)	0.33 (N=2,192)

Table 14: Normalization constants of baseline off-the-shelf Llama-2 (7B) models with different prompt formats.

Finally, for the sample studied above, we perform the explicit normalization in Equation (10) and compute the perplexity (bootstrap standard deviations in paraphrase). Table 15 reports perplexities on different datasets using two different prompts. We see that, even after constraining the model to predict valid job titles through normalization, the off-the-shelf Llama-2 (7B) model still failed to match the performance of baseline models examined by Vafa et al. (2024).

Prompt Format	PSID	NLSY79	NLSY97
$\text{List of Titles}\oplus\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))$	57.20 (5.11)	33.69 (1.62)	38.31 (2.69)
$\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))$	237.48 (37.02)	111.69 (8.36)	101.47 (10.45)

Table 15: Perplexities of Llama-2 (7B) off-the-shelf after explicitly normalizing the predicted probability using Equation (10).

Appendix H Effects of Normalization in Occupation-as-Token Prediction

This section examines the performance of fine-tuned Llama-2 (7B) and fine-tuned Llama-2 (13B) models on the three survey datasets using the first approach discussed (i.e., predict the next occupation as tokens) with explicit normalization. We could not run the same experiment with the Llama-2 (70B) model because the normalization operation required excessive computational resources.

We conduct experiments to investigate the necessity of the computationally expensive normalization procedure. The third column in Table 16 shows that the average normalization constant is close to one for the fine-tuned Llama-2 (7B) and Llama-2 (13B) models on all three test datasets. Therefore, it is possible to approximate $P(y_{t}\mid\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))$ using Equation (9) without normalization, which would enable us to scale to up much larger language models such as Llama-2 (70B). The last two columns in Table 16 report perplexities from unnormalized probabilities in Equation (9) and perplexities from the normalized probabilities in Equation (10). Our experiment results indicate that it is feasible to use unnormalized probabilities for prediction in larger models without significantly affecting performance. Moreover, as noted earlier, since the denominator in 10 is less than 1 (since the total probability mass on the subset of job title tokens is less than the total probability mass on all tokens), $\tilde{P}_{\textsc{LLM}}$ is always an overestimate of $\hat{P}_{\textsc{Model}}$ . As a result, Test Perplexity calculated using the former is also an overestimate of the latter since the normalization constant is smaller than one. In other words, in table 16, the unnormalized perplexity is a strict overestimate of the normalized perplexity. We have shown that even without normalization, our approach outperforms the state of the CAREER model. Thus, we can bypass the computational overhead associated with normalization, making it practical to scale up to models like Llama-2 (70B).

Model	Dataset	Avg. Norm. Const.	Perplexity (Normalized)	Perplexity (Unnormalized)
Fine-tuned Llama-2 (7B) (8-bit)	PSID	0.986448	13.43 (0.29)	13.61 (0.30)
Fine-tuned Llama-2 (7B) (8-bit)	NLSY79	0.996003	11.32 (0.12)	11.37 (0.12)
Fine-tuned Llama-2 (7B) (8-bit)	NLSY97	0.993262	14.53 (0.24)	14.62 (0.24)
Fine-tuned Llama-2 (13B) (8-bit)	PSID	0.989791	13.17 (0.29)	13.30 (0.29)
Fine-tuned Llama-2 (13B) (8-bit)	NLSY79	0.995048	11.22 (0.11)	11.27 (0.12)
Fine-tuned Llama-2 (13B) (8-bit)	NLSY97	0.992862	14.07 (0.24)	14.17 (0.24)

Table 16: Normalization constant and perplexities of Llama-2 (7B) and Llama-2 (13B) models. Perplexity (normalized) is computed using Equation (10), and perplexity (unnormalized) is computed using Equation (9) without explicit normalization. The number in parentheses represents the standard deviation of perplexities computed on 1,000 bootstrap samples of the test set.

Appendix I Details of Model Pairwise Performance Differences

Figure 12 illustrates the distributions of (Perplexity of Model 1, Perplexity of Model 2) pairs across different model pairs and datasets. Our observations indicate that larger models (represented on the y-axis) consistently outperform smaller models (represented on the x-axis) in terms of perplexity, suggesting significant returns from scaling model size.

Difference between Model Perplexities	PSID	NLSY79	NLSY97
Perplexity(Fine-tuned Llama-2 (70B)) - Perplexity(Fine-tuned Llama-2 (13B))	-0.21 (0.07)	-0.25 (0.03)	-0.29 (0.04)
Perplexity(Fine-tuned Llama-2 (70B)) - Perplexity(Fine-tuned Llama-2 (7B))	-0.49 (0.07)	-0.34 (0.03)	-0.74 (0.05)
Perplexity(Fine-tuned Llama-2 (13B)) - Perplexity(Fine-tuned Llama-2 (7B))	-0.29 (0.07)	-0.09 (0.02)	-0.45 (0.05)
Perplexity(Fine-tuned Llama-2 (70B)) - Perplexity(CAREER)	-0.77 (0.09)	-0.30 (0.03)	-0.28 (0.05)
Perplexity(Fine-tuned Llama-2 (13B)) - Perplexity(CAREER)	-0.56 (0.09)	-0.05 (0.03)	-0.00 (0.05)
Perplexity(Fine-tuned Llama-2 (7B)) - Perplexity(CAREER)	-0.27 (0.10)	0.04 (0.03)	0.46 (0.07)

Table 17: Difference in perplexities between models using Approach 1, predicting future occupations as tokens. The number in parenthesis represents the standard deviation of perplexity differences computed on 1,000 bootstrap samples of the test set.

Appendix J Details of Language Models used as Embedding Engines

Table 18 summarizes the embedding models we use in our experiments and the dimensions of embeddings they generate.

Text Embedding Model	Dimension	Trained on the Survey Data
OpenAI Ada 2	1,536	No
OpenAI Ada 3 (small)	1,536	No
OpenAI Ada 3 (large)	3,072	No
Meta Llama-2 (7B) (off the shelf)	4,096	No
Meta Llama-2 (13B) (off the shelf)	5,120	No
Meta Llama-2 (70B) (off the shelf)	8,192	No
Our Fine-tuned Llama-2 (7B)	4,096	Yes
Our Fine-tuned Llama-2 (13B)	5,120	Yes
Our Fine-tuned Llama-2 (70B)	8,192	Yes

Table 18: Summary of language models used to construct embeddings in Approach 2, use language models as embedding engines.

The most straightforward prediction head to use is to use a multinomial regression model to predict the next occupation. The estimate of the conditional probability of the next occupation is given by:

\displaystyle\hat{P}_{\textsc{MNL}}(y\mid x_{i},x_{i,\leq t},y_{i,<t})=\frac{% \exp(\beta_{y}^{\top}E_{i,<t})}{\sum_{y^{\prime}\in\mathcal{Y}}\exp(\beta_{y^{% \prime}}^{\top}E_{i,<t})}

(11)

where $\{\beta_{y}\}_{y\in\mathcal{Y}}$ is the set of trainable parameters. We train $\beta$ s in the prediction head to minimize the cross-entropy loss between the predicted distribution and the true distribution of the next occupation, defined in Equation (12). We use L2 regularization on the $\beta$ s to avoid overfitting.

\displaystyle\bm{\beta}^{\star}=\operatorname{arg\,min}_{\bm{\beta}\in\mathbb{% R}^{|\mathcal{Y}|}}-\frac{1}{\sum_{i\in\textsc{Train Set}}T_{i}}\sum_{i\in% \textsc{Train Set}}\sum_{t=1}^{T_{i}}\sum_{y\in\mathcal{Y}}\mathbf{1}\{y_{i,t}% =y\}\log\hat{P}_{\textsc{MNL}}(y\mid x_{i},x_{i,\leq t},y_{i,<t})

(12)

Finally, we plug in the estimated $\bm{\beta}^{\star}$ into Equation (11) to obtain the predicted probability for every occupation, and we use the same test set perplexity in Equation (2) to evaluate the model.

It is worth noting that $\beta_{y}$ ’s in Equation (11) can be interpreted as a latent representation of job $y$ ; $\beta_{y}$ ’s were initialized randomly and learned during the training process. In contrast, the direct prediction from the job tokens approach in Section 5.1 incorporates the LLMs’ understanding of the information embedded in job titles while making the prediction; therefore, we expect a slightly worse performance from this second approach. Researchers can also deploy other prediction heads, such as random forests, gradient boosting, and neural networks, to predict the next occupation.

We fit a $|\mathcal{Y}|$ -class multinomial regression on the train split of the respective survey dataset to capture the ground truth occupation $y_{i,t}$ with the Adam optimizer, a learning rate of $0.003$ , and a weight decay (i.e., regularization) hyperparameter. Since the loss landscape of multinominal regressions is generally well-behaved, we only conduct a hyper-parameter search on the weight decay, ranging from $10^{-6}$ to $1$ in log space. To avoid over-fitting and speed up the experiment, a training strategy with early stopping (on the validation set loss) was implemented, and the final regularization parameter was chosen to minimize the validation set loss.

Appendix K Details of In-Context Learning

Formally, let $\mathcal{T}_{j}=\mathcal{T}(x_{j},x_{j,\leq T_{j}},y_{j,\leq T_{j}})$ , $\mathcal{T}_{k}=\mathcal{T}(x_{k},x_{k,\leq T_{k}},y_{k,\leq T_{k}})$ , and $\mathcal{T}_{\ell}=\mathcal{T}(x_{\ell},x_{\ell,\leq T_{\ell}},y_{\ell,\leq T_% {\ell}})$ denote text representations of the three in-context learning examples. Given individual $i$ ’s history $(x_{i},x_{i,\leq t},y_{i,<t})$ , we compute an embedding vector

\displaystyle\tilde{E}_{i,<t}=\textsc{Model}(\mathcal{T}_{j}\oplus\mathcal{T}_% {k}\oplus\mathcal{T}_{\ell}\oplus\mathcal{T}(x_{i},x_{i,\leq t},y_{i,<t}))\in% \mathbb{R}^{d}

(13)

where $\mathcal{T}(\cdot)$ is the text representation function as defined in Section 4.1 and $\oplus$ denotes the string concatenation operation. Finally, we train a $|\mathcal{Y}|$ -class logistic regression model on $\tilde{E}_{i,<t}$ to predict the next occupation, following the same procedure as in Section 5.2.

To ensure the robustness and replicability of our findings, this experiment is replicated five times for each dataset, with each iteration utilizing a different set of randomly selected examples. This procedure allows us to evaluate the stability of in-context learning across various career trajectories. Although generating the embedding in Equation (13) requires the language model to process a longer sequence of text, which would increase the inference cost, this approach requires a significantly lower amount of computational resources, since it does not require model fine-tuning.

Appendix L Model Performance by Different Education Groups

Table 19 presents the perplexities of the CAREER and fine-tuned language models on different survey datasets, indicating superior performance of fine-tuned language models compared to previous models.

Dataset	PSID		NLSY79		NLSY97
Group	College or Above	Non-College	College or Above	Non-College	College or Above	Non-College
CAREER	15.87 (0.49)	12.01 (0.38)	11.84 (0.19)	11.19 (0.15)	10.14 (0.37)	15.91 (0.30)
Fine-tuned Llama-2 (7B)	15.83 (0.50)	11.55 (0.37)	11.70 (0.20)	11.13 (0.16)	10.04 (0.36)	16.24 (0.29)
Fine-tuned Llama-2 (13B)	15.50 (0.50)	11.38 (0.36)	11.87 (0.20)	11.30 (0.16)	9.80 (0.36)	15.90 (0.30)
Fine-tuned Llama-2 (70B)	15.27 (0.45)	11.11 (0.34)	11.27 (0.18)	10.84 (0.15)	9.55 (0.34)	15.42 (0.28)

Table 19: Perplexities of different models by dataset and education groups. The number in parenthesis represents the standard deviation of perplexities computed on 1,000 bootstrap samples of the test set.

Figure 13 shows the calibration plots for predicting staying/moving on different datasets and education groups. Finally, Figure 14 plots ROC curves of different models while predicting staying/moving on different datasets and education groups.