Language Generation in the Limit
Abstract
Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don’t already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language that is known only to come from one of a possibly infinite list of candidates. A computational agent is trying to learn to generate from this language; we say that the agent generates from in the limit if after some finite point in the enumeration of , the agent is able to produce new elements that come exclusively from and that have not yet been presented by the adversary. Our main result is that there is an agent that is able to generate in the limit for every countable list of candidate languages. This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it.
1 Introduction
The recent advances in large language models (LLMs) have been remarkable, sparking active lines of theoretical work into their performance. These investigations implicitly revolve around two fundamental questions: how do we formally reason about the effectiveness of LLMs; and within such a framework, what are the core mathematical ideas that enable their performance?
Answers to these questions must begin by formalizing the specification for what a generative algorithm for language should be doing. Here, we propose starting from a very basic, assumption-free, statement for such a specification: there is an unknown target language , over time the algorithm sees a sequence of strings from , and eventually we would like the algorithm to generate new strings from that it has not seen before.111We will formalize the definition of a language more precisely below, but for now we can think of a language as simply any set of strings over a fixed alphabet; for example, the strings of the language could be the set of all grammatical sentences (or all well-formed expressions) according to a given grammar.
Viewed this way, it is also clear why it seems so remarkable for LLMs to be doing well at such a problem. The fully general statement of the problem feels unsolvable: if we know nothing about the unknown target language , then how can a generative algorithm reliably produce valid strings from that it hasn’t seen before?
Language Learning in the Limit.
In fact, there is a well-established formalism that allows us to phrase this point precisely: the classical model of language learning in the limit, formulated by Mark Gold in 1967 and fully characterized by Dana Angluin in 1980 [6, 2]. In this model, there is an unknown language that is known only to be produced by one of a list of candidate representations , where produces some language . We can think of this list of representations as the set of all possible context-free grammars, or the set of all possible finite automata, or the set of all Turing machines with a fixed space bound, or any other generative model that produces strings; in fact, the formal result is much more general than this, in that it is sufficient to suppose that the unknown language simply comes from a countable list of candidate languages , and we can dispense with explicit representations altogether.222In this more general view, we will assume that the family of languages is presented simply via a black box that for a string and an index can answer the question, “Is ”
In the Gold-Angluin model, an adversary enumerates the strings of one by one, and the algorithm is required after each new string to guess a language from the list such that . If there is some finite step after which the algorithm’s guess is always correct, then we say the algorithm has identified in the limit. Gold proved that this kind of language identification in the limit is impossible in general, even for simple language families such as the regular languages (i.e. those produced by finite automata), and Angluin characterized precisely those families for which it is possible, further establishing how limited they are [1, 2]. Note, crucially, that in the Gold-Angluin model, the adversary enumerates strings in , but does not provide examples of strings that do not belong to , nor does it allow the algorithm to ask questions about a string’s membership in ; their point with this formalism was to focus on cases where an algorithm tries inferring a language purely from seeing a sufficient number of examples of strings that belong to the language.
Our Results: Language Generation in the Limit.
These negative results of Gold and Angluin feel intuitive — how should we be able to identify a language from a finite sample when we are allowed to make essentially no assumptions about its structure? Because of this intuition, both via the Gold-Angluin model and for more informal reasons as well, the focus in language generation has naturally turned to distributional assumptions; one posits that large language models are so effective because they are able to exploit distributional probabilities of language, and from a finite set of samples they are able to estimate conditional probabilities of strings with increasing accuracy. In this way, the question moves from adversaries to probability distributions, and one seeks explanations for the effectiveness of LLMs through underlying probabilistic models.
In this paper, we offer a sharply different view: we show that in the Gold-Angluin model of adversarially produced examples, language generation is always possible. We will provide full details on the result and its proof beginning in the next section, but the key point at a high level is that even in an adversarial model with an unknown language , language generation is a fundamentally different task than language identification: where identification asks an algorithm to eventually name a language after seeing a large enough finite sample from , generation instead asks an algorithm to eventually output strings in after seeing a large enough from . Our main result is that this difference in specifications leads to dramatic differences in what is possible in the limit; whereas the Gold-Angluin results establish that identification in the limit is impossible except in highly constrained cases, we show that generation in the limit is possible for every countable list of candidate languages.
General Connections to Language Modeling.
Clearly, methods to design large language models in practice make extensive use of the empirical distributional properties of language, as they should. Our results don’t question this design methodology; when there are genuine empirical regularities in the training data, there is no reason not to use them. Rather, our results argue that if we are looking for the essential reasons why language generation is tractable, we do not fundamentally require any empirical regularities, or indeed any probabilistic assumptions at all; there is instead a formal sense in which language generation — unlike broader learning tasks such as language identification — is possible even against an adversary presenting positive training examples in a worst-case fashion. In some crucial sense, the generation problem is therefore different from these other learning tasks in ways that more detailed formalisms may potentially obscure.
Despite the generality of the model, the generation algorithm that proves our main theorem makes use of subtle structural properties of the given list of candidate languages. Again, we defer detailed descriptions to subsequent sections, but the idea at a high level is to maintain a sequence of “provisional languages” that are consistent with the finite sample from seen so far, and to continually refine this sequence of provisional languages as the adversary adds strings to . Since the Gold-Angluin results say that the algorithm can never be sure which is the true language , there is a sense in which this refinement process may need to continue indefinitely, and in general it leads the algorithm to generate from provisional languages that may be increasingly “thin” subsets of . This does not cause trouble for the specification of language generation, since it is acceptable to produce any unseen string from , but it does mean that while the algorithm is able to eventually produce an infinite sequence of unseen strings all from , it might do so from a narrow part of .
This property of the solution in the presence of an adversary suggests interesting connections to the problem of generation in practice as well. First, and most directly, our basic model — viewed at a high level given the abstract setting — is engaging in the basic loop that essentially any method for language generation must perform: iterating over a sequence of possible representations for the target language, and repeatedly refining these representations in the presence of new examples so as to eventually achieve successful generation.
Beyond just this basic point, our model also encounters a key trade-off that appears in applications as well: specifically, that any method for generation has to deal with the tension between an underlying validity problem — producing valid outputs — and an underlying breadth problem — producing outputs that represent the full range of valid outputs in some reasonable way. The breadth problem is notoriously difficult, and it manifests itself in numerous ways in the methodology of machine learning and generative models.
The approach that proves our main result helps illustrate the tension between validity and breadth even in settings with worst-case assumptions rather than probabilistic ones, and this tension shows up in both the early phases of our algorithm’s execution and the later phases. In the early phases, before the algorithm has refined its provisional language sufficiently, it is generating too broadly and producing strings that are not part of the target language — an analogue at a high level of a kind of hallucination in which the generated strings belong to some consistent candidate language, but not to the actual target language [7, 8, 10]. In the later phases, on the other hand, the algorithm continuously shrinks its range of possible outputs so as to ensure that they will be contained within — sacrificing validity for breadth in a manner analogous to the issues that arise in the problem of mode collapse for generative models [3, 4]. Our model therefore suggests interesting questions about the fundamental trade-offs that may exist between validity and breadth even in settings without an underlying probabilistic model.
2 Formal Model and Results
We now provide a formal description of the model and the statement of our results. To begin with, we have a countable list of candidate languages , where each is a subset of some countable set . All we assume about the list of languages is that it is specified through a black box that can answer questions of the form “Is ?” for any string and language . (If the reader finds it helpful for concreteness, they can consider the results that follow in the context of a specific list of languages , such as the set of all context-free languages or the set of all regular languages; but everything we say applies to general collections of languages.) We will allow the collection to contain repetitions, in that we may have for different indices and . We will assume that all the languages are infinite; while the original Gold-Angluin framework did not require this, it becomes important in specifying the generation problem: if we require an algorithm to output unseen strings forever, then this is not possible from a finite language, where the algorithm would eventually run out of new strings to generate.
An adversary and an algorithm now play the following game. The adversary chooses a language from without revealing it to the algorithm, and it begins enumerating the strings of one by one over a sequence of steps . The adversary can repeat strings in its enumeration, but the crucial point is that for every string , there must be at least one time step in which it appears. Let be the set of strings that the adversary has enumerated in steps 1 through .
Identification and Generation.
In this framework, we can now specify both the Gold-Angluin problem of identification and the contrasting problem of generation that we study in this paper.
- •
-
•
Generation (from the present paper): In each step, the algorithm observes and must output a string (its guess for an unseen string in ). The algorithm generates from in the limit if there is some such that for all steps , the algorithm’s guess belongs to .
Recall that a key point point about the Gold-Angluin framework is that the algorithm is not provided with feedback about whether its outputs are correct — in the case of identification it is not told if its guesses about the identity of the language are correct, and correspondingly our model of generation does not provide the algorithm with feedback about whether the string it generates in step belongs to the target language .
We know from the Gold-Angluin results that there is no algorithm that can achieve identification in the limit for an arbitrary countable collection of languages (or even for specific countable collections, like the set of all regular languages or the set of all context-free languages). In contrast, our main result is a dramatically different answer for language generation; it is possible for every countable collection:
(2.1)
There is an algorithm with the property that for any countable collection of languages , and any enumeration of one of these languages , the algorithm generates from in the limit.
A Result for Finite Collections.
We prove a second result as well, focusing on the variant of the problem in which the collection of languages is finite. In this case, it follows from Angluin’s characterization that every finite collection allows for identification in the limit. Given this, what more might we ask for? A natural question is whether there is a uniform bound on the number of samples needed to ensure that the algorithm can correctly identify the true language ; that is, for any finite collection , is there a bound and an algorithm with the property that after seeing any distinct strings from , the algorithm is guaranteed to correctly report as its guess for the true language?
It is easy to see that for the Gold-Angluin model of language identification, this is not possible. For example, suppose that is the collection consisting of two languages and : consists of all possible strings, and consists of all strings of even length. Suppose there were a bound and an algorithm that was guaranteed to guess correctly after seeing distinct samples. Then an adversary could present distinct strings of even length, and then ask the algorithm to guess whether the true language is or : if the algorithm guesses at this point, then the adversary could announce that the answer is , and conversely if the algorithm guesses . This does not prevent the algorithm from learning the true language in the limit, since the algorithm could simply keep guessing until the first time (if ever) when a string of odd length is presented, at which point it switches to . But there is no fixed bound by which it can be guaranteed to output the correct guess.
However, for the problem of generation with a finite collection of candidate languages, it is possible to provide this much stronger type of uniform bound, via an algorithm that can generate correctly after seeing a finite sample whose size is specified at the outset. In fact, we can achieve more: after seeing this finite sample, the algorithm can correctly generate an infinite sequence of unseen elements from the true language.
(2.2)
There is an algorithm with the property that for any finite collection of languages , there is a number , such that for any language in , and any sequence of at least distinct elements from , the algorithm given can produce an infinite sequence of distinct strings from .
Extensions and Generalizations.
Following these two main results in our basic model of generation, we provide (in Section 7) some extensions in a generalization of the model. Specifically, a familiar issue from language generation applications is the role of the prompt: a user provides an input string , and a generation algorithm is then supposed to produce a “continuation” string to come after the prompt, so that the concatenation of and is a valid utterance. We offer a way of extending our model to incorporate the notion of prompting, while maintaining the general structure of the model, and we show how to formulate and prove a generalization of our first main result in a setting where at each time step the adversary is allowed to specify a prompt that must be completed.
3 Initial Observations
To begin with, we make a few additional notational points beyond what was spelled out in the introduction. First, the languages we consider are all subsets of a countable set of elements , and for our two main results (2.1) and (2.2), it is not important what corresponds to — we can choose to think of as the natural numbers, or the collection of all finite strings over a fixed alphabet, or any other explicitly enumerated countable set. In particular, it will not be crucial for these results whether we talk about a set of strings listed as or simply the natural numbers . As a result, at different times in examples, for expositional simplicity we will take to be different countable sets. (In contrast, when we consider extensions of our results to handle constructions like prompts, it will be necessary to focus on the case in which is the set of all strings over a finite alphabet, so that the notion of string concatenation makes sense.)
Recall that we think of the algorithm as having knowledge of the sequence of languages in the following sense: given an index and a string , it can evaluate in finite time whether or not . We will refer to this evaluation of as a membership query, and assume henceforth that our algorithms have the power to perform membership queries on the languages in . The true language appears in the sequence ; suppose that for an index . Going forward, we will often refer to as . We observe that because languages can appear more than once in the sequence , it is possible that is also equal to for one or more indices . Finally, and crucially, we note that the algorithm does not have the power to pose queries about the membership of a string in the true language : since the algorithm can only pose membership queries of the form “?” for strings and indices that it provides, it cannot ask “?” because it does not know the index for which .
We prove our main results beginning in Section 4, but first we discuss a few points that provide useful background for thinking about the problem.
3.1 Review of Negative Results for Identification
The first point is a review of why language identification in the limit is not possible in general, adapting the exposition from [6, 9]. It is useful to go through the proof of this result, so as to get a better sense for the contrast with our positive results for generation.
There are many ways to show the negative result for language identification using very simple language families, and we choose one that highlights some intuitive contrasts with generation. For the argument, let be the set of all integers, and let the collection of languages — each of which is a subset of — be the set of all infinite arithmetic progressions of integers. (The choice of which countable ground set we use is not crucial for any of these results, and examples are often easier to describe over the set of integers than over the set of finite strings.) Formally, for an arbitrary integer and a positive integer , let be the arithmetic progression consisting of all integers of the form ; and let be the “bidirectional” arithmetic progression consisting of all integers of the form . Let the collection consist of all arithmetic progressions and .
Now, suppose by way of contradiction that there is an algorithm that can identify in the limit an adversarially chosen arithmetic progression . We construct an enumeration of that causes the algorithm to fail, as follows. First, for integers , let be the interval of all integers for which . We enumerate elements of in stages, where each stage consists of a set of consecutive steps. If by induction stage has enumerated the elements of the interval for some , then stage will enumerate additional elements so that by the end of the stage we have enumerated exactly for some . In particular, stage first enumerates and , and then it begins enumerating in increasing order. At some point during this process, the algorithm must output as its guess for , since we can continue in this way to produce a full enumeration of , at which point the true language is . Once the algorithm outputs as its guess during stage , we end stage , defining to be largest integer we’ve enumerated up to that point, and we begin stage .
In this way, the successive stages extend the interval unboundedly in both directions. We are therefore enumerating , and so in fact . But we have also produced an unbounded sequence of steps such that the algorithm guesses at step . Thus, there is no time step for which the algorithm outputs the (correct) guess at every .
3.2 Generation and Closure
In contrast, the particular collection defined above is easy for an algorithm attempting to perform generation in the limit. Once the algorithm has seen two elements from the true language , then setting , it knows that must contain not just and , but also and in particular the entire arithmetic progression . Therefore, for every step from then on, it can always generate a string in , since is infinite and is finite. Given that , this is guaranteed to be a string in .
A key point in this example is that the algorithm never needs to find out the identity of the true language in order to be able to generate an infinite sequence of strings from it (and by the argument above, we know that in fact it provably can’t find out the identity of the true language). Rather, it just needs to answer the question, “What is a string that is guaranteed to belong to every language in consistent with what I’ve seen so far?”
This highlights a sense in which generation is closely related to a certain kind of closure operation on the languages in , which we define as follows. Given the set of strings seen up to step , we say that a language is consistent with if . We now define the closure of in , denoted , to be the intersection of all languages in that are consistent with . We observe that , and so the closure is non-empty.
Now, the more general point to take away from our example here is that in any step where contains an element not in , the algorithm can always safely output an element in and be sure that it is outputting a new element in the true language . This is simply because for every consistent language , by the definition of the closure operation, and in particular this holds for the true language . This is what the algorithm did in our example: once contained the elements and , then we could conclude that the full arithmetic progression was a subset of and would be for all future steps , so this provided an infinite set that the algorithm could safely output elements from.
We refer to as the “closure” of by analogy with other forms of “closure.” For example, the convex hull of a set of points in the plane follows a similar idea: it is simply the intersection of all convex sets that contain . And although we won’t attempt to define a formal contrast between learning and generation for convex sets here, it is clear informally that the contrast we’ve been discussing in this section has an analogue in this geometric setting as well. For example, suppose an adversary is thinking of a hidden convex set , and it shows a finite set of examples to an algorithm. From any finite set, the algorithm has no chance of identifying the true convex set that the adversary is thinking of; but the algorithm can correctly generate an infinite sequence of new points from by simply enumerating points in the convex hull of that do not already belong to .
The insufficiency of closure.
Closure is a first useful idea in designing an algorithm that can generate in the limit, and we will see in Section 6 that for the case of finite collections it is the main idea that we need. Unfortunately, it is insufficient on its own to achieve generation in the limit for arbitrary countable collections , for the simple reason that can be empty in general, providing the algorithm with no guidance on which element to generate next.
To see how this can happen in a simple example, consider the following slightly more complicated collection of languages . is still the integers, and now for every arithmetic progression and every finite set of integers , we define the language . Our collection consists of every for an arbitrary integer , an arbitrary positive integer , and an arbitrary finite set of integers . (Think of as a copy of the arithmetic progression that has been obscured by an arbitrarily large finite set so that its structure is harder to discern.) We could also include the languages in for every integer , positive integer , and finite set , but this won’t be crucial for our purposes in this example.
Now, suppose the adversary is enumerating a language , and consider the set of samples after steps. We claim that . Intuitively, this is because might have come completely from the finite set that is part of ; more formally, there cannot be any element because is consistent with , and it does not contains .
This example illustrates the sense in which we mean that closure by itself is insufficient to provide a strategy for generation in the limit; in this instance, the algorithm has to repeatedly output elements outside the closure of and yet somehow eventually generate from . To see how it does so in this current example, suppose that in any step , the algorithm finds the two largest elements in , and with , it outputs . The point is that if the true language is , there will come a time when the adversary has enumerated every element of , and within a finite number of steps after this, the two largest elements of will have to come from . From this step onward, the algorithm’s strategy of outputting is guaranteed to produce an element in . This particular strategy of course seems highly specific to the particular example we are discussing here, but at a very high level it does contain ideas that are more easily discernable in retrospect from the general solution, which we describe in the next two sections.
3.3 The Failure of Direct Hypothesis Enumeration
We now discuss a final point that is useful to highlight before turning to the main proof. In thinking about approaches to generation in the limit, there is a natural strategy that at first seems to solve the problem directly, but in fact does not work. Its failure is useful to discuss, since it motivates the more involved solution that follows.
The strategy is to move through the list of languages in order, treating each language as a hypothesis for until the sample proves otherwise. That is, we start with , and we generate strings from until we encounter (if ever) a step in which . At this point we know that cannot be the true language , and so we continue the process with . The nice idea that underpins this strategy is that the true language is equal to for some index . So if our process were to reach at some step , it would never move on from , and so we would be generating from for all . (Since can contain repetitions, might appear several times, but we can take as the first appearance.)
Unfortunately, there is a deep problem with this approach: there may be a language with the property that comes before and properly contains (that is, , and ). In this case, our procedure would stop at the first such forever: since it is only ever shown samples in that come from the language , and since , it would never encounter a string in that didn’t belong to , and so it we would never move on to . And when this procedure generated from , there is no guarantee that it would choose strings from . (Recall that the algorithm is not provided with feedback about whether the string it generates in step belongs to the target language ; its only knowledge of comes from the sample that the adversary is enumerating.)
This problem is not easily avoided, since if this approach worked as written, it would also solve identification in the limit, which we know is impossible. So we need to extract some of the useful ideas from this failed approach — in particular, the point that appears at some finite index in the list , as the language — but add important further ideas as well. Specifically, if the algorithm is maintaining hypotheses for the true language over time, it can provably never know whether its current hypothesis is correct; instead, it must be always moving further down the collection of languages, potentially considering languages that are not , but in such a way that it is eventually always generating from . This is what our proof beginning in the next section will have to accomplish.
4 Generation in the Limit via a Function
We prove our main result in two parts. We first give a method for language generation in the limit that is not concerned with the computational power required by the agent performing the generation. Thus, rather than an algorithm to generate the string, we ask whether we can construct a function based on the given language collection that maps a finite set of strings to a new string; this function takes the strings seen so far and outputs a string intended to be in . We will prove the following:
(4.1)
For every countable collection of languages , there is a function from finite subsets of to elements of , such that for every enumeration of a language , there is a such that for all , we have .
Note that while this positive result is not concerned with the computational power required to evaluate , it already contains the core contrast with language identification, which remains impossible even if we simply ask for a function , by the same argument given in Section 3.1. In the next section, we will then go on to prove (2.1) by using an algorithm that only performs standard computational steps and membership queries of the form “?”
Minimal and critical languages.
As before, we will suppose is an index such that . We say that a language is consistent with the sample at step if . An important idea, which is implicit in our discussion of the failed approach at the end of Section 3, is that if are both consistent with , then it is safer for an algorithm to generate from than from : if then we must also have . This suggests that it would be useful to find consistent languages that are minimal with respect to inclusion: we say that is minimal if is consistent with , and there is no such that is consistent with and . Unfortunately, this is too much to ask for, since there exist instances of the problem for which there might not be any languages that are minimal with respect to inclusion. (In a finite collection of language there would need to be a minimal language, but it is easy to construct infinite collections without one.)
Therefore, we define a related concept that only involves examining the inclusion of a given language with respect to a finite set of other languages. Specifically, we look for languages that are consistent with in a given step , such that is a subset of every consistent language that precedes it in the indexing of . We will say that such a language is critical at step . To define this formally, we first let denote the finite collection of languages . We now have the following definition.
(4.2)
A language is critical at step if is consistent with , and for every language that is consistent with , we have .
Finding critical languages.
At any given step , there is at least one language consistent with , since the language is always consistent with . It follows that there is also at least one critical language at any step : for any , the consistent language with the lowest index must be critical at step , as it is the only consistent language in .
Note that there can be choices of for which the language is not critical at step . But a crucial fact is that will eventually become critical at some step and remain critical forever after that: We prove this next.
(4.3)
There exists a time step such that for all , the language is critical at step .
Proof.
Let be the indices for which . For each , let be an element of . Let be the step in which first appears in the enumeration of , and let .
Now, suppose by way of contradiction that for some , the language is not critical at step . In this case, there must be some such that is consistent with and . But we know that and , contradicting the consistency of with .
There can be multiple critical languages at a given step ; for example, if on the step in (4.3) the first consistent language is not equal to , then both and will be critical at step .333This is also a useful moment to recall that there can multiple languages that are equal to . A direct analogue of the proof of (4.3) shows that for any for which , there is a step such that for all , the language is critical at step . But these steps may be different for different . In particular, if are indices with the property that , then we must have , but it is possible that , simply because at a given time step , there might be a language with such that is consistent with and . For our purposes in the arguments to follow, it is sufficient to consider the step at which a specific copy of the true language first becomes critical forever; we do not need to worry about whether other copies are critical in this step or not. Despite the potential multiplicity of critical languages, the collection of all critical languages at step has a useful structure that follows directly from the definition of criticality.
(4.4)
Let , and suppose that and are both critical at step . Then .
Proof.
belongs to and is consistent with , and is critical at step , so by definition (4.2), we have .
A function for generation in the limit.
At a given step , suppose that the critical languages are where . (This list of critical languages might be finite or infinite.) Then (4.4) tells us that this sequence is nested by inclusion: .
By (4.3) we know that the language will eventually appear on this nested list from some step onward, but even then we do not know which index it corresponds to at any given step . Indeed, to recall a point from earlier, the Gold-Angluin results for learning in the limit tell us that we can never know for sure which index corresponds to . But we now arrive at the crucial point, which is that beyond some finite index, all the critical languages are subsets of , so it is safe to generate from any of them.
Given this, we are prepared to construct our function .
(4.5)
is defined as follows. We first identify all languages in that are critical at step . (If no such languages exist — which can only happen if none of them are consistent with — we define arbitrarily.) Among these critical languages, let be the one with the largest index . We define to be the lowest-indexed element of .
To prove our initial claim (4.1), it is sufficient to verify the following property of .
(4.6)
For any language and any enumeration of , there is a such that for all , we have .
Proof.
In the given enumeration of , (4.3) tells us that there will come a step such that for all , the language is critical at step . Let . In every step , our construction of will include among its critical languages in . Therefore, the highest-indexed critical language satisfies , and so by (4.4) we have . Since , we have as required.
As a final note, we observe that the current formulation of allows it to generate the same string more than once, provided that this string is in . However, it is not hard to modify so that it generates a different string each time, essentially by defining it so that it generates the lowest-indexed element that it hasn’t already generated.
The computational power required to produce .
Our plan was to construct without worrying about the computational power required to do so (and recalling that for comparison, in the corresponding problem of identification in the limit, no function achieving identification could exist regardless of the computational power required to produce it). Now that we’ve constructed an appropriate , we can ask what was in fact required computationally.
In addition to standard computational steps and membership queries of the form “?”, the definition of requires that we identify the critical languages in . From the definition, we can do this provided we can answer a finite number of subset queries of the form “?”. So an algorithm augmented with the power to perform such subset queries can perform generation in the limit.
In the next section, we will show how to remove the necessity for subset queries, so that generation in the limit can be performed by an algorithm using only standard computational steps and membership queries.
5 Generation in the Limit via an Algorithm
We now prove (2.1) by giving an algorithm that generates in the limit for any countable collection of languages , using only standard computational steps and membership queries of the form “?”
The set of possible strings can be written as , and for simplicity we will sometimes use the language of the positive integers to describe , treating as the number . In an enumeration of the true language , let the sequence of strings that are enumerated step by step be denoted .
Extending definitions to finite subsets of languages .
The notion of a critical language was crucial to our approach in the previous section, and since the direct approach to verifying whether a language is critical involves subset queries, an important part of designing an algorithm that avoids subset queries is to work with finite subsets of the languages in . Thus, for a language and a number , we will use to denote the finite set . Note that deciding whether for a fixed value of requires only that we make at most membership queries: we simply ask whether implies for all . This allows us to replace the definition of critical languages from (4.2) with a variation tailored to finite sets, and to be able to verify that this finite version is satisfied using only membership queries. By gradually expanding the value of over the steps of the algorithm, we will eventually get to large enough values of and for which this finite version of the definition is sufficient for generation.
Our extension of definition (4.2) for critical languages to finite sets is as follows.
(5.1)
Let and be positive integers. A language is -critical if is consistent with , and for every language such that is consistent with , we have .
Since implies for any , we have the following analogue of (4.3).
(5.2)
There exists a time step such that for all and all , the language is -critical.
The analogue of (4.4) also still holds with this definition, using the same proof.
(5.3)
Let and suppose that and are both -critical. Then .
Finally, there is a basic monotonicity property of -criticality that is useful to write down.
(5.4)
Suppose that is -critical, and . Then is -critical.
Proof.
Since is -critical, we know it is consistent with , and that for all languages such that is consistent with . Now, if is a language in that is consistent with , then since and , we have . It follows that for all such that is consistent with , and so is -critical.
5.1 An algorithm for generation in the limit
We now describe an algorithm for generation in the limit. As before, is the subset of enumerated through step , treating the as integers. We will consider the languages in in step , and maintain an auxiliary variable , roughly corresponding to how large a prefix we consider from each language .
At the start of step , we set ; note that by induction this implies . (At the end of step , we will define to be a number that is at least as large as , via the process described below.) We then determine which are consistent with ; note that by the definition of , it is sufficient to perform membership queries for only the finite set of elements in in order to do this. If there are no consistent languages in , then we output a string arbitrarily.
Otherwise, there is at least one language consistent with , and so there is at least one -critical language for any choice of , since the first consistent language is -critical for all . Our goal is to imitate the plan from (4.5) and generate a new string from the highest-indexed critical language. But to do this, we have to find a new string, and this will in general require performing additional membership queries.
Generating a string.
For any choice of , let be the maximum index of a -critical language from ; as noted above, is well-defined since we are in the case where at least one language in is consistent with , and so the first consistent language is -critical for all . We now search for a string to generate as follows.
We maintain a counter that begins at and gets incremented iteratively, with each iteration doing the following:
-
(i)
Increment by 1.
-
(ii)
Perform membership queries to determine for each . Note that since , the determination of which languages in are consistent with does not change (relative to the initial iteration) when we do this.
-
(iii)
Determine which languages are -critical, and from this determine . Note that this only requires consulting the results of membership queries already performed; also, since we are working with a value of for which contains at least one consistent language, the value of is well-defined.
-
(iv)
If there is any string for such that , then choose the minimum with this property; output the string and define . If there is no such , then continue to the next iteration.
It is useful to give an example of the step-by-step execution of this algorithm, and Figures 1 and 2 (which fill the next two pages) do this for a sample input. In the notation of these figures, each language is a vertical column with a cell for each string , and an “X” in the cell indicates that . Each vertical column only goes up to the height in step , indicating that by the end of step , the algorithm has only considered finite prefixes of the form .
data:image/s3,"s3://crabby-images/d12e8/d12e877230408955eb8579709b305559e91b419b" alt="Refer to caption"
data:image/s3,"s3://crabby-images/02f98/02f982702d558a3f980e0d77db72ec811b5b44a9" alt="Refer to caption"
Analyzing the algorithm.
As written, it is not immediately clear that the algorithm’s iterations in step will come to an end with a string , rather than running forever. We prove this now.
(5.5)
In step , the algorithm outputs a string after a finite sequence of operations.
Proof.
We identify each iteration with the value of after the initial increment of the iteration; so the iterations begin at and continue upward from there. Suppose by way of contradiction that the algorithm performs an infinite sequence of iterations.
Let us call an iteration disruptive if . Since is the maximum index of a -critical language, and since our monotonicity property (5.4) implies that is also -critical, it follows that . Since starts at a value upper-bounded by and decreases by at least one with every disruptive iteration, there can be at most disruptive iterations.
The sequence of iterations must therefore contain a last disruptive iteration . For all iterations , the language does not change. If there is an index for which , then the algorithm terminates in iteration with the first such . Otherwise, since the language is infinite, we must eventually reach an iteration for which , and the algorithm will stop and output at this point.
Given that (5.5) establishes that the algorithm outputs a string in step , it is useful to record an additional property of the algorithm that follows directly from its construction.
(5.6)
In step , if at least one language in is consistent with , then there is an and an such that the algorithm outputs a string from , where is the -critical language with maximum index in .
(5.7)
For any language and any enumeration of , there is a such that for all , the algorithm generates a string in .
Proof.
In the given enumeration of , (5.2) tells us that there is a such that for all and all , the language is -critical. Let . In every step , by (5.6) there is an such that the algorithm generates a string from , where is the -critical language with maximum index in . In each such step , is a -critical language in , and so . From (5.3), it follows that . Since the algorithm’s output comes from , it follows that it comes from as well.
As in the discussion at the end of Section 4, it is straightforward to modify the algorithm so that it generates strings without repetition.
6 Generation for Finite Collections of Languages
We now turn to our second main result, (2.2), which derives a stronger conclusion for finite collections of languages.
The finite case illustrates the power of the closure operator described in Section 3; this turns out to be sufficient to obtain the result. To review the definition from Section 3, for a sequence of strings from a language in , we define the closure of in , denoted , to be the intersection of all languages in that are consistent with . If there is a string in , then it is always safe to generate such a string; by definition, it must be an unseen string from the true language . The challenge in the previous sections was that there are simple instances with infinite collections for which . But for finite collections , we will be able to make much more progress using the closure operator.
Informal version of the argument.
We start by giving the basic idea behind the proof, and then the proof itself. Let us write the finite collection of candidate languages as , and suppose that after the adversary has enumerated a set of strings, the languages consistent with are . Note that the true language must be one of these languages. Now, the closure is equal to the mutual intersection , and there are two possibilities: either is infinite, or it is finite. If it is infinite, then the algorithm can safely generate all of the strings in , and thus achieve the goal specified by (2.2). On the other hand, if is finite, then it has size equal to some natural number ; in this case, after the adversary enumerates at most more distinct strings, the algorithm will learn that at least one of is no longer consistent. We will then have a set of at most consistent languages, and we can iterate this argument at most more times until (i) there is only a single consistent language, which must be , or (ii) more generally, the set of all consistent languages has a mutual intersection that is infinite, in which case the algorithm can safely generate from this infinite set.
This argument conveys the key point underlying the proof, that as the adversary enumerates strings from , it cannot prevent itself from reaching a point where the set of strings it has enumerated has an infinite closure. To turn this into an argument that produces a uniform bound on how many strings are needed before the closure must become infinite, we replace the iterative argument in the previous paragraph with one that is shorter and more direct. Specifically, consider all sub-collections of languages from (where we think of a sub-collection as any way of choosing some of the languages from but not others). Note that since is finite, there are only finitely many possible sub-collections of languages from . For each sub-collection , the mutual intersection of the languages in is either finite or infinite. Consider the sub-collections that have a finite mutual intersection, and let be the maximum size of such a mutual intersection. Now, suppose the adversary produces a set of distinct strings from . If we consider the sub-collection of all languages in that are consistent with , its mutual intersection must contain and therefore it has cardinality greater than . By the definition of , this means that its cardinality must be infinite. So the closure is infinite, and therefore the algorithm can safely generate all the strings in .
The proof.
The argument above, in informal terms, is the complete proof of (2.2). We now formalize this argument in the remainder of the section.
Proof of (2.2). We begin with some additional definitions. For any subset of the indices , let be the intersection of the languages whose indices are in ; in other words, For any sequence of strings from a language in , let be the set of indices of the languages in that contain ; that is, We observe that the closure operator can be written in terms of this notation, in that .
If is infinite, then the algorithm can generate arbitrary strings from as its output without seeing any sample of strings at all; since for every language , in particular for the true language , and this satisfies the requirements of (2.2).
For the rest of the proof, we therefore suppose is finite. Let be the collection of all sets of indices with finite. Finally, let ; since , we observe that is the maximum of a finite set of positive integers, and hence a positive integer.
We now define and claim that this choice of satisfies the required guarantee of (2.2). Indeed, consider the true language and any sequence of distinct elements from . Recall that denotes the set of indices of all languages in that contain . We have . If were finite, then by the definition of , the cardinality of would be at most . But this would contradict the fact that contains , which has cardinality .
Therefore is infinite, and it is a subset of the true language . To conclude the proof, we therefore need only show that there is an algorithm that can enumerate all of using only membership queries. To do this, the algorithm begins by querying whether each belongs to each . From this, it can determine the set of indices of languages that contain . Now, it enumerates every string in ascending order, skipping the elements of . For each such string , it queries whether for each , and it outputs if it belongs to each of these languages. In this way, the algorithm enumerates the infinite set after seeing a sample of strings in .
7 Extensions and Generalizations
As discussed at the end of Section 2, a natural direction for generalization is to consider whether we can preserve the general structure of the model while adding in a notion of prompting — an idea familiar from language generation systems in practice, where the algorithm is provided with a prompt string and it must complete it to a valid output.
To explore how we might add add prompts to our model, let’s first recall that there is a countable collection of language , the adversary chooses a true language from , and it begins enumerating the strings of one by one, over a sequence of steps . We continue to use to denote the set of strings enumerated by the adversary up through step , and for an index such that .
A model of prompting.
The new feature of the problem in our generalization is that in every step , the adversary provides the algorithm with two things: a string from the true language , and a string that serves as a prompt. (The adversary is allowed use the same prompt in multiple steps.) The algorithm in step must then produce a string with the goal that the concatenation of and is a string belonging to , where — that is, it should be an unseen string from . In what follows, we will use to denote the concatenation of and .
We observe that it leads to an equivalent problem whether we ask the algorithm to output so that , or whether we ask the algorithm to output the full contatenated string . In this latter formulation, we can phrase the algorithm’s task as follows: given a prompt , output a string with the properties that is a prefix of , and . Because it makes the exposition slightly simpler, we will take this latter formulation as our default version of the problem — that the algorithm is supposed to output the full string , with as a prefix of — but we will refer to both versions in our discussion.
To establish a positive result for prompted generation in the limit, we need to impose some type of restriction on the prompts the adversary can provide. For example, if the adversary were simply to provide an arbitrary string and ask the algorithm if there exists a string and a language for which , this is not a problem that could be solved by an algorithm that must terminate with a yes/no answer and whose only access to comes in the form of membership queries of the form “Is ?” So as a minimal restriction on the adversary, we can at least require that its prompt at step must have the property that there exists a string for which . This weak assumption raises interesting open questions that we will consider later in this section. But first, we will establish a positive result with a stronger restriction on the adversary, as follows. We say that a prompt is robust if for all languages , there exist arbitrarily long strings for which . We will start by considering adversaries that only provide robust prompts.
We say that the algorithm achieves prompted generation from in the limit if there is some such that for all steps , the algorithm’s output has the property that is a prefix of and . We now prove the following.
(7.1)
There is an algorithm with the property that for any countable collection of languages , and any enumeration of one of these languages accompanied by a sequence of robust prompts, the algorithm achieves prompted generation from in the limit.
We make two initial observations about this result. First, (7.1) is a strict generalization of our first main result (2.1), since if the adversary always provides the empty string as its prompt , then the problem of finding continuations for which is simply the problem of finding strings in , as in the original definition of generation in the limit. Moreover, the empty string is a robust prompt, since each of the languages is infinite, and so there are arbitrarily long continuation strings that belong to when concatenated to the empty string.
Second, we observe that there is no requirement that the algorithm has ever seen a string beginning with the prefix among the adversary’s examples before the first step in which the adversary provides . An important point to notice about (7.1) is that the algorithm can achieve prompted generation in the limit despite this challenge.
7.1 A First Result for Prompted Generation
We now describe how to prove our result (7.1). The proof is a direct adaptation of the proof of (2.1) from Section 5; as we will see, the structure of critical languages built up there is sufficiently strong that not much more is needed to handle the prompted version of the problem with robust prompts.
As in Section 5, we will work with a specific enumeration of all strings in , and work with finite subsets of the languages , defined via the notation . The algorithm for prompted generation will closely follow the algorithm from Section 5, in that in every step , it will increment a counter and maintain knowledge of the maximum index of a -critical language from . Maintaining knowledge of does not require knowledge of the prompts, and so this part of the algorithm is the same as before. What changes is the stopping condition for the algorithm in step : rather than continue increasing until any valid output is found — that is, until — the algorithm must increase potentially even further, until it finds a string for which is a prefix of , and . However, since is a robust prompt, the algorithm is guaranteed to eventually find such a string, and so we can be sure that its iterations in step will terminate. If we let be the value of at the end of step , then once is large enough, we know that , where is the true language, and so the string that it outputs has as a prefix and belongs to as required.
Detailed analysis.
The discussion above probvides the entire set of modifications to the algorithm; for completeness we now describe these in more detail, together with a proof of correctness.
First, the facts (5.2) through (5.4) still hold in the prompted case, since they are structural properties of the language that are not affected by the adversary’s use of prompts. The algorithm for generating an output string uses an iteration in step for which parts (i), (ii), and (iii) of each iteration are the same as in Section 5. Step (iv) of each iteration is replaced by
-
(iv)
If there is any string for such that has as a prefix and , then choose the minimum with this property; output the string and define . If there is no such , then continue to the next iteration.
Now, the proof of termination works as before, by establishing that there are only finitely many disruptive iterations in which the identity of changes; this part does not depend on the structure of prompts but only on the definition of a -critical language, and so it uses (5.2) through (5.4) exactly as before. After the last disruptive iteration, either there is a string with for which is a prefix, or else the algorithm will eventually reach one, since the prompt is robust. It declares this to be its output string. We therefore have
(7.2)
In step , if at least one language in is consistent with , then there is an and an such that the algorithm terminates with a string for which is a prefix of and , where is the -critical language with maximum index in .
Finally, we establish the basic correctness property of the algorithm, from which (7.1) follows directly.
(7.3)
For any language and any enumeration of with robust prompts , there is a such that for all , the algorithm generates a string for which is a prefix of and .
Proof.
In the given enumeration of , (5.2) tells us that there is a such that for all and all , the language is -critical. Let . In every step , by (7.2) there is an such that the algorithm generates a string such that is a prefix of , and , where is the -critical language with maximum index in . In each such step , is a -critical language in , and so . From (5.3), it follows that . Since , it follows that as well.
7.2 Prompted Generation with a More Powerful Adversary
Having proved (7.1), let us go back to the question of what we must require about the adversary’s prompts in order to establish a positive result for prompted generation. Arguably the weakest requirement we might picture placing on the adversary — and therefore, the most power we could give the adversary — would be to require only that each of its prompts has at least one valid continuation at the time that it poses the prompt: that is, its prompt in step must have the property that there is at least one string for which . Let us call such a prompt non-trivial. While requiring non-trivial prompts is a weaker restriction than requiring robust prompts as in Section 7.1, we reiterate the point that even the assumption of robust prompts is sufficient to provide a strict generalization of our first main result (2.1), given that the empty string, when used as a prompt, is robust.
We now establish some results with this weaker restriction on the adversary, though we leave a fully general characterization of the power of this restriction as an open question. In particular, going back to the style of results from Section 4, where we allow algorithms with additional computational power, we will show that there is a positive result for non-trivial prompts via an algorithm that can ask not only membership queries of the form “Is ?” but is also augmented with one extra form of computational power: if and is a regular language (i.e. one that is accepted by a finite automaton), then the algorithm can correctly answer the query “Is ?”.444Equivalently, we could say that the algorithm’s added power is to answer questions of the form “Is empty?” for a regular language , since is a regular language if and only if its complement is, and is a subset of if and only if is disjoint from . The fact that such an algorithm can perform prompted generation in the limit for any sequence of non-trivial prompts has an interesting implication: we can establish this same result without any augmentation of the algorithm by additional power — that is, using an algorithm that can only perform standard computational steps — in the case where is the set of all context-free languages. This is simply because the problem “Is ?” is decidable when is context-free and is regular, and so we don’t need to augment the algorithm with any additional power to answer these queries when consists of the set of context-free languages.555To see why this is decidable, we begin with the point in the previous footnote, that for the algorithm to answer “Is ?” it can equivalently answer “Is empty?” The intersection of a context-free language with a regular language is context-free, and the question of whether a context-free language is empty is decidable.
Prompted generation using regular subset queries.
To make the discussion above concrete, we write the added computational power of the algorithm as the following assumption.
(7.4)
Assumption: The algorithm is able to answer subset queries of the form “Is ?” where and is a regular language.
We will call the type of query specified in (7.4) a regular subset query, and again, we note that for some important families of languages — for example the context-free languages — an algorithm does not need to be augmented with anything additional in order to decide regular subset queries.
We will prove the following.
(7.5)
There is an algorithm, augmented with the power to perform regular subset queries of the form in Assumption (7.4), with the property that for any countable collection of languages , and any enumeration of one of these languages accompanied by a sequence of non-trivial prompts, the algorithm achieves prompted generation from in the limit.
As in Subsection 7.1, we prove this by adapting the algorithm from Section 5. However, this time the modifications need to be a bit more extensive. As a first new definition, we call a language -valid if there exists a string such that . Observe that is -valid for all , by our requirement that the adversary has to provide non-trivial prompts.
An algorithm for generation with non-trivial prompts.
We now give the algorithm that proves (7.5). As in Section 5, in step we maintain a prefix size and work with prefixes for . As before, we write the sequence of examples provided so far by the adversary as .
We let be the maximum index of a -critical language in , if such an index exists. We let be the maximum index of a language in that is both -critical and -valid, if such an index exists. We continue to use (5.2), which establishes that there is some time step such that for all , and all , the language is -critical. If we set , then for all , the quantities and are both well-defined for all , since when , there is at least one language in (namely ) that is both -critical for all and also -valid.
In step , the algorithm is looking for a string to output that has as a prefix; we will call this an acceptable output. As a first phase of step , the algorithm determines which languages in are -valid. To do this, it defines to be the language consisting of all strings that have as a prefix, and it then checks whether is empty. Since consists of all acceptable outputs, this check determines whether or not is -valid. Moreover, since is a regular language, this check can be done using the algorithm’s augmented power to determine whether is a subset of a given regular language, or equivalently whether has an empty intersection with a given regular language.
Operating by analogy with Section 5, the algorithm then defines a counter that it initalizes to , where is the most recent string in . If fails to contain a language that is both -critical and -valid, then the algorithm can output an arbitrary string in step . (This will not pose a problem, since as we have observed above (5.2) establishes that will always contain such languages once is large enough.) Henceforth, let us assume we are at a step for which contains at least one language that is both -critical and -valid, which means that and are well-defined. In this case, the algorithm performs a sequence of iterations to find a string to output, where each iteration consists of the following steps.
-
(i)
Increment by 1.
-
(ii)
Perform membership queries to determine for each . Note that since , the determination of which languages in are consistent with does not change (relative to the initial iteration) when we do this.
-
(iii)
Determine which languages are -critical, and from this determine and . Note that this only requires consulting the results of membership queries already performed, together with earlier determination of which languages are -valid.
-
(iv)
If there is any string for such that has as a prefix and , then choose the minimum with this property; output the string and define . If there is no such , then continue to the next iteration.
The proof that the iterations in step terminate with an output string again follows the general structure of the proof of (5.5). In particular, we say that an iteration is disruptive if . Since (5.4) implies that the indices of -critical languages form a subset of the indices of -critical languages, and since the identities of the -valid languages do not depend on the value of , we have after a disruptive iteration, and therefore there can be at most disruptive iterations. Now, consider the final disruptive iteration . Step (iv) of this iteration checks whether any string among has as a prefix and belongs to , and if so the iterations terminate with the earliest among these strings as output. If none of has this property, then the fact that is -valid implies that there is some with that will have this property. Therefore, by the time iteration is reached, the algorithm will terminate with an acceptable output.
The following claim now establishes (7.5).
(7.6)
For any language and any enumeration of with non-trivial prompts , there is a such that for all , the algorithm generates a string for which is a prefix of and .
Proof.
In the given enumeration of , (5.2) tells us that there is a such that for all and all , the language is -critical. Let . As argued above, is well-defined in every step , and the algorithm generates a string such that is a prefix of and . In each such step , is a language in that is both -critical and -valid, and so . From (5.3), it follows that . Since , it follows that as well.
7.3 Prompted Generation for a Finite Collection of Languages
When the collection of languages is finite, we can ask a final question, which is whether an analogue of our result (2.2) might hold for prompted generation: could there be an algorithm, together with some fixed bound , such that after seeing strings from the true language , the algorithm is able to achieve prompted generation?
There is a short argument that such a result cannot hold. To see why, suppose that is the collection of two languages and , each over the two-letter alphabet , where consists of all strings that begin with and all odd-length strings that begin with ; and consists of all strings that begin with and all even-length strings that begin with . Suppose there were a bound and an algorithm that guaranteed correct prompted generation after seeing distinct samples from the true language . Then an adversary could present distinct strings all beginning with , and then provide the single-letter string as a prompt: if the algorithm outputs an even-length string, then the adversary could declare this to be incorrect because the true language is , and conversely if the algorithm outputs an odd-length string.
This does not prevent the algorithm from achieving promoted generation in the limit, because the adversary must eventually output a string beginning with , after which the algorithm knows whether to respond to prompts that begin with using even-length or odd-length strings. But this guarantee cannot be achieved after any fixed number of samples that must be specified in advance.
8 Concluding Remarks
Our results suggest that generating from a language based on observed samples is a fundamentally different, more tractable problem than identifying the language from observed samples. It is so tractable, in fact, that it can be accomplished provided only that we are told the samples come from a language in a known countable collection of candidate languages.
It is therefore interesting to ask what stylized conclusions we might draw from these general results about generation as a task, and its relation to other learning processes. In the case of finite collections of languages, the basic idea underlying the proof is that a large “core” to the language (the closure of the sample, in our terminology) emerges at a known time after a finite set of observations, and it is then enough to generate from this core even though there might always remain peripheral parts of the language — disjoint from this core — that we can never be sure about. In the case of infinite collections of languages, the task is more complex, because there is never a known time at which a core to the language emerges. Instead, the algorithm may need to continually shrink the set it is using for generation; through this infinite process of shrinkage, the algorithm can be sure that beyond a certain point, it is always generating from the true language , even if it can not be sure when it has reached this point or what the true language is.
In this way, as noted earlier in the paper, the solutions we develop highlight some interesting tensions between the problem of producing valid strings that belong to the target language, and the problem of maintaining breadth by not restricting to only a limited subset of the target language. Our approaches achieve validity through a strategy that implicitly gives up on breadth as a goal, and it is interesting to ask whether this is essentially necessary for any method that achieves language generation in the limit.
This tension, as it arises in our solution, also creates an interesting echo of the human process by which people acquire the vernacular within a new community [5]: as with our solution in this abstract model, people encountering the dialect in a new community may similarly pass through a sequence of conceptually distinct phases: an initial phase in which they are generating too adventurously and producing invalid utterances; then a phase where the utterances are approximately aligned with the scope of the language; and finally a phase in which the range of utterances they are willing to generate shrinks further over their lifetime, as they become increasingly conservative in what they consider valid. Again, it is interesting to consider whether this type of structure is inherent in any solution to the task of generation in the limit.
Acknowledgements.
We thank Bobby Kleinberg, Lillian Lee, Marios Papachristou, and Kenny Peng for helpful discussions on these questions and on early drafts of this paper. The work has been supported in part by a Vannevar Bush Faculty Fellowship, a Simons Collaboration grant, a grant from the MacArthur Foundation, and the Center for Applied AI at the University of Chicago Booth School of Business.
References
- [1] Dana Angluin. Finding patterns common to a set of strings. In Proceedings of the 11th annual ACM Symposium on Theory of Computing, pages 130–141, 1979.
- [2] Dana Angluin. Inductive inference of formal languages from positive data. Information and Control, 45(2):117–135, 1980.
- [3] Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
- [4] Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 214–223. PMLR, 2017.
- [5] Cristian Danescu-Niculescu-Mizil, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. No country for old members: User lifecycle and linguistic change in online communities. In Proceedings of the 22nd international conference on World Wide Web, pages 307–318, 2013.
- [6] E Mark Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967.
- [7] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- [8] Adam Tauman Kalai and Santosh S Vempala. Calibrated language models must hallucinate. In Proceedings of the 56th annual ACM Symposium on Theory of Computing, 2024.
- [9] Lillian Lee. Learning of context-free languages: A survey of the literature. Technical Report TR-12-96, Harvard University, 1996. Available via ftp, ftp://deas-ftp.harvard.edu/techreports/tr-12-96.ps.gz.
- [10] Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817, 2024.