As the industry’s various AI markets continue their trundling paths towards maturity, not to mention the trough of disillusionment and the attendant and growing AI exhaustion, patterns have continued to unfold as has been discussed previously. The end game of many of these is and has been clear from the start. As but one example, enterprises are proving predictably unwilling to trust newly minted AI startups characterized by a startling lack of process and legal documentation with their private corporate data.
In other cases, that is far from the case. The following are brief thoughts on three such cases in which the answer isn’t obvious, and in at least one case, may not exist at all.
Which Model for Which User?
As noted last month, for all of the understandable industry focus and attention paid to large, expansively capable models, there are important roles to play for more limited medium and even small models. As conversations proceed beyond just the model capability and issues about data exfiltration and token costs come to the fore, things become more complicated. Couple those concerns with enterprises focused on more narrowly drawn use cases, and the state of the art large models aren’t the obvious and preordained winners that industry perception might suggest.
What this means, in practical terms, is that the large commercial models like ChatGPT, Claude, Gemini and others that their respective vendors would dearly like to sell to less price-sensitive large enterprises look in some cases less useful or cost-effective than smaller, cheaper models that can be trained more easily to solve very specific problems. Which presents an interesting economics problem, because as fast as the individual consumer business of players like ChatGPT has grown – which is very fast indeed, and $20 a month at scale isn’t nothing – the vendors are never going to get rich off the consumer market. Or even, in all likelihood, cover their exorbitant chip-driven hardware costs.
One early hypothesis of the market has been that individuals would end up using lower end, cheaper models while enterprises would require the largest, most state of the art and highly trained. Instead, the reality in many cases may very well be the opposite.
AI Gateways: The New Default Interface?
One of the least surprising developments of the AI space to date has been the emergence of AI gateways from players like Cloudflare, Fastly or Kong. Much like their API gateway predecessors, AI gateways are instantiated in between users and AI endpoints like OpenAI’s. The primary justification to date for these have been issues like improving query performance or preventing the wild escalation of token based costs.
One potential use case that has generally been under appreciated to date has been that of: the new AI interface abstraction.
AI models, at present, are proliferating wildly. Businesses are experimenting with multiple models, attempting to arrive at a workable strategic approach that delivers required capabilities while eliminating the possibility of data exfiltration while also minimizing costs. Users, meanwhile, are actively and aggressively imprinting on particular models – even models, in many cases, that they have been forbidden to use.
What if an interface existed, however, which disintermediated the user interface from the model behind it? What if a single interface could deliver unified access to both large, public models and internal private ones – much as someone might use Ollama on a laptop – but in a scalable, enterprise-friendly way?
Enterprises would potentially tick boxes like public/private capabilities, scalable, centralized data on usage and therefore costs, as well as benefits in compliance and other areas. Users, for their part, would potentially get a single interface with the capability of every model.
AI gateways already have many of the requisite capabilities to execute on this functionally, if not the messaging and vision.
Model vendors would presumably resist this sort of commoditization and disintermediation, preferring direct routes to their customers. But short of knee-capping a given vendors APIs, it would be difficult to deny users their interface of choice for fear of limiting its addressable market.
Open Source, AI and Data
At present, the OSI is working aggressively to try and finalize a proposed draft definition of open source as it pertains to AI technologies. This is necessary because while it’s easy to understand how copyright-based licenses apply to software, it’s much more difficult to determine where, how and even whether it applies to the unique combination of software, data, inferences, embeddings and so on that makes up large AI models.
While there are a range of issues, one of the most pressing at present concerns training data. More specifically, whether or not training data needs to be included alongside other project components like software to be considered “open source.”
The current OSI draft of the definition does not mandate release of the accompanying training data. It instead requires:
“Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.”
Many smart and reasonable individuals with decades of open source experience regard this as insufficient. Jason Warner, CEO of Poolside, discussed the importance of training data with us recently. Julia Ferraioli, meanwhile, explicitly made the case last week that the current definition falls short, arguing in essence that without the data – and a sure way to replicate it – the draft AI definition cannot fulfill the four key freedoms that the traditional OSI definition does.
This is correct. If data is key to AI, and data may or not be replicated per these more lenient terms, then while users would be able to use or distribute the system, clearly the draft definition could not guarantee the right to study the full system or modify it deeply.
Clearly, then, an OSI definition that does not require the inclusion of training data is problematic. A definition that requires full training set availability, however, is arguably equally problematic.
Practicality
First, and most obvious, is practicality. For large models that are trained on datasets the size of the internet, dropping them in a repo the way that we would with source code is challenging individually and essentially impossible at scale. It’s also not clear that datasets of that scale would be practically navigable or could be reasonably evaluated.
Legality
The second issue is legality. Obviously, there are reasons for commercial parties to not want to release their training data. Some of those reasons might be that they are training on data of questionable legality.
But setting simple cases like that aside, the questions here are anything but simple.
Many years ago, we were briefed on a particular data use case in which two separate datasets that could not legally be combined on disk, were instead leveraged in the only fashion the lawyers authorized – in memory. According to the way the licenses to these datasets were written, disk meant spinning disks – which was regarded as legally distinct from RAM.
When law intersects with data, in other words, things get complicated quickly. It’s not clear, in fact, whether requiring the release of data in a license is itself legal. Questions like that are perhaps best deferred to actual lawyers like Luis Villa and Van Lindbergh. But even with resources and input from such experts helping to illuminate and clarify some of the questions around all of this, there will presumably be a large number of corner cases with no simple answer.
Authors may rely on training data that cannot be legally released under an open source license. More commonly, they may rely on training data that they don’t know unequivocally whether they’re able to release or not, because sufficient case law has yet to decide the issue.
What this means in practical terms is that when in doubt – in the best case scenario, at least, as opposed to the worst case scenario we’ll get to momentarily – authors will simply default to non-open source licenses.
Outcomes
Which brings us to the last issue, which concerns outcomes – desired and non-desired. On the one hand, strict adherence to the four freedoms clearly necessitates a full release of training data. Whether those training datasets are big and unwieldy – and whether that dramatically narrows the funnel of available open source models due to questions of data and law – is immaterial, if the desired outcome is a pure definition of open source and AI.
It seems at least possible, however, that such adherence could lead to undesirable outcomes. If we assume, for example, that the definition requires full release of datasets, one thing is certain: in Julia’s words, it would be “a definition for which few existing systems qualify.”
In and of itself, that would not necessarily be a negative if there was a plausible pathway for some reasonable number of AI projects to comply in some reasonable timeframe. It’s not clear here, however, that that is the case. It seems more likely, in fact, that a tight definition would have the reverse effect. If the goal seems fundamentally unachievable, why try? At which point, each project would have a choice to make: follow in Google’s footsteps with Gemma or Meta’s with Llama.
Google was explicit that while Gemma was an open model, it did not and would not qualify for the term open source because they respect the OSI definition. The majority of the press ignored this important bit of what they considered semantics and called it open source. Meta, on the other hand, as it has for years, willfully and counterfactual described and continues to describe Llama as open source – in spite of the use restrictions on it which mean peers like Amazon, Google and Microsoft cannot leverage the project – and the press, which the exception of the odd tech reporter here and there and, of all publications, Nature, has not seen fit to question them.
In a world, then, where companies provably want the benefits of the open source brand, but it’s seen as difficult if not impossible to achieve – particularly a world in which the open source label is already under siege from other quarters – the most likely course of action from this perspective is vendors totally abandoning any pretense of consideration for the OSI and the term it’s charged with guarding. In attempting to more closely protect the open source definition with respect to AI, then, it’s possible that the outcome would be opposite of the intention.
Which is why the question is such a challenge, and the task of answering it not one to be envied.
Disclosure: Amazon, Cloudflare, Fastly, Google and Microsoft are all RedMonk customers. Anthropic, Kong, Meta, OpenAI and Poolside are not currently RedMonk customers.