Expanding our open source large language models responsibly

Open Source

July 23, 2024•

7 minute read

Takeaways:

Meta is committed to openly accessible AI. Read Mark Zuckerberg’s letter detailing why open source is good for developers, good for Meta, and good for the world.
Open source has multiple benefits: It helps ensure that more people around the world can access the opportunities that AI provides, guards against concentrating power in the hands of a small few, and deploys technology more equitably. And we believe it will lead to more safe AI outcomes across society. That’s why we continue to advocate for making open access to AI the industry standard.
We’re bringing open intelligence to all by introducing the Llama 3.1 collection of models, which expand context length to 128K, add support across eight languages, and include Llama 3.1 405B—the first frontier-level open source AI model.
As we improve the capabilities of our models, we’re also scaling our evaluations, red teaming, and mitigations, including for catastrophic risks.
We’re bolstering our system-level safety approach with new security and safety tools, which include Llama Guard 3 (an input and output multilingual moderation tool), Prompt Guard (a tool to protect against prompt injections), and CyberSecEval 3 (evaluations that help AI model and product developers understand and reduce generative AI cybersecurity risk). We’re also continuing to work with a global set of partners to create industry-wide standards that benefit the open source community.
We prioritize responsible AI development and want to empower others to do the same. As part of our responsible release efforts, we’re giving developers new tools and resources to implement the best practices we outline in our Responsible Use Guide.

How Meta is scaling AI safety

We’re closely following as governments around the world seek to define AI safety. Meta supports new safety institutes and works with established entities—including the National Institute of Standards and Technology (NIST) and ML Commons—to drive toward common definitions, threat models, and evaluations. Working with bodies such as Frontier Model Forum (FMF) and Partnership on AI (PAI), we seek to develop common definitions and best practices, while also engaging with civil society and academics to help inform our approach. For this release, we’ve continued to build on our efforts to evaluate and red team our models in areas of public safety and critical infrastructure, which includes cybersecurity, catastrophic risks, and child safety.

It’s important to note that before releasing a model, we work to identify, evaluate, and mitigate potential risks through several measures:

We conduct pre-deployment risk assessments, safety evaluations and fine-tuning, and extensive red teaming with both external and internal experts to stress test these models and find unexpected ways they may be used. As we expanded Llama 3.1 capabilities to include multilinguality or expanding the context window, we similarly scaled our evaluations and mitigations across those abilities in these areas. More about our safety evaluations and fine-tuning work is included in our Llama 3.1 research paper.

We’re committed to helping developers protect against potential adversarial users who aim to exploit Llama capabilities. This includes proactively incorporating mitigations throughout model development and building a suite of safeguards at the system level so developers can customize the generative AI application to fit their needs.

We work closely with partners including AWS, NVIDIA, Databricks, and others to ensure safety solutions are provided as part of the distribution of Llama models, facilitating responsible deployment of Llama systems.

In line with our open approach, we’re sharing model weights, recipes, and safety tools. In the Llama 3.1 research paper, we're also detailing the advancements we’ve made in our research, and outlining how we’ve measured model and system-level safety, and mitigated safety mapped to each stage of LLM model and system development. By sharing these artifacts, we aim to support and provide developers with the ability to deploy helpful, safe, and flexible experiences for their target audiences—and for the use cases supported by Llama.

System safety: New resources, security, and safety tools for developers

Our vision for Llama is to give developers a powerful foundation to build on by providing pieces of a broader system that gives them the flexibility to design and create custom offerings that align with their goals and needs. As part of the Llama reference system, we’re integrating a safety layer to facilitate adoption and deployment of the best practices outlined in the Responsible Use Guide. We’re excited to release new safety components for developers to power this safety layer and enable responsible implementation of their use cases.

The first, Llama Guard 3, is a high-performance input and output moderation model designed to support developers in detecting various common types of violating content, supporting even longer context across eight languages. It was built by fine-tuning the Llama 3.1 model and optimized to support the detection of emerging standards of hazard taxonomy. We believe that aligning on a common hazard taxonomy across the industry is an important step toward cultivating collaboration on safety. Llama Guard 3 is seamlessly integrated into our reference implementations and empowers the developer community to build responsibly from the start. Other resources to get started with Llama Guard 3, including how to fine-tune it for specific use cases, are available in the Llama-recipe GitHub repository.

Our second tool, Prompt Guard, is a multi-label model that categorizes inputs into three categories—benign, injection, and jailbreak—to help developers detect and respond to prompt injection and jailbreak inputs:

Prompt Injections are data from untrusted sources that attempt to induce models to execute unintended instructions.
Jailbreaks are malicious instructions designed to override the safety and security features built into a model.

Prompt Guard is capable of detecting explicitly malicious prompts and data that contain injected inputs. As-is, the model is useful for identifying and guardrailing against risky inputs to LLM-powered applications. For optimal results, we recommend that AI developers fine-tune Prompt Guard with application-specific data.

We’ve heard from developers that these tools are most effective and helpful when they can be tailored to the application. That’s why we’re eager to provide developers with an open solution so they can help create the safest and most effective experience based on their needs. We provide instructions in the Llama-recipe GitHub repository on how to do this.

Red teaming

Using both human and AI-enabled red teaming, we seek to understand how our models perform against different types of adversarial actors and activities. We partner with subject matter experts in critical risk areas and have also assembled a team of experts from a variety of backgrounds. Our red-teaming efforts incorporate experts across various disciplines, including cybersecurity, adversarial machine learning, and responsible AI, in addition to multilingual content specialists with backgrounds in AI security and safety in specific geographic markets.

We conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting, and we used our learnings to improve our benchmark measurements and fine-tuning datasets.

We also continued our fine-tuning work in post-training, where we produced final chat models by doing several rounds of alignment on top of the pre-trained model. Each round involved supervised fine-tuning, direct preference optimization, and reinforcement learning with a human feedback step. We produced the majority of our supervised fine-tuning examples through synthetic data generation. We also invested in multiple data processing techniques to filter the synthetic data we used to maintain high quality for our training datasets. This allowed us to scale the amount of fine-tuning data across capabilities.

Measuring Llama 3.1 capabilities and mitigating risks

We’ve assessed and mitigated against many areas of potential risk associated with the open source release of Llama 3.1 405B—for example, risks related to cybersecurity, chemical and biological weapons, and child safety:

Cybersecurity

We evaluated cybersecurity risks to third parties in the context of Llama 3.1 405B’s propensity to automate social engineering via spear-phishing and scale manual offensive cyber operations. This work also focused on potential risks for Llama 3.1 405B to be used for autonomous offensive cyber operations, along with autonomous software vulnerability discovery and exploitation. For all of the evaluations, we have not detected a meaningful uplift in actor abilities using Llama 3.1 405B.

In our research and testing, we covered the most prevalent categories of potential risks to application developers. These include prompt injection attempts, code interpreter abuse to execute malicious code, assistance in facilitating a cyber attack, and suggesting or autonomously writing insecure code.

As part of our commitment to openness and safety, we’re also releasing CyberSecEval 3, which has been updated with new evaluations for social engineering via spear phishing, autonomous offensive cyber operations, and image-based prompt injection. We discuss our approach to cybersecurity in our latest research paper.

Chemical and biological weapons

In order to assess risks related to the proliferation of chemical and biological weapons, we performed uplift testing designed to determine whether the use of the Llama 3.1 405B model could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons, compared to the use of the internet. In our research and testing, carried out with the assistance of external experts, we included the evaluation of threat models that we believe meaningfully increase ecosystem risk for low and moderate skilled actors, consistent with the research we have seen for other high-performing LLMs. Our testing included modeling of multiple stages of attack plans, with expert review of the outputs, and included examining how tool integration could aid an adversary. We have not detected a meaningful uplift in malicious actor abilities using Llama 3.1 405B.

Child safety

We’re committed to developing AI models in compliance with the Safety by Design principles published by Thorn and All Tech is Human. We incorporated these principles by responsibly sourcing training datasets and safeguarding them from CSAM and CSEM. Alongside a team of experts, we also conducted adversarial risk discovery exercises to assess against child safety risks. We used these insights to deploy appropriate risk mitigations by fine-tuning our model. These expert red teaming sessions were used to expand the coverage of our evaluation benchmarks through Llama 3.1 model development. For this latest release, we conducted new in-depth sessions using objective-based methodologies to assess the model risks along multiple attack vectors. We also partnered with content specialists to perform red teaming exercises to assess potentially violating content, while taking market-specific nuances and experiences into account.

Privacy

Llama 3.1 405B underwent privacy evaluations at various points in the training, including at the data level. We employed different techniques to reduce memorization including deduplication and reduced epochs. Using both manual and AI-assisted techniques, we red-teamed the model for memorization of information about private individuals and took steps to mitigate the model against those risks. We’re excited to see how the developer and research community further advance this space using Llama Guard 3 and other tools.

By open sourcing this work, we’re empowering developers to deploy systems aligned with their preferences and customize the safety of their systems for their particular use cases and needs.

As these technologies continue to evolve, we look forward to improving these features and models. And in the months and years to come, we’ll continue to help people build, create, and connect in new and exciting ways.

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

See all open positions