Takeaways:
How Meta is scaling AI safety
We’re closely following as governments around the world seek to define AI safety. Meta supports new safety institutes and works with established entities—including the National Institute of Standards and Technology (NIST) and ML Commons—to drive toward common definitions, threat models, and evaluations. Working with bodies such as Frontier Model Forum (FMF) and Partnership on AI (PAI), we seek to develop common definitions and best practices, while also engaging with civil society and academics to help inform our approach. For this release, we’ve continued to build on our efforts to evaluate and red team our models in areas of public safety and critical infrastructure, which includes cybersecurity, catastrophic risks, and child safety.
It’s important to note that before releasing a model, we work to identify, evaluate, and mitigate potential risks through several measures:
System safety: New resources, security, and safety tools for developers
Our vision for Llama is to give developers a powerful foundation to build on by providing pieces of a broader system that gives them the flexibility to design and create custom offerings that align with their goals and needs. As part of the Llama reference system, we’re integrating a safety layer to facilitate adoption and deployment of the best practices outlined in the Responsible Use Guide. We’re excited to release new safety components for developers to power this safety layer and enable responsible implementation of their use cases.
The first, Llama Guard 3, is a high-performance input and output moderation model designed to support developers in detecting various common types of violating content, supporting even longer context across eight languages. It was built by fine-tuning the Llama 3.1 model and optimized to support the detection of emerging standards of hazard taxonomy. We believe that aligning on a common hazard taxonomy across the industry is an important step toward cultivating collaboration on safety. Llama Guard 3 is seamlessly integrated into our reference implementations and empowers the developer community to build responsibly from the start. Other resources to get started with Llama Guard 3, including how to fine-tune it for specific use cases, are available in the Llama-recipe GitHub repository.
Our second tool, Prompt Guard, is a multi-label model that categorizes inputs into three categories—benign, injection, and jailbreak—to help developers detect and respond to prompt injection and jailbreak inputs:
Prompt Guard is capable of detecting explicitly malicious prompts and data that contain injected inputs. As-is, the model is useful for identifying and guardrailing against risky inputs to LLM-powered applications. For optimal results, we recommend that AI developers fine-tune Prompt Guard with application-specific data.
We’ve heard from developers that these tools are most effective and helpful when they can be tailored to the application. That’s why we’re eager to provide developers with an open solution so they can help create the safest and most effective experience based on their needs. We provide instructions in the Llama-recipe GitHub repository on how to do this.
Red teaming
Using both human and AI-enabled red teaming, we seek to understand how our models perform against different types of adversarial actors and activities. We partner with subject matter experts in critical risk areas and have also assembled a team of experts from a variety of backgrounds. Our red-teaming efforts incorporate experts across various disciplines, including cybersecurity, adversarial machine learning, and responsible AI, in addition to multilingual content specialists with backgrounds in AI security and safety in specific geographic markets.
We conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting, and we used our learnings to improve our benchmark measurements and fine-tuning datasets.
We also continued our fine-tuning work in post-training, where we produced final chat models by doing several rounds of alignment on top of the pre-trained model. Each round involved supervised fine-tuning, direct preference optimization, and reinforcement learning with a human feedback step. We produced the majority of our supervised fine-tuning examples through synthetic data generation. We also invested in multiple data processing techniques to filter the synthetic data we used to maintain high quality for our training datasets. This allowed us to scale the amount of fine-tuning data across capabilities.
Measuring Llama 3.1 capabilities and mitigating risks
We’ve assessed and mitigated against many areas of potential risk associated with the open source release of Llama 3.1 405B—for example, risks related to cybersecurity, chemical and biological weapons, and child safety:
Cybersecurity
We evaluated cybersecurity risks to third parties in the context of Llama 3.1 405B’s propensity to automate social engineering via spear-phishing and scale manual offensive cyber operations. This work also focused on potential risks for Llama 3.1 405B to be used for autonomous offensive cyber operations, along with autonomous software vulnerability discovery and exploitation. For all of the evaluations, we have not detected a meaningful uplift in actor abilities using Llama 3.1 405B.
In our research and testing, we covered the most prevalent categories of potential risks to application developers. These include prompt injection attempts, code interpreter abuse to execute malicious code, assistance in facilitating a cyber attack, and suggesting or autonomously writing insecure code.
As part of our commitment to openness and safety, we’re also releasing CyberSecEval 3, which has been updated with new evaluations for social engineering via spear phishing, autonomous offensive cyber operations, and image-based prompt injection. We discuss our approach to cybersecurity in our latest research paper.
Chemical and biological weapons
In order to assess risks related to the proliferation of chemical and biological weapons, we performed uplift testing designed to determine whether the use of the Llama 3.1 405B model could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons, compared to the use of the internet. In our research and testing, carried out with the assistance of external experts, we included the evaluation of threat models that we believe meaningfully increase ecosystem risk for low and moderate skilled actors, consistent with the research we have seen for other high-performing LLMs. Our testing included modeling of multiple stages of attack plans, with expert review of the outputs, and included examining how tool integration could aid an adversary. We have not detected a meaningful uplift in malicious actor abilities using Llama 3.1 405B.
Child safety
We’re committed to developing AI models in compliance with the Safety by Design principles published by Thorn and All Tech is Human. We incorporated these principles by responsibly sourcing training datasets and safeguarding them from CSAM and CSEM. Alongside a team of experts, we also conducted adversarial risk discovery exercises to assess against child safety risks. We used these insights to deploy appropriate risk mitigations by fine-tuning our model. These expert red teaming sessions were used to expand the coverage of our evaluation benchmarks through Llama 3.1 model development. For this latest release, we conducted new in-depth sessions using objective-based methodologies to assess the model risks along multiple attack vectors. We also partnered with content specialists to perform red teaming exercises to assess potentially violating content, while taking market-specific nuances and experiences into account.
Privacy
Llama 3.1 405B underwent privacy evaluations at various points in the training, including at the data level. We employed different techniques to reduce memorization including deduplication and reduced epochs. Using both manual and AI-assisted techniques, we red-teamed the model for memorization of information about private individuals and took steps to mitigate the model against those risks. We’re excited to see how the developer and research community further advance this space using Llama Guard 3 and other tools.
By open sourcing this work, we’re empowering developers to deploy systems aligned with their preferences and customize the safety of their systems for their particular use cases and needs.
As these technologies continue to evolve, we look forward to improving these features and models. And in the months and years to come, we’ll continue to help people build, create, and connect in new and exciting ways.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Join us in the pursuit of what’s possible with AI.
Foundational models
Latest news
Foundational models