Jun 17, 2024 · Studies show that misaligned capabilities remain latent in safety-tuned models. In this work, we shed light on the mechanics of this phenomenon.
Aug 13, 2024 · This paper examines the concept of "user personas" and how they can lead to "latent misalignment" in large language models (LLMs).
Aug 14, 2024 · We investigate why certain personas break model safeguards and find that they enable the model to form more charitable interpretations of ...
Jun 19, 2024 · Despite investments in improving model safety, studies show that misaligned capabilities remain latent in safety-tuned models.
Jul 24, 2024 · Who's asking? User personas and the mechanics of latent misalignment. Despite investments in improving model safety, studies show that ...
Who's asking? User personas and the mechanics of latent misalignment. A Ghandeharioun, A Yuan, M Guerard, E Reif, MA Lepori, L Dixon. arXiv preprint arXiv ...
Who's asking? User personas and the mechanics of latent misalignment. CoRR abs/2406.12094 (2024). [i10]. view. electronic edition via DOI (open access) ...
Who's asking? User personas and the mechanics of latent misalignment. Author(s): Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori ...
May 24, 2022 · Who's asking? User personas and the mechanics of latent misalignment · no code implementations • 17 Jun 2024 • Asma Ghandeharioun, Ann Yuan ...
We investigate why certain personas break model safeguards and find that they enable the model to form more charitable interpretations of otherwise dangerous ...