Who's asking? User personas and the mechanics of latent misalignment.

AllImages Videos Books Maps News Shopping

Who's asking? User personas and the mechanics of latent misalignment

Jun 17, 2024 · Studies show that misaligned capabilities remain latent in safety-tuned models. In this work, we shed light on the mechanics of this phenomenon.

Who's asking? User personas and the mechanics of latent misalignment

www.aimodels.fyi › papers › arxiv › wh...

Aug 13, 2024 · This paper examines the concept of "user personas" and how they can lead to "latent misalignment" in large language models (LLMs).

Who's asking? User personas and the mechanics of latent misalignment

www.researchgate.net › publication › 38...

Aug 14, 2024 · We investigate why certain personas break model safeguards and find that they enable the model to form more charitable interpretations of ...

Who's asking? User personas and the mechanics of latent misalignment. - X

x.com › status

Jun 19, 2024 · Despite investments in improving model safety, studies show that misaligned capabilities remain latent in safety-tuned models.

Asma Ghandeharioun - X.com

x.com › ghandeharioun

Jul 24, 2024 · Who's asking? User personas and the mechanics of latent misalignment. Despite investments in improving model safety, studies show that ...

‪Michael A. Lepori‬ - ‪Google Scholar‬

scholar.google.com › citations › user=G1...

Who's asking? User personas and the mechanics of latent misalignment. A Ghandeharioun, A Yuan, M Guerard, E Reif, MA Lepori, L Dixon. arXiv preprint arXiv ...

Michael A. Lepori - DBLP

dblp.org › Persons

Who's asking? User personas and the mechanics of latent misalignment. CoRR abs/2406.12094 (2024). [i10]. view. electronic edition via DOI (open access) ...

Awesome Representation Engineering - GitHub

github.com › chrisliu298 › awesome-rep...

Who's asking? User personas and the mechanics of latent misalignment. Author(s): Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori ...

Papers with Code - Marius Guerard

paperswithcode.com › author › marius-g...

May 24, 2022 · Who's asking? User personas and the mechanics of latent misalignment · no code implementations • 17 Jun 2024 • Asma Ghandeharioun, Ann Yuan ...

Emily Reif - CatalyzeX

www.catalyzex.com › author

We investigate why certain personas break model safeguards and find that they enable the model to form more charitable interpretations of otherwise dangerous ...