ROME: Memorization Insights from Text, Logits and Representation

Li, Bo; Zhao, Qinghua; Wen, Lijie

Computer Science > Computation and Language

arXiv:2403.00510 (cs)

[Submitted on 1 Mar 2024 (v1), last revised 16 Jun 2024 (this version, v3)]

Title:ROME: Memorization Insights from Text, Logits and Representation

Authors:Bo Li, Qinghua Zhao, Lijie Wen

View PDF HTML (experimental)

Abstract:Previous works have evaluated memorization by comparing model outputs with training corpora, examining how factors such as data duplication, model size, and prompt length influence memorization. However, analyzing these extensive training corpora is highly time-consuming. To address this challenge, this paper proposes an innovative approach named ROME that bypasses direct processing of the training data. Specifically, we select datasets categorized into three distinct types -- context-independent, conventional, and factual -- and redefine memorization as the ability to produce correct answers under these conditions. Our analysis then focuses on disparities between memorized and non-memorized samples by examining the logits and representations of generated texts. Experimental findings reveal that longer words are less likely to be memorized, higher confidence correlates with greater memorization, and representations of the same concepts are more similar across different contexts. Our code and data will be publicly available when the paper is accepted.

Comments:	Submitted to EMNLP, 2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.00510 [cs.CL]
	(or arXiv:2403.00510v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.00510

Submission history

From: Bo Li [view email]
[v1] Fri, 1 Mar 2024 13:15:30 UTC (778 KB)
[v2] Mon, 4 Mar 2024 06:36:01 UTC (773 KB)
[v3] Sun, 16 Jun 2024 13:53:44 UTC (717 KB)

Computer Science > Computation and Language

Title:ROME: Memorization Insights from Text, Logits and Representation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ROME: Memorization Insights from Text, Logits and Representation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators