CP-KGC: Constrained-Prompt Knowledge Graph Completion With Large Language Models
CP-KGC: Constrained-Prompt Knowledge Graph Completion With Large Language Models
ABSTRACT for knowledge graph completion (KGC) are pivotal in the automated
Knowledge graph completion (KGC) aims to utilize existing knowl- construction and validation of KGs.
edge to deduce and infer missing connections within knowledge
arXiv:2310.08279v1 [cs.CL] 12 Oct 2023
the model may occasionally experience generation hallucination 2.2 Large Language Models
[14], leading to the introduction of erroneous information. The evolution of language models has progressed from early rule-
Leveraging recent advancements in LLMs [29], we introduce based stages to the contemporary era of data-driven methodolo-
the Constrained-Prompt Knowledge Graph Completion (CP-KGC) gies [12, 23]. Currently, it is advancing toward large-scale multi-
method. This method formulates constrained prompts for KGC task learning [11], pre-trained models, and heightened language
datasets, enabling LLMs to regenerate textual entity descriptions. comprehension capabilities. Prominent pre-trained models, exem-
LLMs possess robust text generation capabilities, making them ex- plify robust language comprehension and generation capabilities.
ceedingly valuable for generating textual descriptions of entities in These LLMs employ structures akin to transformers [35], employing
KGs. Presently, LLMs are in a state of continuous evolution, under- context-aware language representations and self-attention mecha-
going ongoing optimization and updates, with the expectation of nisms to capture linguistic nuances, consequently achieving height-
further enhancing their capabilities. Consequently, we introduce a ened language understanding and generation precision. This trans-
constrained-prompt approach to delimit the scope of model outputs, formation has significantly influenced the evolution of the natural
thereby mitigating the issue of generation hallucinations in LLMs language processing learning paradigm.
to a certain extent. With the expansion of training data, models have seen a surge
Additionally, it is possible to automatically generate higher- in parameters. From early models like BERT [11] and RoBERT [28],
quality tips for KGC [15], presenting an alternative approach to with hundreds of millions of parameters. Currently, higher parame-
integrating large models with KG tasks. In this study, we compare ter models have been developed, like GPT3 (175B parameters) and
GPT-3.5 and GPT-4, utilizing open-source, low-parameter models PaLM (540B parameters). The rise in language model parameters has
to investigate the synergistic integration of large-scale models with endowed them with remarkable problem-solving abilities. GPT3,
tasks related to KGs. Specifically, we employ Qwen-7B-chat [1] and for instance, excels in few-sample tasks through context learning,
LLaMA2-7B-chat [32] models to enhance data-driven text-based a feat not achieved by GPT2 [26]. A notable application is Chat-
methods and attain efficient inference under resource-constrained GPT, utilizing the GPT3.5 for conversations, showcasing remarkable
conditions. human-like conversational abilities. Subsequent to the release of
We evaluated CP-KGC using three widely recognized datasets: GPT-3.5, OpenAI introduced GPT-4, which exhibits enhanced ca-
WN18RR, FB15K237, and UMLS. To assess the efficacy of the gen- pabilities. Notably, there are models with fewer parameters that
erated data, we employed the state-of-the-art SimKGC [38] as our maintain commendable performance, including Alibaba’s Qwen-7B
baseline model. CP-KGC exhibited superior performance across [1] and MetaAI’s LLaMA2 series[32].
multiple indicators, as demonstrated by evaluation metrics (MRR,
Hits@1,3,10), thus confirming the efficacy of the generated data.
Furthermore, we validated the effectiveness of low-parameter mod- 2.3 Prompt Engineering
els compared to large models with varying parameters. We hope
that CP-KGC will contribute to developing improved KGC systems. Prompt engineering focuses on creating and refining prompts to
maximize the effectiveness of LLMs [21, 40, 45]. Effectively harness-
ing the text generated by LLMs as a knowledge base represents a
2 RELATED WORK significant concern. Studies [25, 28] introduced manually designed
2.1 Knowledge Graph Completion prompts to probe language models. Researchers [5] initially identi-
KGC techniques primarily aim to resolve the problem of miss- fied the effectiveness of prompts, which are text templates crafted
ing links in KGs, thereby enhancing their comprehensiveness. In manually. Continuing this line of research, subsequent studies
embedding-based approaches, like TransE [4], models map entities [28, 36] further expanded upon this approach. However, select-
and relationships within KGs to a low-dimensional vector space. ing discrete prompts necessitates human intervention and presents
Subsequently, these models employ calculations involving these challenges in optimizing them seamlessly for downstream tasks.
vectors to predict missing triples. The ComplEx [33] model, to Addressing the limitations of discrete templates, recent studies
augment the expressive capability of models, introduces complex [18, 20] utilized trainable continuous vectors within frozen Pre-
embeddings. The RotatE [31] model utilizes rotation angles within trained Language Models (PLMs). With the emergence of large
a complex space to quantify the similarity between triples. Specific models like GPT4, the advantages of prompts have been further
text-based methodologies endeavor to encode entity text descrip- demonstrated. Users can configure input prompts to include intent,
tions through the utilization of pre-trained language models, such content guidance, style, tone, and contextual constraints, thereby
as KG-BERT [43], StAR [37], and BLP [9]. These methods heavily obtaining model outputs that yield outstanding results in zero-shot
depend on textual entity descriptions. Building upon the context inference.
learning approach from GPT-3 [5], GenKGC [41] employed the
robust language model BART [19] as its foundational framework to
enhance the learning process. However, the performance of these 3 METHODOLOGY
methods frequently lags behind embedding-based approaches [22] We formulate prompts with constraints tailored to datasets, employ
until the advent of SimKGC, which, for the first time, showcased zero-shot LLMs for inference, and create cleaning scripts to extract
superiority over embedding-based methods in specific dataset met- textual entity descriptions. These extracted descriptions are denoted
rics. Furthermore, the improvement of the quality of KGs [44] can as "stronger data." In Figure 2, we present the architecture of CP-
also further supplement the language model training. KGC.
CP-KGC: Constrained-Prompt Knowledge Graph Completion with Large Language Models Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Figure 2: An example of CP-KG for the Stronger data generate <Spider-Man 3>. CP-KGC employs entities and their associated
descriptions as context to regulate the model’s generated outcomes, as directly providing entities might induce hallucinations.
Leveraging LLMs, CP-KGC facilitates zero-shot inference. Post data-cleansing, the refined data is condensed significantly,
from 278 tokens to 29 tokens. CP-KGC efficiently produces concise textual representations that aptly characterize the entity.
The entity-descriptive text restricts the generation scope of the LLM, resulting in the generated text referring to the movie
“Spider-Man 3” instead of the video game.
3.1 Notation Prompt for Longtext: Please summarize the following text in one
A knowledge graph can be represented as G, where entities are de- sentence: “the long text from entity.”
noted as E, and relationships as R. Each edge in G can be expressed Generally, E − text denotes text that poses difficulties in achiev-
as a triple (h, r ,t), where h, r ,t correspond respectively to the head ing concise and comprehensive refinement through preprocessing
entity, the relationship, and the tail entity. The task of KGC involves or manual editing. LLMs, such as GPT3.5, exhibit outstanding text
inferring the missing triples when provided with an incomplete G. generation capabilities but may not explain all entities (e.g., re-
In this paper, we define the textual descriptions corresponding to peated person names or polysemy). LLMs can only generate text
the head entity or tail entity as E − text. based on the statistical knowledge they acquire from large-scale
corpora, lacking the ability to understand entities deeply. Moreover,
since LLMs generate text relying on statistical probabilities, there
3.2 Prompt for Longtext are instances where they may generate inaccurate or irrational
FB15K237, an extensive and refined knowledge graph dataset, was text, a phenomenon referred to as the "hallucination problem" in
released by Facebook AI Research as an extension and enhancement large models. To augment the efficiency of text generation, we sup-
of its predecessor, FB15K. This dataset predominantly covers a wide ply contextual constraints alongside input entities to direct LLMs
range of real-world concepts and entities, including individuals, lo- toward producing more pertinent and rational text.
cations, and items. At present, most text-based KGC methods make Consequently, we instruct the model to summarize the textual
use of E − text data from [43], and this paper similarly incorporates descriptions (E − text) associated with entity relationships, thereby
the use of E − text. In FB15K237, the E − text descriptions are pre- acquiring stronger data—an operation well-suited for LLMs.
sented as extensive textual narratives, with an average length of
139 tokens. 3.3 Prompt for Synonyms
Most text-based KGC methods depend on pre-trained language WN18RR serves as a standard dataset utilized for link prediction
models that utilize the transformer architecture. Models of this kind within KG and is derived from WordNet. It constitutes an extended
possess self-attention mechanisms when processing text, allowing and refined iteration of the WN18 dataset. The dataset comprises a
them to dynamically focus on different parts of the input text based range of relationship types, encompassing hypernym, part-whole,
on context. This capability empowers them to efficiently capture synonym, and other connections.
essential information within the text while attenuating redundancy In contrast to FB15K237, WN18RR features shorter E − text, pre-
and less crucial details. Models such as BERT, for example, when dominantly comprising concise explanations or example sentences
confronted with longer and more redundant texts (E − text from for the entities. Additionally, certain E − text entries contain syn-
FB15K237), may identify repetitive information within the text. onyms. Consequently, we formulate prompts for inferring synonym
This recognition may result in a partial dispersion of attention, tables by leveraging entities and their explanatory content as a com-
yielding a somewhat fragmented representation of the entire text. plementary resource to E − text.
Moreover, redundant expressions can augment the intricacy of text Prompt for Synonyms: Give synonyms for “entity” based on the
processing for the model, as it must sift through repeated informa- content of “the description of entity.”
tion to extract pivotal details. Although pre-training mechanisms In Prompt2, "entity" refers to the entities in WN18RR, and "the
can aid the model to a certain extent in managing redundancy effi- description of entity" represents the E − text. For example, using
ciently, excessively long or repetitive text can pose a performance the term "restitution" as an illustrative example, it possesses two
bottleneck. distinct meanings within the dataset. The first meaning refers to "a
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Rui Yang, Li Fang, and Yi Zhou
sum of money paid as compensation for loss or injury," while the We utilize the data supplied by KG-BERT for the WN18RR, FB15k-
second denotes "the act of restoring something to its original state." 237, and UMLS datasets for textual descriptions.
Employing the "Prompt for Synonyms", we can deduce synonyms Implementation Details In this paper’s experimental setup,
corresponding to these meanings: [’compensation’, ’recompense’, we employ the text-based model SimKGC and the similarity calcu-
’indemnification’] for the former and [’renewal’, ’reinstatement’, lation model bert-base-uncased. The selection of LLMs encompasses
’restoration’] for the latter. Qwen-7B-chat, LLaMA-2-7B-chat, LLaMA-2-13B-chat, GPT3.5-turbo,
The inclusion of entities and their associated explanatory text and GPT4. For model inference, Qwen-7B-chat and LLaMA-2-7B-
in Prompt for Synonyms provides several advantages. Firstly, entity chat are executed on RTX3090-24G GPU, LLaMA-2-13B-chat utilizes
restitution have multiple meanings, and offering only one entity an A100-80G GPU, and GPT3.5-turbo and GPT4 are accessed via
may yield synonyms that do not align with the current context, API calls. For training and testing within the SimKGC framework,
potentially diminishing the quality of the generated text. Secondly, we employ an A100-80G GPU to maintain consistency with the
there are 12873 repeated entities in the WN18RR dataset, and de- original parameters and SimKGC configuration.
riving synonyms from the explanations of the original entities in Evaluation Metrics Building upon prior research, we imple-
WN18RR ensures higher accuracy. Pre-trained language models mented the CP-KGC method and incorporated text-based strategies
are trained on extensive datasets and possess robust generalization for extended predictions with augmented data. We assessed our
capabilities. To assess the accuracy of content produced by LLMs, model using four standardized evaluation metrics: Mean Reciprocal
we employ the "bert-base-uncased" model to determine the most Rank (MRR) and Hits@k. MRR, specifically tailored for KGC tasks,
analogous synonyms, imposing a similarity threshold for refine- gauges the model’s performance by computing the reciprocal of the
ment. rank of the first accurate answer for each query and averaging these
Similarity Calculation Entities extracted from bert-base-uncased reciprocals across all queries. Hits@k metrics evaluate whether the
are denoted as e𝑡 , while the set of synonyms inferred through LLM correct answer falls within the top k predictions made by the model.
inference is represented as 𝑆 = (s1, s2, ..., s𝑖 ). Cosine similarity If the correct answer ranks among the top k predictions, Hits@k is
Cos(e𝑡 , s𝑖 ) is simply the dot product between two embeddings: 1; otherwise, it is 0. Both MRR and Hits@k are calculated within a
filtering framework, where scores associated with all known true
et · si triples from the training, validation, and test sets are disregarded.
Cos(et, si ) = = et · si (1)
∥et ∥∥si ∥ These metrics are averaged across two dimensions: head entity
We compute the cosine similarity between all synonyms and prediction and tail entity prediction.
entities, resulting in a set of synonyms with similarity scores greater
than or equal to 0.9. Subsequently, these acquired synonyms are
incorporated into the text, resulting in stronger data.
4.2 Main Result
argmaxsi Cos(et, si ), si ∈ S (2) In prevalent KGC benchmark tests, CP-KGC consistently surpasses
both embedding-based and text-based models on WN18RR, regis-
4 EXPERIMENTS tering a peak improvement of 1.16% (from 58.7% to 59.86%) in H@1.
Although it outperformed SimKGC on the FB15k-237 dataset, the
4.1 Experimental Setup optimal performance was not reached. Among text-based methods,
Datasets We employ three datasets for our evaluation: FB15K237, CSProm-KG exhibited superior performance in Hits@1, surpass-
WN18RR, and UMLS [3]. The statistical details of the datasets uti- ing embedding-based techniques. Notably, CP-KGC’s significant
lized in this study are presented in Table 1. Bordes et al. [4] in- performance gains are achieved solely through data modifications,
troduced the WN18RR and FB15K datasets. Subsequent research underscoring its efficacy as a data augmentation technique.
[10, 32] uncovered the issue of test set leakage in these two datasets. LLM Selection We propose CP-KGC to obtain Stronger data,
To resolve this concern, the removal of inverse relations resulted in aiming to collaborate large models and KGC tasks from a data-
the creation of the WN18RR and FB15k-237 datasets. The WN18RR driven perspective and further enhance the performance of text-
dataset encompasses approximately 41k synonym sets and 11 rela- based KGC models. The SimKGC model is the pioneering text-
tions derived from WordNet [24], whereas the FB15k-237 dataset based approach surpassing traditional embedding-based models.
comprises roughly 15k entities and 237 relations obtained from We selected SimKGC as the foundational model, LLaMA2-7B-chat
Freebase. UMLS [10] is a medical semantic network comprising for WN18RR, and Qwen-7B-chat for FB15K237 (due to its lower
semantic types and semantic relations. resource consumption compared to LLaMA-2-13B-chat).
CP-KGC: Constrained-Prompt Knowledge Graph Completion with Large Language Models Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Table 2: Main results for WN18RR and FB15k-237 datasets. The inference models utilized for CP-KGC with Stronger data on
WN18RR and FB15k-237 were LLaMA2-7B-chat and Qwen-7B-chat, respectively.
WN18RR FB15K237
Method MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
embedding-based methods
TransE[4] 24.3 4.3 44.1 53.2 27.9 19.8 37.6 44.1
DistMult[42] 44.4 41.2 47.0 50.4 28.1 19.9 30.1 44.6
ComplEx[33] 44.9 40.9 46.9 53.0 27.8 19.4 29.7 45.0
ConvE[10] 45.6 41.9 47.0 53.1 31.2 22.5 34.1 49.7
RotatE[31] 47.6 42.8 49.2 57.1 33.8 24.1 37.5 53.3
TuckER[2] 47.0 44.3 48.2 52.6 35.8 26.6 39.4 54.4
CompGCN[34] 47.9 44.3 49.4 54.6 35.5 26.4 39.0 53.5
text-based methods
KG-BERT[43] 21.6 4.1 30.2 52.4 - - - 42.0
MTL-KGC[17] 33.1 20.3 38.3 59.7 26.7 17.2 29.8 45.8
StAR[37] 40.1 24.3 49.1 70.9 29.6 20.5 32.2 48.2
MLMLM[8] 50.2 43.9 54.2 61.1 - - - -
GenKGC[41] - 28.7 40.3 53.5 - 19.2 35.5 43.9
KGT5[27] 50.8 48.7 - 54.4 27.6 21.0 - 41.4
KG-S2S[6] 57.4 53.1 59.5 66.1 33.6 25.7 37.3 49.8
CSProm-KG[7] 57.5 52.2 59.6 67.8 35.8 26.9 39.3 53.8
SimKGC[38] 66.6 58.7 71.7 80.0 33.6 24.9 36.2 51.1
CP-KGC 67.3 59.9 72.1 80.4 33.8 25.1 36.5 51.6
Table 3: Main results for WN18RR and FB15k-237 datasets with different LLMs.
WN18RR FB15K237
Method MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
SimKGC 66.6 58.7 71.7 80.0 33.6 24.9 336.2 51.1
CP-KGC-Qwen-7B-chat 66.81 59.27 71.28 80.46 33.77 25.05 36.54 51.57
CP-KGC-LLaMA2-7B-chat 67.3 59.86 72.05 80.36 33.62 24.96 36.31 51.31
CP-KGC-LLaMA-2-13B-chat 66.99 59.38 71.69 80.46 33.79 25.08 36.47 51.6
CP-KGC-GPT3.5-turbo 67.29 60.03 71.56 80.28 33.46 24.73 36.13 51.09
CP-KGC-GPT4 67.1 59.66 71.84 80.03 33.56 24.65 36.5 51.25
Results on FB15K237 CP-KGC demonstrates better perfor- dataset outperform those of embedding-based methods. It is impor-
mance to SimKGC (Hits@10: 51.6% to 51.1%) across multiple met- tant to highlight that in this paper when incorporating synonyms as
rics but still falls short of embedding-based KGC methods. This is a textual supplement, a similarity threshold of is employed. When
partly attributed to the higher density of the FB15k-237 dataset and the model is tested with a similarity threshold lower than 0.9, the re-
its smaller entity count compared to other datasets, a significant sults generally exhibit lower performance compared to the original
number of links in the FB15k-237 dataset exhibit unpredictability, ones.
which contributes to suboptimal model performance. Relative to FB15K237, the enhancement of CP-KGC on WN18RR
Table 3 illustrates the varied effects of text generated by different is more pronounced. All models surpass the performance of SimKGC.
LLMs on the model’s performance. CP-KGC-LLaMA-2-13B-chat Incorporating pertinent synonyms in text descriptions positively
emerges as the top performer, registering an improvement of 0.5% influences the models.
percentage points (from 51.1% to 51.6% on H@10). CP-KGC-Qwen- Low Resource Availability The text generation capabilities of
7B-chat also delivers commendable results, with a marginal differ- LLMs vary based on the size of their parameters [16]. In this study,
ence from the former. Notably, Qwen-7B-chat operates with only 7B we utilized open-source deployable LLMs, namely Qwen-7B-chat
parameters, offering benefits in terms of computational efficiency and LLaMA-2-7B-chat, each possessing 7 billion parameters and
and reduced inference time. the larger LLaMA-2-13B-chat with 13 billion parameters. Notably,
Results on WN18RR On the WN18RR dataset, H@1 improves Qwen-7B-chat is a quantized version of the model, making it suit-
from 58.7% to 60.03% with GPT3.5-turbo. Overall, the results on this able for lower-resource usage scenarios. Additionally, we utilized
the GPT3.5-turbo and GPT4 models, which are not open-source
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Rui Yang, Li Fang, and Yi Zhou
but were accessed through a paid API for comparative analysis (a)-longtext indicates the utilization of extended text data. CP-
purposes. This decision is geared towards investigating the collabo- KGC-longtext signifies experiments where Qwen-7B-chat generates
rative potential of LLMs and KGC in resource-constrained settings, lengthy text. CP-KGC-Qwen-7B-chat represents an effort to gener-
thus opening avenues for novel approaches to integrating LLMs ate concise yet semantically rich text. KG-BERT (a)-triples exhibit
into the KG domain. superior performance, but a significant decline is noticeable when
As indicated in Table 3, in FB15K237, CP-KGC-LLaMA-2-13B- using long texts. However, in the context of extended texts, CP-
chat exhibited the highest performance, with Qwen-7B-chat closely KGC-longtext demonstrates markedly improved performance. This
trailing. Both models surpassed the original metrics across a range can be attributed to several factors. Firstly, the UMLS dataset is rela-
of indicators. CP-KGC-GPT3.5-turbo experienced a decrease in tively small, comprising only 135 entities and 46 relationship types.
performance across all metrics. In WN18RR, CP-KGC-LLaMA2-7B- Furthermore, the entity descriptions in the UMLS dataset, as pro-
chat achieved the highest results, and the remaining models also vided in KG-BERT, are of subpar quality, inadequately conveying
showed varying degrees of improvement. The outcomes of this the relationships and meanings associated with the entities. Despite
experiment demonstrate that increased model parameters do not its shorter text descriptions, the superior outcomes obtained with
always correlate with improved performance. CP-KGC-Qwen-7B-chat reinforce this argument.
We are not the first to integrate LLMs into KG tasks. Our ap-
proach to combining these domains involves distinct perspectives,
particularly optimizing resource utilization within the constraints
of limited computational capacity. 5.2 Fine-grained Analysis
Our proposed constraint prompts were tailored to generate high-
5 ANALYSIS quality textual descriptions for entities, considering their specific
We perform a comprehensive set of analyses to delve deeper into contexts.
CP-KGC and explore the intricacies of integrating KGC with LLMs. FB15K237 Due to the diverse nature of the datasets in terms
of types and domains, a uniform prompt optimization approach
5.1 What Makes Stronger data Better? was not viable. Instead, we adopted contextual summarization for
entities within the FB15K237 dataset. This choice was influenced
Compared to established text-based KGC methods, CP-KGC distin-
by the initial lengthy and redundant entity descriptions. Through
guishes itself in two fundamental ways: Firstly, in our pursuit of
summarization, we obtained concise yet informative texts that accu-
excellence in text-based KGC, we prioritize acquiring robust data.
rately represented the entities. For example, the entity "Spider Man
Secondly, we delve into integrating LLMs and KGC tasks, particu-
3," a movie title, initially had a description spanning 278 tokens.
larly under constraints posed by limited computational resources.
Following CP-KGC processing, it was condensed to a 29-token text:
As shown in Table 3, relying on models with massive parameter
"Spider-Man 3 is a 2007 American superhero film directed by Sam
sizes (GPT3.5-turbo and GPT4) is unnecessary to achieve a success-
Raimi and distributed by Columbia Pictures, which stars Tobey
ful fusion of these two domains. The prompts outlined in this paper
Maguire, Kirsten Dunst, James Franco, and Thomas Haden Church.
do not represent the only nor the most optimal options. Exploring
" This brief yet comprehensive description effectively identifies the
Stronger data can also be extended to other KGC models for further
entity.
investigation.
It’s crucial to acknowledge that “Spider-Man 3” can represent
KG-BERT’s application of text-based methods to KGC tasks
not only a movie but also a video game. Utilizing a direct LLM
showcases a visionary approach. The model suggests that substi-
approach for entity text generation could lead to inaccuracies. CP-
tuting entities with text can enhance its performance based on MR
KGC resolved this concern by producing precise and contextually
and Hits@10 metrics. Nevertheless, experimental results clearly
fitting textual descriptions for entities.
indicate that optimal performance on the UMLS dataset is attained
WN18RR is a frequently employed dataset in KG research, rep-
when exclusively employing triples, while the addition of extensive
resenting a variation of the WordNet dataset. Entities in WN18RR
text results in a deterioration of model performance. We investigate
are linked through diverse relationships, such as hypernyms, hy-
CP-KGC on the UMLS dataset.
ponyms, synonyms, defined within WordNet. CP-KGC’s prompt ap-
proach for WN18RR infers synonyms for entities from their original
Table 4: Main results for UMLS datasets.
textual descriptions. Synonyms generated by LLMs were carefully
filtered to ensure the quality of the generated data.
UMLS The effectiveness of context constraints in CP-KGC on the WN18RR
Method MRR H@1 H@3 H@10 dataset is evident in distinguishing polysemy, where one word has
KG-BERT (a)-triples 85.1 76.7 93.0 98.8 multiple meanings. The WN18RR dataset initially comprises 40943
KG-BERT (a)-longtext 64.8 53.9 71.4 83.9 unique entity identifiers. However, after eliminating duplicates, the
CP-KGC-longtext 77.5 64.8 87.4 97.7 dataset reduces to a total of 32547. Each identifier corresponds to a
CP-KGC-Qwen-7B-chat 79.8 67.6 89.7 98.0 distinct meaning for the same entity. Directly generating synonyms
using LLMs might not produce desired outcomes; it could even
compromise text quality. Utilizing context constraints, CP-KGC
Table 4 provides definitions for the following: KG-BERT (a)- accurately generates entity-specific synonyms. We will elaborate
triples, trained with triples for link prediction tasks. KG-BERT on this topic in the next subsection.
CP-KGC: Constrained-Prompt Knowledge Graph Completion with Large Language Models Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
6 CONCLUSION
In this paper, we introduce CP-KGC, a straightforward approach
with constrained prompts. We recognize that the central challenge
lies in constraining LLMs to produce meaningful text. CP-KGC
leverages advanced LLMs and employs entities and their textual
descriptions as contextual constraints. Experimental results on the
WN18RR and FB15K237 datasets demonstrate CP-KGC’s ability to
enhance the performance of state-of-the-art models.
Figure 3: MRR results of LLMs with different parameter sizes CP-KGC not only demonstrates superior performance within
on the WN18RR data set. Including the results of predicting the SimKGC framework but also extends its applicability to other
head entities and tail entities. text-based methodologies. The progression of LLMs paves the way
for CP-KGC’s continued evolution. Additionally, CP-KGC holds the
potential to augment the data quality in real-world applications.
As we move forward, our aim is to optimize open-source LLMs
with specialized domain data and deepen our exploration into the
confluence of KGs and LLMs in scholarly research.
7 LIMITATIONS
CP-KGC effectively incorporates LLMs into the KGC task, yielding
notable enhancements in both performance and efficiency. Nonethe-
less, relative to other text-based models, this integration demands
greater computational resources, given that the inference dura-
tion for large language models significantly exceeds their training
time. In the context of zero-shot inference, certain LLMs exhibit
challenges such as knowledge paucity, repetitive answer genera-
tion, and linguistic ambiguity. These issues are potentially attrib-
utable to the reduction in model parameters, which subsequently
impacts performance. It’s worth noting that our proposed con-
strained prompt is neither exclusive nor optimal. For instance, in
the WN18RR dataset, our optimization focused solely on synonyms,
Figure 4: MRR results of LLMs with different parameter sizes while in the FB15K237 dataset, we merely capitalized on the gener-
on the FB15K237 data set. alizability of LLMs without fully harnessing their potential. Future
work aims to address these limitations in CP-KGC and delve deeper
into the synergy between LLMs and KGC.
Dissecting the above sections reveals two tasks: a summariza-
tion task and a question-answering synonym task, both demanding REFERENCES
substantial model parameters and displaying outstanding question- [1] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan,
Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji
answering capabilities. Consequently, in practical inference, the dif- Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men,
ferences among these models are marginal, with smaller-parameter Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng
models demonstrating superior performance. Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang,
Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng
In cases where performance disparities are minimal, the focus Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang
shifts to the scalability and computational resource consumption Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Qwen Technical
Report. arXiv preprint arXiv:2309.16609 (2023).
of the method. Currently, GPT3.5 and GPT4 are not open-source, [2] Ivana Balažević, Carl Allen, and Timothy M Hospedales. 2019. Tucker: Tensor
necessitating inference through purchased APIs. Conversely, the factorization for knowledge graph completion. arXiv preprint arXiv:1901.09590
LLaMA2 series and Qwen-7B models are open-source and deploy- (2019).
[3] Olivier Bodenreider. 2004. The unified medical language system (UMLS): in-
able, and models with lower parameters consume fewer computa- tegrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004),
tional resources. This disparity holds significant implications for D267–D270.
CP-KGC: Constrained-Prompt Knowledge Graph Completion with Large Language Models Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
[4] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok- [26] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
sana Yakhnenko. 2013. Translating embeddings for modeling multi-relational et al. 2019. Language models are unsupervised multitask learners. OpenAI blog
data. Advances in neural information processing systems 26 (2013). 1, 8 (2019), 9.
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, [27] Apoorv Saxena, Adrian Kochsiek, and Rainer Gemulla. 2022. Sequence-to-
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda sequence knowledge graph completion and question answering. arXiv preprint
Askell, et al. 2020. Language models are few-shot learners. Advances in neural arXiv:2203.10321 (2022).
information processing systems 33 (2020), 1877–1901. [28] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh.
[6] Chen Chen, Yufei Wang, Bing Li, and Kwok-Yan Lam. 2022. Knowledge is flat: 2020. Autoprompt: Eliciting knowledge from language models with automatically
A seq2seq generative framework for various knowledge graph completion. In generated prompts. In In Proceedings of the 2020 Conference on Empirical Methods
In Proceedings of the 29th International Conference on Computational Linguistics. in Natural Language Processing (EMNLP). 4222–4235.
4005–4017. [29] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won
[7] Chen Chen, Yufei Wang, Aixin Sun, Bing Li, and Kwok-Yan Lam. 2023. Dipping Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al.
plms sauce: Bridging structure and text for effective knowledge graph completion 2022. Large language models encode clinical knowledge.
via conditional soft prompting. In In Findings of the Association for Computational [30] Haitian Sun, Tania Bedrax-Weiss, and William W Cohen. 2019. Pullnet: Open
Linguistics: ACL 2023. 11489–11503. domain question answering with iterative retrieval on knowledge bases and text.
[8] Louis Clouatre, Philippe Trempe, Amal Zouaq, and Sarath Chandar. 2020. In In Proceedings ofthe 2019 Conference on Empirical Methods in Natural Lan-
MLMLM: Link prediction with mean likelihood masked language model. In guage Processing and the 9th International Joint Conference on Natural Language
In Findings ofthe Association for Computational Linguistics: ACL/IJCNLP 2021. Processing (EMNLP-IJCNLP). 2380–2390.
4321–4331. [31] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: Knowl-
[9] Daniel Daza, Michael Cochez, and Paul Groth. 2021. Inductive entity representa- edge graph embedding by relational rotation in complex space. In In 7th Interna-
tions from text via link prediction. In In Proceedings of the Web Conference 2021. tional Conference on Learning Representations, ICLR 2019.
798–808. [32] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choud-
[10] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. hury, and Michael Gamon. 2015. Representing text for joint embedding of text
Convolutional 2d knowledge graph embeddings. In Proceedings of the AAAI and knowledge bases. In Proceedings of the 2015 conference on empirical methods
conference on artificial intelligence, Vol. 32. in natural language processing. 1499–1509.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: [33] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume
Pre-training of deep bidirectional transformers for language understanding. arXiv Bouchard. 2016. Complex embeddings for simple link prediction. In International
preprint arXiv:1810.04805 (2018). conference on machine learning. PMLR, 2071–2080.
[12] Jianfeng Gao and Chin-Yew Lin. 2004. Introduction to the special issue on [34] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. 2019.
statistical language modeling. , 87–93 pages. Composition-based multi-relational graph convolutional networks. arXiv preprint
[13] Jin Huang, Wayne Xin Zhao, Hongjian Dou, Ji-Rong Wen, and Edward Y Chang. arXiv:1911.03082 (2019).
2018. Improving sequential recommendation with knowledge-enhanced mem- [35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
ory networks. In The 41st international ACM SIGIR conference on research & Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
development in information retrieval. 505–514. you need. Advances in neural information processing systems 30 (2017).
[14] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, [36] Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019.
Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in Universal adversarial triggers for attacking and analyzing NLP. In In Proceedings
natural language generation. Comput. Surveys 55, 12 (2023), 1–38. ofthe 2019 Conference on Empirical Methods in Natural Language Processing and the
[15] Pengcheng Jiang, Shivam Agarwal, Bowen Jin, Xuan Wang, Jimeng Sun, and 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP
Jiawei Han. 2023. Text-Augmented Open Knowledge Graph Completion via 2019. 2153–2162.
Pre-Trained Language Models. , 11161—11180 pages. [37] Bo Wang, Tao Shen, Guodong Long, Tianyi Zhou, Ying Wang, and Yi Chang.
[16] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, 2021. Structure-augmented text representation learning for efficient knowledge
Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. graph completion. In Proceedings of the Web Conference 2021. 1737–1748.
Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020). [38] Liang Wang, Wei Zhao, Zhuoyu Wei, and Jingming Liu. 2022. SimKGC: Simple
[17] Bosung Kim, Taesuk Hong, Youngjoong Ko, and Jungyun Seo. 2020. Multi-task contrastive knowledge graph completion with pre-trained language models. In
learning for knowledge graph completion with pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational
In Proceedings of the 28th International Conference on Computational Linguistics. Linguistics, ACL 2022, Vol. 1: Long Papers. 4281–4294.
1737–1743. [39] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge
[18] Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for graph embedding by translating on hyperplanes. In Proceedings of the AAAI
parameter-efficient prompt tuning. In In Proceedings of the 2021 Conference on conference on artificial intelligence, Vol. 28.
Empirical Methods in Natural Language Processing, EMNLP 2021. 3045–3059. [40] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
[19] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning
Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising in large language models. Advances in Neural Information Processing Systems 35
sequence-to-sequence pre-training for natural language generation, translation, (2022), 24824–24837.
and comprehension. In In Proceedings of the 58th Annual Meeting of the Association [41] Xin Xie, Ningyu Zhang, Zhoubo Li, Shumin Deng, Hui Chen, Feiyu Xiong, Mosha
for Computational Linguistics, ACL 2020. 7871—-7880. Chen, and Huajun Chen. 2022. From discrimination to generation: Knowledge
[20] Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous graph completion with generative transformer. In Companion Proceedings of the
prompts for generation. In In Proceedings ofthe 59th Annual Meeting ofthe Associ- Web Conference 2022. 162–165.
ation for Computational Linguistics and the 11th International Joint Conference on [42] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Em-
Natural Language Processing, ACL/IJCNLP 2021. 4528–4597. bedding entities and relations for learning and inference in knowledge bases. In
[21] Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, In 3rd International Conference on Learning Representations, ICLR 2015.
Yejin Choi, and Hannaneh Hajishirzi. 2021. Generated knowledge prompting [43] Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. KG-BERT: BERT for knowledge
for commonsense reasoning. In in Proceedings of the 60th Annual Meeting of the graph completion. arXiv preprint arXiv:1909.03193 (2019).
Association for Computational Linguistics, Vol. 1: Long Papers. 3154–3169. [44] Donghan Yu, Chenguang Zhu, Yiming Yang, and Michael Zeng. 2022. Jaket: Joint
[22] Xin Lv, Yankai Lin, Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu, Peng Li, and pre-training of knowledge graph and language understanding. In Proceedings of
Jie Zhou. 2022. Do pre-trained models benefit knowledge graph completion? a the AAAI Conference on Artificial Intelligence, Vol. 36. 11630–11638.
reliable evaluation and a reasonable approach. Association for Computational [45] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis,
Linguistics. Harris Chan, and Jimmy Ba. 2022. Large language models are human-level
[23] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. prompt engineers. In In NeurIPS 2022 Foundation Models for Decision Making
Distributed representations of words and phrases and their compositionality. Workshop.
Advances in neural information processing systems 26 (2013).
[24] George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM
38, 11 (1995), 39–41.
[25] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu,
Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge
A INFERENCE AND TRAINING
bases?. In In Proceedings of the 2019 Conference on Empirical Methods in Natu- Inference constructing constrained prompts using entities and
ral Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP). 2463–2473. their inherent descriptions, coupled with reasoning through LLMs,
constitutes the most resource-intensive phase of our experiment.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Rui Yang, Li Fang, and Yi Zhou
FB15K237 WN18RR
Method forward/backward MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
Qwen-7B-chat forward 42.66 33.77 46.05 60.75 71.42 63.56 76.42 85.39
backward 24.87 16.34 27.02 42.38 62.19 54.98 66.15 75.53
LLaMA2-7B-chat forward 42.4 33.57 45.69 60.44 72.11 64.42 77.31 85.58
backward 24.84 16.36 26.94 42.19 62.52 55.3 66.78 75.14
LLaMA2-13B-chat forward 42.58 33.71 45.73 60.69 71.29 63.15 76.61 85.64
backward 25.0 16.43 27.2 42.5 62.69 55.62 66.78 75.27
GPT3.5-turbo forward 42.11 33.16 45.48 60.22 72.13 64.65 76.61 85.67
backward 24.8 16.3 26.78 41.96 62.45 55.42 66.5 74.89
GPT4 forward 42.57 33.6 45.95 60.76 71.66 63.69 77.22 85.19
backward 24.55 15.7 27.05 41.74 62.54 55.62 66.46 74.86
As the model parameters increase, both time and computational re- The InfoNCE loss effectively learns meaningful representations
sources escalate correspondingly. For the locally deployable Qwen- of features by comparing the similarity between positive samples
7B model, we employ AutoGPTQ-based quantization, resulting in and negative samples. The score function 𝑝ℎ𝑖 (ℎ, 𝑟, 𝑡) for the can-
an Int4 quantized model. This adaptation ensures minimal com- didate triplet is defined as 𝑝ℎ𝑖 (ℎ, 𝑟, 𝑡) = Cos(et, si ) ∈ [−1, 1], as
promise on evaluation performance while benefiting from reduced shown in Equation (3). 𝜏 is a temperature parameter that controls
storage demands and enhanced inference speed. Through this quan- the sharpness of the distribution, and a smaller 𝜏 focuses the loss
tization process, the GPU memory footprint diminishes from an more on hard negatives but also risks overfitting to label noise.
initial 17GB to 8GB. Notably, the model’s inference speed correlates The additive margin 𝛾>0 encourages increasing the score of correct
with the volume of content it generates. Precise and apt instructions triplets.
can expedite the model’s generation process, thereby optimizing
efficiency. B MORE ANALYSIS RESULTS
Training During training, we employed the SimKGC model and In Table 7, we showcase the forward and backward metrics of vari-
the additive margin InfoNCE loss. ous LLMs. Forward metrics pertain to the prediction of tail entities
in a forward direction, whereas backward metrics concern the pre-
e (𝜙 (ℎ,𝑟,𝑡 ) −𝛾 )/𝜏 diction of head entities in a reverse direction. For both the WN18RR
L = − log Í |N| ′ (3)
e (𝜙 (ℎ,𝑟,𝑡 ) −𝛾 )/𝜏+ 𝑖=1 e (𝜙 (ℎ,𝑟,𝑡𝑖 ) ) /𝜏 and FB15K237 datasets, forward metrics consistently outperform
backward metrics, with a notable difference between the two.