Can’t say cant? Measuring and Reasoning of Dark Jargons in Large Language Models
Abstract
Ensuring the resilience of Large Language Models (LLMs) against malicious exploitation is paramount, with recent focus on mitigating offensive responses. Yet, the understanding of cant or dark jargon remains unexplored. This paper introduces a domain-specific Cant dataset and CantCounter evaluation framework, employing Fine-Tuning, Co-Tuning, Data-Diffusion, and Data-Analysis stages. Experiments reveal LLMs, including ChatGPT, are susceptible to cant bypassing filters, with varying recognition accuracy influenced by question types, setups, and prompt clues. Updated models exhibit higher acceptance rates for cant queries. Moreover, LLM reactions differ across domains, e.g., reluctance to engage in racism versus LGBT topics. These findings underscore LLMs’ understanding of cant and reflect training data characteristics and vendor approaches to sensitive topics. Additionally, we assess LLMs’ ability to demonstrate reasoning capabilities. Access to our datasets and code is available at https://github.com/cistineup/CantCounter.
Keywords Large language model, Jargon, Cant language detection, Evaluation system, Slang, Reasoning
1 Introduction
Large Language Models (LLMs), exemplified by ChatGPT[1], redefine information acquisition, communication, and problem-solving[2]. These models are trained on extensive datasets or fine-tuned from pre-existing models, necessitating vast amounts of data. However, LLMs also pose security and ethical concerns as attackers can exploit their generative capabilities for malicious purposes [3]. Such misuse encompasses disinformation dissemination [4], AI-driven crime [5], privacy breaches [6], and social engineering [7]. Despite efforts by regulators like OpenAI to implement content filters [8], there remains a risk of attackers disguising malicious content using “cant” or “dark jargon” - concealed language elements requiring deeper comprehension [9]. LLMs excel in understanding and generating natural language responses, fostering user trust. While research evaluates their efficacy in providing accurate responses [10], little attention has been paid to LLMs’ interaction with cant in specific domains. Prior studies often lack depth in understanding the intricacies of cant [11], especially its varied representations in domains like politics and drugs. In this paper, we investigate LLMs’ ability to recognize and reason about cant, particularly in domains prone to offensive content like politics and drugs. Despite progress in filtering harmful content, attackers can still exploit cant to evade detection. Understanding LLMs’ response to cant in specific domains is essential for addressing emerging security challenges. Additionally, we assess LLMs’ ability to demonstrate reasoning capabilities.
Research Questions. To address the above issues, in this paper, we evaluate the reasoning abilities of current LLMs involving cant or dark jargon from the following four perspectives:
-
1.
RQ1: Do different types of questions help LLM understand the cant?
-
2.
RQ2: Do different question setups and prompt clues help LLM understand cant?
-
3.
RQ3: Do different LLMs have the same understanding of the same cant?
-
4.
RQ4: How well does LLM understand cant in different domains?
CantCounter: Addressing past shortcomings[11], CantCounter is a system crafted to evaluate LLM’s grasp of cant within specific domains. We compile Cant and Scene datasets from various sources to form adversarial texts. These datasets fine-tune the GPT-2 model and generate Scene fragments for assessing LLM comprehension. Co-Tuning methods align the Cant dataset and Scene fragments, while Data-Diffusion techniques augment and refine adversarial text. Employing Type, Sample learning, and Clue approaches enrich our experiments. Finally, Data-Analysis methods systematically evaluate 1.67 million data points. CantCounter is locally deployable and adaptable to any open-world dialogue system. Its replication has both advantages and drawbacks, aiding attackers in bypassing LLM classifiers while facilitating safety filter development. We define “entities” as distinct objects or concepts and “scenes” as related events in specific environments.
Ethical Considerations: CantCounter draws from public datasets such as Reddit [12] and 4chan [13], avoiding direct user interaction. However, its misuse poses risks, despite its benefits in addressing LLM’s challenges. Despite these potential risks, we believe that the benefits of CantCounter far outweigh the risks. LLM has become a hot topic [14], and we need to fully recognize the potential problems of LLM and promote its safer development and application. We caution that this paper may contain sensitive content, including drug and violence-related examples, which could cause discomfort. Comprehensive data is available upon request. We have only open sourced part of the dataset.
Contributions. This paper introduces three key contributions:
-
1.
We present the Cant and Scene datasets, addressing data scarcity in domains like drugs, weapons, and racism, laying groundwork for future large language model assessment.
-
2.
CantCounter, our framework, assesses large language models’ understanding of domain-specific cants through four stages: Fine-Tuning for scene fragment generation, Co-Tuning for cross-matching, Data-Diffusion for text expansion, and Data-Analysis for simplifying complex calculations.
-
3.
Our evaluation of CantCounter reveals its efficacy in bypassing security filters of mainstream dialogue LLMs, providing insights into LLM reasoning within specific domains and guiding future research.
2 Background
2.1 Large Language Model Security Issues
ChatGPT, developed by OpenAI in November 2022 [1], has undergone upgrades and fine-tuning [15] to prevent harmful content generation. However, users can still provoke negative responses by using specific prompts [16]. Researchers are investigating security risks, including the generation of toxic outputs from benign inputs [17]. Recent studies have shown that attackers can bypass detection by encrypting inputs with methods like Caesar ciphers and exploiting language nuances [18]. This paper proposes a Q&A query approach to evaluate LLMs’ reasoning abilities in handling such content.
2.2 Cant
Cant, a specialized language used by social groups for secrecy [19], varies in names like argot [20], slang [21], and secret language across history. While LLMs excel in traditional cant analysis, understanding criminal cant poses challenges. Criminal groups use innocuous terms to hide illegal activities, necessitating mastery for law enforcement [22]. Our study explores cant in politics, drugs, racism, weapons, and LGBT issues. These cants share ambiguity, indirect messaging, and potential for social harm. Political cant conveys biases, drug cant evades regulation, racism cant reinforces biases, weapons cant enables illegal dealings, and LGBT cant discriminates. Mastering these cants is vital for addressing societal and security concerns.
2.3 Question Answering (Q&A) Task
Dialogue systems fall into task-oriented and non-task-oriented categories. Task-oriented systems serve specific purposes like reservations, while non-task-oriented systems engage in free conversation. Examples include ChatGPT, Bard, ERNIE, and Claude, offering services in entertainment, social interaction, and information retrieval [23].Question-answering (Q&A) tasks in NLP evaluate language processing capabilities [24], including reading comprehension and logical reasoning. Q&A formats include abstractive, Yes/No, and Multiple-Choice, each requiring specific evaluation metrics [25]. We employ Zero-shot/One-shot learning for testing.
3 CantCounter
3.1 High-level Idea
We observe that the responses generated by LLMs vary with different cants, allowing adversaries to bypass filters or security restrictions. Thus, understanding how LLMs react to different cants is very important. However, exhaustively trying different cants queries with different scenes across numerous domains to find those capable of bypassing LLM restrictions and generating harmful outputs would be time-consuming and impractical. Therefore, we investigate whether adversaries can independently combine different cants and scenes to generate context that is reasonable and coherent, bypassing LLM filters or restrictions. To this end, we introduce CantCounter, the first evaluation (attack) framework targeting open-world dialogue systems (LLM).
3.2 Threat Model
We adopt a threat model similar to “Why so toxic” [17], targeting deployed dialogue LLMs like ChatGPT. Firstly, the adversary requires scene data different from the target LLM’s training data. Secondly, they interact with the LLM, combining cants and scenarios to extract detectable cants. Finally, they access the victim LLM via CantCounter in a black-box manner, querying it through an API-like interface.
3.3 Dataset
In our study, we extensively gathered cant related to five domains: politics, drugs, racism, weapons, and LGBT. The cant, comprising common and less common usages, holds practical meanings in real life. This Cant dataset forms a robust basis for evaluating the veracity and reliability of LLMs across specific domains. These five areas were chosen to address pressing societal issues impacting fundamental values such as social justice and human rights. Exploration of politics, drugs, racism, weapons, and homosexuality enables LLMs to tackle real-world challenges effectively. While other domains like hacking and fraud are significant, we focused on these due to data availability and processing feasibility, leaving room for future research on sensitive topics.
In constructing the Cant dataset (Figure 2 \scriptsize{2}⃝), we crawled or manually screened multiple sources, including government agency websites [26], online forums like Reddit [12], 4chan [13], and X [27], publicly available datasets from Kaggle [28] and Hugging Face [29], dark web, and public compilations of cant. Multi-source data encompasses various text types closely related to specific domains. CantCounter utilizes information networks [30] to address redundancy challenges between cants, capturing their interdependency.
The Cant dataset covers five domains, totaling 1,778 cants across 187 entities. We randomly selected 53 entities, totaling 692 cants, ensuring even representation across domains and prevalence in the open world. Selected entities and cants were cross-validated with authoritative sources [31, 32, 33, 34, 35] to ensure wide presence and reflection in publicly accessible information sources. Criteria like content relevance and topic specificity guided information selection and filtering, aiming for transparency and consistency. The resulting high-quality data forms the Scene dataset, laying the groundwork for subsequent simulation scene generation models.
During information selection and filtering (Figure 2 \scriptsize{1}⃝), explicit criteria were used to judge relevance and adherence to study definitions. Decisions were reached through participatory discussion to mitigate subjectivity and ensure alignment with research objectives. This rigorous process yields a refined dataset for accurate and relevant analysis.
3.4 Pipeline
The CantCounter pipeline (Figure 2) consists of four stages: Fine-Tuning, Co-Tuning, Data-Diffusion, and Data-Analysis, as detailed below.
Cant is prevalent in the open world, so we aggregate raw text data from various sources to construct Cant and Scene datasets (Section 3.3). Although Cant and Scene datasets provide specific entities and scenes, they may not align well with the domain’s requirements. Therefore, in Stage \scriptsize{3}⃝, we fine-tune GPT-2 using the Scene dataset to build five scene generation models for large-scale scenes, tailored to our specific domains. However, the fine-tuned scenes may not match the entities in the Cant dataset. In Stage \scriptsize{4}⃝, we address this issue by using entities from the Cant dataset to constrain the output of the generated model, ensuring scenes closely relate to the cant entities. Next, we conduct semi-automatic screening of the generated simulation scenes to form a set of Scene fragments. While these fragments contain entities, linking them with specific questions requires a method we have not yet discovered. Hence, in Steps \scriptsize{5}⃝-\scriptsize{6}⃝, we devise the Co-Tuning stage, where Scene fragments cross-match with cants from the Cant dataset to form Fragments. To enable multi-task comparison, we construct detection tests through different combinations of specific domains, question types, learning methods, and prompt clue methods in Stage \scriptsize{7}⃝. This completes and diffuses Fragments to form Q&A-Query datasets.
Finally, in Stages \scriptsize{8}⃝-\scriptsize{9}⃝, Q&A-Queries are sent to the target model API for completion, and a segmented data statistics algorithm is applied to obtain and analyze test results, conducting analyses in the Data-Analysis stage.
3.5 Stage 1: Fine-Tuning
During the fine-tuning stage, we use the Scene dataset to guide GPT-2 in generating tailored scenarios for specific domains. Despite more advanced models like GPT-3.5 and GPT-4 being available, we opt for GPT-2 due to its open-source nature, facilitating better control over training details. The fine-tuning code is publicly accessible for replication. The fine-tuning process is outlined in Algorithm 1.
The Transformer model [36] forms the basis for GPT-2, featuring encoders and decoders with identical modules. GPT-2 employs a partially masked self-attention mechanism and positional coding to understand sequence relationships. It has been successfully applied in various tasks like AI detection and text summarization. Overall, GPT-2’s fine-tuning with the Scene dataset enables the generation of Question-Answer patterns tailored to specific domains, aiding in simulated scene generation tasks.
3.6 Stage 2: Co-Tuning
To solve the problem of many intersecting data processes in CantCounter, we use the Cant dataset and Scene fragments to collaborate and design a Co-Tuning method. Co-Tuning realizes the generation and collaboration of cross-matching and solves the problem of detection data insufficiency. The Cant dataset provides detailed entity information for the generated model. The entities could constrain the generative model and make the Scene Fragments more consistent and coherent in the need for a specific domain during the Co-Tuning stage. In the end, we also manually review the results to ensure the relevance of cants to scenes and the distinctiveness of all scenes corresponding to the same cant.
In this paper, we design formulas in the Co-Tuning to mathematically represent this part of the stage. The generation model is specified as , and it includes five fine-tuned models, which are denoted as , , , , and .
As shown in Figure 3, entity represents the -th entity () in the Cant dataset, and cant represents the -th cant of (). For example, in the case of the politics domain, there are 10 entities used in our experiments, each entity has twenty cants, is taken as . The entity can constrain the fine-tuned model ’s output, and the result of the constraint is the Scene fragment; this part corresponds to Eq. (1). The Scene is . The Scene represents the -th scene fragment (, ) that the -th entity enters into the output of the fine-tuning model ().
(1) |
Eq. (2) denotes the cross-match of Cant and Scene fragment and was saved in .
(2) |
There are orange boxes in the Scene fragment. These orange boxes represent the -generated text containing the Cant dataset’s entities. The function of Eq. 2 is to replace the entities in the Scene fragments with cant in the Cant dataset. As shown in Figure 3, for example, from Scene fragment to Fragment 1. We replace entities in Scene with the cant (), forming Fragment 1. By analogy, we built Fragments in the Co-Tuning stage.
In the Co-Tuning stage, we can obtain scene fragments related to entities in specific domains that have a high degree of context consistency and express various characteristics of the entities in different contexts. At the same time, our fine-tuned model is flexible enough to introduce multiple entities during the generation process and allow scene fragments to describe the relationships among multiple entities. This stage generates diverse scene fragments. While the scene fragments are generated through a generative process, the Scene dataset we provide undergoes manual review to mitigate errors in both the generated content and the language utilized within the experimental environment.
3.7 Stage 3: Data-Diffusion
At this stage, Fragments from the Co-Tuning stage are transformed into Q&A-Queries to enhance interaction with LLM and diversify evaluation. We employ three diffusion methods: two sample learning techniques, three question types, and four prompt clue methods. Each Fragment generates 24 Q&A-Queries. First, we introduce sample learning techniques for zero-shot and one-shot learning transformations of Fragments. Second, we categorize Fragments into Abstractive, Yes/No, and Multiple-choice question types. Finally, prompts are classified into None-tip, Tip-1, Tip-2, and All-tip categories, considering information retrieval difficulty and situational prompting.
The introduction of Data-Diffusion in extended Fragments has significantly increased Q&A queries, providing diverse test cases for evaluating the generation model’s performance comprehensively. This approach promises to establish a diverse database for future research and applications.
3.8 Stage 4: Data-Analysis
As shown \scriptsize{8}⃝ and \scriptsize{9}⃝ in Figure 2, \scriptsize{8}⃝ means sending the data expanded by Data-Diffusion to ChatGPT and other target models. \scriptsize{9}⃝ shows data analysis of the output results of LLMs such as ChatGPT. After completing the Data-Diffusion, we submit the generated Q&A-Queries to the LLM API interface to obtain a large number of data results. These data results are complex and diverse, including the interplay of relationships. Therefore, we devise a data analysis algorithm to yield both numerical and analysis outcomes.
After the Co-Tuning and Data-Diffusion stages, the test data generated by CantCounter is very complex. Therefore, in the Data-Analysis stage, we implement Algorithm 2 to conduct data statistics from various angles. During analysis, when the entity is modified in the Co-Tuning stage (see Figure 3), Algorithm 2 will be called accordingly. We analyze the results based on different tasks. We learn and analyze data features from Question Type Method (See 4.2 QTM) and Sample Learning Method (See 4.3 SLM) based on different question types and samples learning to get ; we analyze the data based on different prompt clues from Prompt Clue Method (See 4.4 PCM) to get . In Algorithm 2, we set the matching conditions, calculate the number of fragments, and obtain and accuracy . At the same time, we set eleven intervals: 0, 1-10, 11-20, …, 91-101 to distinguish different feedbacks and obtain .
As shown in the Algorithm 2, we put Zero-shot learning, One-shot learning, and three tasks together as a loop. We define that in the Abstractive task, the output is in the Zero-shot learning input; the output is in the One-shot learning input. In the Yes/NO task, the output is expressed as in the Zero-shot learning input; the output is expressed as in the One-shot learning input. In the Multiple-choice task, the output is represented as in the Zero-shot learning input; the output is expressed as in the One-shot learning input. The above content has been integrated into our code to form semi-automation.
4 Experimental Design and Results
To explore our research questions, we conducted experiments in CantCounter, outlined sequentially in this section. We examined various question types in RQ1 (Section 4.1), different question setups in RQ2 (Section 4.2), and diverse prompt clues in RQ2 (Section 4.3). Focusing primarily on ChatGPT-3.5 (version gpt-3.5-turbo-0613) due to its convenience and wide usage, similar experiments were also conducted with other language models. All experiments were performed on a server equipped with an RTX 3090 Ti GPU. In this section, we analyze using cant and scene to bypass the LLM filter in the CantCounter framework quantitatively. We conduct open-world query experiments across five domains: politics, drugs, racism, weapons, and LGBT. Initially setting to 101, we match 692 cants to 53 entities, resulting in 69,892 scenes. These undergo Data-Diffusion, expanding to 1,677,408 scenes. This study enables a comprehensive analysis of corpus performance and language changes within specific domains.
4.1 Question Type Method (QTM)
In the Q&A task, we conduct three types of tasks:
-
•
Abstractive Task: Models generate responses freely, without relying on specific information extraction.
-
•
Yes/No Task: Models provide binary responses, “True” or “False,” based solely on the presented question and existing knowledge.
-
•
Multiple-choice Task: Models select the correct answer from a set of options, demonstrating comprehension of semantics and accurate identification.
Table 1 shows that Multiple-choice tasks achieve the highest accuracy (45.38%), while Yes/No tasks have the lowest (22.91%). The discovery that ChatGPT performs well in multiple-choice questions is intriguing. In this task, there are five options (A) to (E), with (A) to (D) relevant to a specific domain, and (E) set as “I don’t know.” “Other” signifies an answer unrelated to these options, with (A) as the correct choice. Figure 5 displays the box plot analysis results. Analyzing the Multiple-choice task results, we find key factors for its success. Firstly, it offers a set of answers with one correct option and distractors, aiding comprehension. Secondly, its structured format simplifies the process of eliminating incorrect options, improving accuracy. Lastly, the inclusion of an “I don’t know” option enhances accuracy in uncertain situations.
We also explore the low accuracy in the Yes/No task. Comparing ChatGPT-3.5’s “False” answers with Multiple-choice task data, we find they often include option (E) and incorrect choices from the Multiple-choice task due to the clarity of options. Additionally, differences in response styles and keyword detection criteria impact ChatGPT-3.5’s performance across Abstractive and Yes/No tasks, where Yes/No tasks restrict responses to “True” or “False.” Overall, our analysis highlights how different Q&A types affect ChatGPT-3.5’s accuracy in specific domains, with Multiple-choice tasks showing higher performance. Further research is needed to improve ChatGPT-3.5’s accuracy and adaptability in these domains.
4.2 Sample Learning Method (SLM)
In our experiments, we explore two sample setups: Zero-shot and One-shot learning.
-
•
Zero-shot learning. No examples are provided in the prompt, which only includes instructions and questions.
-
•
One-shot learning. The prompt includes an example relevant to the discussion, consisting of a sample message and user information.
Zero-shot learning involves a single user message, while One-shot learning processes a sample message and a user message. These methods help understand LLM’s performance in different sample learning approaches and reveal its inference capabilities in information-poor settings. Further investigation uncovers learning patterns and effects of the model in specific domains, with default hyper-parameter settings used to avoid extensive tuning.
In this section, we explore how Zero-shot and One-shot learning methods affect LLM accuracy in recognizing cant scenes for RQ2. Traditionally, One-shot learning often outperforms Zero-shot learning due to more available data [37]. However, our cross-domain analysis, depicted in Figure 6 and reflected in Table 1 (red section), reveals a trend favoring Zero-shot learning overall. We find this trend varies by domain.
In the politics domain, One-shot learning performs better due to ample data and contextual understanding. Conversely, in the LGBT domain, Zero-shot learning outperforms One-shot learning due to limited publicly available examples. One-shot learning aids ChatGPT-3.5 in better contextual comprehension of sensitive topics, but it may also introduce biases, leading to lower overall accuracy in specific domains. Similar analyses across other domains yield consistent results.
4.3 Prompt Clue Method (PCM)
In this part of the study, the purpose of CantCounter is to explore the impact of different clues on LLM recognition and reasoning abilities. To this end, we provide four different clues to experiment with:
-
•
None-tip. Keeps the same as the original prompt and does not add any additional clues.
-
•
Tip 1. Add relevant tip for “None-tip”. For example, when describing Trump’s cant, we can add the clue “Politician” in the political domain to make the prompt more directional.
-
•
Tip 2. Add another relevant tip for “None-tip”. For example, when describing Trump’s cant, add the “United States” prompt in the domain of politics to enrich the prompt content.
-
•
All-tip. Add both Tip 1 and Tip 2 on the basis of “None-tip”; for example, when describing Trump’s cant, add both “politician” and “American” in the political domain to make the prompt more appropriate.
By observing the effects of these different clues on LLMs, CantCounter can assess the fluctuating changes they induce in recognition and reasoning abilities. This study will help further understand the influence of cues on LLM and provide directions for improving its application and performance.
To answer RQ2, Table 1 displays ChatGPT-3.5’s accuracy across five domains using different prompt clues. Generally, more clue-related information improves recognition accuracy, as seen in the political domain where All-tip prompts perform significantly better. However, increasing clues doesn’t always lead to higher accuracy, possibly due to information redundancy or LLM filter triggering. Too many clues may reduce accuracy, as seen in the LGBT domain where Tip 1 prompts were less accurate than none-tip prompts.
Our analysis stresses the importance of a balanced clue selection approach to maximize external information usage without compromising accuracy. Thus, choosing appropriate clues in moderate quantities is key to enhancing ChatGPT-3.5’s domain-specific performance.
QTM | SLM | PCM | |||||||
---|---|---|---|---|---|---|---|---|---|
Domain | A | Y/N | Mc | Zs | Os | NT | T1 | T2 | AllT |
Politics | 26.81 | 22.55 | 50.64 | 42.85 | 57.15 | 19.01 | 24.75 | 25.19 | 31.05 |
Drugs | 21.16 | 22.41 | 56.43 | 55.41 | 44.59 | 17.32 | 27.43 | 25.47 | 29.78 |
Racism | 29.05 | 27.60 | 43.35 | 41.39 | 58.61 | 11.22 | 19.63 | 37.50 | 31.66 |
Weapons | 50.89 | 16.20 | 32.91 | 54.96 | 45.04 | 18.73 | 28.11 | 25.27 | 27.90 |
LGBT | 34.41 | 25.75 | 39.84 | 59.78 | 40.22 | 22.58 | 22.10 | 28.53 | 26.79 |
Total | 31.71 | 22.91 | 45.38 | 52.13 | 47.87 | 19.03 | 24.61 | 27.24 | 29.11 |
Zero-shot learning | One-shot learning | |||||
---|---|---|---|---|---|---|
Acc | Rej | Don’t know | Acc | Rej | Don’t know | |
ChatGPT-3.5[1] | 47.61 | 4.66 | 39.91 | 45.52 | 1.63 | 46.45 |
GPT-4[38] | 27.27 | 0.00 | 70.45 | 50.00 | 0.00 | 34.09 |
Bard[39] | 47.73 | 4.55 | 13.64 | 65.91 | 15.91 | 6.82 |
New Bing[40] | 50.00 | 11.36 | 34.09 | 50.00 | 36.36 | 2.27 |
SparkDesk[41] | 29.55 | 45.45 | 9.09 | 20.45 | 68.18 | 2.27 |
4.4 Comparison with other LLMs
In our study, we examine several LLMs alongside ChatGPT-3.5 to address RQ3, including GPT-4[1], New Bing [40], Bard [39], Claude [42], ERNIE [43], and SparkDesk [41]. While ERNIE is optimized for Chinese content, translating cant prompts may compromise their subtlety and effectiveness. Moreover, ERNIE’s frequent account suspensions hindered extensive trials [44]. Claude’s sensitive content handling also led to account suspensions [42]. Thus, we focus on comparing and validating four other LLMs: GPT-4, Bard, New Bing, and SparkDesk. Table 2 presents ratios of correct answers, refused answers, and “I don’t know” responses. Interestingly, GPT-4 consistently responds in all situations, avoiding refusal to answer. This contrasts with other models that often refuse to respond due to content filtering. GPT-4’s tendency to use “I don’t know” may stem from our controlled comparisons in the QTM and PCM methods, particularly in Multiple-choice scenarios. Conversely, other LLMs tend to refuse to answer, likely due to content categorization by filters and classifiers. SparkDesk exhibits the highest refusal rate, possibly due to overly strict filters. Furthermore, One-shot learning models are more prone to refusal to answer, as they rely on context understanding, potentially triggering filters. These findings offer insights into the performance of these LLMs across different learning tasks, informing future research directions.
4.5 Takeaways
We observe varying accuracy across different Q&A-Query types (RQ1), with Multiple-choice tasks being most accurate and Yes/No tasks the least. In sensitive domains, Zero-shot learning performs better than One-shot learning (RQ2). Increasing prompt clues improves cant identification accuracy (RQ2). More recent LLM models consistently avoid refusing to answer (RQ3), but they are more likely to refuse answering questions related to racism compared to LGBT (RQ4).
5 Conclusion
This paper presents the first comprehensive evaluation of LLM’s reasoning capability using cants or dark jargons. We created two domain-specific datasets: Cant and Scene datasets, and developed an evaluation framework to assess LLM’s reasoning abilities through cant comprehension. We proposed a four-stage strategy - Fine-Tuning, Co-Tuning, Data-Diffusion, and Data-Analysis - to address cross-matching and complex data calculation problems. Our experiments reveal varying comprehension levels of LLM under different question types (Abstractive, Yes/No, Multiple-choice), sample learning methods (Zero-shot/One-shot learning), and prompt clues (None-tip, Tip1, Tip2, All-tip). Additionally, across different domains (Politics, Drugs, Racism, Weapons, LGBT), different LLMs (GPT-3.5, GPT-4, New Bing, Bard, SparkDesk) demonstrate varying refusal rates to answer questions. Our findings provide insights for the security research community into LLM’s reasoning capabilities regarding “cant”, emphasizing the importance of implementing effective safety filters and measures for screening potentially hazardous LLM-generated content.
References
- [1] OpenAI. https://openai.com/chatgpt.
- [2] Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
- [3] Sayak Saha Roy, Krishna Vamsi Naragam, and Shirin Nilizadeh. Generating phishing attacks using chatgpt. arXiv preprint arXiv:2305.05133, 2023.
- [4] Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, et al. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214–229, 2022.
- [5] Daniel Birks and Joseph Clare. Linking artificial intelligence facilitated academic misconduct to existing prevention frameworks. International Journal for Educational Integrity, 19(1):20, 2023.
- [6] Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
- [7] Maanak Gupta, CharanKumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access, 2023.
- [8] OpenAI platform. https://platform.openai.com/docs/guides/moderation/overview.
- [9] Kan Yuan, Haoran Lu, Xiaojing Liao, and XiaoFeng Wang. Reading thieves’ cant: automatically identifying and understanding dark jargons from cybercrime marketplaces. In 27th USENIX Security Symposium (USENIX Security 18), pages 1027–1041, 2018.
- [10] Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Yongrui Chen, and Guilin Qi. Can chatgpt replace traditional kbqa models? an in-depth analysis of the question answering performance of the gpt llm family. In International Semantic Web Conference, pages 348–367. Springer, 2023.
- [11] David Rozado. The political biases of chatgpt. Social Sciences, 12(3):148, 2023.
- [12] Reddit. https://www.reddit.com/.
- [13] 4chan community. https://www.4chan.org/.
- [14] Shahab Saquib Sohail, Faiza Farhat, Yassine Himeur, Mohammad Nadeem, Dag Øivind Madsen, Yashbir Singh, Shadi Atalla, and Wathiq Mansoor. Decoding chatgpt: A taxonomy of existing research, current challenges, and possible future directions. Journal of King Saud University-Computer and Information Sciences, page 101675, 2023.
- [15] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- [16] Jiongxiao Wang, Zichen Liu, Keun Hee Park, Muhao Chen, and Chaowei Xiao. Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950, 2023.
- [17] Wai Man Si, Michael Backes, Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, Savvas Zannettou, and Yang Zhang. Why so toxic? measuring and triggering toxic behavior in open-domain chatbots. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2659–2673, 2022.
- [18] Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
- [19] Zhang Li Feng Xiaojin. Development trend and identification path of drug-related cryptic language under the background of ”internet plus”. Journal of Political Science and Law, 38(107-118), 2021.
- [20] Marc Sourdot. Argot, jargon, jargot. Langue française, (90):13–27, 1991.
- [21] Liang Wu, Fred Morstatter, and Huan Liu. Slangsd: building, expanding and using a sentiment dictionary of slang words for short-text sentiment classification. Language Resources and Evaluation, 52:839–852, 2018.
- [22] Qu Yanbin. Grammar summary of chinese folk secret language (lingo) (part 1). Cultural Journal, (26-33), 2014.
- [23] Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jianshe Zhou, and Zhoujun Li. Building task-oriented dialogue systems for online shopping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
- [24] Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021.
- [25] Anna Rogers, Matt Gardner, and Isabelle Augenstein. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Computing Surveys, 55(10):1–45, 2023.
- [26] X Corp. https://drugabuse.com/addiction/list-street-names-drugs/.
- [27] X. https://twitter.com/.
- [28] Kaggle. https://www.kaggle.com.
- [29] Hugging Face. https://huggingface.co/.
- [30] Jun Zhao, Qiben Yan, Xudong Liu, Bo Li, and Guangsheng Zuo. Cyber threat intelligence modeling based on heterogeneous graph convolutional network. In 23rd international symposium on research in attacks, intrusions and defenses (RAID 2020), pages 241–256, 2020.
- [31] EverybodyWiki Bios & Wiki. https://en.everybodywiki.com/List˙of˙nicknames˙of˙Donald˙Trump.
- [32] Defining Wellness. https://definingwellness.com/resources/drug-slang-word-glossary/.
- [33] A Gun Lingo Glossary for Those Unfamiliar With Firearms. https://lifehacker.com/a-gun-lingo-glossary-for-those-unfamiliar-with-firearms-1825427596.
- [34] The Racial Slur Database. http://www.rsdb.org/races.
- [35] Wikipedia. https://en.wikipedia.org/wiki/LGBT˙slang.
- [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- [37] Li Zhong and Zilong Wang. A study on robustness and reliability of large language model code generation. arXiv preprint arXiv:2308.10335, 2023.
- [38] GPT-4. https://openai.com/research/gpt-4.
- [39] Bard-Google. https://bard.google.com/.
- [40] NewBing. https://www.bing.com/new.
- [41] SparkDesk Xunfei-Xinghuo. https://xinghuo.xfyun.cn/.
- [42] Google. https://claude.ai/.
- [43] ERNIE. https://yiyan.baidu.com/welcome.
- [44] ERNIE Protection Rule. https://wanhua.baidu.com/talk/protectionrule.