Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making.

Oluyemi Enoch Amujo Rochester Institute of Technology, NY.
[email protected], [email protected]
Shanchieh Jay Yang Rochester Institute of Technology, NY.
[email protected], [email protected]
Abstract

Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning for domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study utilizes a comprehensive methodology to assess foundational models, which includes problem formulation, data analysis, and the development of ThroughCut, a novel outlier detection technique that automatically identifies response throughput outliers based on their conciseness. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing inference time, response length, throughput, quality, and resource utilization and investigated the correlations between these factors. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality. In addition, common prompts, which include various types of queries, generate diverse and inconsistent responses at irregular intervals. In contrast, domain-specific prompts consistently generate concise responses within a reasonable time. Overall, this study underscores the need for comprehensive evaluation frameworks to enhance the reliability of benchmarking procedures in multidomain AI research.

Index Terms:
LLM, inference, domain-specific, benchmarking, outlier

I Introduction

The increasing efficacy of large language models (LLMs) for dynamic natural language processing (NLP) tasks has attracted significant attention across various disciplines. Notably, LLMs often undergo an intricate development process encompassing training, optimization, fine-tuning, and, in some cases, incorporation of sophisticated techniques like reinforcement learning with human feedback (RLHF), to achieve advanced conversational capabilities. Extensive studies [1], [2] have unveiled the transformative potential of LLMs across various scientific and technological domains. Moreover, their potential to positively impact diverse fields, including software engineering, via process and outcome optimization has been recognized [3]. Further, their impact on the evolution of artificial intelligence and the burgeoning field of digital learning has been well-documented [4][5]. In addition, LLMs are increasingly being acknowledged for their transformative potential in the cybersecurity field, particularly in the area of threat detection [6].

The functions executed by Large Language Models (LLMs) can be fundamentally classified into two categories: input classification and output generation. Tasks falling under the former may encompass Text classification[7], named entity recognition (NER) [8], Sentiment Analysis [9], question answering [10] , [11], and various others. Conversely, output generation tasks may include multimodal output generation  [12], [13], language understanding and translation [14],[15], text summarization [16], dialog systems [17][18], among others. Notably, output generation tasks distinguish LLMs from traditional machine learning and deep learning methodologies due to their objective, which is similar to human-like natural language processing tasks.

A significant milestone in artificial intelligence (AI) is the development of large foundational models (LLFMs), which are LLMs pre-trained on diverse datasets across various domains. LLFMs represent substantial progress in AI, offering enhanced performance, broader applicability, and efficiency gains [19]. LLFMs serve as fundamental pillars for numerous natural language processing (NLP) tasks and applications, and they excel at tasks such as text generation, summarization, and translation with exceptional accuracy and fluency. A plethora of LLFMs, varying in size and architecture, have been recently released, including models like Alpaca [20]  [21], BERT, GPT, DALL-E, LLAMA, BLOOM, Gemma, and Alpaca. These models have been developed for diverse applications, spanning multimodal capabilities, vision-and-language tasks, visual-audio processing, as well as code and image generation, among others [22][23][24]. By leveraging large datasets for training, these models demonstrate remarkable adaptability, requiring relatively fewer data to fine-tune specific tasks within distinct domains. Thus, they serve as foundational frameworks for many AI applications.

Focusing on the cybersecurity domain, previous studies have explored the potential of fine-tuning LLFMs for various tasks like cyber threat intelligence (CTI) and automation, using natural language text [25], identification of intricate patterns for automating software vulnerability detection [26], and many others. However, these studies have revealed a notable inability to evaluate the LFM’s baseline cybersecurity knowledge prior to fine-tuning. This gap is exacerbated by the absence of comprehensive pre-fine-tuning assessments, which leads to persistent inaccuracies in benchmarking. In addition, existing evaluation benchmark datasets ROUGE [27], Super-NaturalInstructions [28], MMMLU [29] in the realm of LLMs and natural language processing (NLP), inadequately encompass essential cybersecurity materials, which hinders accurate evaluations. Consequently, the misguided adoption of fine-tuning methodologies resulted in flawed benchmarking outcomes (Author, Year). Addressing these issues highlights the urgent need for a more rigorous and comprehensive evaluation framework to rectify these shortcomings and improve the reliability of benchmarking procedures in cybersecurity-AI research.

Therefore, the goal of this study is to evaluate the foundational understanding of special domains such as cybersecurity, finance, and health/medicine, within large foundational models, which facilitates the development of a fine-tuning framework tailored to cybersecurity tasks. Our motivation stems from the intuition that a large language model (LLM) reacts differently to the prompts of various domains. For example, it responds differently to a common query like ”What is the meaning of a good life?” compared to a cybersecurity query such as ”What is the attacker trying to achieve when running a DLL remotely on the server?”. Leveraging this insight, we address inquiries regarding the assessment of foundational model comprehension in various domains such as cybersecurity, finance, and medical. Furthermore, this study is significant because it establishes a framework for developers and researchers to assess the need for fine-tuning.

Our work and findings. Large language model (LLM) inference requires significant resources, including time, CPU, and memory. Logically, a foundational model trained on a diverse dataset is inclined to retain various forms of knowledge. In our investigation, we observed that their performance in terms of both output and resource usage varies depending on whether they are prompted with a common or domain-specific query.

Summary of our key findings:

  1. 1.

    7B models consume more GPU memory than 2B models.

  2. 2.

    Overall, common prompts tend to produce responses with greater diversity in length and longer inference times.

  3. 3.

    Across all categories, the 2B model tends to have higher throughput than its 7B counterpart.

  4. 4.

    There is a strong correlation between inference time and response length compared to the other parameters.

  5. 5.

    When using semantic textual similarity (STS) with ChatGPT responses as a reference, the 7B model exhibits superior performance compared to 2B.

  6. 6.

    7B model with a response length limit of 50 yields responses with higher ROUGE-L scores in all domains compared to any other parameter.

Our contributions are as follows:

  1. 1.

    We delve into foundation models of diverse sizes, specifically Gemma-2B and Gemma-7B, within both the domain-specific (cybersecurity, health/medical, finance) and common prompt and response generation control settings.

  2. 2.

    Our analysis compares resource utilization and the quality and length of responses generated by the models.

  3. 3.

    We introduce a framework to facilitate informed decision-making when fine-tuning large language models (LLMs) in the cybersecurity, medical, and finance domains.

  4. 4.

    We propose a novel outlier detection technique, termed ThroughCut, which automatically identifies response throughput outliers by assessing their conciseness.

II Literature Review

In this section, we delve into several concepts crucial to this study, such as the large language foundation model (LLFM), LLM inference, and LLM evaluation metrics.

II-A Large Language Foundation Model (LLFM)

For the sake of precision and lucidity, the term ”foundation models” is employed within the machine learning paradigm antecedent to the emergence of large language models (LLMs), delineating a broader category of AI models that served as a benchmark for user applications [30]. Moreover, an LLFM, alternatively denoted as a Pre-trained Language Model (PLM) [31][32], undergoes training on a comprehensive and diverse dataset to function as a versatile substrate for various applications. After this phase, the model can be fine-tuned on reduced data to perform specific tasks [33]. It is important to acknowledge that the capacity and diversity of the foundation model are contingent on the size of the training dataset [34]. Therefore, while all LLMs can be categorized as foundation models, not all foundation models attain the scale of largeness.

One common factor among all Large Language and Large Language-Focused Models (LLLFMs) is their development by companies with substantial resources and workforces. These entities include OpenAI, Google Research, MetaAI, and others. For example, GPT-1 was trained using 4.5 GB of text over 30 days on 8 P600 GPUs, equivalent to 1 petaFLOP/s-day, and was publicly released in 2018 [35]. In 2023, GPT-4 underwent training involving both text prediction and Reinforcement Learning Hyperparameter Fine-Tuning (RLHF), where the specifics of the data volume and training duration undisclosed, yet estimated to range from 2.1 to 25 FLOP. In addition, more than 50 experts were engaged solely for adversarial testing, in addition to undisclosed others contributing to various facets of the system [36][37]. Llama 3, as described in its model card, was trained using two custom-built 24K GPU clusters, consuming 7.7 million GPU hours and processing over 15 trillion tokens. This dataset is seven times more extensive than the training dataset for Llama 2 [38][39].

Fine-tuning is essential when adapting a large language model (LLM) to downstream tasks. There exist various categories of fine-tuning techniques that are worth mentioning. First, fine-tuning the pre-trained parameters can be performed in either a full [40] or partial [41] manner, aiming to update the pre-trained parameters to suit a new task. Although this approach has demonstrated remarkable performance, particularly in domain-specific tasks, it is computationally expensive. Second, parameter-efficient fine-tuning (PEFT) involves adding a small trainable parameter for fine-tuning. PEFT utilizes only a small percentage of existing fine-tuned parameters, referred to as low-rank, to adapt to a downstream task and incorporates them into the pre-trained model [42, 43, 44]. While this strategy balances performance and resource efficiency better than full fine-tuning, it increases model size. Finally, prompt-based fine-tuning [45, 46] is a method to construct prompts in a more insightful manner to optimize the model’s performance without altering its parameters. In addition, advanced prompt tuning techniques, such as retrieval augmented generation (RAG), have been introduced and demonstrated to effectively mitigate LLM hallucinations [47]. However, a drawback of prompt tuning is that it requires users to have more experience in creating prompts or crafting RAGs that align with their objectives.

In general, the perspective on LLM fine-tuning may vary depending on the researcher’s objectives. A large organization with abundant computing resources may prioritize high-accuracy downstream tasks or specific tasks. Conversely, for a small organization, institution, or individual researcher with limited resources, the objectives may include reducing fine-tuning computational overhead while enhancing overall performance.

II-B LLM Text Generation and Inference

Large language models (LLMs) excel at comprehending human language and extracting insights from corpora of training data. Recent advances in this domain have reached a level of sophistication where distinguishing between machine-generated text and human-authored text has become increasingly challenging, despite numerous investigations [48] [49]. The text generation task is formally delineated as stated in [50].

y=fM(x,P)y𝑓𝑀𝑥𝑃\textit{y}=fM(x,P)y = italic_f italic_M ( italic_x , italic_P ) (1)

Here, the text generation model fM produces the output text y given the input data x that satisfies some special set of properties P. The property may be that the input is text, image, tabular data, a knowledge base, etc.

During inference, the text generation model M𝑀Mitalic_M, typically the decoder, produces output sequences yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT conditionally on some information x𝑥xitalic_x, referred to as the prompt, where each yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a token (a word or a subword). Formally, given xkXsubscript𝑥𝑘𝑋x_{k}\in Xitalic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_X to a model M𝑀Mitalic_M where i=1,2,,n𝑖12𝑛i=1,2,\ldots,nitalic_i = 1 , 2 , … , italic_n, the objective is to predict ykYsubscript𝑦𝑘𝑌y_{k}\in Yitalic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_Y where j=1,2,,k𝑗12𝑘j=1,2,\ldots,kitalic_j = 1 , 2 , … , italic_k. The conditional probability denotes this as follows:

P(y|x)=P(y1|x)P(y2|x,y1)P(yk|x,y1,,yk-1))P(y|x)=P(y_{\text{1}}|x)P(y_{\text{2}}|x,y_{\text{1}})\textellipsis P(y_{\text% {k}}|x,y_{\text{1}},\textellipsis,y_{\text{k-1}}))italic_P ( italic_y | italic_x ) = italic_P ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) italic_P ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) … italic_P ( italic_y start_POSTSUBSCRIPT k end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT k-1 end_POSTSUBSCRIPT ) ) (2)

In this context, the model operates in an autoregressive manner [51], generating yisubscript𝑦iy_{\text{i}}italic_y start_POSTSUBSCRIPT i end_POSTSUBSCRIPT sequentially and appending it to the target input sequence to predict the subsequent sequence. Consequently, the data structure housing the computational weight mirrors that of a lower triangular matrix [52][53]. Various strategies have been proposed to optimize the value of k𝑘kitalic_k, resulting in a lengthy and coherent passage [54]. However, it is essential to maintain a balance between response length and throughput, especially concerning resource-constrained devices.

II-C Google Gemma Architecture

In this subsection, our focus is on Gemma, which serves as a case study for large language foundation models (LLFMs). Google DeepMind released the model in two variations: one with 2 billion parameters and another with 7 billion parameters, as part of the Gemini model series [55]. Although the specific architecture remains undisclosed in the documentation [56], essential components are illustrated in the architecture depicted in Figure 1.

Refer to caption
Figure 1: A Gemma-2B architecture showing the salient components

The model under consideration constitutes an advanced transformer-decoder [57] and employs sequence-to-sequence learning techniques  [58]. These models have been extensively examined in previous studies, and their autoregressive nature is primarily focused on output generation rather than classification tasks. Essential parameters pertinent to the model are delineated in Figure 2.

Refer to caption
Figure 2: Salient parameters of Gemma model. Source: [56]

Additionally, an essential aspect worth noting regarding Gemma is the Rotary Position Embedding (RoPE) [59]. RoPE assigns a position to each token in the input sequence, ensuring accurate positioning in the output sequence. This method considers valuable properties such as sequence length flexibility and decaying inter-token dependency with increasing relative distances. Previous models like BERT [60], GPT [35], and ELECTRA [61] implemented absolute position dependency. In contrast, models like XLNet [62], DeBERTa [63], and Music Transformer [64] utilized relative position dependency. RoPE integrates both techniques by encoding the former with a rotation matrix and explicitly incorporating the latter in the self-attention formulation.

III Methodology

In this section, we initially delineate the problem and present our hypotheses regarding its nature. Subsequently, we expound upon the methodology employed for the proposed frameworks and align them with the study objectives. The framework is delineated in two iterations: the conceptual and implementation frameworks. In addition, we investigate the constituent elements of these frameworks and preprocess the datasets preceding inference.

III-A Problem Formulation

Our hypothesis proposes that when an expert addresses a query in a particular domain, they expend a level of cognitive effort that may diverge from that required for a Common question. This indicates a correlation between the model’s domain expertise and inference overhead, manifested in the form of the time and computational resources consumed during the process. Formally, this relationship can be expressed as a 4-tuple:

O=f(t,g,x,y,qy)O𝑓𝑡𝑔𝑥𝑦subscript𝑞y\textit{O}=f(t,g,x,y,q_{\text{y}})O = italic_f ( italic_t , italic_g , italic_x , italic_y , italic_q start_POSTSUBSCRIPT y end_POSTSUBSCRIPT ) (3)

where x represents the prompt length, y denotes the response length, O signifies the inference overhead, t denotes the inference time in seconds, g stands for the maximum GPU usage, and qysubscript𝑞yq_{\text{y}}italic_q start_POSTSUBSCRIPT y end_POSTSUBSCRIPT indicates the quality of the response.

III-B Proposed Framework

A framework to assess a large foundational model’s comprehension of different domains is presented in Fig. 1. This is a three-dimensional representation of the problem domain, model size, and response control. In terms of the problem domain, we examine cybersecurity, medical, finance, and common questions. The model size includes Gemma-2B and Gemma-7B for tasks related to text generation inference. The response output is controlled, limited to 50 words, and unrestricted.

Refer to caption
Figure 3: A framework for a large foundational model assessment about a domain understanding

Refer to caption

Figure 4: A framework for the implementation of a large foundational model assessment about a domain understanding

We present the implementation framework in Fig. 2. Alongside the predictive models Gemma-2B and Gemma-7B, ChatGPT serves as a referential model against which we evaluate the quality of the predictive models.

To ensure precision, the experiment unfolded in sixteen distinct phases (4 x 4), each corresponding to an output line (blue and red which comprises F, M, and C) from the models as delineated in the framework. Pairing the output from each model, we configured each to produce responses containing 50 or unlimited words. Subsequently, we compared the predictive model output from each configuration with that of the referential model.

Throughout each inference phase, we meticulously recorded the inference time, response word length, GPU maximum consumption, and prompt word length data. In addition, we computed the inference throughput and latency and assessed response quality using the ROUGE-L and semantic text similarity (STS) metrics.

III-C Data Analysis

In this study, we determine the statistical significance of the inference parameters and investigate the implications of the observed correlations between two variables. The correlation coefficient (typically Pearson’s r) quantifies the linear relationship between two variables X𝑋Xitalic_X and Y𝑌Yitalic_Y as follows:

r=(XiX¯)(YiY¯)(XiX¯)2(YiY¯)2𝑟subscript𝑋𝑖¯𝑋subscript𝑌𝑖¯𝑌superscriptsubscript𝑋𝑖¯𝑋2superscriptsubscript𝑌𝑖¯𝑌2r=\frac{\sum(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum(X_{i}-\bar{X})^{2}\sum(% Y_{i}-\bar{Y})^{2}}}italic_r = divide start_ARG ∑ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_X end_ARG ) ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_Y end_ARG ) end_ARG start_ARG square-root start_ARG ∑ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_X end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_Y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (4)

A value of r𝑟ritalic_r equal to 1 indicates perfect positive correlation, whereas a value of 1 indicates perfect negative correlation. A value of 0 implies no correlation.

III-D Formulation of Outlier Technique

Given a set of data points that represent the correlation between two variables, we define upper and lower boundaries within which the slopes, mminsubscript𝑚minm_{\text{min}}italic_m start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and mminsubscript𝑚minm_{\text{min}}italic_m start_POSTSUBSCRIPT min end_POSTSUBSCRIPT, can be determined. The margin between these bounds represents the concentration area of the data points with upper and lower slope margins taken at a specific value derived from Eq.511𝐸𝑞.511Eq.5-11italic_E italic_q .5 - 11. Outliers were identified as data points falling below the lower boundary.

mcentral=y2y1x2x1subscript𝑚centralsubscript𝑦2subscript𝑦1subscript𝑥2subscript𝑥1m_{\text{central}}=\frac{y_{\text{2}}-y_{\text{1}}}{x_{\text{2}}-x_{\text{1}}}italic_m start_POSTSUBSCRIPT central end_POSTSUBSCRIPT = divide start_ARG italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG (5)
θcentral=arctan(mcentral)subscript𝜃centralsubscript𝑚central\theta_{\text{central}}=\arctan(m_{\text{central}})italic_θ start_POSTSUBSCRIPT central end_POSTSUBSCRIPT = roman_arctan ( italic_m start_POSTSUBSCRIPT central end_POSTSUBSCRIPT ) (6)
θstep=rad((μx+(1.96σx))λ)subscript𝜃step𝑟𝑎𝑑subscript𝜇x1.96subscript𝜎x𝜆\theta_{\text{step}}=rad((\mu_{\text{x}}+(1.96*\sigma_{\text{x}}))\lambda)italic_θ start_POSTSUBSCRIPT step end_POSTSUBSCRIPT = italic_r italic_a italic_d ( ( italic_μ start_POSTSUBSCRIPT x end_POSTSUBSCRIPT + ( 1.96 ∗ italic_σ start_POSTSUBSCRIPT x end_POSTSUBSCRIPT ) ) italic_λ ) (7)
θmax=θcentral+θnextsubscript𝜃maxsubscript𝜃centralsubscript𝜃next\theta_{\text{max}}=\theta_{\text{central}}+\theta_{\text{next}}italic_θ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT central end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT next end_POSTSUBSCRIPT (8)
θmin=θcentralθnextsubscript𝜃minsubscript𝜃centralsubscript𝜃next\theta_{\text{min}}=\theta_{\text{central}}-\theta_{\text{next}}italic_θ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT central end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT next end_POSTSUBSCRIPT (9)
mmax=tan(θmax)subscript𝑚maxsubscript𝜃maxm_{\text{max}}=\tan(\theta_{\text{max}})italic_m start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = roman_tan ( italic_θ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) (10)
mmin=tan(θmin)subscript𝑚minsubscript𝜃minm_{\text{min}}=\tan(\theta_{\text{min}})italic_m start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = roman_tan ( italic_θ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) (11)

First, a straight line is plotted from (0, 0). The slope (mcentralsubscript𝑚centralm_{\text{central}}italic_m start_POSTSUBSCRIPT central end_POSTSUBSCRIPT) and the angle in radians (θcentralsubscript𝜃central\theta_{\text{central}}italic_θ start_POSTSUBSCRIPT central end_POSTSUBSCRIPT) are, respectively, calculated as follows. 5 and 6, respectively. The subsequent angle, θstepsubscript𝜃step\theta_{\text{step}}italic_θ start_POSTSUBSCRIPT step end_POSTSUBSCRIPT, in radians, is computed in Eq. 7 using the 95% confidence interval between the max-line and the central line and then to the min-line, where λ𝜆\lambdaitalic_λ is the tuning parameter for angle adjustment, and μxsubscript𝜇x\mu_{\text{x}}italic_μ start_POSTSUBSCRIPT x end_POSTSUBSCRIPT and σxsubscript𝜎x\sigma_{\text{x}}italic_σ start_POSTSUBSCRIPT x end_POSTSUBSCRIPT are the mean and standard deviation of the interval, respectively. Furthermore, θmaxsubscript𝜃max\theta_{\text{max}}italic_θ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT (Eq. 8) and θminsubscript𝜃min\theta_{\text{min}}italic_θ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT (Eq. 9) are determined using θcentralsubscript𝜃central\theta_{\text{central}}italic_θ start_POSTSUBSCRIPT central end_POSTSUBSCRIPT and θstepsubscript𝜃step\theta_{\text{step}}italic_θ start_POSTSUBSCRIPT step end_POSTSUBSCRIPT, which are then used to compute mminsubscript𝑚minm_{\text{min}}italic_m start_POSTSUBSCRIPT min end_POSTSUBSCRIPT (Eq. 10) and mminsubscript𝑚minm_{\text{min}}italic_m start_POSTSUBSCRIPT min end_POSTSUBSCRIPT (Eq. 11), respectively.

Refer to caption
(a) 2B/Common/502B/Common/\approx 502 italic_B / italic_C italic_o italic_m italic_m italic_o italic_n / ≈ 50;
R=0.9761𝑅0.9761R=0.9761italic_R = 0.9761; No.ofOutlier=8formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟8No.ofOutlier=8italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 8
Refer to caption
(b) 7B/Common/507B/Common/\approx 507 italic_B / italic_C italic_o italic_m italic_m italic_o italic_n / ≈ 50;
R=0.9585𝑅0.9585R=0.9585italic_R = 0.9585; No.ofOutlier=11formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟11No.ofOutlier=11italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 11
Refer to caption
(c) 2B/Common/2𝐵𝐶𝑜𝑚𝑚𝑜𝑛2B/Common/\infty2 italic_B / italic_C italic_o italic_m italic_m italic_o italic_n / ∞;
R=0.9813𝑅0.9813R=0.9813italic_R = 0.9813; No.ofOutlier=28formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟28No.ofOutlier=28italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 28
Refer to caption
(d) 7B/Common/7𝐵𝐶𝑜𝑚𝑚𝑜𝑛7B/Common/\infty7 italic_B / italic_C italic_o italic_m italic_m italic_o italic_n / ∞;
R=0.9786𝑅0.9786R=0.9786italic_R = 0.9786; No.ofOutlier=38formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟38No.ofOutlier=38italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 38
Refer to caption
(e) 2B/Finance/502B/Finance/\approx 502 italic_B / italic_F italic_i italic_n italic_a italic_n italic_c italic_e / ≈ 50;
R=0.9587𝑅0.9587R=0.9587italic_R = 0.9587; No.ofOutlier=1formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟1No.ofOutlier=1italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 1
Refer to caption
(f) 7B/Finance/507B/Finance/\approx 507 italic_B / italic_F italic_i italic_n italic_a italic_n italic_c italic_e / ≈ 50;
R=0.9587𝑅0.9587R=0.9587italic_R = 0.9587; No.ofOutlier=3formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟3No.ofOutlier=3italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 3
Refer to caption
(g) 2B/Finance/2𝐵𝐹𝑖𝑛𝑎𝑛𝑐𝑒2B/Finance/\infty2 italic_B / italic_F italic_i italic_n italic_a italic_n italic_c italic_e / ∞;
R=0.9841𝑅0.9841R=0.9841italic_R = 0.9841; No.ofOutlier=7formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟7No.ofOutlier=7italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 7
Refer to caption
(h) 7B/Finance/7𝐵𝐹𝑖𝑛𝑎𝑛𝑐𝑒7B/Finance/\infty7 italic_B / italic_F italic_i italic_n italic_a italic_n italic_c italic_e / ∞;
R=0.9817𝑅0.9817R=0.9817italic_R = 0.9817; No.ofOutlier=12formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟12No.ofOutlier=12italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 12
Refer to caption
(i) 2B/Medicalhealth/502B/Medical-health/\approx 502 italic_B / italic_M italic_e italic_d italic_i italic_c italic_a italic_l - italic_h italic_e italic_a italic_l italic_t italic_h / ≈ 50
R=0.9654𝑅0.9654R=0.9654italic_R = 0.9654; No.ofOutlier=4formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟4No.ofOutlier=4italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 4
Refer to caption
(j) 7B/Medicalhealth/507B/Medical-health/\approx 507 italic_B / italic_M italic_e italic_d italic_i italic_c italic_a italic_l - italic_h italic_e italic_a italic_l italic_t italic_h / ≈ 50;
R=0.9718𝑅0.9718R=0.9718italic_R = 0.9718; No.ofOutlier=5formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟5No.ofOutlier=5italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 5
Refer to caption
(k) 2B/Medicalhealth/2𝐵𝑀𝑒𝑑𝑖𝑐𝑎𝑙𝑒𝑎𝑙𝑡2B/Medical-health/\infty2 italic_B / italic_M italic_e italic_d italic_i italic_c italic_a italic_l - italic_h italic_e italic_a italic_l italic_t italic_h / ∞;
R=0.9931𝑅0.9931R=0.9931italic_R = 0.9931; No.ofOutlier=10formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟10No.ofOutlier=10italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 10
Refer to caption
(l) 7B/Medicalhealth/7𝐵𝑀𝑒𝑑𝑖𝑐𝑎𝑙𝑒𝑎𝑙𝑡7B/Medical-health/\infty7 italic_B / italic_M italic_e italic_d italic_i italic_c italic_a italic_l - italic_h italic_e italic_a italic_l italic_t italic_h / ∞;
R=0.9937𝑅0.9937R=0.9937italic_R = 0.9937; No.ofOutlier=4formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟4No.ofOutlier=4italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 4
Refer to caption
(m) 2B/Cybersecury/502B/Cybersecury/\approx 502 italic_B / italic_C italic_y italic_b italic_e italic_r italic_s italic_e italic_c italic_u italic_r italic_y / ≈ 50;
R=0.8568𝑅0.8568R=0.8568italic_R = 0.8568; No.ofOutlier=4formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟4No.ofOutlier=4italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 4
Refer to caption
(n) 7B/Cybersecury/507B/Cybersecury/\approx 507 italic_B / italic_C italic_y italic_b italic_e italic_r italic_s italic_e italic_c italic_u italic_r italic_y / ≈ 50;
R=0.9368𝑅0.9368R=0.9368italic_R = 0.9368; No.ofOutlier=4formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟4No.ofOutlier=4italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 4
Refer to caption
(o) 2B/Common/2𝐵𝐶𝑜𝑚𝑚𝑜𝑛2B/Common/\infty2 italic_B / italic_C italic_o italic_m italic_m italic_o italic_n / ∞;
R=0.8568𝑅0.8568R=0.8568italic_R = 0.8568; No.ofOutlier=2formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟2No.ofOutlier=2italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 2
Refer to caption
(p) 7B/Common/7𝐵𝐶𝑜𝑚𝑚𝑜𝑛7B/Common/\infty7 italic_B / italic_C italic_o italic_m italic_m italic_o italic_n / ∞;
R=0.9368𝑅0.9368R=0.9368italic_R = 0.9368; No.ofOutlier=0formulae-sequence𝑁𝑜𝑜𝑓𝑂𝑢𝑡𝑙𝑖𝑒𝑟0No.ofOutlier=0italic_N italic_o . italic_o italic_f italic_O italic_u italic_t italic_l italic_i italic_e italic_r = 0
Figure 5: Inference time (s) and response word length plots, estimating the correlation coefficient (R)𝑅(R)( italic_R ), central line, upper and lower bounds, and outliers. The Common model had the highest number of outliers in all cases compared to the domain-specific responses.
TABLE I: Outcomes of inference and ablation considering model size, domain, and response restriction.
Quality
Model Size Domain Response Restriction Throughput (μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ) Latency (μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ) GPU Mem (MB) (μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ) Response Length (μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ)
ROUGE-L
(μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ)
STS
(μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ)
2B Common \approx50 12.0 ± 2.88 0.09 ± 0.08 2766 (4.43%) 35.31 ± 40.68 0.25 ± 0.22 0.74± 0.38
2B Cybersecurity \approx50 11.26 ± 2.76 0.09 ± 0.02 3172 (5.08%) 25.72 ± 11.47 0.29 ± 0.17 0.74± 0.25
2B Medical \approx50 12.08 ± 2.63 0.08 ± 0.02 2766 (4.43%) 28.94 ± 27.36 0.26 ± 0.2 0.72 ± 0.4
2B Finance \approx50 12.19 ± 2.77 0.08 ± 0.02 2770 (4.43%) 34.69 ± 36.16 0.29 ± 0.2 0.78 ± 0.27
7B Common \approx50 7.58 ± 1.75 0.13 ± 0.04 8760 (14.0%) 35.07 ± 33.51 0.30 ± 0.25 0.78± 0.36
7B Cybersecurity \approx50 6.29 ± 0.95 0.16 ± 0.03 8906 (14.26%) 35.71 ± 17.13 0.34 ± 0.23 0.68 ± 0.26
7B Medical \approx50 7.73 ± 1.22 0.13 ± 0.02 7121 (14.01%) 37.71 ± 26.29 0.32 ± 0.2 0.82 ± 0.25
7B Finance \approx50 7.98 ± 1.65 0.13 ± 0.03 8750 (14.0%) 38.41 ± 31.55 0.33 ± 0.21 0.83 ± 0.21
2B Common \infty 11.79 ± 2.8 0.09 ± 0.08 2792 (4.47%) 207.75 ± 251.57 0.19 ± 0.19 0.69 ± 0.49
2B Cybersecurity \infty 11.91 ± 2.44 0.13 ± 0.03 3564 (5.71%) 268.27 ± 91.39 0.21 ± 0.12 0.72 ± 0.26
2B Medical \infty 12.37 ± 2.28 0.08 ± 0.02 2792 (4.47%) 128.23 ± 220.98 0.21 ± 0.19 0.71 ± 0.39
2B Finance \infty 12.33 ± 2.15 0.08 ± 0.02 2790 (4.47%) 248.56 ± 248.27 0.21 ± 0.13 0.78 ± 0.22
7B Common \infty 7.86 ± 1.83 0.13 ± 0.04 8800 (14.09%) 215.5 ± 247.28 0.21 ± 0.22 0.72 ± 0.5
7B Cybersecurity \infty 7.61 ± 1.56 0.09 ± 0.02 8764 (14.03%) 206.79 ± 186.24 0.21 ± 0.12 0.72 ± 0.26
7B Medical \infty 7.9 ± 1.11 0.13 ± 0.02 71214 (14.0%) 176.46 ± 222.19 0.21 ± 0.19 0.8 ± 0.24
7B Finance \infty 8.16 ± 1.4 0.12 ± 0.03 8750 (14.0%) 265.9 ± 223.71 0.23 ± 0.14 0.81 ± 0.2

III-E Dataset

In the investigation, we examined two categories of datasets: Common and domain-specific datasets, each comprising 2019 instances. The Common dataset was retrieved from GLUE (General Language Understanding Evaluation) [65]. To ensure fairness, we endeavored to exclude instances containing toxic content, as they could potentially be rejected by the model, affecting the response length. Furthermore, we ensured that the prompts included commonplace questions that anyone could easily answer without particular expertise.

Furthermore, the domain-specific datasets consist of three domains: cybersecurity-oriented, finance-oriented, and medical-oriented datasets. The cybersecurity dataset was obtained from [66]. Primarily, it consists of attack procedures originally curated from the MITRE ATT&CK [67], a globally accessible knowledge base of adversary tactics and techniques derived from real-world observations. Given an attack procedure as a prompt, the underlying premise is that we anticipate the model’s ability to predict what the attacker aims to achieve. The finance-oriented dataset was acquired from [68], originally combining Stanford’s Alpaca and FiQA datasets which have been used to facilitate the training and fine-tuning of diverse models tailored for financial applications. Subsequently, the medical-oriented dataset sourced from [69] was employed by the original author for training purposes in the development of the AI medical chatbot.

In the investigation, we examined two categories of datasets: Common and domain-specific datasets, each comprising 2019 instances. The Common dataset was retrieved from GLUE (General Language Understanding Evaluation) [65]. To ensure fairness, we excluded instances containing toxic content because they could be rejected by the model, which affected the response length.

Furthermore, we ensured that the prompts included commonplace questions that anyone could easily answer without particular expertise. In addition, the domain-specific datasets comprise three domains: cybersecurity-oriented, finance-oriented, and medical-oriented datasets. The cybersecurity dataset was obtained from [66]. Primarily, it consists of attack procedures originally curated from the MITER ATT&CK [67], a globally accessible knowledge base of adversary tactics and techniques derived from real-world observations. Given an attack procedure as a prompt, the underlying premise is that we anticipate the model’s ability to predict what the attacker intends to achieve. The finance-oriented dataset was acquired from [68], originally combining Stanford’s Alpaca and FiQA datasets, which have been used to facilitate the training and fine-tuning of diverse models tailored for financial applications. Subsequently, a medical-oriented dataset was sourced from [69]. The original author used the same dataset to train AI-medical chatbots.

TABLE II: Outlier analysis (Figure 5. The values highlighted in red indicate that the outliers are below the overall values specified in Table I, and the values in blue indicate that they exceed the standard values.
Model
Size
Domain
Response
Restriction
Inf_time-
response_len
Correlation
(R𝑅Ritalic_R)
Max
Slope
(mmaxsubscript𝑚maxm_{\text{max}}italic_m start_POSTSUBSCRIPT max end_POSTSUBSCRIPT)
Central
Slope
(mcentralsubscript𝑚centralm_{\text{central}}italic_m start_POSTSUBSCRIPT central end_POSTSUBSCRIPT)
Min
Slope
(mminsubscript𝑚minm_{\text{min}}italic_m start_POSTSUBSCRIPT min end_POSTSUBSCRIPT)
Max
angle
in rad
(θmaxsubscript𝜃max\theta_{\text{max}}italic_θ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT)
Central
angle in
rad
(θcentralsubscript𝜃central\theta_{\text{central}}italic_θ start_POSTSUBSCRIPT central end_POSTSUBSCRIPT)
Min
angle in
rad
(θminsubscript𝜃min\theta_{\text{min}}italic_θ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT)
No of
Outlier
Inf_time(s)
(μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ)
resp_word_len
(μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ)
prompt_word
_len
(μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ)
Latency
(μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ)
Throughput
(μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ)
ROUGE-L
(μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ)
STS
(μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ)
2B Common \approx50 0.9761 15.14 11.66 5.99 1.50 1.49 1.41 8 2.3±0.9 12.5±7.3 17.1±2.2 0.4±0.5 5.2±1.9 0.3±0.2 0.7±0.2
2B Cybersecurity \approx50 0.8568 13.56 10.97 8.22 1.51 1.48 1.45 4 3.46±0.29 24.50±2.65 38.75±4.86 0.14±0.00 7.08±0.21 0.26±0.07 0.73±0.12
2B Medical \approx50 0.9654 16.23 11.54 7.65 1.51 1.48 1.44 4 3.06±1.69 22.00±12.36 18.75±1.71 0.14±0.00 7.17±0.14 0.25±0.12 0.84±0.04
2B Finance \approx50 0.9674 15.1 11.87 5.91 1.50 1.49 1.4 1 1.99±0.0 8±0.0 31±0.0 0.25±0.0 4.03±0.0 0.06±0.0 0.52±0.0
7B Common \approx50 0.9585 9.61 7.65 3.88 1.47 1.44 1.32 11 5.97±3.41 21.09±12.32 17.55±1.86 0.29±0.02 3.51±0.23 0.06±0.05 0.08±0.11
7B Cybersecurity \approx50 0.9368 8.56 6.28 4.26 1.45 1.41 1.34 4 9.02±6.00 35.50±22.41 40.75±6.18 0.25±0.02 3.99±0.31 0.19±0.12 0.59±0.17
7B Medical \approx50 0.9718 9.66 7.83 4.38 1.47 1.44 1.35 5 3.83±1.73 16.80±6.72 20.40±1.82 0.22±0.01 4.46±0.25 0.24±0.05 0.83±0.05
7B Finance \approx50 0.9587 10.18 8.08 4.07 1.47 1.45 1.33 3 9.61±3.79 33.00±11.43 20.67±3.86 0.29±0.02 3.51±0.27 0.36±0.07 0.86±0.03
2B Common \infty 0.9813 14.91 11.49 7.97 1.5 1.48 1.45 28 13.09±10.34 85.89±72.66 11.71±4.85 0.23±0.32 6.15±1.69 0.27±0.16 0.73±0.19
2B Cybersecurity \infty 0.8568 14.22 11.39 8.22 1.5 1.48 1.45 2 65.79±57.86 225.00±32.53 32.00±4.24 0.28±0.22 5.22±4.10 0.04±0.05 0.04±0.01
2B Medical \infty 0.9931 14.65 11.55 8.62 1.5 1.48 1.46 10 17.13±10.97 142.30±92.38 13.50±4.35 0.12±0.01 8.22±0.41 0.24±0.06 0.78±0.12
2B Finance \infty 0.9841 15.1 12.8 8.15 1.5 1.49 1.45 7 17.01±14.83 123.43±109.33 15.29±7.02 0.14±0.01 7.19±0.62 0.15±0.04 0.61±0.14
7B Common \infty 0.9786 10.18 7.81 5.38 1.47 1.44 1.39 38 13.60±14.72 64.08±76.32 11.42±3.84 0.23±0.07 4.52±0.89 0.31±0.19 0.76±0.17
7B Cybersecurity \infty 0.9368 9.17 7.44 5.45 1.46 1.44 1.39 0 - - - - - - -
7B Medical \infty 0.9937 10.32 7.82 5.62 1.47 1.44 1.39 4 19.19±18.19 103.75±100.34 13.00±3.16 0.19±0.01 5.22±0.26 0.25±0.05 0.72±0.12
7B Finance \infty 0.9817 10.21 8.2 5.55 1.47 1.45 1.39 12 16.68±15.63 81.00±77.58 14.50±5.20 0.21±0.03 4.87±0.63 0.05±0.04 0.06±0.05

IV Result Discussion

IV-A Analysis of Response

The study results are presented in Table 1. Common prompts tend to generate responses with greater diversity in length and significantly longer inference times. The 7B models consume more GPU memory than their 2B counterparts, which is expected due to the differences in size. In all cases, the 2B models achieved higher throughputs, over 50%, than the 7B models, indicating that the 7B models incur more computation and memory overhead than the 2B models.

Regarding quality, using semantic textual similarity (STS) with ChatGPT responses as a reference model, the top five highest scores for semantic textual similarity (STS) are: 7B/Finance/507B/Finance/\approx 507 italic_B / italic_F italic_i italic_n italic_a italic_n italic_c italic_e / ≈ 50, 7B/Medical/507B/Medical/\approx 507 italic_B / italic_M italic_e italic_d italic_i italic_c italic_a italic_l / ≈ 50, 7B/Finance/7𝐵𝐹𝑖𝑛𝑎𝑛𝑐𝑒7B/Finance/\infty7 italic_B / italic_F italic_i italic_n italic_a italic_n italic_c italic_e / ∞, 7B/Medical/7𝐵𝑀𝑒𝑑𝑖𝑐𝑎𝑙7B/Medical/\infty7 italic_B / italic_M italic_e italic_d italic_i italic_c italic_a italic_l / ∞, and 7B/Common/507B/Common/\approx 507 italic_B / italic_C italic_o italic_m italic_m italic_o italic_n / ≈ 50. Conversely, 2B/Cybersecurity/2𝐵𝐶𝑦𝑏𝑒𝑟𝑠𝑒𝑐𝑢𝑟𝑖𝑡𝑦2B/Cybersecurity/\infty2 italic_B / italic_C italic_y italic_b italic_e italic_r italic_s italic_e italic_c italic_u italic_r italic_i italic_t italic_y / ∞, 7B/Cybersecurity/7𝐵𝐶𝑦𝑏𝑒𝑟𝑠𝑒𝑐𝑢𝑟𝑖𝑡𝑦7B/Cybersecurity/\infty7 italic_B / italic_C italic_y italic_b italic_e italic_r italic_s italic_e italic_c italic_u italic_r italic_i italic_t italic_y / ∞, and 7B/Cybersecurity/507B/Cybersecurity/\approx 507 italic_B / italic_C italic_y italic_b italic_e italic_r italic_s italic_e italic_c italic_u italic_r italic_i italic_t italic_y / ≈ 50 have the lowest STS. This alignment implies that the words and phrases in 7B/Common/507B/Common/\approx 507 italic_B / italic_C italic_o italic_m italic_m italic_o italic_n / ≈ 50 convey meanings that are more similar to those conveyed in the reference ChatGPT text. This indicates a higher degree of agreement or relevance. Higher STS scores indicate greater semantic similarity (or word-level similarity) between the text and reference text. In addition, restricted responses showed better ROUGE-L values for both common and domain-specific prompts, which implies that the responses contain similar keywords as the reference responses. It is important to note that the quality assessment of the responses using the STS and ROUGE-L in this study may not reflect the true value of response quality because the predicted and reference responses are often not of equal length and differ with a wide margin, which is a major criterion that may affect the quality assessment of datasets.

IV-B Analysis of Correlation and Outliers

Our results demonstrate a significant correlation between inference time and response length (Figure 5a-p). Table II provides further insight into this relationship. The 7B and 2B models with unrestricted responses to Medical, Finance, and Common prompts exhibit the highest time-response correlations, and the cybersecurity model exhibits the lowest time-response correlation across all scenarios. Several key observations emerge from the results. The variability in correlation is depicted in Fig. 5 can be computed using the previously discussed outliers technique (Section III -subsection C). A higher response in a shorter time indicates high throughput, and the primary objective is to identify outliers on the x-axis, i.e., inf-time. For the response-restricted cases, the value of λ𝜆\lambdaitalic_λ is set to 0.0050.0050.0050.005 for θmaxsubscript𝜃max\theta_{\text{max}}italic_θ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT and 0.50.50.50.5 for θminsubscript𝜃min\theta_{\text{min}}italic_θ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT, whereas, for the unrestricted cases, the value of λ𝜆\lambdaitalic_λ is set to 0.00050.00050.00050.0005 for θmaxsubscript𝜃max\theta_{\text{max}}italic_θ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT and 0.050.050.050.05 for θminsubscript𝜃min\theta_{\text{min}}italic_θ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT. This resulted in a variability ratio of 10:1:10110:110 : 1 between restricted and unrestricted response lengths.

Common prompts exhibited the highest number of outliers across all cases compared to domain-specific prompts. This finding further confirms the inconsistency in response length associated with common prompts relative to their domain-specific counterparts. Within the domain-specific category, 7B/Finance/7𝐵𝐹𝑖𝑛𝑎𝑛𝑐𝑒7B/Finance/\infty7 italic_B / italic_F italic_i italic_n italic_a italic_n italic_c italic_e / ∞ manifests the highest number of outliers, which is attributed to its inclusion of content from related fields such as insurance, accounting, and taxation. In contrast, cybersecurity and medical prompts consistently yielded concise responses in all cases. The analysis of the outliers indicates that the values are nearly uniformly below the overall mean across all parameters except latency.

V Conclusion

This study investigates the inference behavior of foundation models of varying sizes under common and domain-specific prompts, such as those related to cybersecurity, medical, and finance domains. This study examines these behaviors under conditions in which response lengths are both restricted and unrestricted. We present a framework to assess large foundational models in terms of domain understanding and outlier formulation. The results indicate that model size and types of prompts used for inference significantly influence response length and quality, as larger datasets for training provide more information across various domains. In addition, common prompts, which include different types of queries, generate diverse responses and may result in inconsistent response lengths when the same prompt is used multiple times or in different ways. In contrast, domain-specific prompts consistently generate concise responses. Therefore, we recommend eliminating irrelevant domains in the language model information prior to fine-tuning domain-specific tasks. For example, when provided with sufficient datasets to fine-tune the 2 billion parameter model for a cybersecurity downstream task, we advocate for a full fine-tuning approach rather than employing a parameter-efficient technique. This method involves the elimination of irrelevant domains, allowing the target domain to become predominant. Such an approach preserves the model’s size while facilitating the generation of concise and consistent responses in a minimal amount of time.

While this study aimed to focus on resource utilization, response length, and response quality, we did not observe significant differences in resource utilization and response quality assessment because the results did not show statistically significant differences among the cases. Consequently, future research should focus on a comprehensive investigation of response quality across various domains and determine whether response length correlates with quality. In addition, resource usage must be assessed in a manner that does not affect inference time.

References

  • [1] M. A. K. Raiaan, M. S. H. Mukta, K. Fatema, N. M. Fahad, S. Sakib, M. M. J. Mim, J. Ahmad, M. E. Ali, and S. Azam, “A review on large language models: Architectures, applications, taxonomies, open issues and challenges,” IEEE Access, 2024.
  • [2] L. Fan, L. Li, Z. Ma, S. Lee, H. Yu, and L. Hemphill, “A bibliometric review of large language models research from 2017 to 2023,” arXiv preprint arXiv:2304.02020, 2023.
  • [3] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,” arXiv preprint arXiv:2308.10620, 2023.
  • [4] S. Eger, C. Leiter, J. Belouadi, R. Zhang, A. Kostikova, D. Larionov, Y. Chen, and V. Fresen, “NLLG quarterly arxiv report 06/23: What are the most influential current AI papers?” CoRR, vol. abs/2308.04889, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.04889
  • [5] D. L. L. Denise A. Schmidt-Crawford and A. D. Thompson, “Publishing trends in jdlte: A five-year perspective,” Journal of Digital Learning in Teacher Education, vol. 38, no. 3, pp. 102–104, 2022. [Online]. Available: https://doi.org/10.1080/21532974.2022.2107321
  • [6] M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, “Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study,” Journal of Information Security and Applications, vol. 50, p. 102419, 2020.
  • [7] S. Biswas, “The function of chat gpt in social media: According to chat gpt,” Available at SSRN 4405389, 2023.
  • [8] X. Dai, S. Karimi, B. Hachey, and C. Paris, “Using similarity measures to select pretraining data for ner,” in Proceedings of NAACL-HLT, 2019, pp. 1460–1470.
  • [9] M. S. U. Miah, M. M. Kabir, T. B. Sarwar, M. Safran, S. Alfarhood, and M. Mridha, “A multimodal approach to cross-lingual sentiment analysis with ensemble of transformer and llm,” Scientific Reports, vol. 14, p. 9603, 2024.
  • [10] Y. Tan, D. Min, Y. Li, W. Li, N. Hu, Y. Chen, and G. Qi, “Can chatgpt replace traditional kbqa models? an in-depth analysis of the question answering performance of the gpt llm family,” in International Semantic Web Conference.   Springer, 2023, pp. 348–367.
  • [11] M. A. Arefeen, B. Debnath, and S. Chakradhar, “Leancontext: Cost-efficient domain-specific question answering using llms,” Natural Language Processing Journal, vol. 7, p. 100065, 2024.
  • [12] L. Qu, S. Wu, H. Fei, L. Nie, and T.-S. Chua, “Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 643–654.
  • [13] A. Acharya, B. Singh, and N. Onoe, “Llm based generation of item-description for recommendation system,” in Proceedings of the 17th ACM Conference on Recommender Systems, 2023, pp. 1204–1207.
  • [14] D. Nam, A. Macvean, V. Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13.
  • [15] J. Li, H. Zhou, S. Huang, S. Cheng, and J. Chen, “Eliciting the translation ability of large language models via multilingual finetuning with translation instructions,” Transactions of the Association for Computational Linguistics, vol. 12, pp. 576–592, 2024.
  • [16] G. Keswani, W. Bisen, H. Padwad, Y. Wankhedkar, S. Pandey, and A. Soni, “Abstractive long text summarization using large language models,” International Journal of Intelligent Systems and Applications in Engineering, vol. 12, no. 12s, pp. 160–168, 2024.
  • [17] R. Goel, C. Hidey, H. R. Mohammad, P. R. Muddireddy, and F. Liu, “Llm-based task-oriented dialog system with few-shot retrieval augmentation,” Technical Disclosure Commons, 2023.
  • [18] Z. Hu, Y. Feng, A. T. Luu, B. Hooi, and A. Lipani, “Unlocking the potential of user feedback: Leveraging large language model as user simulators to enhance dialogue system,” in Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023, pp. 3953–3957.
  • [19] J. Jia, H. Liu, and N. Z. Gong, “10 security and privacy problems in large foundation models,” in Unknown, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:259129905
  • [20] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023. [Online]. Available: https://github.com/tatsu-lab/stanford_alpaca
  • [21] R. Taori, I. Gulrajani, T. Zhang, X. L. Yann Dubois, C. Guestrina, P. Liang, and T. B. Hashimoto., “Alpaca: A strong, replicable instruction-following model. stanford center for research on foundation models,” 2023. [Online]. Available: https://crfm.stanford.edu/2023/03/13/alpaca.html
  • [22] A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE Computer Society, 2022, pp. 15 617–15 629.
  • [23] J. B. Li, J. S. Michaels, L. Yao, L. Yu, Z. Wood-Doughty, and F. Metze, “Audio-journey: Efficient visual+ llm-aided audio encodec diffusion,” in Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
  • [24] C. Li, Z. Gan, Z. Yang, J. Yang, L. Li, L. Wang, J. Gao et al., “Multimodal foundation models: From specialists to general-purpose assistants,” Foundations and Trends® in Computer Graphics and Vision, vol. 16, no. 1-2, pp. 1–214, 2024.
  • [25] E. Aghaei, X. Niu, W. Shadid, and E. Al-Shaer, “Securebert: A domain-specific language model for cybersecurity,” in Security and Privacy in Communication Networks, F. Li, K. Liang, Z. Lin, and S. K. Katsikas, Eds.   Cham: Springer Nature Switzerland, 2023, pp. 39–56.
  • [26] M. A. Ferrag, A. A. Battah, N. Tihanyi, M. Debbah, T. Lestable, and L. C. Cordeiro, “Securefalcon: The next cyber reasoning system for cyber security,” ArXiv, vol. abs/2307.06616, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259847202
  • [27] C. Lin, “Recall-oriented understudy for gisting evaluation (rouge),” Retrieved August, vol. 20, p. 2005, 2005.
  • [28] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap et al., “Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,” in EMNLP, 2022.
  • [29] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in International Conference on Learning Representations, 2020.
  • [30] D. Saxena, N. Sharma, D. Kim, R. Dwivedula, J. Chen, C. Yang, S. Ravula, Z. Hu, A. Akella, S. Angel, I. D. Joydeep Biswas, Swarat Chaudhuri, A. Dimakis, D. Kim, C. Rossbach, and G. Wang, “On a foundation model for operating systems,” in Machine Learning for Systems Workshop at 37th NeurIPS Conference, 2023, New Orleans, LA, USA., 2023.
  • [31] C.-H. Chiang, Y.-S. Chuang, and H.-y. Lee, “Recent advances in pre-trained language models: Why do they work and how do they work,” in Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Tutorial Abstracts, M. A. Alonso and Z. Wei, Eds.   Taipei: Association for Computational Linguistics, Nov. 2022, pp. 8–15. [Online]. Available: https://aclanthology.org/2022.aacl-tutorials.2
  • [32] J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Pre-trained language models for text generation: A survey,” ACM Computing Surveys, vol. 56, no. 9, pp. 1–39, 2024.
  • [33] Q. Lu, L. Zhu, X. Xu, Z. Xing, and J. Whittle, “A framework for designing foundation model based systems,” arXiv e-prints, pp. arXiv–2305, 2023.
  • [34] Y. Liu, Q. Lu, L. Zhu, and H.-Y. Paik, “Decentralised governance for foundation model based ai systems: Exploring the role of blockchain in responsible ai,” IEEE Software, pp. 1–6, 2024.
  • [35] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” OpenAI, 2018.
  • [36] G. Y. Ai, C. Ai, and R. Ai, “Gpt-4 technical report,” arxiv, 2023.
  • [37] K. S. Kalyan, “A survey of gpt-3 family large language models including chatgpt and gpt-4,” Natural Language Processing Journal, vol. 6, p. 100048, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2949719123000456
  • [38] AI@Meta, “Llama 3 model card,” 2024. [Online]. Available: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
  • [39] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models (2023),” arXiv preprint arXiv:2302.13971, 2023.
  • [40] K. Lv, Y. Yang, T. Liu, Q. Gao, Q. Guo, and X. Qiu, “Full parameter fine-tuning for large language models with limited resources,” arXiv e-prints, pp. arXiv–2306, 2023.
  • [41] B. Liao, Y. Meng, and C. Monz, “Parameter-efficient fine-tuning without introducing new latency,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 4242–4260. [Online]. Available: https://aclanthology.org/2023.acl-long.233
  • [42] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2021.
  • [43] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [44] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia, “LongloRA: Efficient fine-tuning of long-context large language models,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=6PmJoRfdaK
  • [45] A. Gao, “Prompt engineering for large language models,” Available at SSRN 4504303, 2023.
  • [46] X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).   Association for Computational Linguistics, 2022.
  • [47] J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17754-17762., 2024.
  • [48] Y. Mo, H. Qin, Y. Dong, Z. Zhu, and Z. Li, “Large language model (llm) ai text generation detection based on transformer deep learning algorithm,” International Journal of Engineering and Management Research, vol. 14, no. 2, p. 154–159, Apr. 2024. [Online]. Available: https://ijemr.vandanapublications.com/index.php/ijemr/article/view/1565
  • [49] R. Tang, Y.-N. Chuang, and X. Hu, “The science of detecting llm-generated text,” Commun. ACM, vol. 67, no. 4, p. 50–59, mar 2024. [Online]. Available: https://doi.org/10.1145/3624725
  • [50] J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Pre-trained language models for text generation: A survey,” ACM Comput. Surv., vol. 56, no. 9, apr 2024. [Online]. Available: https://doi.org/10.1145/3649449
  • [51] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” in Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada., 2019.
  • [52] L. Li, “Triangular matrix algebras: Recollements, torsion theories, and derived equivalences,” arXiv: Representation Theory, 2013. [Online]. Available: https://api.semanticscholar.org/CorpusID:118650914
  • [53] Y. Li, Y. Tian, and X. Du, “Triangularization of matrices and polynomial maps,” Canadian Mathematical Bulletin, vol. 63, pp. 94 – 105, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:182718155
  • [54] B. Tan, Z. Yang, M. Al-Shedivat, E. Xing, and Z. Hu, “Progressive generation of long text with pretrained language models,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 4313–4324.
  • [55] S. Pichai and D. Hassabis, “Introducing gemini: our largest and most capable ai model,” Google DeepMind and Alphabet, p. 154–159, Dec. 2023. [Online]. Available: https://blog.google/technology/ai/google-gemini-ai/#sundar-note
  • [56] G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love et al., “Gemma: Open models based on gemini research and technology,” arXiv e-prints, pp. arXiv–2403, 2024.
  • [57] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [58] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
  • [59] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024.
  • [60] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds.   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019. [Online]. Available: https://aclanthology.org/N19-1423
  • [61] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=r1xMH1BtvB
  • [62] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf
  • [63] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=XPZIaotutsD
  • [64] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck, “Music transformer,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=rJe4ShAcF7
  • [65] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” in International Conference on Learning Representations, 2018.
  • [66] R. Fayyazi, R. Taghdimi, and S. J. Yang, “Advancing ttp analysis: Harnessing the power of encoder-only and decoder-only language models with retrieval augmented generation,” arXiv preprint arXiv:2401.00280, 2023.
  • [67] MITRE, “Mitre att&ck,” https://attack.mitre.org/, 2023, accessed: 2024-06-07.
  • [68] Gaurang Bharti, “finance-alpaca (revision 51d16b6),” 2024. [Online]. Available: https://huggingface.co/datasets/gbharti/finance-alpaca
  • [69] R. M. Vsevolodovna, “Ai medical dataset,” 2023. [Online]. Available: https://github.com/ruslanmv/ai-medical-chatbot