Responsible Design and Use of Large Language Models

Responsible design
and use of large

language models
ChatGPT and Bard have taken the world To begin, businesses must be aware of the
by storm! In many circles, even in casual various ways LLMs can be employed in
conversations, discussions about the context-specific applications. Three different
huge positive influences of Large mechanisms enable businesses to leverage
Language Models (LLMs) and the this technology:
sobering risks they pose to our Using existing LLM Models ‘as-is’ via APIs
socio-economic fabric are happening in
the same breath. Some of the concerns Technology companies have developed
revolving around the use of LLMs like cutting-edge LLMs, allowing end users, such
spread of misinformation and production as enterprises, consumers, and enthusiasts,
of unfair, biased, or harmful content are to access these models through APIs. This
legitimate to a greater or lesser extent enables them to explore the possibilities
depending upon the context. However, offered by these technologies. For example,
the anxiety around the negative impact OpenAI allows anyone with online access to
that such advancing technology may register and experience ChatGPT-3.5. OpenAI
have on the human-race is amplified by achieved an impressive end-user base of 100
the AI apocalypse warnings from million within the first two months of
prominent figures in the world of AI. ChatGPT’s launch through this mechanism [3].
Even the developers of such
technologies are swiftly addressing the Building LLMs
public concerns about the potential risks
associated with the use of the LLMs. For Enterprises may find the use of the existing
instance, OpenAI — the developers of LLMs ‘as-is’ inadequate for scenario-specific
ChatGPT, employs a [Content] applications due to performance and
‘Moderation API’ to assess if the output compliance concerns (e.g., data-security).
of ChatGPT contains sexual, violent, In such cases, enterprises can develop their
hateful or harmful content in accordance own context-specific LLMs in two ways:
with their organization’s content policy[1].
Businesses committed to Responsible AI
a. Finetune existing LLMs:
principles need to follow suit. To fully
Many developers of generic LLMs
realise the potential of this technology
allow fine-tuning of their LLMs
responsibly in specific use-cases, they
(training the front/adaptive layers
must recognise the different ways LLMs
of the LLM using context-specific
can be used in applications and
data) to cater to the performance
establish an appropriate governance
and compliance needs of
layer to evaluate and monitor the context
customer-enterprises.
of application from an ethical
HuggingFace, for example, offers
perspective throughout its lifecycle.
various models that enterprises
can customize for specific
How to use LLMs, the different use-cases [4].
factors to consider
LLMs, like any other AI models, are
essentially socio-technical systems — b. Building from scratch:
an inextricable bundle of code, data, For deep domain specific
subjective parameters and people [2]. use-cases, enterprise may
Evaluating such systems from an ethical consider creating field-specific
perspective requires examining not only LLMs from scratch. It is essential
the characteristics of technology but to note that LLMs are built from
also the context of its use. For scratch following the architecture
businesses considering utilisation of the outlined in the ground-breaking
LLM-technology for a use-case, it is paper “Attention is all you need,”
crucial to analyse both the technology as published in 2017[5].
well as the scenario-specific angles.
2
Each approach has strengths and limitations, as listed in Table 1, and its suitability depends on
the context of the use-case for which LLMs are being considered.
Table 1: Strengths and limitations of the different approaches of

using LLMs
Approach for
Strengths Limitations
using LLM
Using existing • Easy access to collective knowledge • Potential for biases and
LLMs across domains ethical concerns in training
‘as-is’ as APIs • Straight forward consumption of raw data
output by simple API calls from the • Lack of control over training
existing LLM models data and architecture
• Proven performance on various for the enterprise
language tasks • May not be suitable for
scenario specific tasks
Finetune • Most effective solution for an • Chances of bias

existing LLMs enterprise as it allows appropriate incorporation from the
(Retrain the retraining of adaptative layers for frozen layers of LLM
adaptive layers) use in scenario-specific • Possible data security and
applications privacy threat
• Ready-to-use framework available
in market for enablement of
finetuned LLM in an enterprise
Building LLMs • Maximum control over training • Requirement of huge

from Scratch data and architecture resources, time, and
• More flexibility for customization expertise for model training
• No chances of any bias • Requirement of

incorporation humongous volume of
appropriate data to train
model to achieve better
results
There are advantages and limitations should evaluate the approaches from a
associated with each of the approaches, as feasibility-of-implementation angle as well.
listed in Table 1, and the suitability of an For instance, businesses with low-to-medium
approach depends on the context of the AI maturity levels should avoid building an
use-case for which LLMs are being LLM from scratch, as it is time and
considered. To assess the suitability from resource-intensive, requiring advanced
both performance and ethical AI data-science (NLP) skill sets and extensive
perspectives, it is essential to consider how data and computational power. Table 2
the three approaches can help improve presents an overview of LLMs from these
application performance and ensure different dimensions.
responsible LLM design and use. Businesses
3
Table 2: Evaluating llms from different decision-making dimensionstable
Existing LLM Finetune Existing

Broad Building LLM
Factors (Use ‘As-Is’ LLMs (Retrain
Dimensions From Scratch
Via APIs) Adaptive Layers)
Data No model-building Enterprises would need Building a robust and

requirement data required as use-case specific data performant LLM from
s for building the developers of to finetune existing scratch requires huge
an LLM these existing LLMs. Volume of data volume of data.
LLMs have already required for finetuning
trained the model is less than the volume
with vast amount of data required to build
of data from a LLMs from scratch, in
wide variety of general.
text corpora.
Architecture Architectural Architecture considered Enterprises must

considerations by the LLM-provider decide on the choice of
not required in cannot be changed. architecture as it has a
this context. Enterprises need to significant impact on
consider the adaptive the model's perfor-
layers required for mance.
Feasibility of
finetuning.
implementatio
n of performant
LLMs for
use-cases
Hyperparame- Pre-built LLMs Finetuning LLMs A proper strategy

ter-tuning provide no scope provides on option to around the selection
of hyperparameter tune hyperparameter of hyperparameters
tuning from the adaptive and its optimization
layers (specific to based on the training
scenarios) as data are required.
hyperparameters of
frozen layers of the
architecture are
inaccessible to
customization.
Time and Time and resource As it needs some A colossal

resources consideration retraining hence time-consuming and
required only to consumption of highly skilled
put in place time and resources data-science
mechanisms to are required.
resource-intensive
feed inputs However, there are
process
(prompts) into an lot of available
existing LLM and frameworks in as it requires
consume its market which developing and
output easily enables this testing of complex
capability. advanced architecture.
Existing LLM Finetune Existing Building LLM
Broad
Factors (Use ‘As-Is’ LLMs (Retrain From Scratch
Dimensions
Via APIs) Adaptive Layers)
Bias and Prebuilt LLMs can Retraining the The enterprise has
Fairness perpetuate and adaptive layers may ultimate control in
amplify harmful not eradicate all the terms of bias and
biases present in application-specific fairness as it owns the
the training data. unwanted biases that data and the
There is a chance may have creeped governance oversight
that these models into it from the frozen for the choice of
may not generate section of the model. architecture and the
equitable process of
outcomes in development and
specific deployment.
applications.
Privacy & Uploading data, Uploading data into The enterprise has
Security in the form of finetuned LLMs may complete control to
inputs/prompts, also pose a data privacy mitigate data
to an existing and security threat. privacy and security
LLM may pose a Some of the leading risks.
data privacy LLM-providers are
and security coming up with
threat. architectural designs to
address these
concerns.
Feasibility of
Governance Adherence to Adherence to The enterprise can
use from a
and organisational organisational level incorporate all the
Responsible AI
regulation level governance governance and necessary governance
perspective
and regulation regulation may be frameworks and
may prove to be limited in this regulatory norms to
difficult in this context. make the developed
context. LLM compliant.
Auditing and Prebuilt LLMs can LLMs built with Built from scratch
testing be audited and transfer learning LLMs should
tested in the form should be tested from require thorough
of ‘Black Box both development development
Testing’ to identify and usage testing as well as
potential issues perspectives. usage-based
pertaining to its use assessment.
in a use-case.
Document There are extensive Finetuning LLMs Enterprise needs to

ation documentations and providers offer an produce all the
user guides from a extensive documentation documentation
usage perspective. on the usage and on the pertaining to the
However, modifications of LLMs that they built
LLM-providers may adaptive layers. However, from scratch along
not reveal the LLM-providers may not with user support.
details of the data reveal the details of the
used to build the data used to build the
LLM or the specifics LLM or the specifics of
of the architecture. the architecture.
Assessing the impact of LLMs in use-cases
Businesses, with an understanding of the decision-making dimensions of the three approaches
to use LLMs, should then evaluate the potential impact of an appropriate scenario-specific LLM
application on its end-users, society, and environment and the organization itself (as depicted in
Figure 1).
RISKS TO THE END-USERS

• Data privacy and security#
• Inaccurate output
• Harmful response
• Taxic response
RISKS TO THE
RISKS TO THE SOCIETY ORGANIZATION
• Human Right Violation • Regulatory compliance
• Biasness towards certain • Reputation cost
groups of consumers • Business transparency risk
RISKS TO THE ENVIRONMENT

• Negative impact on
environment
Figure 1: Risk of LLMs upon its end-users, society, environment, and organization
To gauge the impact of a scenario specific preferable to maintain feasibility and

LLM system on its end-users, businesses environmental commitments [7] [8]. However,
must first determine the system’s reach or implementing guardrails such as
coverage. Let’s take the example of a deflection-logic and prompt & response
multi-national electro-mechanical filtering for responsible use is vital. It is
component manufacturer for automotive essential for the business to conduct a
applications. They plan to deploy an thorough evaluation of the LLM system
LLM-powered chatbot for field-engineers including data security, privacy, contextual
across all operating geographies. correctness and toxicity along with proper
Field-engineers, with domain knowledge, can documentation.
likely distinguish sensible responses from
hallucinations (non-sensical outputs) [6]). Businesses should be mindful of risks posed
However, the company must be cautious, as a by use-case specific LLMs, such as personal
potentially harmful response slipping past data leakage and unfair outcomes, which
human filters could lead to adverse can translate to societal risks and regulatory
outcomes. Additionally, consistency of non-compliance. Lack of transparency in
correct responses in different languages LLM operations when using 'as-is' or
must be ensured. finetuned models adds to the risks. Table 3
presents mechanisms to assess LLM
For engineering-help queries, the company systems comprehensively, including their
can opt for finetuned LLMs or build their own ability to match human-level
use-case specific LLMs. Finetuned LLMs are common-sense knowledge.
6
Table 3: Evaluating LLMs from impact perspective
LLM Toxicity [7] Common sense
Accuracy Robustness Fairness
Usage Knowledge [8]
As-is’ via Reference Reference Equalised Odds,

APIs from Holistic from HELM* Individual/Group
HellaSwag*
Evaluation of framework Fairness
– evaluate
Language
physical,
Models
Hugging grounded, and
(HELM)*
Face based temporal common
framework [9]
Toxicity sense
Score [7] for
detecting WinoGrande*
Finetuned Perplexity, Adversarial Demographic hate speech, - examines
LLMs Entropy Accuracy, Parity, Equalized Test Cases physical and social
Distribu- Odds, Treatment designed common sense.
tional Shift Equality, with LLM
Individual BLEU
/Group Fairness Social IQA*
framework
- evaluates social
[10]
to identify
common sense
the
Built from Perplexity, Adversarial Demographic differences
scratch Entropy, Accuracy, Parity, Equalized w.r.t. to a PIQA*
BPC Distribution Odds, Treatment reference - covers the
al Shift, Equality, physical aspect of
Diversity, Individual/Group common sense
Bias Fairness
Mitigation
* Small descriptions on HELM, HellaSwag, WinoGrande, Social IQA, PIQA, etc. are provided in the notes
Businesses considering the direct found in Ethical AI: Looking beyond accuracy to
integration of LLM systems into their realize business value [11].
product or service offerings should also # There are different ways of implementing
assess the risk of these offerings from LLMs responsibly from a data security and
a usage perspective, aligning with the privacy point-of-view using federated learning,
forthcoming AI regulations in various differential privacy, etc.; these methods are
geographies (for example, EU’s AI Act). elaborated in[6]
More information about this can be
Conclusion perpetuation. Mitigating risks and ensuring

The advent of advanced Large Language equitable outcomes through protective
Models (LLMs) brings both excitement and measures becomes paramount.
trepidation. Businesses must tread The fast-shifting LLM landscape calls for
carefully in this landscape, establishing human-in-the-loop applications, where
safeguards to harness the potential of LLMs humans retain ultimate responsibility and
with responsibility and ethics. Incorporating decision-making, with LLMs serving as
control elements into LLM approaches is support tools. This approach provides greater
essential to align with business objectives control, accountability, and risk reduction.
and ethical standards. Scrutiny of By treading thoughtfully, businesses can
scenario-based LLM applications is embrace the possibilities of LLMs while
imperative to identify potential adverse mitigating their associated challenges and
effects such as biases and misinformation ensuring responsible utilization.
7
Notes
HELM Framework
Holistic Evaluation of Language Models (HELM) is an assessment approach that

provides a comprehensive analysis of the performance, limitations, and
capabilities of language models. It considers various factors, including linguistic
quality, factual accuracy, bias detection, ethical considerations, and robustness,
to provide a holistic understanding of the model's strengths and weaknesses.
HELM aims to ensure a well-rounded evaluation that considers the broader
implications and challenges associated with deploying language models in
real-world applications [12].
HelaSwag
HellaSwag is an evaluation benchmark designed to assess the physical,

grounded, and temporal common sense understanding of Large Language
Models (LLMs). It focuses on evaluating the models' ability to reason about
real-world situations, events, and context, going beyond syntactic and semantic
understanding. HellaSwag aims to measure the LLM's capability to generate
plausible and contextually appropriate responses, considering nuanced aspects
of human-like reasoning, thereby providing insights into the model's
comprehension of common-sense knowledge in a broader context [13].
WinoGrande
WinoGrande is an evaluation dataset specifically designed to examine the
physical and social common sense understanding of language models.
It consists of a set of multiple-choice questions that require reasoning about
real-world scenarios, incorporating both physical and social contexts.
WinoGrande focuses on challenging the models' ability to comprehend nuanced
aspects of common-sense knowledge, such as causality, intention, and social
dynamics. By assessing the performance of language models on WinoGrande,
researchers gain insights into their capabilities and limitations in understanding
and reasoning about common sense in a variety of contexts [14].
Social IQA
Social IQA is an evaluation benchmark that measures the social common sense
understanding of language models, assessing their ability to comprehend and
reason about social interactions, emotions, intentions, and cultural context [15].
PIQA
PIQA (Physical Interaction: Question Answering) is an evaluation benchmark

that focuses on assessing the physical common sense understanding of
language models by challenging them with questions related to physical
interactions and dynamics in the real world [16].
8
References
T. Markov, C. Zhang, S. Agarwal, T. Eloundou, T. Lee, S. Adler, A. Jiang and L. Weng, "New
1 and improved content moderation tooling," 10 August 2022. [Online]. Available:
https://openai.com/blog/new-and-improved-content-moderation-tooling.
I. Bartoletti, "Another warning about the AI apocalypse? I don’t buy it," 3 May 2023. [Online].
Available:
2
https://www.theguardian.com/commentisfree/2023/may/03/ai-chatgpt-bard-artificial-int
elligence-apocalypse-global-rules.
K. Hu, "ChatGPT sets record for fastest-growing user base - analyst note," 2 February
2023. [Online]. Available:
3
https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-an
alyst-note-2023-02-01/.
"Fine-tune a pretrained model - Transformers," Hugging Face, [Online]. Available:

4
https://huggingface.co/docs/transformers/training.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I.

5 Polosukhin, "Attention Is All You Need," in 31st Conference on Neural Information
Processing Systems, Long Beach, CA, USA, 2017.
Mugunthan, V.; Maximizing the ROI of Large Language Models for the large enterprise.,
29 March, 2023. [Online]
6
https://www.dynamofl.com/blogs/maximizing-the-roi-of-large-language-models-for-th
e-large-enterprise
Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto and P. Fung,
7 "Survey of Hallucination in Natural Language Generation," ACM Computing Surveys, vol.
55, no. 12, p. 1–38, 2023.
"Toxicity," Hugging Face, [Online]. Available:

8
https://huggingface.co/spaces/evaluate-measurement/toxicity.
X. L. Li, A. Kuncoro, J. Hoffmann, C. de Masson d'Autume, P. Blunsom and A. Nematzadeh, "A

Systematic Investigation of Commonsense Knowledge in Large Language Models," in
9
Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing, Abu Dhabi, United Arab Emirates, 2022.
10 Liang et. al., "Holistic Evaluation of Language Models," arXiv, 2022.
K. Doshi, "Foundations of NLP Explained - Bleu Score and WER Metrics," Medium, 9 May
2021. [Online]. Available:
11
https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metr
ics-1a5ba06d812b.
B. K. Mitra and M. Smith, "Ethical AI: Looking beyond accuracy to realize business value,"
Wipro Limited, March 2022. [Online]. Available:
12
https://www.wipro.com/blogs/bhargav-kumar-mitra/ethical-ai-looking-beyond-accuracy-t
o-realize-business-value/.
9
Bommasani, Rishi, P. Liang and T. Lee, "HELM," Center for Research on Foundation
13
Models, 19 March 2023. [Online]. Available: https://crfm.stanford.edu/helm/latest/.
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi and Y. Choi, "HellaSwag: Can a Machine

14
Really Finish Your Sentence?," arxiv, 2019.
K. Sakaguchi, R. Le Bras, C. Bhagavatula and Y. Choi, "WinoGrande: An Adversarial

15
Winograd Schema Challenge at Scale," arXiv, 2019.
M. Sap, H. Rashkin, D. Chen, R. Le Bras and Y. Choi, "SocialIQA: Commonsense Reasoning

16
about Social Interactions," Allen Institute for Artificial Intelligence, Seattle, 2019.
Y. Bisk, R. Zellers, R. Le Bras, J. Gao and Y. Choi, "PIQA: Reasoning about Physical
17 Commonsense in Natural Language," Proceedings of 34th AAAI Conference on Artificial
Intelligence, vol. 34, pp. 7432-7439, 202 0.
About the Authors
SOUMYA TALUKDER
is currently working as a Consultant at Wipro Limited. He has
9+ years of experience as a data scientist, where he has worked
majorly on Retail and Telecom domains. He has good experience
in Statistical, Machine Learning and AI model development.
DIPOJJWAL GHOSH
is currently a Principal Consultant at Wipro Limited, India.
He received his M. Tech. in Quality, Reliability and Operations
Research from Indian Statistical Institute, Kolkata. He has 16+ years
of research and analytical experience in various domains including
retail, manufacturing and energy & utilities. Dipojjwal has published
multiple research and popular technology articles up to date.
SILADITYA SEN
is a Data Scientist at Wipro Limited. He has received his
M. Sc. in Statistics from Presidency University, Kolkata.
He has close to 8 years of experience in the field of data science
in Retail, Telecom and Utility domains. He is quite proficient in
building classical statistical, Machine Learning and AI models.
BHARGAV MITRA
is a Data Science Expert and MLOps Consultant with an
entrepreneurial mindset. He is working with Wipro as the AI &
Automation Practice Partner for Europe and leading the practice’s
global initiatives on Responsible AI. Bhargav has over 18 years of
‘hands-on’ experience in scoping, designing, implementing, and
delivering business-intelligence driven Machine/Deep Learning. He
holds a DPhil in Computer Vision from the University of Sussex and
an MBA from Warwick Business School.
ANINDITO DE
is CTO of the AI Practice at Wipro Limited. His primary
responsibilities are building capabilities across different areas of
AI and ML and bringing to life AI driven intelligent solutions for
customers. With over two decades of experience, he has been a
part of many large technology implementations across sectors
and authored multiple technology publications and patents.
11
Ambitions Realized.
Wipro Limited Wipro Limited (NYSE: WIT, BSE: future-ready, sustainable

Doddakannelli 507685, NSE: WIPRO) is a leading businesses. With 250,000 employees
Sarjapur Road technology services and consulting and business partners across more
Bengaluru – 560 035 company focused on building than 60 countries, we deliver on the
India innovative solutions that address promise of helping our clients,
clients’ most complex digital colleagues, and communities thrive in
Tel: +91 (80) 2844 0011 transformation needs. Leveraging our an ever-changing world.
Fax: +91 (80) 2844 0256 holistic portfolio of capabilities in
wipro.com consulting, design, engineering, and For more information,
operations, we help clients realize please write to us at [email protected]
their boldest ambitions and build
IND/CMOAXIS/MAR2023-MAR2024

Responsible Design and Use of Large Language Models

Uploaded by

Copyright:

Available Formats

Responsible Design and Use of Large Language Models

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Responsible Design and Use of Large Language Models

Uploaded by

Copyright:

Available Formats

Responsible design

and use of large

Table 1: Strengths and limitations of the different approaches of

Finetune • Most effective solution for an • Chances of bias

Building LLMs • Maximum control over training • Requirement of huge

• No chances of any bias • Requirement of

Existing LLM Finetune Existing

Data No model-building Enterprises would need Building a robust and

Architecture Architectural Architecture considered Enterprises must

Hyperparame- Pre-built LLMs Finetuning LLMs A proper strategy

Time and Time and resource As it needs some A colossal

Document There are extensive Finetuning LLMs Enterprise needs to

RISKS TO THE END-USERS

RISKS TO THE ENVIRONMENT

To gauge the impact of a scenario specific preferable to maintain feasibility and

As-is’ via Reference Reference Equalised Odds,

Conclusion perpetuation. Mitigating risks and ensuring

Holistic Evaluation of Language Models (HELM) is an assessment approach that

HellaSwag is an evaluation benchmark designed to assess the physical,

PIQA (Physical Interaction: Question Answering) is an evaluation benchmark

"Fine-tune a pretrained model - Transformers," Hugging Face, [Online]. Available:

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I.

"Toxicity," Hugging Face, [Online]. Available:

X. L. Li, A. Kuncoro, J. Hoffmann, C. de Masson d'Autume, P. Blunsom and A. Nematzadeh, "A

10 Liang et. al., "Holistic Evaluation of Language Models," arXiv, 2022.

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi and Y. Choi, "HellaSwag: Can a Machine

K. Sakaguchi, R. Le Bras, C. Bhagavatula and Y. Choi, "WinoGrande: An Adversarial

M. Sap, H. Rashkin, D. Chen, R. Le Bras and Y. Choi, "SocialIQA: Commonsense Reasoning

Wipro Limited Wipro Limited (NYSE: WIT, BSE: future-ready, sustainable

You might also like