Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models

Peiyi Zhang1   Yazhou Zhang1,2   Bo Wang111footnotemark: 1   Lu Rong1   Jing Qin2   
1 Tianjin University
2 The Hong Kong Polytechnic University
Corresponding authors.
Abstract

With the recent evolution of large language models (LLMs), concerns about aligning such models with human values have grown. Previous research has primarily focused on assessing LLMs’ performance in terms of the Helpful, Honest, Harmless (3H) basic principles, while often overlooking their alignment with educational values in the Chinese context. To fill this gap, we present Edu-Values, the first Chinese education values evaluation benchmark designed to measure LLMs’ alignment ability across seven dimensions: professional ideology, cultural literacy, educational knowledge and skills, education laws and regulations, teachers’ professional ethics, basic competencies, and subject knowledge. We meticulously design and compile 1,418 questions, including multiple-choice, multi-modal question answering, subjective analysis, adversarial prompts, and questions on traditional Chinese culture. We conduct both human evaluation and automatic evaluation over 11 state-of-the-art (SoTA) LLMs, and highlight three main findings: (1) due to differences in educational culture, Chinese LLMs significantly outperform English LLMs, with Qwen 2 ranking the first with a score of 81.37; (2) LLMs perform well in subject knowledge and teaching skills but struggle with teachers’ professional ethics and basic competencies; (3) LLMs excel at multiple-choice questions but perform poorly on subjective analysis and multi-modal tasks. This demonstrates the effectiveness and potential of the proposed benchmark 111Our dataset is available at https://github.com/zhangpeii/Edu-Values.git.

\UseRawInputEncoding

Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models


Peiyi Zhang1   Yazhou Zhang1,2thanks: Corresponding authors.   Bo Wang111footnotemark: 1   Lu Rong1   Jing Qin2 1 Tianjin University 2 The Hong Kong Polytechnic University


1 Introduction

LLMs have orchestrated a grand symphony across the vast amphitheater of natural language processing (NLP), demonstrating unprecedented performance across a wide spectrum of tasks. Through instruction fine-tuning and in-context learning, such models have acquired remarkable capabilities in language comprehension, generation, complex reasoning, etc. The influence of LLMs has flowed beyond the shores of NLP, catalyzing significant advancements across various vertical domains, e.g., healthcare Singhal et al. (2023); Bolton et al. (2024), law Cui et al. (2024); Zhou et al. (2024), finance Wang et al. ; Zhang and Yang (2023), especially education Dan et al. (2023).

The integration of domain-specific knowledge into LLMs has produced models with encyclopedic expertise, excelling in homework assistance, problem-solving, and personalized learning, thus reshaping the educational landscape. However, despite these benefits, there is growing concerns on the potential risks, such as ideological infiltration that could shape students’ values, hostile attacks that threaten the security of education systems, privacy breaches compromising user data, and educational bias or discrimination. If left unaddressed, these issues may overshadow the transformative potential of LLMs in education. Therefore, a comprehensive evaluation of their role in educational contexts is essential.

Dataset Size Safety? Fairness? Legality? Resp? Multimodal?
Safety-prompts(Sun et al., 2023) 100k
COLD(Deng et al., 2022) 37k
CValues(Xu et al., 2023) 2,100
FLAMES(Huang et al., 2024) 2,251
Edu-Values(Ours) 1,418
Table 1: A brief comparison between existing datasets and our Edu-Values. Resp = responsibility.

As a result, new and challenging benchmarks have been introduced to assess the educational performance of LLMs, such as HELM (Liang et al., 2022), GAOKAO (Zhang et al., 2023a), MMLU (Hendrycks et al., ), M3KE(Liu et al., 2023). However, these benchmarks mainly assess the models’ general abilities and subject knowledge, without addressing their alignment with human educational values. In contrast, another branch of research specifically targets the evaluation of models’ adherence to human moral values, encapsulated in the Helpful, Honest, and Harmless (3H) principles. For example, The Bias Benchmark for QA (BBQ) (Parrish et al., 2022) highlights social bias against protected classes of people on nine social dimensions related to the English-speaking environment in the United States. Xu et al. (2023) proposed CVALUE, a human values assessment benchmark with confrontational and evocative cues, which was designed with two incremental assessment criteria, safety and responsibility, to assess the alignment of LLMs with human values. FLAMES (Huang et al., 2024) covers both the common principle of harmlessness and five dimensions of fairness, legality, data protection, morality, and safety that are consistent with human values.

While these benchmarks contribute significantly to ensuring ethical AI behavior, they fail to consider the deeper principles of educational theory, educational needs and educational ethics, limiting their ability to assess LLMs’ effectiveness in promoting educational fairness, respecting cultural diversity, and adhering to legal standards. Therefore, there is an urgent need for more comprehensive frameworks that not only evaluate LLMs’ subject knowledge but also their capacity to enhance educational experiences and uphold fundamental educational values.

To fill this gap, we propose Edu-Values - to the best of our knowledge, the first Chinese-language benchmark test for assessing the alignment of LLM educational values. The benchmark consists of 1,418 questions, including multiple-choice questions, multimodal answers, subjective analysis questions, adversarial prompts, traditional Chinese culture questions, and other question types, and aims to measure the educational values of LLMs from multiple dimensions. Tab 1 shows the comparison between Edu-Values and existing benchmarks.

In summary, our main contributions are summarised as follows:

  • We propose Edu-Values, the first Chinese educational values assessment benchmark, which fills the gap of Chinese educational values in the education field. It contains 1,418 questions, each tailored to examine a specific value dimension. The value dimensions examined include professional ideology, education laws and regulations, teachers’ professional ethics, cultural literacy, educational knowledge and skills, basic competencies and subject knowledge.

  • We not only conducted tests on 10 LLMs that support Chinese input, but also constructed an assessment method that uses LLMs for assessment and combines manual correction. The combination of the two approaches ensures the efficiency and stability of the assessment process, and also reduces the possible systematic bias of LLMs in assessing different answers and the human bias and subjective interference in manual scoring, making the scoring results more objective and fair.

2 The Edu-Values Benchmark

In this section, we first describe the definition of educational values and the design objectives of the Edu-Values Benchmark. Then, the components of the Edu-Values benchmark are described, as well as the process of data collection and test set construction. Finally, we introduce the scoring methodology, which combines automatic and manual scoring.

2.1 Definition of Educational Values

A study of mainstream educational values reveals that they can be broadly divided into two mutually reinforcing levels: micro and macro. At the micro level, they are goal-oriented and closely linked to the elements within the education system, emphasising their organic integration and mutual reinforcement. At the macro level, educational values focus on the harmonious coexistence and resonance of values between education and the wider social system, reflecting a kind of cross-sectoral coordination and cooperation.

As a unified whole, the micro and macro levels of educational values are mutually constraining and interpenetrating. As the inner core of education, the micro level plays a decisive role in the nature, purpose and implementation of education, while the macro level represents the external environment and social influence of education, and is the broad stage for the display of educational values in social practice, which is also of vital importance. The two together form the rich connotation of educational values that guide the comprehensive development of education.

Edu-Values is intended to help researchers and developers assess the compatibility of LLMs with the current educational values of human society, and thus help promote the effective alignment of LLMs and human values in the field of education.

2.2 Composition of Edu-Values

Based on the understanding of mainstream educational values and combining the existing Chinese educational values assessment content, we propose Edu-Values. As shown in Figure 1, our benchmark consists of seven dimensions:

  • Professional Ideology Aiming at ensuring that LLMs develop a correct view of education, students and teachers and are able to understand the basic requirements for the implementation of quality education in the country, understand the requirements for the professional development of teachers and achieve the holistic development of students as the basis for their educational and teaching activities.

  • Education Laws and Regulations Designed to assess LLMs’ knowledge of the country’s key education laws and regulations, as well as their familiarity with the rights and responsibilities of teachers and the legal rights of students. The aim is to prevent LLMs from encouraging or inducing users to commit acts contrary to national education laws and regulations, and to ensure that the legitimate rights of teachers and students are not violated.

    Refer to caption
    Figure 1: Distribution of Edu-values questions on seven dimensions: professional philosophy, educational laws and regulations, teacher ethics, cultural literacy, basic competencies, educational knowledge and skills, and subject knowledge.
  • Teachers’ Professional Ethics Intended to ensure that LLMs act in accordance with the code of ethics of the teaching profession, respecting the law and socially accepted norms of behaviour, and to assess the ability of LLMs to apply the code of conduct in their educational activities to manage their relationships with, among others, their students, their parents, their colleagues and their educational administrators.

  • Cultural Literacy Focusing on the performance of LLMs in scientific literacy, literary literacy, historical literacy and artistic literacy, and requires LLMs to have a foundation of general scientific knowledge, an accumulation of literary knowledge, an understanding of cultural literacy and a good appreciation of art.

  • Basic Competencies Covering reading comprehension, logical reasoning, information processing and pedagogical writing skills.

  • Educational Knowledge and Skills Emphasising the mastery of basic educational theories, skills in student guidance and classroom management, the integration of subject knowledge, and the comprehensive application of these skills in the design, implementation and evaluation of teaching and learning.

  • Subject Knowledge Specifically examining LLMs’ expertise in subject areas such as language, mathematics, chemistry, music and art, as well as their performance in key aspects of instructional design, implementation and evaluation.

2.3 Dataset Construction

Based on the above definition and categorisation of educational values, the dataset was constructed. As shown in table 2, we have collected a total of 1,418 questions covering different levels of education such as kindergarten, primary, secondary and university, and containing different types of questions.

Question Type Numbers
Multiple-choice 1085
Multimodal 100
Subjective analysis 113
Adversarial 100
Chinese culture 20
Overall 1418
Table 2: The five types of question contained in the Edu-Values.

Among them, multiple-choice questions, multimodal questions and adversarial questions are worth 1 point each, while subjective analysis questions and Chinese traditional culture questions are worth 5 points each, and the total score of all questions is 1950. Our questions are partly taken from previous years’ questions of the Chinese Teacher Qualification Examination, and partly based on the Chinese education system and culture, written by people.

  • multiple-choice questions We have collected multiple-choice questions from 2016 to 2024 for the Chinese teacher qualification exams at kindergarten, primary and secondary school levels as well as multiple-choice questions from the Chinese university teacher qualification exams for the following subjects: comprehensive quality, knowledge and ability of preservation and teaching, knowledge and ability of education and teaching, knowledge and ability of education, knowledge and ability of education and knowledge and ability of teaching in the subject.

  • multimodal questions Comprehensive questions combining both image and text modalities were used to examine LLMs’ ability to acquire and process different forms of information. The data came from the Teacher Qualification Examination, which not only tested LLMs’ mastery of professional knowledge, but also incorporated the examination of their thinking literacy, logical reasoning and information integration skills.

  • adversarial questions The use of traps, inducements, obscurity, camouflage, deception, etc. lead to incorrect outputs from the model. The questions cover Professional Ethics of Primary and Secondary School Teachers, Measures for the Punishment of Regulatory Violations in State Education Examinations, Law on the Protection of Minors and other laws.

  • subjective analysis questions The questions are derived from the last five years of the Chinese Teacher Qualification Examination and include 51 short-answer questions, 16 expository questions, 17 correct and incorrect analyses, and 29 material analysis questions.

  • Chinese traditional culture questions It mainly examines the differences between China and other countries in terms of education goals, education methods, education system, education culture, education concepts, education evaluation, etc. The answers to the questions are marked manually.

Model Overall Multiple-choice Adversarial Subjective analysis Chinese culture
Qwen-2-72B 81.37 90.14 80.00 65.62 76.60
ERNIE-4 80.72 89.22 79.00 65.63 75.60
Baichuan-4 78.74 88.85 64.00 61.98 78.40
ERNIE-3.5 78.49 86.64 79.00 63.33 75.20
Baichuan-3-Turbo 78.13 90.05 68.00 57.45 75.80
ChatGLM-4 75.84 81.84 86.00 62.08 78.20
GPT-4 72.39 78.43 64.00 61.95 74.20
Claude-3 72.04 72.81 80.00 67.42 81.80
Llama-3-70B 71.91 74.75 72.00 65.24 78.80
GPT-3.5 63.25 62.21 62.00 63.58 74.00
Table 3: Scores of LLMs on different types of questions. Models are ranked in descending order based on total score.
Model Overall Multimodal
Qwen-2-72B 85.10 69.00
ERNIE-4 83.54 52.00
Baichuan-4 82.19 64.00
ERNIE-3.5 81.95 64.00
ChatGLM-4 78.32 46.00
Claude-3 75.82 70.00
GPT-4 75.47 57.00
Table 4: Scores on multimodal questions for LLMs that support multimodal inputs, and total scores for all questions after including multimodal questions.

2.4 Evaluation Method

We adopted a differentiated scoring strategy based on the characteristics of different question types. For multiple-choice and multimodal questions, we used an automated approach to assess effectiveness. We first entered the multiple-choice questions and their options into the LLMs, and then collected the answers from the LLMs and compared them with the pre-defined standard answers in an automated manner to statistically and analytically analyse the performance of the model on such questions.

For other types, including subjective analysis questions, adversarial questions, and questions related to traditional Chinese culture, we adopt a more refined evaluation approach: we choose advanced LLMs as the scoring model and combine them with manual correction methods for scoring. First, we input the content of the questions into evaluated LLMs and obtain the corresponding responses. Second, we develop scoring criteria for the questions based on the dimensions and types of questions examined. Finally, we design scoring prompts based on these scoring criteria to guide the scoring model to score the responses generated by the LLMs more accurately and meticulously in combination with the standard answers, and at the same time, supplemented by manual corrections and audits to ensure fairness and accuracy of scoring.

3 Results and Analysis

In this section, we first evaluate the performance of ten state-of-the-art LLMs on Edu-Values using the aforementioned evaluation strategy. Then, the experimental results are analysed in detail from the perspectives of both test types and examined dimensions.

3.1 Experimental Settings

We conducted evaluation experiments on Edu-Values Benchmark in various LLMs supporting Chinese. The evaluated models include Baichuan-3-Turbo (Baichuan, 2024), Baichuan-4 (Baichuan, 2024), ChatGLM4 (Du et al., 2022), Claude 3 (Bai et al., 2022), ERNIE-3.5-8K (Sun et al., 2021), ERNIE-4.0-8K (Sun et al., 2021), GPT 3.5 (OpenAI, 2023), GPT 4 (OpenAI, 2023), Llama3-70b (Dubey et al., 2024), Qwen-2-72b (Yang et al., 2024).

3.2 Results on Different Types

Model Overall Professional Ideology Education Laws and Regulations Cultural Literacy Basic Comp- etencies Educational Knowledge and Skills Teachers’ Professional Ethics Subject Knowledge
Qwen-2-72B 81.37 77.01 86.73 87.10 78.38 79.41 71.95 91.48
ERNIE-4 80.72 77.01 85.07 82.58 77.43 78.93 73.77 91.48
Baichuan-4 78.74 75.17 88.29 85.48 71.81 77.45 68.57 86.67
ERNIE-3.5 78.49 76.32 81.85 82.47 74.19 77.33 70.22 87.78
Baichuan-3-Turbo 78.13 75.10 86.44 83.33 71.62 75.48 67.62 90.00
ChatGLM-4 75.84 72.41 77.17 77.96 70.29 75.60 73.16 83.7
GPT-4 72.39 69.81 75.80 75.81 68.57 71.46 69.52 77.04
Claude-3 72.04 74.64 70.63 68.39 73.62 71.85 75.06 69.63
Llama-3-70B 71.91 71.72 73.85 73.33 72.57 71.25 68.83 72.96
GPT-3.5 63.25 67.59 62.05 61.29 63.9 65.22 63.12 57.41
Table 5: Scores of all LLMs on each dimension when multimodal questions are not included.

Table 3 shows the results of the 10 LLMs evaluated on 4 question types: multiple-choice, subjective analysis, adversarial and traditional Chinese culture questions with the results of each subcomponent. Table 4 shows the evaluation results of the 7 LLMs supporting multimodal input on multimodal questions. We have linearly normalised all the scores to make the results easier to understand. From these results, we derive the following observations:

  • Overall, the best performer in the assessment results that did not include multimodal questions was Qwen2-72b, with a score of 81.37. ERNIE-4 came second, with a more even performance across the five question types. GPT3.5 was the worst performer, with a score of 62.21, well below the scores of the other LLMs. The best performer in the assessment results containing multimodal questions was also Qwen2-72b with a score of 80.74, while the worst performer was GPT4 with a score of 71.60.

  • The performance of the different LLMs on different types of questions was uneven. The best performer on multiple-choice questions was Qwen2-72b, while Claude3 was ahead of the other LLMs on subjective-analysis and multimodal questions.ChatGLM4 performed best on difficult adversarial questions, while its performance on multimodal questions was the worst.Baichuan3-Turbo performed second only to Qwen2-72b on the multiple-choice questions, but worst on the subjective analysis questions.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a) Radar chart of the overall score distribution of all LLMs on the seven dimensions (excluding multimodal questions). (b) Radar plot of overall score distribution on seven dimensions for LLMs supporting multimodal inputs (including multimodal questions).

3.3 Results on Different Dimensions

Figure 2(a) shows the results of the 10 assessed LLMs on all but the multimodal questions on the seven dimensions of professional ideology, education laws and regulations, teachers’ professional ethics, cultural literacy, basic competencies, educational knowledge and skills, and subject knowledge. cultural literacy, basic competencies, educational knowledge and skills, and subject knowledge on seven dimensions. Figure 2(b) shows the results of the seven LLMs supporting multimodal inputs on the seven dimensions for all questions containing multimodal questions. We similarly performed linear normalisation on all scores. The specific scores for each dimension are shown in Tables 5 and 6. We arrive at the following analysis:

Model Overall Professional Ideology Education Laws and Regulations Cultural Literacy Basic Comp- etencies Educational Knowledge and Skills Teachers’ Professional Ethics Subject Knowledge
Qwen-2-72B 80.74 77.78 86.73 84.51 74.94 79.41 71.95 94.64
ERNIE-4 79.25 77.41 85.07 75.93 72.26 78.93 73.77 94.64
Baichuan-4 77.98 76.00 88.29 81.86 68.89 77.45 68.57 89.66
ERNIE-3.5 77.74 76.37 81.85 78.05 72.72 77.33 70.22 90.80
ChatGLM-4 74.31 72.22 77.17 71.24 65.75 75.6 73.16 86.59
Claude-3 71.93 75.48 70.63 67.79 72.64 71.85 75.06 72.03
GPT-4 71.60 70.07 75.80 72.57 65.52 71.46 69.52 79.69
Table 6: The scores of LLMs supporting multimodal inputs on each dimension when multimodal questions were included.
Refer to caption
(a)
Refer to caption
(b)
Figure 3: Distribution of results for LLM versus Qwen2-72b ((a) excluding multimodal questions, (b) including multimodal questions).
  • Overall, Qwen2-72b performed better on all dimensions with and without the inclusion of multimodal questions, with Qwen2-72b performing significantly better than the other LLMs on the five dimensions of professional ideology, cultural literacy, educational knowledge and skills, basic competence and subject knowledge. in addition, Baichuan4 was the best on the dimension of education laws and regulations, while Claude3 was the best on the dimension of professional ethics of teachers.

  • On the subject knowledge dimension, LLMs demonstrated a high level of competence, as they were able to respond accurately to questions that focused on objective subject knowledge, which usually had clear standard answers. However, the performance of LLMs on the dimension of professional ethics of teachers leaves much to be desired. This may be because professional ethics questions often involve complex moral judgements, situational understanding and ethical reasoning, and these dimensions cannot always be defined by simple standard answers. Therefore, although LLMs perform well in the area of subject knowledge, their competencies need to be further developed and refined when it comes to deeper humanistic concerns and ethical judgements.

4 LLMs Versus

We used Qwen-2-72b, the best overall performer, to play against all other models and counted the win, failure and tie rates for all models, as shown in Figure 3. We calculated the score difference between the other LLMs and Qwen-2-72b for each question. For multiple-choice, multimodal, and adversarial questions, the difference was 1 point for the model, 0 for a tie, and -1 for a model defeat. For other questions, the model wins when the difference is greater than 0.5 points, a tie when the difference is between -0.5 and +0.5, and the model loses when the difference is less than -0.5. We can make the following observations and analyses:

  • Overall performance of LLMs When multimodal questions were not included, all LLMs won between 4% and 7% of their matches against Qwen-2-72b, with a tie rate of over 65% and a failure rate between 7% and 31%. When multimodal questions were included, all LLMs with Qwen-2-72b had win rates between 6% and 9%, tie rates over 70%, and failure rates between 11% and 23%. Overall, all other LLMs had a low win rate of less than 10% against Qwen-2-72b and a tie rate of more than 65%, suggesting that all LLMs were at a similar level on the basic questions.

  • When multimodal questions are excluded, the top three win rates are ERNIE-4, Claude3 and ChatGLM4, and the top three loss rates are GPT3.5, Claude3 and Llama3-70b. When multimodal questions are included, the top three rankings in terms of winning rate are ERNIE-4,ChatGLM4 and Claude3 in order, and the top three rankings in terms of loss rate are Claude3,GPT4 and ChatGLM4 in order. This shows that ERNIE-4 is relatively more stable when playing against each other, while Claude3 and ChatGLM4 are relatively less stable.

5 Conclusion

In this paper, we present Edu-Values, the first Chinese-language benchmark test to assess the alignment of LLMs with human values in education. We test ten state-of-the-art LLMs through an automated assessment combined with manual evaluation. The results show that most LLMs perform moderately well overall, especially in the dimension of subject knowledge, while there is still some room for improvement in the dimension of teacher ethics. We hope that Edu-Values can be used to highlight the potential risks of LLMs in education and to promote the alignment of LLMs with human educational values.

References

  • Askell et al. (2021) Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  • Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  • Baichuan (2024) Baichuan. 2024. https://www.baichuan-ai.com/.
  • Bolton et al. (2024) Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, et al. 2024. Biomedlm: A 2.7 b parameter language model trained on biomedical text. arXiv preprint arXiv:2403.18421.
  • Cui et al. (2024) Jiaxi Cui, Munan Ning, Zongjian Li, Bohua Chen, Yang Yan, Hao Li, Bin Ling, Yonghong Tian, and Li Yuan. 2024. ‘chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model. arXiv preprint arXiv:2306.16092.
  • Dan et al. (2023) Yuhao Dan, Zhikai Lei, Yiyang Gu, Yong Li, Jianghao Yin, Jiaju Lin, Linhao Ye, Zhiyan Tie, Yougen Zhou, Yilei Wang, et al. 2023. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv preprint arXiv:2308.02773.
  • Deng et al. (2022) Jiawen Deng, Jingyan Zhou, Hao Sun, Chujie Zheng, Fei Mi, Helen Meng, and Minlie Huang. 2022. Cold: A benchmark for chinese offensive language detection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11580–11599.
  • Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  • Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
  • (10) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  • Hosseini et al. (2017) Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. 2017. Deceiving google’s perspective api built for detecting toxic comments. arXiv preprint arXiv:1702.08138.
  • Huang et al. (2024) Kexin Huang, Xiangyang Liu, Qianyu Guo, Tianxiang Sun, Jiawei Sun, Yaru Wang, Zeyang Zhou, Yixu Wang, Yan Teng, Xipeng Qiu, et al. 2024. Flames: Benchmarking value alignment of llms in chinese. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4551–4591.
  • Liang et al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  • Liu et al. (2023) Chuang Liu, Renren Jin, Yuqi Ren, Linhao Yu, Tianyu Dong, Xiaohan Peng, Shuting Zhang, Jianxiang Peng, Peiyi Zhang, Qingqing Lyu, et al. 2023. M3ke: A massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models. arXiv preprint arXiv:2305.10263.
  • OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Parrish et al. (2022) Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. 2022. Bbq: A hand-built bias benchmark for question answering. Findings of the Association for Computational Linguistics: ACL 2022.
  • Singhal et al. (2023) Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
  • Sun et al. (2023) Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang. 2023. Safety assessment of chinese large language models. arXiv preprint arXiv:2304.10436.
  • Sun et al. (2021) Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, et al. 2021. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137.
  • (20) Neng Wang, Hongyang Yang, and Christina Wang. Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  • Xu et al. (2023) Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, et al. 2023. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv preprint arXiv:2307.09705.
  • Yang et al. (2024) An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671.
  • Zhang et al. (2023a) Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. 2023a. Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474.
  • Zhang and Yang (2023) Xuanyu Zhang and Qing Yang. 2023. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM international conference on information and knowledge management, pages 4435–4439.
  • Zhang et al. (2023b) Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, and Minlie Huang. 2023b. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045.
  • Zhou et al. (2024) Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiao-Wen Yang, Yi-Xuan Jin, Lan-Zhe Guo, and Yu-Feng Li. 2024. Lawgpt: A chinese legal knowledge-enhanced large language model. arXiv preprint arXiv:2406.04614.