Mondal
Mondal
Syaamantak Das
Centre for Educational Technology, Indian Institute of Technology Kharagpur, India
ORCID: 0000-0001-9896-3312
Shyamal Kumar Das Mandal
Centre for Educational Technology, Indian Institute of Technology Kharagpur, India
ORCID: 0000-0002-4088-3173
Anupam Basu
Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, India
National Institute of Technology Durgapur, India
ORCID: 0000-0002-1960-9225
Abstract
Cognitive learning complexity identification of assessment questions is an essential task in the domain of
education, as it helps both the teacher and the learner to discover the thinking process required to answer
a given question. Bloom’s Taxonomy cognitive levels are considered as a benchmark standard for the
classification of cognitive thinking (learning complexity) in an educational environment. However, it was
observed that some of the action verbs of Bloom’s Taxonomy are overlapping in multiple levels of the
hierarchy, causing ambiguity about the real sense of cognition required. The paper describes two
methodologies to automatically identify the cognitive learning complexity of given questions. The first
methodology uses labelled Latent Dirichlet Allocation (LDA) as a machine learning approach. The second
methodology uses the BERT framework for multi-class text classification for deep learning. The
experiments were performed on an ensemble of 3000+ educational questions, which were based on
previously published datasets along with the TREC question corpus and AI2 Biology How/Why question
corpus datasets. The labelled LDA reached an accuracy of 83% while BERT based approach reached 89%
accuracy. An analysis of both the results is shown, evaluating the significant factors responsible for
determining cognitive knowledge.
Keywords: multi-class text classification, labelled LDA, pre-trained BERT (Bidirectional Encoder
Representations from Transformers), question classification
INTRODUCTION
In the field of education, it is essential to construct a cognitively well-balanced question paper. One of the
standard approaches to solve this issue is the usage of Bloom’s Taxonomy (Bloom et al., 1956), created by
Benjamin Bloom in the 1950s. Bloom’s Taxonomy classifies educational objectives and learning outcomes
into multiple cognitive levels based on the complexity of the thinking behavior, which is required for
successful completion of learning. The six levels are classified based on previous knowledge and skills such
as (i) knowledge / remembering (ii) comprehension / understanding, (iii) applying, (iv) analyzing, (v)
evaluating / synthesis, and (vi) creating. Each of these levels consists of several action verbs that depict the
Copyright © 2020 by the authors; licensee CEDTECH by Bastas. This articles is published under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/).
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
thinking required. E.g. Define, Analyze, Compare etc. However, it was observed in the work of (Stanny, 2016)
that these words (Bloom’s Taxonomy action verbs - (BTAV)) often co-occur in multiple levels, causing an
ambiguity about the true sense of cognition. To overcome this problem, the proposed paper utilizes the
effectiveness of multi-class text classification algorithms in machine learning and deep learning-based
approaches as a methodology. The proposed paper uses the above methodologies to classify a given question
to it’s most appropriate Bloom’s Taxonomy cognitive level.
Assessment questions consist of text sentences that are not cognitively structured. They also vary in multiple
types. For example - (i) a question can be based on WH words such as What, Why, How etc. E.g., What factors
could lead to the rise of a new species?; (ii) - a question can consist of only Bloom’s Taxonomy action verbs.
E.g., Explain some of the important ideas of the above section in your own words.; (iii) - a question can contain
both WH and cognitive level action verbs. E.g., Why do you think average income is an important criterion
for development? Explain.; (iv) - a question may contain neither cognitive level action verbs nor WH words.
E.g., Will universal basic income be beneficial for the society?
The task of classifying and analysing assessment questions can fall into the category of unstructured short
text classification. One of the assumptions for classifying the cognitive level of a given question is that it
should belong to only one particular class. Thus, with multiple cognitive levels, it becomes a problem of multi-
class classification. Considering each Bloom’s Taxonomy cognitive level as a topic, the action verbs can be
considered as terms/words belonging to that topic. Therefore, this makes it a topic modelling task. For topic
modelling, Latent Dirichlet Allocation (LDA) is one of the standard algorithms for unstructured short text topic
modelling (Massey, 2011; Uys, Du Preez, & Uys, 2008). Unlike Latent Semantic Indexing, which uses singular
value decomposition and the bag-of-words representation of text documents, LDA represents texts as
random mixtures over latent topics, where each topic is characterized by a distribution of words in the corpus
(Blei, Ng, & Jordan, 2003). As the cognitive levels are known previously for the training data, the computation
methodology will use supervised learning for the same.
The dataset used for this proposed work is an ensemble of educational assessment questions obtained from
four different existing works – (i) Microsoft Search Lab - NCERT dataset (Agrawal, Gollapudi, Kannan, &
Kenthapadi, 2014), (ii) the paper of Yahya et al. (Yahya, Toukal, & Osman, 2012), (iii) the paper of Jain et al.
(Jain, Beniwal, Ghosh, Grover, & Tyagi, 2019) and (iv) TREC question classification dataset (Li & Roth, 2002).
Apart from the mentioned datasets, two other datasets - AI2 WHY question dataset, and AI2 HOW question
dataset (Jansen, Surdeanu, & Clark, 2014) were also used for testing the Why and How questions. For the
computational purpose, Amazon AWS service (Amazon Comprehend) (Bhatia, Celikkaya, Khalilia, & Senthivel,
2019; Zarei & Nik-Bakht, 2019) has been used for running the labelled LDA methodology. For Deep learning,
the BERT framework from Google (Devlin, Chang, Lee, & Toutanova, 2018) has been used in the Google Colab
environment using GPU.
The paper describes the methodology for identifying the cognitive learning complexity of assessment
questions using multi-class text Classification along with a brief overview of Bloom’s Taxonomy and review
of the related work. This is followed by data preparation and experimental setup. Later the results and
analysis are shown, followed by future work and conclusion.
2 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
3 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
LITERATURE REVIEW
4 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
Literature on Unstructured Short Text Classification using Labelled LDA and BERT
Rationale for algorithm selection: Word co-occurrence models (Bicalho, Pita, Pedrosa, Lacerda, & Pappa,
2017), topic modelling (Zhang & Zhong, 2016), and word embedding clustering (Wang et al., 2016), are all
examples of standard short text analysis methods. However, these models are useful when there is a
sufficiently large training set. Transfer learning (Pan & Yang, 2009) was developed as an alternative method
to reduce the need for training data. Transfer learning can be an effective method for short-text classification
and requires little domain-specific training data (Long, Chen, Zhu, & Zhang, 2012; Phan, Nguyen, & Horiguchi,
2008), however, it demands to create a new model for every new classification task which is one of the major
drawbacks.
Two algorithms have been chosen for this research work. First is Labelled LDA for machine learning approach
and the second is Pre-trained BERT model for deep-learning approach.
LDA: The traditional LDA model (Blei et al., 2003) uses a multinomial mixture distribution θ (d) over all K
topics, for each document d, from a Dirichlet prior α. For this research using Labelled LDA (Ramage, Hall,
Nallapati, & Manning, 2009), the topics (Bloom’s Taxonomy cognitive levels) are already known. Therefore,
θ (d) is restricted to be defined only over the topics that matches to its labels Λ (d). Since the word-topic
assignments zi (Table 1) are taken from this distribution, this restriction ensures that all the topic assignments
are limited to the document’s labels. This essentially makes the algorithm to learn a bag of words model for
each label, but with a shared prior in the form of η. As the document has only a single label, its topic
assignment is limited to the corresponding topic, and all its words are generated from the same multinomial
distribution. This is because Λ (d) will ensure that only one value of θ (d) is nonzero. β refers to the topic
multinomials. The model of labelled LDA is shown in Figure 1.
5 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
Figure 1. Labelled LDA model using only one label per document
Note: Based on the original Labelled LDA model by (Ramage et al., 2009). The multinomial topic distributions over
vocabulary for each topic K, from a Dirichlet prior, where N is the document D length. w represents a list of word indices,
and z represents word-topic assignments.
BERT: BERT stands for Bidirectional Encoder Representations from Transformers. The algorithm uses pre-
train deep bidirectional representations from the unlabelled text by jointly conditioning on both the left and
right contexts. Therefore, the pre-trained BERT model can be optimized with just one additional output layer
to use in a wide range of NLP tasks. With the continuous growth of unlabelled text data, the development of
pre-trained language models like BERT can give better results (Howard & Ruder, 2018; Radford et al., 2019).
Even for tasks such as short text classification, which is difficult to model statistically, due to fewer features
and training data, using a pre-trained language model can be useful (Luo & Wang, 2019). The method takes
advantage of general language understanding to comprehend contextually relevant new words, without the
requirement of additional domain data, where data volume is limited.
6 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
Training Data: For this research, the following approach was considered for identifying the cognitive level of
an assessment question. First, a set of 434 questions from the NCERT dataset (Agrawal et al., 2014) was
manually annotated by five subject experts with the substantial agreement of 0.65 Fleiss kappa value (Landis
& Koch, 1977) as inter-annotation agreement. Apart from the NCERT dataset, questions from (Yahya et al.,
2012) and (Jain et al., 2019) were also used as questions from these datasets also were manually tagged by
human experts, making them the gold standard data. Furthermore, the DESCRIPTION (DESC:) class of TREC
Question dataset (Li & Roth, 2002) was also used. It was observed that the subclasses of DESCRIPTION class
- (i) definition, description, manner, and reason can be mapped to Remember/Knowledge, Understanding /
Comprehension, Applying/Application, and Analysing/Analysis of Bloom’s Taxonomy Cognitive level
respectively.
Computation Environment: The first experiment was performed using Labelled LDA methodology using the
Amazon AWS Comprehend service. Amazon AWS comprehend service uses standard LDA as a text
classification algorithm making it an unsupervised learning approach. For this research, a custom classifier
was used where it was trained on labelled data making it a supervised learning approach. For the second
experiment, a BERT based multi-class classification model was developed using the Google Colab GPU
environment. The BERT base model (uncased) was used, which has 12-layer, 768-hidden, 12-heads, 110M
parameters. Tokenization was performed, and the softmax function was used. While computing probability,
to understand cross-entropy loss, a score called logit, which is a raw unscaled value associated with a class,
is used. For neural network architectures, a logit is an output of a dense (fully-connected) layer. Five epochs
were used, along with a learning rate of 3e-5, and maximum sequence length was taken 128, which covered
the length size of 99% questions. For both experiments, ~90% data was used for training.
Testing data1: The TREC testing dataset was used for testing the result. As observed in the manual annotation
of NCERT dataset, that for questions with WH words - Why and How showed an almost even distribution
over all Bloom’s Taxonomy cognitive levels as shown in Figure 3. Thus additionally, the AI2 Why and How
question datasets were used for testing purposes. The overview of the are shown in Table 2. The distribution
of the Cognitive level of the training dataset is shown in Table 3.
1
Note: Since no previous papers provided BTAV list, experiments with BTAVs could not be performed, as the dataset
of BTAVs may vary.
7 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
Labelled LDA
For overall questions, a set of (ensuring all classes are being covered) 2967 in-stances of training data were
used to train the customized classifier service of Amazon Comprehend, and 329 instances was used for
testing purpose. The evaluation metrics are as follows: “Accuracy”: 0.8389, “Precision”: 0.8245, “Recall”:
0.8268, “F1Score”: 0.8255. The confusion matrix is shown in Figure 4. The accuracy (0.83) obtained is more
than previous literature (Jain et al., 2019)’s result.
For WH questions, a set of 1417 WH questions were used as training and 157 questions were used for testing.
The evaluation metrics are as follows: “Accuracy”: 0.7898, “Precision”: 0.6188, “Recall”: 0.5875, “F1Score”:
0.5982.
8 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
BERT
For deep-learning model, the training loss at the end of five epochs was 0.113230795. The confusion matrix
of overall test data is shown in Figure 6. The accuracy obtained was 0.8967. The results showed significant
improvement from LDA methodology. Furthermore, the AI2 How and Why data sets (Figure 8a and Figure
8b) were tested to see if the prediction patterns are matching or not. The accuracy obtained for Overall is
89.67% and for WH questions is 88.68%.
9 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
Figure 8. (a) WHY dataset confusion matrix (b) HOW dataset confusion matrix
10 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
B. Bloom’s Taxonomy action verbs: From the results, it was observed that a set of Bloom’s Taxonomy action
verbs were truly ambiguous in nature as it is challenging to identify the required Cognitive level unless the
context is known. Examples of such Bloom’s Taxonomy action verbs are - Choose, Describe, Design, Explain,
Show, and Use. These words have distributions across multiple Cognitive levels.
C. WH questions: It can be concluded from these observations that when multiple WH words are used in the
question stem, it leads to a higher cognitive requirement, mainly if Why is used. The word Why, when
occurring individually, has the highest frequency in Comprehension level. However, when occurring with
another WH word (e.g., What, Which having individual higher occurrence frequencies at lower cognitive
level), which acts as a context; the Why acts as a higher cognitive level (e.g., Analysis) identifier. The ambiguity
of both How and Why can be controlled based on the context.
D. Other questions: For questions that do not contain either action verbs or WH words, it was observed that
the cognitive level distribution was almost uniform throughout.
E. Better performance of Deep learning models: This is because BERT uses bidirectional training of
Transformer, an attention-based model, for language modelling. Most machine learning models, trains
themselves on the text input sequentially, while the Transformer encoder reads the entire sequence of words
at once. This characteristic allows the model to learn the context of a word based on all of its surrounding
words. Classification tasks are done by adding a classification layer on top of the Transformer output.
F. Limitations: First, the limitations of the proposed methodology are that unless there is a lot of training
data, the accuracy cannot be improved. Second each training questions need to be carefully annotated by
the annotators so that the training data is correct. Third, the algorithm being essentially dependent on
training data, do have it’s limitation on open ended philosophical questions (e.g. Who is god?)
11 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
in setting up the learning materials of the curriculum and assessment questions for evaluation. The present
limitation of the proposed methodology is the need for a massive amount of training data. While getting
academic questions from a textbook is not a concern; getting them annotated is a time-consuming and
challenging task. This is because annotations will vary from expert to expert based on previous knowledge of
the domain. For future work, the tasks should be considering images (graphs, photos, equations) as a part of
the question to identify the cognitive level of the give question. Also, the objective should be building a
corpus of academic questions across multiple subjects annotated by domain experts as a standard dataset
for future related works.
REFERENCES
Agrawal, R., Gollapudi, S., Kannan, A., & Kenthapadi, K. (2014). Study navigator: An algorithmically generated
aid for learning from electronic text-books. Journal of Educational Data Mining, 6(1), 53-75.
Andre, T. (1979). Does answering higher-level questions while reading facilitate productive learning? Review
of Educational Research, 49(2), 280-318. https://doi.org/10.3102/00346543049002280
Bhatia, P., Celikkaya, B., Khalilia, M., & Senthivel, S. (2019). Comprehend medical: a named entity recognition
and relationship extraction web service. arXiv preprint arXiv: 1910.07419.
https://doi.org/10.1109/ICMLA.2019.00297
Bicalho, P., Pita, M., Pedrosa, G., Lacerda, A., & Pappa, G. L. (2017). A general framework to expand short text
for topic modeling. Information Sciences, 393, 66-81. https://doi.org/10.1016/j.ins.2017.02.007
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research,
3(Jan), 993-1022.
Bloom, B. S., et al. (1956). Taxonomy of educational objectives. vol. 1: Cognitive domain. New York: McKay,
20-24.
Dalton, J., & Smith, D. (1989). Extending children’s special abilities: strategies for primary classrooms. Office
of Schools Administration, Ministry of Education, Victoria.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers
for language understanding. arXiv preprint arXiv:1810.04805. https://doi.org/10.18653/v1/N19-1423
Hamilton, R. (1992). Application adjunct post-questions and conceptual problem solving. Contemporary
Educational Psychology, 17(1), 89-97. https://doi.org/10.1016/0361-476X(92)90050-9
Hamilton, R. J. (1985). A framework for the evaluation of the effectiveness of adjunct questions and
objectives. Review of Educational Research, 55(1), 47-85.
https://doi.org/10.3102/00346543055001047
Howard, J., & Ruder, S. (2018). Universal language model ne-tuning for text classification. arXiv preprint
arXiv:1801.06146. https://doi.org/10.18653/v1/P18-1031
Jain, M., Beniwal, R., Ghosh, A., Grover, T., & Tyagi, U. (2019). Classifying question papers with bloom’s
taxonomy using machine learning techniques. In International conference on advances in computing
and data sciences (pp. 399-408). https://doi.org/10.1007/978-981-13-9942-8_38
Jansen, P., Surdeanu, M., & Clark, P. (2014). Discourse complements lexical semantics for non-factoid answer
reranking. In Proceedings of the 52nd annual meeting of the association for computational linguistics
(volume 1: Long papers) (pp. 977-986). https://doi.org/10.3115/v1/P14-1092
Jones, K. O., Harland, J., Reid, J. M., & Bartlett, R. (2009). Relationship between examination questions and
bloom’s taxonomy. In 2009 39th IEEE frontiers in education conference (pp. 1-6).
https://doi.org/10.1109/FIE.2009.5350598
12 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
Krathwohl, D. R. (2002). A revision of bloom’s taxonomy: An overview. Theory into practice, 41(4), 212-218.
https://doi.org/10.1207/s15430421tip4104_2
Krathwohl, D. R., & Anderson, L. W. (2010). Merlin c. wittrock and the revision of bloom’s taxonomy.
Educational psychologist, 45(1), 64-65. https://doi.org/10.1080/00461520903433562
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics,
159-174. https://doi.org/10.2307/2529310
Lee, Y.-J., Kim, M., Jin, Q., Yoon, H.-G., & Matsubara, K. (2017). Revised blooms taxonomy the swiss army
knife in curriculum research. In East-asian primary science curricula (pp. 11-16). Springer.
https://doi.org/10.1007/978-981-10-2690-4
Li, X., & Roth, D. (2002). Learning question classifiers. In Proceedings of the 19th international conference on
computational linguistics-volume 1 (pp. 1-7). https://doi.org/10.3115/1072228.1072378
Long, G., Chen, L., Zhu, X., & Zhang, C. (2012). Tcsst: transfer classification of short & sparse text using external
data. In Proceedings of the 21st ACM international conference on information and knowledge
management (pp. 764-772). https://doi.org/10.1145/2396761.2396859
Luo, L., & Wang, Y. (2019). Emotionx-hsu: Adopting pre-trained bert for emotion classification. arXiv preprint
arXiv:1907.09669.
Massey, L. (2011). Autonomous and adaptive identification of topics in unstructured text. In International
conference on knowledge-based and intelligent information and engineering systems (pp. 1-10).
https://doi.org/10.1007/978-3-642-23863-5_1
Pan, S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on knowledge and data
engineering, 22(10), 1345-1359. https://doi.org/10.1109/TKDE.2009.191
Peverly, S. T., & Wood, R. (2001). The effects of adjunct questions and feed-back on improving the reading
comprehension skills of learning-disabled adolescents. Contemporary Educational Psychology, 26(1),
25-43. https://doi.org/10.1006/ceps.1999.1025
Phan, X.-H., Nguyen, L.-M., & Horiguchi, S. (2008). Learning to classify short and sparse text & web with
hidden topics from large-scale data collections. In Proceedings of the 17th international conference on
world wide web (pp. 91-100). https://doi.org/10.1145/1367497.1367510
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised
multitask learners. OpenAI Blog, 1(8).
Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled lda: A supervised topic model for credit
attribution in multi-labeled corpora. In Proceedings of the 2009 conference on empirical methods in
natural language processing, 1(1), 248-256. https://doi.org/10.3115/1699510.1699543
Redfield, D. L., & Rousseau, E. W. (1981). A meta-analysis of experimental research on teacher questioning
behavior. Review of educational research, 51(2), 237-245.
https://doi.org/10.3102/00346543051002237
Rothkopf, E. Z. (1970). The concept of mathemagenic activities. Review of educational research, 40(3), 325-
336. https://doi.org/10.3102/00346543040003325
Stanny, C. (2016). Reevaluating blooms taxonomy: What measurable verbs can and cannot say about student
learning. Education Sciences, 6(4), 37. https://doi.org/10.3390/educsci6040037
Swart, A. J., & Daneti, M. (2019). Analyzing learning outcomes for electronic fundamentals using blooms
taxonomy. In 2019 IEEE global engineering education conference (educon) (pp. 39-44).
https://doi.org/10.1109/EDUCON.2019.8725137
13 / 14
Das et al. / Contemporary Educational Technology, 2020, 12(2), ep275
Uys, J., Du Preez, N., & Uys, E. (2008). Leveraging unstructured information using topic modelling. In
Picmet’08-2008 portland international conference on management of engineering & technology (pp.
955-961). https://doi.org/10.1109/PICMET.2008.4599703
Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.-L., & Hao, H. (2016). Semantic ex-pansion using word embedding
clustering and convolutional neural network for improving short text classification. Neurocomputing,
174, 806-814. https://doi.org/10.1016/j.neucom.2015.09.096
Yahya, A. A., Toukal, Z., & Osman, A. (2012). Blooms taxonomy-based classi-cation for item bank questions
using support vector machines. In Modern advances in intelligent systems and tools (pp. 135-140).
Springer. https://doi.org/10.1007/978-3-642-30732-4_17
Zarei, F., & Nik-Bakht, M. (2019). Automated detection of urban flooding from news. In Proceedings of the
36th international symposium on automation and robotics in construction (pp. 515-521).
https://doi.org/10.22260/ISARC2019/0069
Zhang, H., & Zhong, G. (2016). Improving short text classification by learning vector representations of both
words and hidden topics. Knowledge-Based Systems, 102, 76-86.
https://doi.org/10.1016/j.knosys.2016.03.027
Correspondence: Syaamantak Das, Centre for Educational Technology, Indian Institute of Technology
Kharagpur, India. E-mail: [email protected]
14 / 14