This study sheds light on the effects of COVID-19 in the particular field of Computational Linguistics and Natural Language Processing within Artificial Intelligence. We provide an inter-sectional study on gender, contribution, and experience that considers one school year (from August 2019 to August 2020) as a pandemic year. August is included twice for the purpose of an inter-annual comparison. While the trend in publications increased with the crisis, the results show that the ratio between female and male publications decreased. This only helps to reduce the importance of the female role in the scientific contributions of computational linguistics (it is now far below its peak of 0.24). The pandemic has a particularly negative effect on the production of female senior researchers in the first position of authors (maximum work), followed by the female junior researchers in the last position of authors (supervision or collaborative work).
Since the seminal work of Richard Montague in the 1970s, mathematical and logic tools have successfully been used to model several aspects of the meaning of natural language. However, visually impaired people continue to face serious difficulties in getting full access to this important instrument. Our paper aims to present a work in progress whose main goal is to provide blind students and researchers with an adequate method to deal with the different resources that are used in formal semantics. In particular, we intend to adapt the Portuguese Braille system in order to accommodate the most common symbols and formulas used in this kind of approach and to develop pedagogical procedures to facilitate its learnability. By making this formalization compatible with the Braille coding (either traditional and electronic), we hope to help blind people to learn and use this notation, essential to acquire a better understanding of a great number of semantic properties displayed by natural language.
We address the task of automatic hate speech detection for low-resource languages. Rather than collecting and annotating new hate speech data, we show how to use cross-lingual transfer learning to leverage already existing data from higher-resource languages. Using bilingual word embeddings based classifiers we achieve good performance on the target language by training only on the source dataset. Using our transferred system we bootstrap on unlabeled target language data, improving the performance of standard cross-lingual transfer approaches. We use English as a high resource language and German as the target language for which only a small amount of annotated corpora are available. Our results indicate that cross-lingual transfer learning together with our approach to leverage additional unlabeled data is an effective way of achieving good performance on low-resource target languages without the need for any target-language annotations.
Subtle and overt racism is still present both in physical and online communities today and has impacted many lives in different segments of the society. In this short piece of work, we present how we’re tackling this societal issue with Natural Language Processing. We are releasing BiasCorp, a dataset containing 139,090 comments and news segment from three specific sources - Fox News, BreitbartNews and YouTube. The first batch (45,000 manually annotated) is ready for publication. We are currently in the final phase of manually labeling the remaining dataset using Amazon Mechanical Turk. BERT has been used widely in several downstream tasks. In this work, we present hBERT, where we modify certain layers of the pretrained BERT model with the new Hopfield Layer. hBert generalizes well across different distributions with the added advantage of a reduced model complexity. We are also releasing a JavaScript library 3 and a Chrome Extension Application, to help developers make use of our trained model in web applications (say chat application) and for users to identify and report racially biased contents on the web respectively
Data in general encodes human biases by default; being aware of this is a good start, and the research around how to handle it is ongoing. The term ‘bias’ is extensively used in various contexts in NLP systems. In our research the focus is specific to biases such as gender, racism, religion, demographic and other intersectional views on biases that prevail in text processing systems responsible for systematically discriminating specific population, which is not ethical in NLP. These biases exacerbate the lack of equality, diversity and inclusion of specific population while utilizing the NLP applications. The tools and technology at the intermediate level utilize biased data, and transfer or amplify this bias to the downstream applications. However, it is not enough to be colourblind, gender-neutral alone when designing a unbiased technology – instead, we should take a conscious effort by designing a unified framework to measure and benchmark the bias. In this paper, we recommend six measures and one augment measure based on the observations of the bias in data, annotations, text representations and debiasing techniques.
This papers presents a platform for monitoring press narratives with respect to several social challenges, including gender equality, migrations and minority languages. As narratives are encoded in natural language, we have to use natural processing techniques to automate their analysis. Thus, crawled news are processed by means of several NLP modules, including named entity recognition, keyword extraction,document classification for social challenge detection, and sentiment analysis. A Flask powered interface provides data visualization for a user-based analysis of the data. This paper presents the architecture of the system and describes in detail its different components. Evaluation is provided for the modules related to extraction and classification of information regarding social challenges.
Automatic detection of critical plot information in reviews of media items poses unique challenges to both social computing and computational linguistics. In this paper we propose to cast the problem of discovering spoiler bias in online discourse as a text simplification task. We conjecture that for an item-user pair, the simpler the user review we learn from an item summary the higher its likelihood to present a spoiler. Our neural model incorporates the advanced transformer network to rank the severity of a spoiler in user tweets. We constructed a sustainable high-quality movie dataset scraped from unsolicited review tweets and paired with a title summary and meta-data extracted from a movie specific domain. To a large extent, our quantitative and qualitative results weigh in on the performance impact of named entity presence in plot summaries. Pretrained on a split-and-rephrase corpus with knowledge distilled from English Wikipedia and fine-tuned on our movie dataset, our neural model shows to outperform both a language modeler and monolingual translation baselines.
Hope is considered significant for the well-being, recuperation and restoration of human life by health professionals. Hope speech reflects the belief that one can discover pathways to their desired objectives and become roused to utilise those pathways. To encourage research in natural language processing towards positive reinforcement approach, we created a hope speech detection dataset. This paper reports on the shared task of hope speech detection for Tamil, English, and Malayalam languages. The shared task was conducted as a part of the EACL 2021 workshop on Language Technology for Equality, Diversity, and Inclusion (LT-EDI-2021). We summarize here the datasets for this challenge which are openly available at https://competitions.codalab.org/competitions/27653, and present an overview of the methods and the results of the competing systems. To the best of our knowledge, this is the first shared task to conduct hope speech detection.
In this paper we work with a hope speech detection corpora that includes English, Tamil, and Malayalam datasets. We present a two phase mechanism to detect hope speech. In the first phase we build a classifier to identify the language of the text. In the second phase, we build a classifier to detect hope speech, non hope speech, or not lang labels. Experimental results show that hope speech detection is challenging and there is scope for improvement.
Hope speech detection is a new task for finding and highlighting positive comments or supporting content from user-generated social media comments. For this task, we have used a Shared Task multilingual dataset on Hope Speech Detection for Equality, Diversity, and Inclusion (HopeEDI) for three languages English, code-switched Tamil and Malayalam. In this paper, we present deep learning techniques using context-aware string embeddings for word representations and Recurrent Neural Network (RNN) and pooled document embeddings for text representation. We have evaluated and compared the three models for each language with different approaches. Our proposed methodology works fine and achieved higher performance than baselines. The highest weighted average F-scores of 0.93, 0.58, and 0.84 are obtained on the task organisers’ final evaluation test set. The proposed models are outperforming the baselines by 3%, 2% and 11% in absolute terms for English, Tamil and Malayalam respectively.
Hope is an essential aspect of mental health stability and recovery in every individual in this fast-changing world. Any tools and methods developed for detection, analysis, and generation of hope speech will be beneficial. In this paper, we propose a model on hope-speech detection to automatically detect web content that may play a positive role in diffusing hostility on social media. We perform the experiments by taking advantage of pre-processing and transfer-learning models. We observed that the pre-trained multilingual-BERT model with convolution neural networks gave the best results. Our model ranked first, third, and fourth ranks on English, Malayalam-English, and Tamil-English code-mixed datasets.
In recent times, there exists an abundance of research to classify abusive and offensive texts focusing on negative comments but only minimal research using the positive reinforcement approach. The task was aimed at classifying texts into ‘Hope_speech’, ‘Non_hope_speech’, and ‘Not in language’. The datasets were provided by the LT-EDI organisers in English, Tamil, and Malayalam language with texts sourced from YouTube comments. We trained our data using transformer models, specifically mBERT for Tamil and Malayalam and BERT for English, and achieved weighted average F1-scores of 0.46, 0.81, 0.92 for Tamil, Malayalam, and English respectively.
In a world with serious challenges like climate change, religious and political conflicts, global pandemics, terrorism, and racial discrimination, an internet full of hate speech, abusive and offensive content is the last thing we desire for. In this paper, we work to identify and promote positive and supportive content on these platforms. We work with several transformer-based models to classify social media comments as hope speech or not hope speech in English, Malayalam, and Tamil languages. This paper portrays our work for the Shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion at LT-EDI 2021- EACL 2021. The codes for our best submission can be viewed.
Language as a significant part of communication should be inclusive of equality and diversity. The internet user’s language has a huge influence on peer users all over the world. People express their views through language on virtual platforms like Facebook, Twitter, YouTube etc. People admire the success of others, pray for their well-being, and encourage on their failure. Such inspirational comments are hope speech comments. At the same time, a group of users promotes discrimination based on gender, racial, sexual orientation, persons with disability, and other minorities. The current paper aims to identify hope speech comments which are very important to move on in life. Various machine learning and deep learning based models (such as support vector machine, logistics regression, convolutional neural network, recurrent neural network) are employed to identify the hope speech in the given YouTube comments. The YouTube comments are available in English, Tamil and Malayalam languages and are part of the task “EACL-2021:Hope Speech Detection for Equality, Diversity and Inclusion”.
This paper presents the participation of the IRNLP_DAIICT team from Information Retrieval and Natural Language Processing lab at DA-IICT, India in LT-EDI@EACL2021 Hope Speech Detection task. The aim of this shared task is to identify hope speech from a code-mixed data-set of YouTube comments. The task is to classify comments into Hope Speech, Non Hope speech or Not in language, for three languages: English, Malayalam-English and Tamil-English. We use TF-IDF character n-grams and pretrained MuRIL embeddings for text representation and Logistic Regression and Linear SVM for classification. Our best approach achieved second, eighth and fifth rank with weighted F1 score of 0.92, 0.75 and 0.57 in English, Malayalam-English and Tamil-English on test dataset respectively
Due to the development of modern computer technology and the increase in the number of online media users, we can see all kinds of posts and comments everywhere on the internet. Hope speech can not only inspire the creators but also make other viewers pleasant. It is necessary to effectively and automatically detect hope speech. This paper describes the approach of our team in the task of hope speech detection. We use the attention mechanism to adjust the weight of all the output layers of XLM-RoBERTa to make full use of the information extracted from each layer, and use the weighted sum of all the output layers to complete the classification task. And we use the Stratified-K-Fold method to enhance the training data set. We achieve a weighted average F1-score of 0.59, 0.84, and 0.92 for Tamil, Malayalam, and English language, ranked 3rd, 2nd, and 2nd.
This article introduces the system description of TEAM_HUB team participating in LT-EDI 2021: Hope Speech Detection. This shared task is the first task related to the desired voice detection. The data set in the shared task consists of three different languages (English, Tamil, and Malayalam). The task type is text classification. Based on the analysis and understanding of the task description and data set, we designed a system based on a pre-trained language model to complete this shared task. In this system, we use methods and models that combine the XLM-RoBERTa pre-trained language model and the Tf-Idf algorithm. In the final result ranking announced by the task organizer, our system obtained F1 scores of 0.93, 0.84, 0.59 on the English dataset, Malayalam dataset, and Tamil dataset. Our submission results are ranked 1, 2, and 3 respectively.
This paper mainly introduces the relevant content of the task “Hope Speech Detection for Equality, Diversity, and Inclusion at LT-EDI 2021-EACL 2021”. A total of three language datasets were provided, and we chose the English dataset to complete this task. The specific task objective is to classify the given speech into ‘Hope speech’, ‘Not Hope speech’, and ‘Not in intended language’. In terms of method, we use fine-tuned ALBERT and K fold cross-validation to accomplish this task. In the end, we achieved a good result in the rank list of the task result, and the final F1 score was 0.93, tying for first place. However, we will continue to try to improve methods to get better results in future work.
This paper describes approaches to identify Hope Speech in short, informal texts in English, Malayalam and Tamil using different machine learning techniques. We demonstrate that even very simple baseline algorithms perform reasonably well on this task if provided with enough training data. However, our best performing algorithm is a cross-lingual transfer learning approach in which we fine-tune XLM-RoBERTa.
In this paper, we describe our approach towards utilizing pre-trained models for the task of hope speech detection. We participated in Task 2: Hope Speech Detection for Equality, Diversity and Inclusion at LT-EDI-2021 @ EACL2021. The goal of this task is to predict the presence of hope speech, along with the presence of samples that do not belong to the same language in the dataset. We describe our approach to fine-tuning RoBERTa for Hope Speech detection in English and our approach to fine-tuning XLM-RoBERTa for Hope Speech detection in Tamil and Malayalam, two low resource Indic languages. We demonstrate the performance of our approach on classifying text into hope-speech, non-hope and not-language. Our approach ranked 1st in English (F1 = 0.93), 1st in Tamil (F1 = 0.61) and 3rd in Malayalam (F1 = 0.83).
The rapid rise of online social networks like YouTube, Facebook, Twitter allows people to express their views more widely online. However, at the same time, it can lead to an increase in conflict and hatred among consumers in the form of freedom of speech. Therefore, it is essential to take a positive strengthening method to research on encouraging, positive, helping, and supportive social media content. In this paper, we describe a Transformer-based BERT model for Hope speech detection for equality, diversity, and inclusion, submitted for LT-EDI-2021 Task 2. Our model achieves a weighted averaged f1-score of 0.93 on the test set.
Analysis and deciphering code-mixed data is imperative in academia and industry, in a multilingual country like India, in order to solve problems apropos Natural Language Processing. This paper proposes a bidirectional long short-term memory (BiLSTM) with the attention-based approach, in solving the hope speech detection problem. Using this approach an F1-score of 0.73 (9thrank) in the Malayalam-English data set was achieved from a total of 31 teams who participated in the competition.
This paper aims to describe the approach we used to detect hope speech in the HopeEDI dataset. We experimented with two approaches. In the first approach, we used contextual embeddings to train classifiers using logistic regression, random forest, SVM, and LSTM based models. The second approach involved using a majority voting ensemble of 11 models which were obtained by fine-tuning pre-trained transformer models (BERT, ALBERT, RoBERTa, IndicBERT) after adding an output layer. We found that the second approach was superior for English, Tamil and Malayalam. Our solution got a weighted F1 score of 0.93, 0.75 and 0.49 for English, Malayalam and Tamil respectively. Our solution ranked 1st in English, 8th in Malayalam and 11th in Tamil.
The proliferation of Hate Speech and misinformation in social media is fast becoming a menace to society. In compliment, the dissemination of hate-diffusing, promising and anti-oppressive messages become a unique alternative. Unfortunately, due to its complex nature as well as the relatively limited manifestation in comparison to hostile and neutral content, the identification of Hope Speech becomes a challenge. This work revolves around the detection of Hope Speech in Youtube comments, for the Shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion. We achieve an f-score of 0.93, ranking 1st on the leaderboard for English comments.
In recent years, several systems have been developed to regulate the spread of negativity and eliminate aggressive, offensive or abusive contents from the online platforms. Nevertheless, a limited number of researches carried out to identify positive, encouraging and supportive contents. In this work, our goal is to identify whether a social media post/comment contains hope speech or not. We propose three distinct models to identify hope speech in English, Tamil and Malayalam language to serve this purpose. To attain this goal, we employed various machine learning (SVM, LR, ensemble), deep learning (CNN+BiLSTM) and transformer (m-BERT, Indic-BERT, XLNet, XLM-R) based methods. Results indicate that XLM-R outdoes all other techniques by gaining a weighted f_1-score of 0.93, 0.60 and 0.85 respectively for English, Tamil and Malayalam language. Our team has achieved 1st, 2nd and 1st rank in these three tasks respectively.
In today’s society, the rapid development of communication technology allows us to communicate with people from different parts of the world. In the process of communication, each person treats others differently. Some people are used to using offensive and sarcastic language to express their views. These words cause pain to others and make people feel down. Some people are used to sharing happiness with others and encouraging others. Such people bring joy and hope to others through their words. On social media platforms, these two kinds of language are all over the place. If people want to make the online world a better place, they will have to deal with both. So identifying offensive language and hope language is an essential task. There have been many assignments about offensive language. Shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion at LT-EDI 2021-EACL 2021 uses another unique perspective – to identify the language of Hope to make contributions to society. The XLM-Roberta model is an excellent multilingual model. Our team used a fine-tuned XLM-Roberta model to accomplish this task.
This paper describes the models submitted by the team MUCS for “Hope Speech Detection for Equality, Diversity, and Inclusion-EACL 2021” shared task that aims at classifying a comment / post in English and code-mixed texts in two language pairs, namely, Tamil-English (Ta-En) and Malayalam-English (Ma-En) into one of the three predefined categories, namely, “Hope_speech”, “Non_hope_speech”, and “other_languages”. Three models namely, CoHope-ML, CoHope-NN, and CoHope-TL based on Ensemble of classifiers, Keras Neural Network (NN) and BiLSTM with Conv1d model respectively are proposed for the shared task. CoHope-ML, CoHope-NN models are trained on a feature set comprised of char sequences extracted from sentences combined with words for Ma-En and Ta-En code-mixed texts and a combination of word and char ngrams along with syntactic word ngrams for English text. CoHope-TL model consists of three major parts: training tokenizer, BERT Language Model (LM) training and then using pre-trained BERT LM as weights in BiLSTM-Conv1d model. Out of three proposed models, CoHope-ML model (best among our models) obtained 1st, 2nd, and 3rd ranks with weighted F1-scores of 0.85, 0.92, and 0.59 for Ma-En, English and Ta-En texts respectively.
We describe our system that ranked first in Hope Speech Detection (HSD) shared task and fourth in Offensive Language Identification (OLI) shared task, both in Tamil language. The goal of HSD and OLI is to identify if a code-mixed comment or post contains hope speech or offensive content respectively. We pre-train a transformer-based model RoBERTa using synthetically generated code-mixed data and use it in an ensemble along with their pre-trained ULMFiT model available from iNLTK.
With the internet becoming part and parcel of our lives, engagement in social media has increased a lot. Identifying and eliminating offensive content from social media has become of utmost priority to prevent any kind of violence. However, detecting encouraging, supportive and positive content is equally important to prevent misuse of censorship targeted to attack freedom of speech. This paper presents our system for the shared task Hope Speech Detection for Equality, Diversity, and Inclusion at LT-EDI, EACL 2021. The data for this shared task is provided in English, Tamil, and Malayalam which was collected from YouTube comments. It is a multiclass classification problem where each data instance is categorized into one of the three classes: ‘Hope speech’, ‘Not hope speech’, and ‘Not in intended language’. We propose a system that employs multilingual transformer models to obtain the representation of text and classifies it into one of the three classes. We explored the use of multilingual models trained specifically for Indian languages along with generic multilingual models. Our system was ranked 2nd for English, 2nd for Malayalam, and 7th for the Tamil language in the final leader board published by organizers and obtained a weighted F1-score of 0.92, 0.84, 0.55 respectively on the hidden test dataset used for the competition. We have made our system publicly available at GitHub.
This paper describes the IIITK’s team submissions to the hope speech detection for equality, diversity and inclusion in Dravidian languages shared task organized by LT-EDI 2021 workshop@EACL 2021. Our best configurations for the shared tasks achieve weighted F1 scores of 0.60 for Tamil, 0.83 for Malayalam, and 0.93 for English. We have secured ranks of 4, 3, 2 in Tamil, Malayalam and English respectively.