SEMANTIFY: Unveiling Memes with Robust Interpretability beyond Input Attribution
SEMANTIFY: Unveiling Memes with Robust Interpretability beyond Input Attribution
Dibyanayan Bandyopadhyay, Asmit Ganguly, Baban Gain, Asif Ekbal
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 6189-6197.
https://doi.org/10.24963/ijcai.2024/684
Memes, initially created for humor and social commentary, have transformed into platforms for offensive online content. Detecting such content is crucial; however, existing deep learning-based meme offensiveness classifiers lack transparency, functioning as opaque black-box systems. While Integrated Gradient and similar input-attribution interpretability methods exist, they often yield inadequate and irrelevant keywords. To bridge this gap, we introduce SEMANTIFY, a novel system featuring a theoretically grounded multi-step filtering process. SEMANTIFY extracts meaningful "tokens" from a predefined vocabulary, generating a pertinent and comprehensive set of interpretable keywords. These extracted keywords reveal the model's awareness of hidden meanings in memes, enhancing transparency. Evaluation of SEMANTIFY using interpretability metrics, including 'leakage-adjusted simulatability,' demonstrates its superiority over various baselines by up to 2.5 points. Human evaluation of 'relatedness' and 'exhaustiveness' of extracted keywords further validates its effectiveness. Additionally, a qualitative analysis of extracted keywords serves as a case study, unveiling model error cases and their reasons. SEMANTIFY contributes to the advancement of more interpretable multimodal systems for meme offensiveness detection, fostering trust for real-world applications.
Keywords:
Natural Language Processing: NLP: Interpretability and analysis of models for NLP
AI Ethics, Trust, Fairness: ETF: Societal impact of AI
AI Ethics, Trust, Fairness: ETF: Trustworthy AI