A Bipartite Graph-Based Ranking Approach to Query Subtopics Diversification Focused on Word Embedding Features

Md Zia ULLAH
Masaki AONO

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E99-D    No.12    pp.3090-3100
Publication Date: 2016/12/01
Publicized: 2016/09/05
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2016EDP7190
Type of Manuscript: PAPER
Category: Data Engineering, Web Information Systems
Keyword: 
subtopic mining,  query intent,  diversification,  word embedding,  bipartite graph,  

Full Text: PDF(737.6KB)>>
Buy this Article



Summary: 
Web search queries are usually vague, ambiguous, or tend to have multiple intents. Users have different search intents while issuing the same query. Understanding the intents through mining subtopics underlying a query has gained much interest in recent years. Query suggestions provided by search engines hold some intents of the original query, however, suggested queries are often noisy and contain a group of alternative queries with similar meaning. Therefore, identifying the subtopics covering possible intents behind a query is a formidable task. Moreover, both the query and subtopics are short in length, it is challenging to estimate the similarity between a pair of short texts and rank them accordingly. In this paper, we propose a method for mining and ranking subtopics where we introduce multiple semantic and content-aware features, a bipartite graph-based ranking (BGR) method, and a similarity function for short texts. Given a query, we aggregate the suggested queries from search engines as candidate subtopics and estimate the relevance of them with the given query based on word embedding and content-aware features by modeling a bipartite graph. To estimate the similarity between two short texts, we propose a Jensen-Shannon divergence based similarity function through the probability distributions of the terms in the top retrieved documents from a search engine. A diversified ranked list of subtopics covering possible intents of a query is assembled by balancing the relevance and novelty. We experimented and evaluated our method on the NTCIR-10 INTENT-2 and NTCIR-12 IMINE-2 subtopic mining test collections. Our proposed method outperforms the baselines, known related methods, and the official participants of the INTENT-2 and IMINE-2 competitions.


open access publishing via