Abstract
Recent IR extensions to XML query languages such as Xpath 1.0 Full-Text or the NEXI query language of the INEX benchmark series reflect the emerging interest in IR-style ranked retrieval over semistructured data. TopX is a top-k retrieval engine for text and semistructured data. It terminates query execution as soon as it can safely determine the k top-ranked result elements according to a monotonic score aggregation function with respect to a multidimensional query. It efficiently supports vague search on both content- and structure-oriented query conditions for dynamic query relaxation with controllable influence on the result ranking. The main contributions of this paper unfold into four main points: (1) fully implemented models and algorithms for ranked XML retrieval with XPath Full-Text functionality, (2) efficient and effective top-k query processing for semistructured data, (3) support for integrating thesauri and ontologies with statistically quantified relationships among concepts, leveraged for word-sense disambiguation and query expansion, and (4) a comprehensive description of the TopX system, with performance experiments on large-scale corpora like TREC Terabyte and INEX Wikipedia.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aboulnaga, A., Alameldeen, A.R., Naughton, J.F.: Estimating the selectivity of XML path expressions for internet scale applications. In: VLDB, pp. 591–600. Morgan Kaufmann, San Francisco (2001)
Agrawal, S., Chaudhuri, S., Das, G., Gionis, A.: Automated ranking of database query results. In: CIDR (2003)
Al-Khalifa, S., Jagadish, H.V., Patel, J.M., Wu, Y., Koudas, N., Srivastava, D.: Structural joins: a primitive for efficient XML query pattern matching. In: ICDE, pp. 141–152. IEEE Computer Society, New York (2002)
Al-Khalifa, S., Yu, C., Jagadish, H.V.: Querying structured text in an XML database. In: SIGMOD, pp. 4–15. ACM, New York (2003)
Amato G., Rabitti F., Savino P. and Zezula P. (2003). Region proximity in metric spaces and its use for approximate similarity search. ACM Trans. Inf. Syst. 21(2): 192–227
Amer-Yahia S., Botev C., Dörre J. and Shanmugasundaram J. (2006). XQuery Full-Text extensions explained. IBM Syst. J. 45(2): 335–352
Amer-Yahia, S., Botev, C., Shanmugasundaram, J.: TeXQuery: a full-text search extension to XQuery. In: WWW, pp. 583–594. ACM, New York (2004)
Amer-Yahia S., Case P., Rölleke T., Shanmugasundaram J. and Weikum G. (2005). Report on the DB/IR panel at SIGMOD 2005. SIGMOD Rec. 34(4): 71–74
Amer-Yahia, S., Curtmola, E., Deutsch, A.: Flexible and efficient XML search with complex full-text predicates. In: SIGMOD, pp. 575–586. ACM, New York (2006)
Amer-Yahia, S., Koudas, N., Marian, A., Srivastava, D., Toman, D.: Structure and content scoring for XML. In: VLDB, pp. 361–372. ACM, New York (2005)
Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: flexible structure and full-text querying for XML. In: SIGMOD, pp. 83–94. ACM, New York (2004)
Amer-Yahia S. and Lalmas M. (2006). XML Search: languages, INEX and scoring. SIGMOD Rec. 36(7): 16–23
Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: SIGIR, pp. 35–42. ACM, New York (2001)
Anh, V.N., Moffat, A.: Impact transformation: effective and efficient web retrieval. In: SIGIR, pp. 3–10. ACM, New York (2002)
Anh, V.N., Moffat, A.: Pruned query evaluation using pre-computed impacts. In: SIGIR, pp. 372–379. ACM, New York (2006)
Avnur, R., Hellerstein, J.M.: Eddies: continuously adaptive query processing. In: SIGMOD, pp. 261–272. ACM, New York (2000)
Baeza-Yates R.A. and Ribeiro-Neto B.A. (1999). Modern information retrieval. ACM Press/Addison–Wesley, Reading
Bast, H., Majumdar, D., Theobald, M., Schenkel, R., Weikum, G.: IO-Top-k: index-optimized top-k query processing. In: VLDB, pp. 475–486. ACM, New York (2006)
Bast, H., Weber, I.: Type less, find more: fast autocompletion search with a succinct index. In: SIGIR, pp. 364–371. ACM, New York (2006)
Böhm C., Berchtold S. and Keim D.A. (2001). Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33(3): 322–373
Bruno N., Chaudhuri S. and Gravano L. (2002). Top-k selection queries over relational databases: mapping strategies and performance evaluation. ACM Trans. Database Syst. 27(2): 153–187
Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: SIGMOD, pp. 310–321. ACM, New York (2002)
Buckley, C., Lewit, A.F.: Optimization of inverted vector searches. In: SIGIR, pp. 97–110. ACM, New York (1985)
Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR, pp. 33–40. ACM, New York (2000)
Carmel, D., Maarek, Y.S., Mandelbrod, M., Mass, Y., Soffer, A.: Searching XML documents via XML fragments. In: SIGIR, pp. 151–158. ACM, New York (2003)
Chang, K.C.C., won Hwang, S.: Minimal probing: supporting expensive predicates for top-k queries. In: SIGMOD, pp. 346–357. ACM, New York (2002)
Chaudhuri S., Gravano L. and Marian A. (2004). Optimizing top-k selection queries over multimedia repositories. IEEE Trans. Knowl. Data Eng. 16(8): 992–1009
Chinenyanga, T.T., Kushmerick, N.: Expressive retrieval from XML documents. In: SIGIR, pp. 163–171. ACM, New York (2001)
Choi, B., Mahoui, M., Wood, D.: On the optimality of holistic algorithms for twig queries. In: DEXA. Lecture Notes in Computer Science, vol. 2736, pp. 28–37. Springer, Heidelberg (2003)
Ciaccia, P., Patella, M.: Pac nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: ICDE, pp. 244–255. IEEE Computer Society, New York (2000)
Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSEarch: A semantic search engine for XML. In: VLDB, pp. 45–56. Morgan Kaufmann, San Francisco (2003)
Consens, M.P., Baeza-Yates, R.A.: Database and information retrieval techniques for XML. In: ASIAN, pp. 22–27. Springer, Heidelberg (2005)
Cormen T.H., Leiserson C.E., Rivest R.L. and Clifford S. (2001). Introduction of algorithms. The MIT Press, Cambridge
Craswell, N., Hawking, D., Wilkinson, R., Wu, M.: Overview of the TREC 2003 Web track. In: TREC, pp. 78–92 (2003)
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)
Donjerkovic, D., Ramakrishnan, R.: Probabilistic optimization of top n queries. In: VLDB, pp. 411–422. Morgan Kaufmann, San Francisco (1999)
Fagin R. (1999). Combining fuzzy information from multiple systems. J. Comput. Syst. Sci. 58(1): 83–99
Fagin R. (2002). Combining fuzzy information: an overview. SIGMOD Rec. 31(2): 109–118
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp. 102–113. ACM, New York (2001)
Fagin R., Lotem A. and Naor M. (2003). Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614–656
Fegaras, L.: XQuery processing with relevance ranking. In: XSym. Lecture Notes in Computer Science, vol. 3186, pp. 51–65. Springer, Heidelberg (2004)
Fellbaum, C. (ed.) (1998). WordNet: An Electronic Lexical Database. MIT Press, Cambridge
Fuhr, N., Großjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: SIGIR, pp. 172–180. ACM, New York (2001)
Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB, pp. 436–445. Morgan Kaufmann, San Francisco (1997)
Grabs, T., Schek, H.J.: PowerDB-XML: scalable XML processing with a database cluster. In: Intelligent Search on XML Data, pp. 193–206 (2003)
Graupmann, J., Schenkel, R., Weikum, G.: The spheresearch engine for unified ranked retrieval of heterogeneous XML and web documents. In: VLDB, pp. 529–540. ACM, New York (2005)
Grust, T.: Accelerating XPath location steps. In: SIGMOD, pp. 109–120. ACM, New York (2002)
Grust, T., van Keulen, M., Teubner, J.: Staircase join: teach a relational DBMS to watch its (axis) steps. In: VLDB, pp. 524–525. Morgan Kaufmann, San Francisco (2003)
Güntzer, U., Balke, W.T., Kießling, W.: Optimizing multi-feature queries for image databases. In: VLDB, pp. 419–428. Morgan Kaufmann, San Francisco (2000)
Güntzer, U., Balke, W.T., Kießling, W.: Towards efficient multi-feature queries in heterogeneous environments. In: ITCC, pp. 622–628. IEEE Computer Society, New York (2001)
Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRank: ranked keyword search over XML documents. In: SIGMOD, pp. 16–27. ACM (2003)
Hjaltason G.R. and Samet H. (1999). Distance browsing in spatial databases. ACM Trans. Database Syst. 24(2): 265–318
Hjaltason G.R. and Samet H. (2003). Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4): 517–580
Hristidis, V., Papakonstantinou, Y., Balmin, A.: Keyword proximity search on XML graphs. In: ICDE, pp. 367–378. IEEE Computer Society, New York (2003)
Hung, E., Deng, Y., Subrahmanian, V.S.: TOSS: an extension of TAX with ontologies and similarity queries. In: SIGMOD, pp. 719–730. ACM, New York (2004)
INitiative for the Evaluation of XML Retrieval (INEX): http://inex.is.informatik.uni-duisburg.de
Index-Organized Tables—Oracle9i Data Sheet: http://www.oracle.com/technology/products/oracle9i/datasheets/iots/iot_ds.html
Jagadish, H.V., Lakshmanan, L.V.S., Srivastava, D., Thompson, K.: TAX: A tree algebra for XML. In: DBPL. Lecture Notes in Computer Science, vol. 2397, pp. 149–164. Springer, Heidelberg (2001)
Jiang, H., Wang, W., Lu, H., Yu, J.X.: Holistic twig joins on indexed XML documents. In: VLDB, pp. 273–284. Morgan Kaufmann, San Francisco (2003)
Kaushik, R., Bohannon, P., Naughton, J.F., Korth, H.F.: Covering indexes for branching path queries. In: SIGMOD, pp. 133–144. ACM, New York (2002)
Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the integration of structure indexes and inverted lists. In: SIGMOD, pp. 779–790. ACM, New York (2004)
Kazai, G., Lalmas, M.: INEX 2005 evaluation measures. In: 4th International Workshop of the Initiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 3977, pp. 16–29. Springer, Heidelberg (2005)
Lalmas, M., Kazai, G., Kamps, J., Pehcevski, J., Piwowarski, B., Robertson, S.: INEX 2006 evaluation measures. In: 5th International Workshop of the Initiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 4518. Springer, Heidelberg (2007)
Li, Y., Yu, C., Jagadish, H.V.: Schema-free XQuery. In: VLDB, pp. 72–83. Morgan Kaufmann, San Francisco (2004)
Lim, L., Wang, M., Padmanabhan, S., Vitter, J.S., Parr, R.: XPathLearner: an on-line self-tuning Markov histogram for XML path selectivity estimation. In: VLDB, pp. 442–453. Morgan Kaufmann, San Francisco (2002)
List J.A., Mihajlovic V., Ramírez G., Hiemstra D., Blok H.E. and Vries A.P. (2005). TIJAH: embracing IR methods in XML databases. Inf. Retr. 8(4): 547–570
Liu, S., Chu, W.W., Shahinian, R.: Vague content and structure (VCAS) retrieval for document-centric XML collections. In: WebDB, pp. 79–84 (2005)
Marian, A., Amer-Yahia, S., Koudas, N., Srivastava, D.: Adaptive processing of top-k queries in XML. In: ICDE, pp. 162–173. IEEE Computer Society, New York (2005)
Marian A., Bruno N. and Gravano L. (2004). Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2): 319–362
Mass, Y., Mandelbrod, M.: Component ranking and automatic query refinement for XML retrieval. In: 3rd International Workshop of the INitiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 3493, pp. 73–84. Springer, Heidelberg (2004)
Moffat A. and Zobel J. (1996). Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. 14(4): 349–379
Natsev, A., Chang, Y.C., Smith, J.R., Li, C.S., Vitter, J.S.: Supporting incremental join queries on ranked inputs. In: VLDB, pp. 281–290. Morgan Kaufmann, San Francisco (2001)
Nepal, S., Ramakrishna, M.V.: Query processing issues in image (multimedia) databases. In: ICDE, pp. 22–29. IEEE Computer Society, New York (1999)
Persin M., Zobel J. and Sacks-Davis R. (1996). Filtered document retrieval with frequency-sorted indexes. JASIS 47(10): 749–764
Polyzotis, N., Garofalakis, M.N., Ioannidis, Y.E.: Approximate XML query answers. In: SIGMOD, pp. 263–274. ACM, New York (2004)
Polyzotis N. and Garofalakis M.N. (2006). XSKETCH synopses for XML data graphs. ACM Trans. Database Syst. 31(3): 1014–1063
Reid J., Lalmas M., Finesilver K. and Hertzum M. (2006). Best entry points for structured document retrieval (Part I and II). Inf. Process. Manage. 42(1): 74–105
Robertson S.E. and Spärck-Jones K. (1976). Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27(1): 129–146
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241. ACM/Springer, New York (1994)
Rocchio J. Jr (1971). Relevance feedback in information retrieval. In: Salton, G. (eds) The SMART Retrieval System: Experiments in Automatic Document Processing, chap. 14, pp 313–323. Prentice-Hall, Englewood Cliffs
Schenkel, R., Theobald, M.: Feedback-driven structural query expansion for ranked retrieval of XML data. In: EDBT, pp. 331–348. Springer, Heidelberg (2006)
Schenkel, R., Theobald, M.: Structural feedback for keyword-based XML retrieval. In: ECIR, pp. 326–337. Springer, Heidelberg (2006)
Schlieder T. and Meuss H. (2002). Querying and ranking XML documents. JASIST 53(6): 489–503
Soffer, A., Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y.S.: Static index pruning for information retrieval systems. In SIGIR, pp. 43–50. ACM, New York (2001)
Suchanek, F., Kasneci, G., Weikum, G.: YAGO: A core of semantic knowledge unifying WordNet and Wikipedia. In: WWW (2007)
Tao, Y., Faloutsos, C., Papadias, D.: The power-method: a comprehensive estimation technique for multi-dimensional queries. In: CIKM, pp. 83–90 (2003)
Theobald, A., Weikum, G.: Adding relevance to XML. In: WebDB (informal proceedings), pp. 35–40 (2000)
Theobald, A., Weikum, G.: The index-based XXL search engine for querying XML data with relevance ranking. In: EDBT, pp. 477–495. Springer, Heidelberg (2002)
Theobald, M., Schenkel, R., Weikum, G.: Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data. In: WebDB, pp. 1–6 (2003)
Theobald, M., Schenkel, R., Weikum, G.: Efficient and self-tuning incremental query expansion for top-k query processing. In: SIGIR, pp. 242–249. ACM, New York (2005)
Theobald, M., Schenkel, R., Weikum, G.: An efficient and versatile query engine for TopX search. In: VLDB, pp. 625–636. ACM, New York (2005)
Theobald, M., Schenkel, R., Weikum, G.: The TopX DB & IR engine. In: SIGMOD. ACM, New York (2007)
Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. In: VLDB, pp. 648–659. Morgan Kaufmann, San Francisco (2004)
Text REtrieval Conference (TREC): http://trec.nist.gov/
Trotman, A., Sigurbjörnsson, B.: Narrowed Extended XPath I (NEXI). In: 3rd International Workshop of the INitiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 3493, pp. 16–40. Springer, Heidelberg (2004)
Vagena, Z., Moro, M.M., Tsotras, V.J.: Twig query processing over graph-structured XML data. In: WebDB, pp. 43–48 (2004)
Vorhees, E.: Overview of the TREC 2004 Robust retrieval track. In: TREC, pp. 69–77 (2004)
de Vries, A.P., Mamoulis, N., Nes, N., Kersten, M.L.: Efficient k-nn search on vertically decomposed data. In: SIGMOD, pp. 322–333. ACM, New York (2002)
Williams H.E., Zobel J. and Bahle D. (2004). Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22(4): 573–594
XQuery 1.0 and XPath 2.0 Full-Text: http://www.w3.org/TR/xquery-full-text/
Wu, Y., Patel, J.M., Jagadish, H.V.: Structural join order selection for XML query optimization. In: ICDE, pp. 443–454. IEEE Computer Society, New York (2003)
Yu, C.T., Sharma, P., Meng, W., Qin, Y.: Database selection for processing k nearest neighbors queries in distributed environments. In: JCDL, pp. 215–222. ACM, New York (2001)
Zhang, C., Naughton, J.F., DeWitt, D.J., Luo, Q., Lohman, G.M.: On supporting containment queries in relational database management systems. In: SIGMOD, pp. 425–436. ACM, New York (2001)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Theobald, M., Bast, H., Majumdar, D. et al. TopX: efficient and versatile top-k query processing for semistructured data. The VLDB Journal 17, 81–115 (2008). https://doi.org/10.1007/s00778-007-0072-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-007-0072-z