TopX: efficient and versatile top-k query processing for semistructured data

Theobald, Martin; Bast, Holger; Majumdar, Debapriyo; Schenkel, Ralf; Weikum, Gerhard

doi:10.1007/s00778-007-0072-z

TopX: efficient and versatile top-k query processing for semistructured data

Special Issue Paper
Open access
Published: 12 October 2007

Volume 17, pages 81–115, (2008)
Cite this article

Download PDF

You have full access to this open access article

The VLDB Journal Aims and scope Submit manuscript

TopX: efficient and versatile top-k query processing for semistructured data

Download PDF

Martin Theobald¹,
Holger Bast¹,
Debapriyo Majumdar¹,
Ralf Schenkel¹ &
…
Gerhard Weikum¹

1171 Accesses
63 Citations
3 Altmetric
Explore all metrics

Abstract

Recent IR extensions to XML query languages such as Xpath 1.0 Full-Text or the NEXI query language of the INEX benchmark series reflect the emerging interest in IR-style ranked retrieval over semistructured data. TopX is a top-k retrieval engine for text and semistructured data. It terminates query execution as soon as it can safely determine the k top-ranked result elements according to a monotonic score aggregation function with respect to a multidimensional query. It efficiently supports vague search on both content- and structure-oriented query conditions for dynamic query relaxation with controllable influence on the result ranking. The main contributions of this paper unfold into four main points: (1) fully implemented models and algorithms for ranked XML retrieval with XPath Full-Text functionality, (2) efficient and effective top-k query processing for semistructured data, (3) support for integrating thesauri and ontologies with statistically quantified relationships among concepts, leveraged for word-sense disambiguation and query expansion, and (4) a comprehensive description of the TopX system, with performance experiments on large-scale corpora like TREC Terabyte and INEX Wikipedia.

Article PDF

Pay-as-you-go Approximate Join Top-k Processing for the Web of Data

Approximate XML Query Processing

Compact Indexes for Flexible Top- $$k$$ Retrieval

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Aboulnaga, A., Alameldeen, A.R., Naughton, J.F.: Estimating the selectivity of XML path expressions for internet scale applications. In: VLDB, pp. 591–600. Morgan Kaufmann, San Francisco (2001)
Agrawal, S., Chaudhuri, S., Das, G., Gionis, A.: Automated ranking of database query results. In: CIDR (2003)
Al-Khalifa, S., Jagadish, H.V., Patel, J.M., Wu, Y., Koudas, N., Srivastava, D.: Structural joins: a primitive for efficient XML query pattern matching. In: ICDE, pp. 141–152. IEEE Computer Society, New York (2002)
Al-Khalifa, S., Yu, C., Jagadish, H.V.: Querying structured text in an XML database. In: SIGMOD, pp. 4–15. ACM, New York (2003)
Amato G., Rabitti F., Savino P. and Zezula P. (2003). Region proximity in metric spaces and its use for approximate similarity search. ACM Trans. Inf. Syst. 21(2): 192–227
Article Google Scholar
Amer-Yahia S., Botev C., Dörre J. and Shanmugasundaram J. (2006). XQuery Full-Text extensions explained. IBM Syst. J. 45(2): 335–352
Article Google Scholar
Amer-Yahia, S., Botev, C., Shanmugasundaram, J.: TeXQuery: a full-text search extension to XQuery. In: WWW, pp. 583–594. ACM, New York (2004)
Amer-Yahia S., Case P., Rölleke T., Shanmugasundaram J. and Weikum G. (2005). Report on the DB/IR panel at SIGMOD 2005. SIGMOD Rec. 34(4): 71–74
Article Google Scholar
Amer-Yahia, S., Curtmola, E., Deutsch, A.: Flexible and efficient XML search with complex full-text predicates. In: SIGMOD, pp. 575–586. ACM, New York (2006)
Amer-Yahia, S., Koudas, N., Marian, A., Srivastava, D., Toman, D.: Structure and content scoring for XML. In: VLDB, pp. 361–372. ACM, New York (2005)
Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: flexible structure and full-text querying for XML. In: SIGMOD, pp. 83–94. ACM, New York (2004)
Amer-Yahia S. and Lalmas M. (2006). XML Search: languages, INEX and scoring. SIGMOD Rec. 36(7): 16–23
Article Google Scholar
Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: SIGIR, pp. 35–42. ACM, New York (2001)
Anh, V.N., Moffat, A.: Impact transformation: effective and efficient web retrieval. In: SIGIR, pp. 3–10. ACM, New York (2002)
Anh, V.N., Moffat, A.: Pruned query evaluation using pre-computed impacts. In: SIGIR, pp. 372–379. ACM, New York (2006)
Avnur, R., Hellerstein, J.M.: Eddies: continuously adaptive query processing. In: SIGMOD, pp. 261–272. ACM, New York (2000)
Baeza-Yates R.A. and Ribeiro-Neto B.A. (1999). Modern information retrieval. ACM Press/Addison–Wesley, Reading
Google Scholar
Bast, H., Majumdar, D., Theobald, M., Schenkel, R., Weikum, G.: IO-Top-k: index-optimized top-k query processing. In: VLDB, pp. 475–486. ACM, New York (2006)
Bast, H., Weber, I.: Type less, find more: fast autocompletion search with a succinct index. In: SIGIR, pp. 364–371. ACM, New York (2006)
Böhm C., Berchtold S. and Keim D.A. (2001). Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33(3): 322–373
Article Google Scholar
Bruno N., Chaudhuri S. and Gravano L. (2002). Top-k selection queries over relational databases: mapping strategies and performance evaluation. ACM Trans. Database Syst. 27(2): 153–187
Article Google Scholar
Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: SIGMOD, pp. 310–321. ACM, New York (2002)
Buckley, C., Lewit, A.F.: Optimization of inverted vector searches. In: SIGIR, pp. 97–110. ACM, New York (1985)
Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR, pp. 33–40. ACM, New York (2000)
Carmel, D., Maarek, Y.S., Mandelbrod, M., Mass, Y., Soffer, A.: Searching XML documents via XML fragments. In: SIGIR, pp. 151–158. ACM, New York (2003)
Chang, K.C.C., won Hwang, S.: Minimal probing: supporting expensive predicates for top-k queries. In: SIGMOD, pp. 346–357. ACM, New York (2002)
Chaudhuri S., Gravano L. and Marian A. (2004). Optimizing top-k selection queries over multimedia repositories. IEEE Trans. Knowl. Data Eng. 16(8): 992–1009
Article Google Scholar
Chinenyanga, T.T., Kushmerick, N.: Expressive retrieval from XML documents. In: SIGIR, pp. 163–171. ACM, New York (2001)
Choi, B., Mahoui, M., Wood, D.: On the optimality of holistic algorithms for twig queries. In: DEXA. Lecture Notes in Computer Science, vol. 2736, pp. 28–37. Springer, Heidelberg (2003)
Ciaccia, P., Patella, M.: Pac nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: ICDE, pp. 244–255. IEEE Computer Society, New York (2000)
Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSEarch: A semantic search engine for XML. In: VLDB, pp. 45–56. Morgan Kaufmann, San Francisco (2003)
Consens, M.P., Baeza-Yates, R.A.: Database and information retrieval techniques for XML. In: ASIAN, pp. 22–27. Springer, Heidelberg (2005)
Cormen T.H., Leiserson C.E., Rivest R.L. and Clifford S. (2001). Introduction of algorithms. The MIT Press, Cambridge
Google Scholar
Craswell, N., Hawking, D., Wilkinson, R., Wu, M.: Overview of the TREC 2003 Web track. In: TREC, pp. 78–92 (2003)
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)
Donjerkovic, D., Ramakrishnan, R.: Probabilistic optimization of top n queries. In: VLDB, pp. 411–422. Morgan Kaufmann, San Francisco (1999)
Fagin R. (1999). Combining fuzzy information from multiple systems. J. Comput. Syst. Sci. 58(1): 83–99
Article MATH MathSciNet Google Scholar
Fagin R. (2002). Combining fuzzy information: an overview. SIGMOD Rec. 31(2): 109–118
Article Google Scholar
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp. 102–113. ACM, New York (2001)
Fagin R., Lotem A. and Naor M. (2003). Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614–656
Article MATH MathSciNet Google Scholar
Fegaras, L.: XQuery processing with relevance ranking. In: XSym. Lecture Notes in Computer Science, vol. 3186, pp. 51–65. Springer, Heidelberg (2004)
Fellbaum, C. (ed.) (1998). WordNet: An Electronic Lexical Database. MIT Press, Cambridge
MATH Google Scholar
Fuhr, N., Großjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: SIGIR, pp. 172–180. ACM, New York (2001)
Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB, pp. 436–445. Morgan Kaufmann, San Francisco (1997)
Grabs, T., Schek, H.J.: PowerDB-XML: scalable XML processing with a database cluster. In: Intelligent Search on XML Data, pp. 193–206 (2003)
Graupmann, J., Schenkel, R., Weikum, G.: The spheresearch engine for unified ranked retrieval of heterogeneous XML and web documents. In: VLDB, pp. 529–540. ACM, New York (2005)
Grust, T.: Accelerating XPath location steps. In: SIGMOD, pp. 109–120. ACM, New York (2002)
Grust, T., van Keulen, M., Teubner, J.: Staircase join: teach a relational DBMS to watch its (axis) steps. In: VLDB, pp. 524–525. Morgan Kaufmann, San Francisco (2003)
Güntzer, U., Balke, W.T., Kießling, W.: Optimizing multi-feature queries for image databases. In: VLDB, pp. 419–428. Morgan Kaufmann, San Francisco (2000)
Güntzer, U., Balke, W.T., Kießling, W.: Towards efficient multi-feature queries in heterogeneous environments. In: ITCC, pp. 622–628. IEEE Computer Society, New York (2001)
Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRank: ranked keyword search over XML documents. In: SIGMOD, pp. 16–27. ACM (2003)
Hjaltason G.R. and Samet H. (1999). Distance browsing in spatial databases. ACM Trans. Database Syst. 24(2): 265–318
Article Google Scholar
Hjaltason G.R. and Samet H. (2003). Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4): 517–580
Article Google Scholar
Hristidis, V., Papakonstantinou, Y., Balmin, A.: Keyword proximity search on XML graphs. In: ICDE, pp. 367–378. IEEE Computer Society, New York (2003)
Hung, E., Deng, Y., Subrahmanian, V.S.: TOSS: an extension of TAX with ontologies and similarity queries. In: SIGMOD, pp. 719–730. ACM, New York (2004)
INitiative for the Evaluation of XML Retrieval (INEX): http://inex.is.informatik.uni-duisburg.de
Index-Organized Tables—Oracle9i Data Sheet: http://www.oracle.com/technology/products/oracle9i/datasheets/iots/iot_ds.html
Jagadish, H.V., Lakshmanan, L.V.S., Srivastava, D., Thompson, K.: TAX: A tree algebra for XML. In: DBPL. Lecture Notes in Computer Science, vol. 2397, pp. 149–164. Springer, Heidelberg (2001)
Jiang, H., Wang, W., Lu, H., Yu, J.X.: Holistic twig joins on indexed XML documents. In: VLDB, pp. 273–284. Morgan Kaufmann, San Francisco (2003)
Kaushik, R., Bohannon, P., Naughton, J.F., Korth, H.F.: Covering indexes for branching path queries. In: SIGMOD, pp. 133–144. ACM, New York (2002)
Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the integration of structure indexes and inverted lists. In: SIGMOD, pp. 779–790. ACM, New York (2004)
Kazai, G., Lalmas, M.: INEX 2005 evaluation measures. In: 4th International Workshop of the Initiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 3977, pp. 16–29. Springer, Heidelberg (2005)
Lalmas, M., Kazai, G., Kamps, J., Pehcevski, J., Piwowarski, B., Robertson, S.: INEX 2006 evaluation measures. In: 5th International Workshop of the Initiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 4518. Springer, Heidelberg (2007)
Li, Y., Yu, C., Jagadish, H.V.: Schema-free XQuery. In: VLDB, pp. 72–83. Morgan Kaufmann, San Francisco (2004)
Lim, L., Wang, M., Padmanabhan, S., Vitter, J.S., Parr, R.: XPathLearner: an on-line self-tuning Markov histogram for XML path selectivity estimation. In: VLDB, pp. 442–453. Morgan Kaufmann, San Francisco (2002)
List J.A., Mihajlovic V., Ramírez G., Hiemstra D., Blok H.E. and Vries A.P. (2005). TIJAH: embracing IR methods in XML databases. Inf. Retr. 8(4): 547–570
Article Google Scholar
Liu, S., Chu, W.W., Shahinian, R.: Vague content and structure (VCAS) retrieval for document-centric XML collections. In: WebDB, pp. 79–84 (2005)
Marian, A., Amer-Yahia, S., Koudas, N., Srivastava, D.: Adaptive processing of top-k queries in XML. In: ICDE, pp. 162–173. IEEE Computer Society, New York (2005)
Marian A., Bruno N. and Gravano L. (2004). Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2): 319–362
Article Google Scholar
Mass, Y., Mandelbrod, M.: Component ranking and automatic query refinement for XML retrieval. In: 3rd International Workshop of the INitiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 3493, pp. 73–84. Springer, Heidelberg (2004)
Moffat A. and Zobel J. (1996). Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. 14(4): 349–379
Article Google Scholar
Natsev, A., Chang, Y.C., Smith, J.R., Li, C.S., Vitter, J.S.: Supporting incremental join queries on ranked inputs. In: VLDB, pp. 281–290. Morgan Kaufmann, San Francisco (2001)
Nepal, S., Ramakrishna, M.V.: Query processing issues in image (multimedia) databases. In: ICDE, pp. 22–29. IEEE Computer Society, New York (1999)
Persin M., Zobel J. and Sacks-Davis R. (1996). Filtered document retrieval with frequency-sorted indexes. JASIS 47(10): 749–764
Article Google Scholar
Polyzotis, N., Garofalakis, M.N., Ioannidis, Y.E.: Approximate XML query answers. In: SIGMOD, pp. 263–274. ACM, New York (2004)
Polyzotis N. and Garofalakis M.N. (2006). XSKETCH synopses for XML data graphs. ACM Trans. Database Syst. 31(3): 1014–1063
Article Google Scholar
Reid J., Lalmas M., Finesilver K. and Hertzum M. (2006). Best entry points for structured document retrieval (Part I and II). Inf. Process. Manage. 42(1): 74–105
Article Google Scholar
Robertson S.E. and Spärck-Jones K. (1976). Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27(1): 129–146
Article Google Scholar
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241. ACM/Springer, New York (1994)
Rocchio J. Jr (1971). Relevance feedback in information retrieval. In: Salton, G. (eds) The SMART Retrieval System: Experiments in Automatic Document Processing, chap. 14, pp 313–323. Prentice-Hall, Englewood Cliffs
Google Scholar
Schenkel, R., Theobald, M.: Feedback-driven structural query expansion for ranked retrieval of XML data. In: EDBT, pp. 331–348. Springer, Heidelberg (2006)
Schenkel, R., Theobald, M.: Structural feedback for keyword-based XML retrieval. In: ECIR, pp. 326–337. Springer, Heidelberg (2006)
Schlieder T. and Meuss H. (2002). Querying and ranking XML documents. JASIST 53(6): 489–503
Article Google Scholar
Soffer, A., Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y.S.: Static index pruning for information retrieval systems. In SIGIR, pp. 43–50. ACM, New York (2001)
Suchanek, F., Kasneci, G., Weikum, G.: YAGO: A core of semantic knowledge unifying WordNet and Wikipedia. In: WWW (2007)
Tao, Y., Faloutsos, C., Papadias, D.: The power-method: a comprehensive estimation technique for multi-dimensional queries. In: CIKM, pp. 83–90 (2003)
Theobald, A., Weikum, G.: Adding relevance to XML. In: WebDB (informal proceedings), pp. 35–40 (2000)
Theobald, A., Weikum, G.: The index-based XXL search engine for querying XML data with relevance ranking. In: EDBT, pp. 477–495. Springer, Heidelberg (2002)
Theobald, M., Schenkel, R., Weikum, G.: Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data. In: WebDB, pp. 1–6 (2003)
Theobald, M., Schenkel, R., Weikum, G.: Efficient and self-tuning incremental query expansion for top-k query processing. In: SIGIR, pp. 242–249. ACM, New York (2005)
Theobald, M., Schenkel, R., Weikum, G.: An efficient and versatile query engine for TopX search. In: VLDB, pp. 625–636. ACM, New York (2005)
Theobald, M., Schenkel, R., Weikum, G.: The TopX DB & IR engine. In: SIGMOD. ACM, New York (2007)
Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. In: VLDB, pp. 648–659. Morgan Kaufmann, San Francisco (2004)
Text REtrieval Conference (TREC): http://trec.nist.gov/
Trotman, A., Sigurbjörnsson, B.: Narrowed Extended XPath I (NEXI). In: 3rd International Workshop of the INitiative for the Evaluation of XML Retrieval. Lecture Notes in Computer Science, vol. 3493, pp. 16–40. Springer, Heidelberg (2004)
Vagena, Z., Moro, M.M., Tsotras, V.J.: Twig query processing over graph-structured XML data. In: WebDB, pp. 43–48 (2004)
Vorhees, E.: Overview of the TREC 2004 Robust retrieval track. In: TREC, pp. 69–77 (2004)
de Vries, A.P., Mamoulis, N., Nes, N., Kersten, M.L.: Efficient k-nn search on vertically decomposed data. In: SIGMOD, pp. 322–333. ACM, New York (2002)
Williams H.E., Zobel J. and Bahle D. (2004). Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22(4): 573–594
Article Google Scholar
XQuery 1.0 and XPath 2.0 Full-Text: http://www.w3.org/TR/xquery-full-text/
Wu, Y., Patel, J.M., Jagadish, H.V.: Structural join order selection for XML query optimization. In: ICDE, pp. 443–454. IEEE Computer Society, New York (2003)
Yu, C.T., Sharma, P., Meng, W., Qin, Y.: Database selection for processing k nearest neighbors queries in distributed environments. In: JCDL, pp. 215–222. ACM, New York (2001)
Zhang, C., Naughton, J.F., DeWitt, D.J., Luo, Q., Lohman, G.M.: On supporting containment queries in relational database management systems. In: SIGMOD, pp. 425–436. ACM, New York (2001)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006)

Download references

Author information

Authors and Affiliations

Max-Planck Institute for Informatics, Saarbruecken, Germany
Martin Theobald, Holger Bast, Debapriyo Majumdar, Ralf Schenkel & Gerhard Weikum

Authors

Martin Theobald
View author publications
You can also search for this author in PubMed Google Scholar
Holger Bast
View author publications
You can also search for this author in PubMed Google Scholar
Debapriyo Majumdar
View author publications
You can also search for this author in PubMed Google Scholar
Ralf Schenkel
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Weikum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ralf Schenkel.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Theobald, M., Bast, H., Majumdar, D. et al. TopX: efficient and versatile top-k query processing for semistructured data. The VLDB Journal 17, 81–115 (2008). https://doi.org/10.1007/s00778-007-0072-z

Download citation

Received: 14 September 2006
Revised: 09 April 2007
Accepted: 11 June 2007
Published: 12 October 2007
Issue Date: January 2008
DOI: https://doi.org/10.1007/s00778-007-0072-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

TopX: efficient and versatile top-k query processing for semistructured data

Abstract

Article PDF

Similar content being viewed by others

Pay-as-you-go Approximate Join Top-k Processing for the Web of Data

Approximate XML Query Processing

Compact Indexes for Flexible Top- $$k$$ Retrieval

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

TopX: efficient and versatile top-k query processing for semistructured data

Abstract

Article PDF

Similar content being viewed by others

Pay-as-you-go Approximate Join Top-k Processing for the Web of Data

Approximate XML Query Processing

Compact Indexes for Flexible Top- $$k$$ Retrieval

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation