Abstract
Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Akbarinia, R., Pacitti, E., Valduriez, P.: Reducing network traffic in unstructured p2p systems using top-k queries. Distributed Parallel Databases 19(2–3), 67–86 (2006)
Anh, V.N., Moffat, A.: Pruned query evaluation using pre-computed impacts. In: SIGIR, pp. 372–379 (2006)
Babcock, B., Olston, C.: Distributed top-k monitoring. In: SIGMOD Conference, pp. 28–39 (2003)
Balke, W.T., Nejdl, W., Siberski, W., Thaden, U.: Progressive distributed top k retrieval in peer-to-peer networks. In: ICDE, pp. 174–185 (2005)
Bender, M., Michel, S., Triantafillou, P., Weikum, G., Zimmer, C.: Improving collection selection with overlap awareness in p2p search engines. In: SIGIR, pp. 67–74 (2005)
Brijs, T., Swinnen, G., Vanhoof, K., Wets, G.: Using association rules for product assortment decisions: A case study. In: KDD, pp. 254–260 (1999)
Bruno, N., Chaudhuri, S., Gravano, L.: Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans. Database Syst. 27(2), 153–187 (2002a)
Bruno, N., Gravano, L., Marian, A.: Evaluating top-k queries over web-accessible databases. In: ICDE, pp. 369– (2002b)
Cao, P., Wang, Z.: Efficient top-k query calculation in distributed networks. In: PODC, pp. 206–215 (2004)
Chang, K.C.C., won Hwang, S.: Minimal probing: supporting expensive predicates for top-k queries. In: SIGMOD Conference, pp. 346–357 (2002)
Chaudhuri, S., Das, G., Hristidis, V., Weikum, G.: Probabilistic ranking of database query results. In: VLDB, pp. 888–899 (2004a)
Chaudhuri, S., Gravano, L., Marian, A.: Optimizing top-k selection queries over multimedia repositories. IEEE Trans. Knowl. Data Eng. 16(8), 992–1009 (2004b)
Church, K., Gale, W.: Poisson mixtures. Nat. Lang. Eng. 1(2), 163–190 (1995)
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. MIT/McGraw-Hill, Cambridge (2001)
Das, G., Gunopulos, D., Koudas, N., Tsirogiannis, D.: Answering top-k queries using views. In: VLDB, pp. 451–462 (2006)
Das, G., Gunopulos, D., Koudas, N., Sarkas, N.: Ad-hoc top-k query answering for data streams. In: VLDB, pp. 183–194 (2007)
David, H., Nagaraja, H.: Order Statistics, 3rd edn. Wiley, New York (2003)
Dubinko, M., Kumar, R., Magnani, J., Novak, J., Raghavan, P., Tomkins, A.: Visualizing tags over time. In: WWW, pp. 193–202 (2006)
Garofalakis, M. (ed): Special issue on in-network query processing. IEEE Data Eng. Bull. 28(1) (2005)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)
Güntzer, U., Balke, W.T., Kießling, W.: Optimizing multi-feature queries for image databases. In: VLDB, pp. 419–428 (2000)
Güntzer, U., Balke, W.T., Kießling, W.: Towards efficient multi-feature queries in heterogeneous environments. In: ITCC, pp. 622–628 (2001)
Ilyas, I.F., Aref, W.G., Elmagarmid, A.K., Elmongui, H.G., Shah, R., Vitter, J.S.: Adaptive rank-aware query optimization in relational databases. ACM Trans. Database Syst. 31(4), 1257–1304 (2006)
Information Sciences Institute The University of Southern California: The network simulator—ns-2 (2007). http://www.isi.edu/nsnam/ns/
Ioannidis, Y.E.: The history of histograms (abridged). In: VLDB, pp. 19–30 (2003)
Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the integration of structure indexes and inverted lists. In: SIGMOD Conference, pp. 779–790 (2004)
Koudas, N., Ooi, B.C., Tan, K.L.: Approximate nn queries on streams with guaranteed error/performance bounds. In: VLDB, pp. 804–815 (2004)
Li, C., Chang, K.C.C., Ilyas, I.F., Song, S.: Ranksql: Query algebra and optimization for relational top-k queries. In: SIGMOD Conference, pp. 131–142 (2005)
Long, X., Suel, T.: Optimized query execution in large search engines with global page ordering. In: VLDB, pp. 129–140 (2003)
Luo, Y., Lin, X., Wang, W., Zhou, X.: Spark: top-k keyword query in relational databases. In: SIGMOD Conference, pp. 115–126 (2007)
Madden, S., Franklin, M.J., Hellerstein, J.M., Hong, W.: Tinydb: an acquisitional query processing system for sensor networks. ACM Trans. Database Syst. 30(1), 122–173 (2005)
Marian, A., Bruno, N., Gravano, L.: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2), 319–362 (2004)
Michel, S., Triantafillou, P., Weikum, G.: Klee: A framework for distributed top-k query algorithms. In: VLDB, pp. 637–648 (2005)
Natsev, A., Chang, Y.C., Smith, J.R., Li, C.S., Vitter, J.S.: Supporting incremental join queries on ranked inputs. In: VLDB, pp. 281–290 (2001)
Nepal, S., Ramakrishna, M.V.: Query processing issues in image (multimedia) databases. In: ICDE, pp. 22–29 (1999)
Neumann, T., Michel, S.: Smooth interpolating histograms with error guarantees. In: BNCOD, pp. 126–138 (2008)
Neumann, T., Bender, M., Michel, S., Schenkel, R., Triantafillou, P., Weikum, G.: Optimizing distributed top-k queries. In: WISE, pp. 337–349 (2008)
Schnaitter, K., Spiegel, J., Polyzotis, N.: Depth estimation for ranking query optimization. VLDB J. 18(2), 521–542 (2009)
Soliman, M.A., Ilyas, I.F., Chang, K.C.C.: Top-k query processing in uncertain databases. In: ICDE, pp. 896–905 (2007)
Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. In: VLDB, pp. 648–659 (2004)
de Vries, A.P., Mamoulis, N., Nes, N., Kersten, M.L.: Efficient k-nn search on vertically decomposed data. In: SIGMOD Conference, pp. 322–333 (2002)
Winick, J., Jamin, S.: Inet-3.0: Internet topology generator. http://topology.eecs.umich.edu/inet/. Tech. Rep. UM-CSE-TR-456-02, EECS, University of Michigan, citeseer.nj.nec.com/526211.html (2002)
Xin, D., Han, J., Chang, K.C.C.: Progressive and selective merge: computing top-k with ad-hoc ranking functions. In: SIGMOD Conference, pp. 103–114 (2007)
Yu, H., Li, H.G., Wu, P., Agrawal, D., Abbadi, A.E.: Efficient processing of distributed top-k queries. In: DEXA, pp. 65–74 (2005)
Zeinalipour-Yazti, D., et al.: The threshold join algorithm for top-k queries in distributed sensor networks. In: DMSN (2005)
Zhang, J., Suel, T.: Efficient query evaluation on large textual collections in a peer-to-peer environment. In: Peer-to-Peer Computing, pp. 225–233 (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Neumann, T., Bender, M., Michel, S. et al. Distributed top-k aggregation queries at large. Distrib Parallel Databases 26, 3–27 (2009). https://doi.org/10.1007/s10619-009-7041-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-009-7041-z