计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 406-410.
张素芳1,翟俊海2,王婷婷2,郝璞2,王聪2,赵春玲2
ZHANG Su-fang1,ZHAI Jun-hai2,WANG Ting-ting2,HAO Pu2,WANG Cong2,ZHAO Chun-ling2
摘要: K-近邻(K-Nearest Neighbors,K-NN)是一种懒惰学习算法,用K-NN对数据分类时,不需要训练分类模型。K-NN算法的优点是思想简单、易于实现;缺点是计算量大,原因是在对测试样例进行分类时,其需要计算测试样例与训练集中每一个训练样例之间的距离。压缩近邻算法(Condensed Nearest Neighbors,CNN)可以克服K-NN算法的不足。但是,在面对大数据集时,由于自身的迭代计算特性,CNN的运算效率会变得非常低。针对这一问题,提出一种名为Spark CNN的压缩近邻算法。在大数据环境下,与基于MapReduce的CNN算法相比,Spark CNN的效率大幅提高,在5个大数据集上的实验证明了这一结论。
中图分类号:
[1]COVER T,HART P.Nearest neighbor pattern classification [J]. IEEE Transactions on Information Theory,1967,13(1):21-27.<br /> [2]HART P.The condensed nearest neighbor rule[J].IEEE Transaction on Information Theory,1968,14(5):15-516.<br /> [3]ZHAI J H,LI T,WANG X Z.A cross-selection instance algorithm [J].Journal of Intelligent & Fuzzy Systems,2016,30 (2):717-728.<br /> [4]SONG Y S,LIANG J Y,LU J,et al.An effcient instance selection algorithm for k nearest neighbor regression[J].Neurocomputing,2017,251:26-34.<br /> [5]ONAN A.A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer[J].Expert Systems with Applications,2015,42(20):6844-6852.<br /> [6]ALVAR A G,JOSE-FRANCISCO D P,RODRíGUEZ J J,et al.Instance selection of linear complexity for big data[J].Know-ledge-Based Systems,2016,107(C):83-95.<br /> [7]HOU G,CUI R,PAN Z,et al.Tree-based compact hashing for approximate nearest neighbor search[J].Neurocomputing,2015,166(C):271-281.<br /> [8]WAN J,TANG S,ZHANG D D,et al.HDIdx:High-dimensional indexing for efficient approximate nearest neighbor search [J].Neurocomputing,2017,237:401-404.<br /> [9]文庆福,王建民,朱晗,等.面向近似近邻查询的分布式哈希学习方法[J].计算机学报,2017,40(1):192-206.<br /> [10]刘义,景宁,陈荦,等.MapReduce框架下基于R-树的k-近邻连接算法[J].软件学报,2013,24(8):1836-1851.<br /> [11]MUJA M,LOWE D G.Scalable nearest neighbor algorithms for high dimensional data[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2014,36(11):2227-2240.<br /> [12]MAILLO J,RAM REZ S,TRIGUERO I,et al.kNN-IS:An Itera- tive Spark-based design of the k-nearest neighbors classifier for big data [J].Knowledge-Based Systems,2017,117:3-15.<br /> [13]ZHAI J H,WANG X Z,PANG X H.Voting-based instance selection from large data sets with mapreduce and random weight networks[J].Information Sciences,2016,367:1066-1077.<br /> [14]SONG G,ROCHAS J,BEZE L E,et al.K nearest neighbour joins for big data on mapreduce:a theoretical and experimentalanalysis[J].IEEE Transactions on Knowledge & Data Engineering,2016,28(9):2376-2392.<br /> [15]刘军,林文辉,方澄.Spark大数据处理-原理、算法与实例[M].北京:清华大学出版社,2016.<br /> [16]翟俊海,郝璞,王婷婷,张明阳.MapReduce并行化压缩近邻算法[J].小型微型计算机系统,2017(12):2678-2682. |
[1] | 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240 |
[2] | 陈晶, 吴玲玲. 多源异构环境下的车联网大数据混合属性特征检测方法 Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment 计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273 |
[3] | 王美珊, 姚兰, 高福祥, 徐军灿. 面向医疗集值数据的差分隐私保护技术研究 Study on Differential Privacy Protection for Medical Set-Valued Data 计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032 |
[4] | 孙轩, 王焕骁. 政务大数据安全防护能力建设:基于技术和管理视角的探讨 Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives 计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010 |
[5] | 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207 |
[6] | 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳. 面向大数据分析的智能交互向导系统 Smart Interactive Guide System for Big Data Analytics 计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083 |
[7] | 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130 |
[8] | 王雪岑, 张昱, 刘迎婕, 于戈. 基于表示学习的在线学习交互质量评价方法 Evaluation of Quality of Interaction in Online Learning Based on Representation Learning 计算机科学, 2021, 48(2): 207-211. https://doi.org/10.11896/jsjkx.201000042 |
[9] | 滕建, 滕飞, 李天瑞. 基于3D卷积和LSTM编码解码的出行需求预测 Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder 计算机科学, 2021, 48(12): 195-203. https://doi.org/10.11896/jsjkx.210400022 |
[10] | 张育龙, 王强, 陈明康, 孙静涛. 图像去雨算法在云物联网应用中的研究综述 Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems 计算机科学, 2021, 48(12): 231-242. https://doi.org/10.11896/jsjkx.201000055 |
[11] | 曹萌, 于洋, 梁英, 史红周. 基于区块链的大数据交易关键技术与发展趋势 Key Technologies and Development Trends of Big Data Trade Based on Blockchain 计算机科学, 2021, 48(11A): 184-190. https://doi.org/10.11896/jsjkx.210100163 |
[12] | 刘亚臣, 黄雪莹. 卫星监测时空大数据蠕变特征提取及预警算法 Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data 计算机科学, 2021, 48(11A): 258-264. https://doi.org/10.11896/jsjkx.201000071 |
[13] | 张光君, 张翔. 应用“大数据+区块链”优化立法评估制度的机理与路径 Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain” 计算机科学, 2021, 48(10): 324-333. https://doi.org/10.11896/jsjkx.201200105 |
[14] | 叶雅珍, 刘国华, 朱扬勇. 数据产品流通的两阶段授权模式 Two-step Authorization Pattern of Data Product Circulation 计算机科学, 2021, 48(1): 119-124. https://doi.org/10.11896/jsjkx.191100217 |
[15] | 赵会群, 吴凯锋. 一种大数据估价算法 Big Data Valuation Algorithm 计算机科学, 2020, 47(9): 110-116. https://doi.org/10.11896/jsjkx.191000156 |
|