计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 30-34.doi: 10.11896/j.issn.1002-137X.2019.02.005
樊哲宁, 杨秋辉, 翟宇鹏, 万莹, 王帅
FAN Zhe-ning, YANG Qiu-hui, ZHAI Yu-peng, WAN Ying, WANG Shuai
摘要: 随着数据分析研究的兴起,数据预处理越来越得到研究者的重视,其中缺失数据填补问题的重要性也逐渐显现。在ROUSTIDA数据补齐算法的基础上,针对具有关键属性的重复数据的特点,文中提出了一种改进的ROUSTIDA算法——Key&Rpt_RS算法。Key&Rpt_RS算法继承了ROUSTIDA算法的优势,同时考虑了目标数据的重复性特点,分析了关键属性对填补效果的影响,得到了更加准确且有效的填补结果。
中图分类号:
[1]RUBIND B.Multiple imputation for nonresponse in surveys[J].Journal of Marketing Research,1987,137(1):180. [2]SHUAI P,LI X S,ZHOU X H,et al.Theresearchprocesson statistical processing of missing data[J].Chinese Journal of Health Statistics,2013,30(1):135-139.(in Chinese) 帅平,李晓松,周晓华,等.缺失数据统计处理方法的研究进展[J].中国卫生统计,2013,30(1):135-139. [3]YUE Y,TIAN K C.Review of data missing and its imputation method[J].Journal of Preventive Medicine Information,2005,21(6):683-685.(in Chinese) 岳勇,田考聪.数据缺失及其填补方法综述[J].预防医学情报杂志,2005,21(6):683-685. [4]JIN Y J.Imputation adjustment method for missing data[J].Journal of applied statistics and management,2001,20(6):47-53.(in Chinese) 金勇进.缺失数据的插补调整[J].数理统计与管理,2001,20(6):47-53. [5]DEMPSTER A P.Maximum likelihood estimation from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society,1977,39(1):1-38. [6]JIN Y J.Adjusting for Missing Data by Weighting in Survey Analysis[J].Journal of applied statistics and management,2001(5):61-64.(in Chinese) 金勇进.缺失数据的加权调整(系列之IV)[J].数理统计与管理,2001(5):61-64. [7]ROBINS J M,ROTNITZKY A,ZHAO L P.Estimation of Regression Coefficients When Some Regressors Are Not Always Observed[J].Journal of the American Statistical Association,1994,89(427):846-866. [8]ZHANG Z H,LIU W Q.An Improved Algorithm Based on the Incomplete Data of the Rough Set Theory[J].Computer Engineering & Science,2002,24(4):41-42.(in Chinese) 张振华,刘文奇.一种基于粗集理论不完备数据的改进算法[J].计算机工程与科学,2002,24(4):41-42. [9]DUAN P,ZHUANG H L,HE L,et al.Improved algorithm based on incomplete data analysis method[J].Computer Engineering and Design,2009,30(7):1681-1684.(in Chinese) 段鹏,庄红林,何磊,等.不完备数据分析方法(ROUSTIDA)的改进算法[J].计算机工程与设计,2009,30(7):1681-1684. [10]TIAN S X,WU X P,WANG H X.Improved method for data reinforcement based on ROUSTIDA[J].Journal of Naval University of Engineering,2011,23(5):11-15.(in Chinese) 田树新,吴晓平,王红霞.一种基于改进的ROUSTIDA算法的数据补齐方法[J].海军工程大学学报,2011,23(5):11-15. [11]DING C R,LI L S.Improved ROUSTIDA algorithm based on similarity relation vector[J].Computer Engineering and Applications,2014,50(13):133-136.(in Chinese) 丁春荣,李龙澍.基于相似关系向量的改进ROUSTIDA算法[J].计算机工程与应用,2014,50(13):133-136. [12]PAWLAK Z.Rough set[J].International Journal of Computer & Information Sciences,1982,11(5):341-356. [13]张文修.粗糙集理论与方法[M].北京:科学出版社,2001. [14]SKOWRON A,RAUSZER C.The Discernibility Matrices and Functions in Information Systems[M]∥Intelligent Decision Support. Springer, Dordrecht,1992:331-362. [15]王国胤.Rough集理论与知识获取[M].西安:西安交通大学出版社,2001. [16]ZHANG W,LIAO X F,WU Z F.An incomplete data analysis approach based on rough set theory[J].Pattern Recognition and Artificial Intelligence,2003,16(2):158-163.(in Chinese) 张伟,廖晓峰,吴中福.一种基于Rough集理论的不完备数据分析方法[J].模式识别与人工智能,2003,16(2):158-163. [17]MENG J,LIU Y C,MO H B.New method of packing missing data based on rough set theory[J].Computer Engineering and Applications,2008,44(6):175-177.(in Chinese) 孟军,刘永超,莫海波.基于粗糙集理论的不完备数据填补方法[J].计算机工程与应用,2008,44(6):175-177. |
[1] | 黄颖琦, 陈红梅. 基于代价敏感卷积神经网络的非平衡问题混合方法 Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification 计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013 |
[2] | 徐堃, 付印金, 陈卫卫, 张亚男. 基于区块链的云存储安全研究进展 Research Progress on Blockchain-based Cloud Storage Security Mechanism 计算机科学, 2021, 48(11): 102-115. https://doi.org/10.11896/jsjkx.210600015 |
[3] | 倪晓军, 佘戌豪. 面向无线传感网络应用的改进LZW算法 Improvement of LZW Algorithms for Wireless Sensor Networks 计算机科学, 2020, 47(5): 260-264. https://doi.org/10.11896/jsjkx.190400108 |
[4] | 陈佳,欧阳金源,冯安琪,吴远,钱丽萍. 边缘计算构架下基于孤立森林算法的DoS异常检测 DoS Anomaly Detection Based on Isolation Forest Algorithm Under Edge Computing Framework 计算机科学, 2020, 47(2): 287-293. https://doi.org/10.11896/jsjkx.190100047 |
[5] | 周蓓, 黄永忠, 许瑾晨, 郭绍忠. 向量数学库的向量化方法研究 Study on SIMD Method of Vector Math Library 计算机科学, 2019, 46(1): 320-324. https://doi.org/10.11896/j.issn.1002-137X.2019.01.050 |
[6] | 檀朝东,闵帆,吴霄,李欣伦. 带弱通配符的模式匹配及其在时序分析中的应用 Pattern Matching with Weak-wildcard in Application of Time Series Analysis 计算机科学, 2018, 45(1): 103-107. https://doi.org/10.11896/j.issn.1002-137X.2018.01.016 |
[7] | 梁路,龚奔龙,黎剑,滕少华. 一种缓解分类面交错的样本点扩散方法 Diffusion Method of Sample Points for Alleviating Staggered Situation of Classification 计算机科学, 2017, 44(9): 286-289. https://doi.org/10.11896/j.issn.1002-137X.2017.09.053 |
[8] | 池云仙,赵书良,罗燕,高琳,赵骏鹏,李超. 基于词频统计规律的文本数据预处理方法 Text Data Preprocessing Based on Term Frequency Statistics Rules 计算机科学, 2017, 44(10): 276-282. https://doi.org/10.11896/j.issn.1002-137X.2017.10.050 |
[9] | 李锋,陆婷婷,郭建华. 一种基于重复数据删除的镜像文件存储方法研究 Effective Image File Storage Technique Using Improved Data Deduplication 计算机科学, 2016, 43(Z11): 495-498. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.111 |
[10] | 梁路,黎剑,霍颖翔,滕少华. 一种非均匀分布数据的非线性标准化方法 Nonlinear Normalization for Non-uniformly Distributed Data 计算机科学, 2016, 43(4): 264-269. https://doi.org/10.11896/j.issn.1002-137X.2016.04.054 |
[11] | 徐奕奕,唐培和. 基于分数阶Fourier变换的云存储系统重复数据删除算法 Duplicate Data Remove Algorithm of Cloud Storage System Based on Fractional Fourier Transform 计算机科学, 2015, 42(7): 174-177. https://doi.org/10.11896/j.issn.1002-137X.2015.07.038 |
[12] | 刘解放,赵斌,周宁. 基于有效载荷的多级实时入侵检测系统框架 Multilevel Real-time Payload-based Intrusion Detection System Framework 计算机科学, 2014, 41(4): 126-133. |
[13] | 谢平. 存储系统重复数据删除技术研究综述 Survey on Data Deduplication Techniques for Storage Systems 计算机科学, 2014, 41(1): 22-30. |
[14] | 周敬利,聂雪军,秦磊华,刘科,朱建峰,王宇. 基于存储环境感知的重复数据删除算法优化 Optimization for Data De-duplication Algorithm Based on Storage Environment Aware 计算机科学, 2011, 38(2): 63-67. |
[15] | 于化龙,顾国昌,赵靖,刘海波,沈晶. 基于DNA微阵列数据的癌症分类问题研究进展 State of the Art on Cancer Classification Problems Based on DNA Microarray Data 计算机科学, 2010, 37(10): 16-22. |
|