计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 30-34.doi: 10.11896/j.issn.1002-137X.2019.02.005

• 大数据与数据科学 • 上一篇    下一篇


樊哲宁, 杨秋辉, 翟宇鹏, 万莹, 王帅   

  1. 四川大学计算机学院(软件学院) 成都610065
  • 收稿日期:2017-12-05 出版日期:2019-02-25 发布日期:2019-02-25
  • 通讯作者: 杨秋辉(1970-),女,副教授,CCF会员,主要研究方向为软件测试、经验软件工程、数据库系统及其应用,E-mail:[email protected]
  • 作者简介:樊哲宁(1994-),女,硕士,主要研究方向为软件分析与测试,E-mail:[email protected];翟宇鹏(1992-),男,硕士,主要研究方向为软件自动化测试;万 莹(1993-),女,硕士,主要研究方向为软件分析与测试;王 帅(1992-),男,硕士,主要研究方向为数据挖掘。

Improved ROUSTIDA Algorithm for Missing Data Imputation with Key Attribute in Repetitive Data

FAN Zhe-ning, YANG Qiu-hui, ZHAI Yu-peng, WAN Ying, WANG Shuai   

  1. College of Computer Science(Software Engineering),Sichuan University,Chengdu 610065,China
  • Received:2017-12-05 Online:2019-02-25 Published:2019-02-25

摘要: 随着数据分析研究的兴起,数据预处理越来越得到研究者的重视,其中缺失数据填补问题的重要性也逐渐显现。在ROUSTIDA数据补齐算法的基础上,针对具有关键属性的重复数据的特点,文中提出了一种改进的ROUSTIDA算法——Key&Rpt_RS算法。Key&Rpt_RS算法继承了ROUSTIDA算法的优势,同时考虑了目标数据的重复性特点,分析了关键属性对填补效果的影响,得到了更加准确且有效的填补结果。

关键词: ROUSTIDA算法, 缺失填补, 数据预处理, 重复数据

Abstract: With the rise of data analysis,the importance of data pre-processing has attracted more and more attention,especially the imputation of missing data.Based on the ROUSTIDA algorithm,this paper proposed an improved ROUSTIDA algorithm-Key&Rpt_RS algorithm.Key&Rpt_RS algorithm inherits the advantages of ROUSTIDA algorithm,considers the characteristic of repeatability in objective data,and analyzes the influence of key attribute on imputation effect.At last,this paper conducted the experiments based on the alarm data in communication network.The results show that Key&Rpt_RS algorithm outperforms the traditional ROUSTIDA algorithm in terms of the imputation effect for missing data.

Key words: Data pre-processing, Missing data imputation, Repeated data, ROUSTIDA algorithm


  • TP391
[1]RUBIND B.Multiple imputation for nonresponse in surveys[J].Journal of Marketing Research,1987,137(1):180.
[2]SHUAI P,LI X S,ZHOU X H,et al.Theresearchprocesson statistical processing of missing data[J].Chinese Journal of Health Statistics,2013,30(1):135-139.(in Chinese)
[3]YUE Y,TIAN K C.Review of data missing and its imputation method[J].Journal of Preventive Medicine Information,2005,21(6):683-685.(in Chinese)
[4]JIN Y J.Imputation adjustment method for missing data[J].Journal of applied statistics and management,2001,20(6):47-53.(in Chinese)
[5]DEMPSTER A P.Maximum likelihood estimation from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society,1977,39(1):1-38.
[6]JIN Y J.Adjusting for Missing Data by Weighting in Survey Analysis[J].Journal of applied statistics and management,2001(5):61-64.(in Chinese)
[7]ROBINS J M,ROTNITZKY A,ZHAO L P.Estimation of Regression Coefficients When Some Regressors Are Not Always Observed[J].Journal of the American Statistical Association,1994,89(427):846-866.
[8]ZHANG Z H,LIU W Q.An Improved Algorithm Based on the Incomplete Data of the Rough Set Theory[J].Computer Engineering & Science,2002,24(4):41-42.(in Chinese)
[9]DUAN P,ZHUANG H L,HE L,et al.Improved algorithm based on incomplete data analysis method[J].Computer Engineering and Design,2009,30(7):1681-1684.(in Chinese)
[10]TIAN S X,WU X P,WANG H X.Improved method for data reinforcement based on ROUSTIDA[J].Journal of Naval University of Engineering,2011,23(5):11-15.(in Chinese)
[11]DING C R,LI L S.Improved ROUSTIDA algorithm based on similarity relation vector[J].Computer Engineering and Applications,2014,50(13):133-136.(in Chinese)
[12]PAWLAK Z.Rough set[J].International Journal of Computer & Information Sciences,1982,11(5):341-356.
[14]SKOWRON A,RAUSZER C.The Discernibility Matrices and Functions in Information Systems[M]∥Intelligent Decision Support. Springer, Dordrecht,1992:331-362.
[16]ZHANG W,LIAO X F,WU Z F.An incomplete data analysis approach based on rough set theory[J].Pattern Recognition and Artificial Intelligence,2003,16(2):158-163.(in Chinese)
[17]MENG J,LIU Y C,MO H B.New method of packing missing data based on rough set theory[J].Computer Engineering and Applications,2008,44(6):175-177.(in Chinese)
[1] 黄颖琦, 陈红梅.
Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification
计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013
[2] 徐堃, 付印金, 陈卫卫, 张亚男.
Research Progress on Blockchain-based Cloud Storage Security Mechanism
计算机科学, 2021, 48(11): 102-115. https://doi.org/10.11896/jsjkx.210600015
[3] 倪晓军, 佘戌豪.
Improvement of LZW Algorithms for Wireless Sensor Networks
计算机科学, 2020, 47(5): 260-264. https://doi.org/10.11896/jsjkx.190400108
[4] 陈佳,欧阳金源,冯安琪,吴远,钱丽萍.
DoS Anomaly Detection Based on Isolation Forest Algorithm Under Edge Computing Framework
计算机科学, 2020, 47(2): 287-293. https://doi.org/10.11896/jsjkx.190100047
[5] 周蓓, 黄永忠, 许瑾晨, 郭绍忠.
Study on SIMD Method of Vector Math Library
计算机科学, 2019, 46(1): 320-324. https://doi.org/10.11896/j.issn.1002-137X.2019.01.050
[6] 檀朝东,闵帆,吴霄,李欣伦.
Pattern Matching with Weak-wildcard in Application of Time Series Analysis
计算机科学, 2018, 45(1): 103-107. https://doi.org/10.11896/j.issn.1002-137X.2018.01.016
[7] 梁路,龚奔龙,黎剑,滕少华.
Diffusion Method of Sample Points for Alleviating Staggered Situation of Classification
计算机科学, 2017, 44(9): 286-289. https://doi.org/10.11896/j.issn.1002-137X.2017.09.053
[8] 池云仙,赵书良,罗燕,高琳,赵骏鹏,李超.
Text Data Preprocessing Based on Term Frequency Statistics Rules
计算机科学, 2017, 44(10): 276-282. https://doi.org/10.11896/j.issn.1002-137X.2017.10.050
[9] 李锋,陆婷婷,郭建华.
Effective Image File Storage Technique Using Improved Data Deduplication
计算机科学, 2016, 43(Z11): 495-498. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.111
[10] 梁路,黎剑,霍颖翔,滕少华.
Nonlinear Normalization for Non-uniformly Distributed Data
计算机科学, 2016, 43(4): 264-269. https://doi.org/10.11896/j.issn.1002-137X.2016.04.054
[11] 徐奕奕,唐培和.
Duplicate Data Remove Algorithm of Cloud Storage System Based on Fractional Fourier Transform
计算机科学, 2015, 42(7): 174-177. https://doi.org/10.11896/j.issn.1002-137X.2015.07.038
[12] 刘解放,赵斌,周宁.
Multilevel Real-time Payload-based Intrusion Detection System Framework
计算机科学, 2014, 41(4): 126-133.
[13] 谢平.
Survey on Data Deduplication Techniques for Storage Systems
计算机科学, 2014, 41(1): 22-30.
[14] 周敬利,聂雪军,秦磊华,刘科,朱建峰,王宇.
Optimization for Data De-duplication Algorithm Based on Storage Environment Aware
计算机科学, 2011, 38(2): 63-67.
[15] 于化龙,顾国昌,赵靖,刘海波,沈晶.
State of the Art on Cancer Classification Problems Based on DNA Microarray Data
计算机科学, 2010, 37(10): 16-22.
Full text



No Suggested Reading articles found!