Authors: Sloan, Stephen | Lafler, Kirk Paul
Article Type:
Research Article
Abstract:
Data comes in all forms, shapes, sizes and complexities. Stored in files and data sets, SAS ® users know all too well that data can be, and often is, problematic and plagued with a variety of issues. Although today’s statistical software programs are extremely powerful, they are typically not designed to overcome poor quality data. This paper describes and recommends a comprehensive data preparation and fuzzy matching process to follow to enable improved statistical modeling. Statistical techniques are also available for comparing the results of the process. Most statistical software users are aware that
…two or more data files can be joined, or combined, without a problem when the data files have identifiers with unique and reliable values. However, many files do not have unique identifiers, or “keys”, and need to be joined using character values, like names or E-mail addresses. To add to the difficulty and confusion, these identifiers might be spelled differently, or use different abbreviation or capitalization protocols. This paper describes a versatile 6-step approach to handling data preparation and fuzzy matching issues for improved statistical modeling. The steps include the identification and understanding of potential matching scenarios; exploring data values and data types; data cleaning and validation; data transformation; traditional merge and join techniques; and an assortment of techniques to successfully merge, join and match less than perfect, or “messy”, data by doing phonetic matching using special-purpose character-handling functions like the SOUNDEX algorithm, and the SPEDIS, COMPLEV, and COMPGED fuzzy matching functions. Although the programming techniques described in this paper are illustrated using SAS code, many, if not most, of the techniques can be applied to any software platform that supports character-handling capabilities.
Show more
Keywords: SAS, fuzzy matching, character-handling functions, phonetic matching, SOUNDEX, SPEDIS, edit distance, Levenshtein, COMPLEV, COMPGED
DOI: 10.3233/MAS-180447
Citation: Model Assisted Statistics and Applications,
vol. 13, no. 4, pp. 367-375, 2018
Price: EUR 27.50