Data cleaning

(1)

Data Mining

Piotr Paszek

Data Preprocessing

(2)

Data preprocessing

Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogeneous sources.

Low-quality data will lead to low-quality mining results.

How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results?

How can the data be preprocessed so as to improve the eÖciency of the mining process?

(3)

Major tasks (steps) in data preprocessing (KDD)

1 Data cleaning

to remove noise and inconsistent data.

2 Data integration,

where multiple data sources may be combined.

3 Data selection,

where data relevant to the analysis task are retrieved from the database.

4 Data transformation,

where data are transformed and consolidated into forms appropriate for mining by preforming summary or aggregation operations.

(4)

Data cleaning

Data in the Real World Is Dirty: lots of potentially incorrect data, e.g., instrument faulty, human or computer error

incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

e.g., Occupation =“ ” (missing data) noisy: containing noise, errors, or outliers e.g., Salary = ≠10 (an error)

inconsistent: containing discrepancies in codes or names, e.g., Age =“42”, Birthday =“03/07/2010”

Was rating 1, 2, 3, now rating A, B, C discrepancy between duplicate records intentional (e.g., disguised missing data) Jan. 1 as everyone’s birthday?

(5)

Incomplete (missing) Data

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to equipment malfunction

inconsistent with other recorded data and thus deleted data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data Missing data may need to be inferred

(6)

Missing Data

Ignore the tuple:

This is usually done when the class label is missing (assuming the mining task involves classification).

This method is not very eÄective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably.

By ignoring the tuple, we do not make use of the remaining

attributes’ values in the tuple. Such data could have been useful to the task at hand.

(7)

Missing Data

Fill in the missing value manually:

In general, this approach is time consuming and may not be feasible given a large data set with many missing values.

Use a global constant to fill in the missing value:

Replace all missing attribute values by the same constant such as a label like “Unknown” or ≠Œ.

If missing values are replaced by, say, “Unknown”, then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common – that of “Unknown”.

Hence, although this method is simple, it is not foolproof.

(8)

Missing Data

Use a measure of central tendency for the attribute to fill in the missing value:

For normal (symmetric) data distributions, the mean can be used, while skewed data distribution should employ the median.

Use the attribute mean or median for all samples belonging to the same class as the given tuple:

For example, if classifying customers according to credit risk, we may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple.

If the data distribution for a given class is skewed, the median value is a better choice.

(9)

Missing Data

Use the most probable value to fill in the missing value:

This may be determined with : regression,

inference-based tools using a Bayesian formalism, or decision tree induction.

For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.

(10)

Data cleaning

Noisy Data

Noise: random error or variance in a measured variable Incorrect attribute values may be due to

faulty data collection instruments data entry problems

data transmission problems technology limitation

inconsistency in naming convention

Other data problems which require data cleaning duplicate records

incomplete data inconsistent data

(11)

Data cleaning

How to Handle Noisy Data?

Binning

first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Regression

smooth by fitting the data into regression functions Clustering

detect and remove outliers

Combined computer and human inspection

detect suspicious values and check by human (e.g., deal with possible outliers)

(12)

Data integration

Data integration:

Combines data from multiple sources into a coherent store Entity identification problem:

Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

Detecting and resolving data value conflicts

For the same real world entity, attribute values from diÄerent sources are diÄerent

Possible reasons: diÄerent representations, diÄerent scales, e.g., metric versus British units

(13)

Data reduction

Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results.

Why data reduction?

A database/data warehouse may store terabytes of data.

Complex data analysis may take a very long time to run on the complete data set.

(14)

Data reduction

Dimensionality reduction, e.g., remove unimportant attributes Wavelet transforms

Principal Components Analysis (PCA) Feature subset selection, feature creation

Numerosity reduction (some simply call it: Data Reduction) Regression and Log-Linear Models

Histograms, clustering, sampling Data cube aggregation

Data compression

(15)

Data transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

(16)

Data transformation methods

Smoothing: Remove noise from data Attribute/feature construction

New attributes constructed from the given ones Aggregation:

Summarization, data cube construction Normalization:

Scaled to fall within a smaller, specified range Discretization

(17)

Normalization

The measurement unit used can aÄect the data analysis.

For example, changing measurement units from meters to inches for height, or from kilograms to pounds for weight, may lead to very diÄerent results.

In general, expressing an attribute in smaller units will lead to a larger range for that attribute, and thus tend to give such an attribute greater eÄect or “weight.”

To help avoid dependence on the choice of measurement units, the data should be normalized.

This involves transforming the data to fall within a smaller or common range such as [≠1, 1] or [0, 1].

(18)

Normalization

Min-max normalization

Linear transformation of original data to a new interval by the formula V^Õ = V ≠ min

max≠ min· (maxⁿ≠ minⁿ) + minn

V – old value, V‘ – new value,

[min, max] – old interval, [minn, maxn] – new interval.

(19)

Normalization

Z-score normalization

Data transformation according to the formula V^Õ = V ≠ x

‡ V – old value,

V^Õ – new value, x – mean value,

‡ – standard deviation.

(20)

Normalization

Decimal scaling

Data transformation according to the formula V^Õ = V

10^j where

j is the smallest integer such that max{|V^Õ|} < 1, V – old value,

V‘ – new value.

(21)

Discretization

Three types of attributes

Nominal

– values from an unordered set, e.g., colour, profession

Ordinal

– values from an ordered set, e.g., military or academic rank Numeric

– real numbers,

e.g., integer or real numbers

(22)

Discretization of continuous attributes (features)

In statistics and machine learning, discretization refers to the process of partitioning continuous attributes (features) into intervals

Interval labels can then be used to replace actual data values Discretization reduce data size

Discretization prepare data for further analysis, e.g., classification Various discretization algorithms

Supervised versus unsupervised

Split (top-down) versus merge (bottom-up)

(23)

Discretization

Data Discretization Methods

Binning (unsupervised, top-down split)

Histogram analysis (unsupervised, top-down split)

Clustering analysis (unsupervised, top-down split or bottom-up merge)

Decision-tree analysis (supervised, top-down split)

Correlation (e.g., ‰²) analysis (unsupervised, bottom-up merge) Maximal Disternibility Discretization (supervised, top-down split)

(24)

Discretization: Binning

Equal-width partitioning (Equal Interval Width)

Divides the range into n intervals of equal size: uniform grid if max and min are the lowest and highest values of the attribute, the width of intervals will be: w= ^max–min_n The most straightforward, but outliers may dominate presentation

Equal-frequency partitioning (Equal Frequency per Interval, Maximum Entropy Discretization)

Divides the range into n intervals, each containing approximately same number of samples

Good data scaling

(25)

Discretization

Discretization by Classification

Classification (e.g., decision tree analysis) Supervised: Given class labels

Using entropy to determine split point (discretization point) Top-down, recursive split

Correlation analysis (e.g., ‰²-based discretization) Supervised: use class information

Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes, that is, low ‰² values) to merge

Merge performed recursively, until a predefined stopping condition

(26)

Discretization

Entropy function

Ent(U) = ≠^ÿ^k

i=1

pi· log2pi

where:

k – number of intervals

pi – the ratio of the value of the attribute in the i-th range to the number of all values of this attribute

(27)

Discretization

Maximal Disternibility Heuristic (MD)

Supervised: Given class labels

Using greedy algorithm to determine a cut (discretization point) Top-down, recursive split

Discretization algorithm (heuristic) using Rough Sets Theory (RST), based on Johnson’s greedy strategy:

In any step of algorithm a cut discerning a maximal number of object pairs is found

This step is repeated until the remaining set of discernible pairs is non empty