Data – what are they?

(1)

MODA

(2)

(3)

Data – what are they?

(4)

TYPES OF DATA (1)

•Generally we distinguish:

Quantitative Data Qualitative Data

•Bivaluated:often very useful

•Remember: Null Values are not applicable

•Missing data usually not acceptable

(5)

Types of Attribute Values: Levels of

Measurement

(6)

(7)

(8)

Types of Attribute Values: Discrete and Continuous Attributes

(9)

Missing Values

(10)

Handling Missing values by Eliminating Data

objects

(11)

Handling Missing values by Eliminating

attributes

(12)

Handling Missing values by Estimating

missing values

(13)

(14)

(15)

(16)

(17)

(18)

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

Simple DiscretizationMethods: Binning

•Equal-width(distance) partitioning:

–It divides the range (values of a given attribute) –into N intervals of equal size: uniform grid

–if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B- A)/N.

–The most straightforward

–But outliers may dominate presentation –Skewed data is not handled well.

(28)

Simple DiscretizationMethods: Binning

•Equal-depth(frequency) partitioning:

–It divides the range (values of a given attribute) –into N intervals, each containing approximately same number of samples (elements)

–Good data scaling

–Managing categorical attributes can be tricky.

(29)

Binning Methods for Data Smoothing (book example)

•Sorted data (attribute values ) for price (attribute: price in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

•Partition into (equal-depth) bins:

•Bin 1: 4, 8, 9, 15

•Bin 2: 21, 21, 24, 25

•Bin 3: 26, 28, 29, 34

•Smoothing by bin means:

•Bin 1: 9, 9, 9, 9

•Bin 2: 23, 23, 23, 23

•Bin 3: 29, 29, 29, 29

•Smoothing by bin boundaries:

•Bin 1: 4, 4, 4, 15

•Bin 2: 21, 21, 25, 25

•Bin 3: 26, 26, 26, 34

•Replace all values in a BIN by ONE value (smoothing values)

(30)

(31)

Iris Sample Data Set

• Many of the exploratory data techniques are illustrated with the Iris Plant data set.

– Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html

– From the statistician Douglas Fisher – Three flower types (classes):

• Setosa

• Virginica

• Versicolour

– Four (non-class) attributes

• Sepal width and length

• Petal width and length

Virginica. Robert H. Mohlenbrock.

USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA.

Courtesy of USDA NRCS Wetland Science Institute.

(32)

(33)

(34)

Summary Statistics

• Summary statistics are numbers that summarize properties of the data

– Summarized properties include frequency, location and spread

• Examples: location - mean

spread - standard deviation

– Most summary statistics can be calculated in a single pass through the data

(35)

Frequency and Mode

• The frequency of an attribute value is the percentage of time the value occurs in the data set

– For example, given the attribute ‘gender’ and a representative population of people, the gender

‘female’ occurs about 50% of the time.

• The mode of a an attribute is the most frequent attribute value

• The notions of frequency and mode are typically used with categorical data

(36)

Percentiles

• For continuous data, the notion of a percentile is more useful.

Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth

percentile is a value of x such that p% of the observed values of x are less than .

• For instance, the 50th percentile is the value such that 50% of all values of x are less than .



x_p



x_p



x_p



x_50%



x_50%

(37)

Measures of Location: Mean and Median

• The mean is the most common measure of the location of a set of points.

• However, the mean is very sensitive to outliers.

• Thus, the median or a trimmed mean is also commonly used.

(38)

Measures of Spread: Range and Variance

• Range is the difference between the max and min

• The variance or standard deviation is the most common measure of the spread of a set of points.

• However, this is also sensitive to outliers, so that other measures are often used.

(39)

Visualization

Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

• Visualization of data is one of the most powerful and appealing techniques for data exploration.

– Humans have a well developed ability to analyze large amounts of information that is presented visually

– Can detect general patterns and trends

– Can detect outliers and unusual patterns

(40)

Arrangement

• Is the placement of visual elements within a display

• Can make a large difference in how easy it is to understand the data

• Example:

(41)

Selection

• Is the elimination or the de-emphasis of certain objects and attributes

• Selection may involve the chossing a subset of attributes – Dimensionality reduction is often used to reduce the

number of dimensions to two or three

– Alternatively, pairs of attributes can be considered

• Selection may also involve choosing a subset of objects – A region of the screen can only show so many points – Can sample, but want to preserve points in sparse areas

(42)

Visualization Techniques: Histograms

• Histogram

– Usually shows the distribution of values of a single variable

– Divide the values into bins and show a bar plot of the number of objects in each bin.

– The height of each bar indicates the number of objects – Shape of histogram depends on the number of bins

• Example: Petal Width (10 and 20 bins, respectively)

(43)

Two-Dimensional Histograms

• Show the joint distribution of the values of two attributes

• Example: petal width and petal length

– What does this tell us?

(44)

Visualization Techniques: Box Plots

• Box Plots

– Invented by J. Tukey

– Another way of displaying the distribution of data – Following figure shows the basic part of a box plot

outlier

10^th percentile 25^th percentile 75^th percentile 50^th percentile 10^th percentile

(45)

Example of Box Plots

• Box plots can be used to compare attributes

(46)

v

(47)

(48)

(49)

(50)

(51)

(52)

(53)

• The four sets of data that make up the quartet are similar in many respects. For all four:

• mean of the x values = 9.0

• mean of the y values = 7.5

• equation of the least-squared regression line is: y = 3 + 0.5x

• sums of squared errors (about the mean) = 110.0

• regression sums of squared errors (variance accounted for by x) = 27.5

• residual sums of squared errors (about the regression line)

= 13.75

• correlation coefficient = 0.82

• coefficient of determination = 0.67

(54)

• However, when the data are plotted, the

differences among the data sets are revealed.

(55)

Sources:

Edward R. Tufte, The Visual Display of Quantitative Information (Cheshire, Connecticut: Graphics Press, 1983), pp. 14-15.

F.J. Anscombe, "Graphs in Statistical Analysis," American Statistician, vol. 27 (Feb 1973), pp. 17-21.