Agnieszka Nowak - Brzezińska

(1)

Agnieszka Nowak - Brzezińska

(2)

Metody eksploracji danych

• odkrywanie asocjacji

• odkrywanie wzorców sekwencji

• klasyfikacja

• analiza skupień - grupowanie

• szeregi czasowe

• wykrywanie zmian i odchyleń

Inne metody analizy danych DM

Grupowanie jest to podział zbioru obiektów na podzbiory taki by podobieństwo obiektów należących do jednego podzbioru było największe a obiektów należących do różnych podzbiorów najmniejsze.

(3)

Na czym polega grupowanie ?

Obiekt jest przydzielony do skupienia, którego środek ciężkości leży najbliżej w sensie odległości euklidesowej .

Grupowanie – analiza skupień

(4)

Uczenie nienadzorowane

• dany jest zbiór uczący, w którym obiekty nie są poklasyfikowane

• celem jest wykrycie nieznanych klasyfikacji, podobieństw między obiektami

Analiza skupień – cluster analysis

•Miary odległości,

•Miary podobieństwa.

X₄ : 0 0 0 0 1 0 0 3 X₂₂: 0 0 0 0 1 1 0 3

1 1 )

3 3 ( ) 0 0 ( ) 1 0 ( ) 1 1 ( ) 0 0 ( ) 0 0 ( ) 0 0 ( ) 0 0 ( ) ,

( x

₄

x

₂₂

 

²

 

²

 

²

 

²

 

²

 

²

 

²

 

²

  d

95 . 49 0 . 10

10 11

* 10

9 1

) 3 0 1 1 0 0 0 0 (

* ) 3 0 0 1 0 0 0 0 (

3 3 0 0 1 0 1 1 0 0 0 0 0 0 0 ) 0

,

(

4 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2



 



 































  x x p

jak znajdować podobieństwo ?

(5)

(6)

Analiza skupień – przykład

(7)

Analiza skupień – to proces podziału danych na podzbiory zwane klasami (skupieniami) z punktu widzenia określonego kryterium Klasyfikacja - czyli podział zbiorów obiektów na skupienia.

dany jest zbiór uczący, w którym obiekty nie są poklasyfikowane, celem jest wykrycie nieznanych klasyfikacji, podobieństw między obiektami.

Powstałe grupy (skupienia) są zespołami obiektów badanej serii bardziej podobnych do siebie (wewnątrz grupy) niż do pozostałych obiektów (między grupami).

Klasyfikacja...

(8)

Można określić na dwa sposoby:

 jako różnicę (czyli odległość d ), lub z drugiej strony

 jako podobieństwo p .

Warunki:

 ustalenie standardu opisu badanego obiektu – zespołu diagnostycznych cech, które dobrze opisują zmienność badanych,

 określenie sposobu porównywania obiektów. Inność

(rozróżnialność obiektów).

(9)

(10)

(11)

Odległość sferyczna

Odległość manhattan

(12)

(13)

(14)

(15)

Rodzaje algorytmów:

 graficzne - na przykład diagramy Czekanowskiego,

 hierarchiczne (aglomeracyjne, deglomeracyjne),

 k -optymalizacyjne –(nie-hierarchiczne) seria dzielona jest na k

zbiorów obiektów, przy czym obiekt może należeć tylko do jednego ze

zbiorów, a liczba k jest zwykle podawana przez badacza.

(16)

(17)

o

₄

o

₂

o

₃

o

₅

o

₆

o

₇

o

₁

o

₈

{o

₁

,o

₂

,o

₃

,o

₄

,o

₅

,o

₆

,o

₇

,o

₈

}

Rys. Przykład dendrogramu

(18)

Po normalizacji…

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

(28)

Single Linkage Complete linkage Average Linkage

(29)

(30)

(31)

k-means k-medoid AHC Złożoność

Obliczeniowa O(tkn) O(kn)

Wady Wrażliwość na

obiekty odległe.

Wymagana liczba k- skupień

Wymaga podania

liczby k-skupień Wymaga podania warunku końca, np. współczynnika maksymalnej opłacalności

zalety Prosta struktura, stosunkowo mała

złożoność obliczeniowa

Bardziej efektywna niż k-

means bo szuka reprezentanów i

nie jest tak wrażliwa na obiekty odległe

Nie wymaga podania liczby grup, bardziej efektywna i

popularna O(k(n-k)²) Co najmniej O(n²)

HIERARCHICAL CLUSTERING

- korzysta z macierzy odległości

- tworzy drzewo obiektów (dendrogram)

- nie wymaga podawania na wstępie liczby skupień Ale wymaga określenia warunku zakończenia algorytmu Wady AHC:

- duża złożoność obliczeniowa – co najmniej O(n

²

) -

Aglomeracyjne tworzenie skupień kontra inne metody

(32)

Obiekt jest przydzielony do skupienia, którego środek ciężkości leży najbliżej w sensie odległości euklidesowej .

Algorytmy k -optymalizacyjne...

Cel klasyfikacji:

1. minimalizacja zmienności wewnątrz skupień, 2. maksymalizacja zmienności między skupieniami.

Przebieg:

1. Wyznaczenie k początkowych skupień przez badacza 2. Przydzielenie obiektów do najbliższych im skupień

Kolejno (iteracyjnie) powtarzane przenoszenie obiektów między

skupieniami tak, aby uzyskać najlepszy podział na grupy.

(33)

(34)

 zostanie przekroczona ustalona z góry maksymalna liczba iteracji (kroków),

lub

 nastąpi stabilizacja struktury klas.

Kryterium oceny jakości podziału serii obiektów na grupy jest tzw. funkcja podziału mająca zwykle postać sumy odległości euklidesowych obiektów od środków ciężkości właściwych im grup.







C p

m

c

p dist C

TD ( ) ( , )

Warunek końca procesu optymalizacji:

gdzie:

) , ( p m

_c

dist to odległość euklidesowa danego punktu p od centrum grupy

(35)

(36)

Opcje wyboru wstępnych centrów skupień mogą być następujące:

 losowe przypisanie elementów do k zadeklarowanych skupień,

 Maksymalizacja odległości miedzy skupieniami,

 Obserwacje przy stałym interwale,

 Pierwsze k obserwacji

Liczba k – jak ją określić ?

(37)

Przykład:

Ponownie oblicz Środki skupień Losowo wybierz K

obiektów jako wejściowe środki skupień

Kolejna iteracja Kolejna

iteracja

K=2

Przydziel każdy obiekt do

najbliżej grupy

Wylicz nowe środki skupień

(38)

Zalety:

 Stosunkowo niewielka złożoność obliczeniowa,

 Prosta idea.

Wady:

 Szum w danych i obiekty odległe mogą zniekształcać centroidy,

 Początkowy wybór wpływa na wyniki

Ocena metody k-średnich:

(39)

 Algorytm jest zbyt wrażliwy na tzw. obiekty odległe – outliers,

 Metoda k-medoids –zamiast tworzyć centroidy (średnie z odległości) – tworzy medoidy – te obiekty ze zbioru n, które w danym skupieniu są najbardziej centralne – tzn. ich odległość od wszystkich pozostałych w danym skupieniu jest najmniejsza.

 PAM (Partitioning Around Medoids) – algorytm grupowania metodą k-reprezentantów.

W czym tkwi problem z metodą k-średnich ?

(40)

(41)

Przebieg algorytmu (5 kroków):

1. Wybrać k- obiektów reprezentatywnych (medoidów) 2. Dopasuj każdy z pozostałych (nie będących

medoidami) obiektów do najbardziej podobnych klastrów i oblicz TD

_current

.

3. Dla każdej pary (medoid

_M

, nie-medoid

_N

) oblicz wartość TD

_N



_M.

4. Wybierz ten nie-medoid

_N,

dla którego TD

_N



_M

jest minimalne

 Jeśli TD

_N



_M

jest mniejsze niż:

 Zamień N z M,

 Ustaw TD

_current

:= TD

_N



_M

 wróć do kroku 2

5. Koniec.

(42)

Typowy przebieg PAM – metodą k-medoid’ów

K=2

Powtarzaj dopóki są jakieś zmiany

Oblicz Koszt zmiany

Losowo

wybierz jeden obiekt

O_random Odgórnie wybierz

K obiektów jako początkowe

MEDOIDY

Przydziel Każdy z pozostałych obiektów do najbliższego

medoidu

Total cost=20

Total cost=26

Zamień obiekt bieżący z O_random Jeśli to polepszy

jakość grup

(43)

Zalety:

 Dobrze sobie radzi z “ostańcami” (ang. outliers)– obiekty odległe, izolowane

 Początkowy wybór nie wpływa na wyniki

 Odporność na szum w danych Wady:

 Nie radzi sobie z dużymi zbiorami danych

 Wykonanie jest kosztowne dla dużych wartości n -obiektów i k - skupień.

Ocena metody k-medoids

(44)

 Algorytm k-medoids jest bardziej wytrzymały (odporny) na szumy i odległe obiekty,

 Algorytm k-means jest tańszy (bardziej efektywny) pod względem czasu przetwarzania,

 K-means jest zbyt wrażliwy na obiekty odległe (ang. outliers) – co może zniekształcać dane,

 Zatem zamiast brać średnią wartość - bierze się najbardziej centralny obiekt jako punkt odniesienia (medoid).

Porównanie obydwu metod:

(45)

(46)

(47)

(48)

(49)

(50)

(51)

(52)

(53)

(54)

 Rattle

 R

 TraceIs

 MS Excel

(55)

 Krok 1: Załadowanie danych

(56)

(57)

(58)

K=3

(59)

(60)

(61)

(62)

K=2

(63)

(64)

(65)

(66)

(67)

K=3

(68)

Parametry:



x - rezultat.



data – dane wejściowe.



dimens – wymiary przestrzeni danych.



scale – wartość TRUE lub FALSE. Domyślnie „false” – oznacza ze wybrane wymiary nie mają być prezentowane w tej samej skali.

Wartość:



Możliwe są opcje: wartość BIC dla wybranej licby skupień. Jeśli dane są wielowymiarowe (>2)prezentowane są mieszaniny współrzędnych i prezentowane są parami – wszystkie kombinacje

plot.Mclust(x, data, dimens = c(1, 2), scale = FALSE, ...)

(69)



modelName – typ modelu:

"E" : equal variance (one-dimensional)

"V" : spherical, variable variance (one-dimensional)

"EII": spherical, equal volume

"VII": spherical, unequal volume

"EEE": ellipsoidal, equal volume, shape, and orientation

"VVV": ellipsoidal, varying volume, shape, and orientation



Data – dane (muszą być ilościowe)

Agglomerative hierarchical clustering based on maximum likelihood criteria for MVN mixture models parameterized by eigenvalue

decomposition.

hc(modelName, data, ...)

(70)

 G – liczba składowych dla których liczymy BIC

 emModelNames:

"E" for spherical, equal variance (one-dimensional)

"V" for spherical, variable variance (one-dimensional)

"EII": spherical, equal volume

"VII": spherical, unequal volume

"EEI": diagonal, equal volume, equal shape

"VEI": diagonal, varying volume, equal shape

"EVI": diagonal, equal volume, varying shape

"VVI": diagonal, varying volume, varying shape

"EEE": ellipsoidal, equal volume, shape, and orientation

"EEV": ellipsoidal, equal volume and equal shape

"VEV": ellipsoidal, equal shape

"VVV": ellipsoidal, varying volume, shape, and orientation

hcPairs - A matrix of merge pairs for hierarchical clustering such as produced by function hc.

 subset - A logical or numeric vector specifying the indices of a subset of the data to be used in the initial hierarchical clustering phase.

 eps - A scalar tolerance for deciding when to terminate computations due to computational singularity in covariances. Smaller values of eps allow computations to proceed nearer to singularity. The default is .Mclust\$eps.

 tol - A scalar tolerance for relative convergence of the loglikelihood. The default is .Mclust\$tol.

 itmax - An integer limit on the number of EM iterations. The default is .Mclust\$itmax.

 equalPro - Logical variable indicating whether or not the mixing proportions are equal in the model. The default is .Mclust\$equalPro.

 warnSingular - A logical value indicating whether or not a warning should be issued whenever a singularity is encountered. The default is warnSingular=FALSE. ... Provided to allow lists with elements other than the arguments can be passed in indirect or list calls with do.call.

BIC for EM initialized by hierarchical clustering for parameterized Gaussian mixture models.

EMclust(data, G, emModelNames, hcPairs, subset, eps, tol,

itmax, equalPro, warnSingular, ...)

(71)

 x data matrix or data frame, or dissimilarity matrix, depending on the value of the diss argument.

 diss logical flag: if TRUE (default for dist or dissimilarity objects), then x is assumed to be a dissimilarity matrix. If FALSE, then x is treated as a matrix of observations by variables. metric character string specifying the metric to be used for calculating

dissimilarities between observations. The currently available options are "euclidean" and

"manhattan". Euclidean distances are root sum-of-squares of differences, and

manhattan distances are the sum of absolute differences. If x is already a dissimilarity matrix, then this argument will be ignored. stand logical flag: if TRUE, then the

measurements in x are standardized before calculating the dissimilarities.

Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If x is already a

dissimilarity matrix, then this argument will be ignored. method character string defining the clustering method.

 The six methods implemented are "average" ([unweighted pair-]group average method, UPGMA), "single" (single linkage), "complete" (complete linkage), "ward" (Ward's method),

"weighted" (weighted average linkage) and its generalization "flexible" which uses (a constant version of) the Lance-Williams formula and the par.method argument. Default is "average".

agnes(x, diss = inherits(x, "dist"), metric = "euclidean", stand = FALSE, method = "average", par.method, keep.diss = n

< 100, keep.data = !diss)

(72)

 stand - logical, indicating if the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by

subtracting the variable's mean value and dividing by the variable's mean absolute deviation. samples integer, number of samples to be drawn from the dataset.

 sampsize - integer, number of observations in each sample. sampsize should be higher than the number of clusters (k) and at most the number of observations (n = nrow(x)).

 trace integer indicating a trace levelfor diagnostic output during the algorithm.

 medoids.x logical indicating if the medoids should be returned, identically to some rows of the input data x. If FALSE, keep.data must be false as well, and the medoid indices, i.e., row numbers of the medoids will still be returned (i.med component), and the algorithm saves space by needing one copy less of x.

 keep.data logical indicating if the (scaled if stand is true) data should be kept in the

result. Setting this to FALSE saves memory (and hence time), but disables clusplot()ing of the result. Use medoids.x = FALSE to save even more memory. rngR logical indicating if R's random number generator should be used instead of the primitive clara()-builtin one. If true, this also means that each call to clara() returns a different result – though only slightly different in good situations.

clara(x, k, metric = "euclidean", stand = FALSE, samples

= 5, sampsize = min(n, 40 + 2 * k), trace = 0, medoids.x

= TRUE, keep.data = medoids.x, rngR = FALSE)

(73)

 diss logical flag: if TRUE (default for dist or dissimilarity objects), then x will be

considered as a dissimilarity matrix. If FALSE, then x will be considered as a matrix of observations by variables

 stand logical; if true, the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by

subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If x is already a dissimilarity matrix, then this argument will be ignored.

 keep.data logicals indicating if the dissimilarities and/or input data x should be kept in the result. Setting these to FALSE can give much smaller results and hence even save memory allocation time. Details

diana(x, diss = inherits(x, "dist"), metric =

"euclidean", stand = FALSE, keep.diss = n < 100,

keep.data = !diss)

(74)

 x data matrix or data frame, or dissimilarity matrix, depending on the value of the diss argument.

 In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed.

 In case of a dissimilarity matrix, x is typically the output of daisyor dist. Also a vector of length n*(n-1)/2 is allowed (where n is the number of observations), and will be interpreted in the same way as the output of the above-mentioned functions. Missing values (NAs) are not allowed.

 k integer giving the desired number of clusters. It is required that 0 < k < n/2 where n is the number of observations.

 diss logical flag: if TRUE (default for dist or dissimilarity objects

 memb.exp number r strictly larger than 1 specifying the membership exponent used in the fit criterion; see the

„Details‟ below. Default: 2 which used to be hardwired inside FANNY. metric character string specifying the metric to be used for calculating dissimilarities between observations. Options are "euclidean" (default), "manhattan", and

"SqEuclidean". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences, and "SqEuclidean", the squared euclidean distances are sum-of-squares of differences. Using this last option is equivalent (but somewhat slower) to computing so called “fuzzy C-means”.

iniMem.p numeric n * k matrix or NULL (by default); can be used to specify a starting membership matrix

 maxit, tol maximal number of iterations and default tolerance for convergence (relative convergence of the fit criterion) for the FANNY algorithm

fanny(x, k, diss = inherits(x, "dist"), memb.exp = 2, metric =

c("euclidean", "manhattan", "SqEuclidean"), stand = FALSE, iniMem.p = NULL, cluster.only = FALSE, keep.diss = !diss && !cluster.only && n <

100, keep.data = !diss && !cluster.only, maxit = 500, tol = 1e-15,

trace.lev = 0)

(75)



do.swap logical indicating if the swap phase should happen. The default, TRUE, correspond to the original algorithm. On the other hand, the swap phase is much more computer intensive than the build one for large n , so can be skipped by do.swap = FALSE.



keep.diss, keep.data logicals indicating if the dissimilarities and/or input data x should be kept in the result. Setting these to FALSE can give much

smaller results and hence even save memory allocation time . trace.lev integer specifying a trace level for printing diagnostics during the build and swap phase of the algorithm. Default 0 does not print anything; higher values print increasingly more.

pam(x, k, diss = inherits(x, "dist"), metric = "euclidean", medoids = NULL, stand = FALSE, cluster.only = FALSE, do.swap = TRUE, keep.diss = !diss && !cluster.only && n < 100, keep.data

= !diss && !cluster.only, trace.lev = 0)

(76)

(77)

kmeans

(78)

kmeans

(79)

(80)

(81)

(82)

(83)

(84)



# Determine number of clusters

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata,

centers=i)$withinss)

plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")



# K-Means Cluster Analysis

fit <- kmeans(mydata, 5) # 5 cluster solution

# get cluster means

aggregate(mydata,by=list(fit$cluster),FUN=mean)

# append cluster assignment

mydata <- data.frame(mydata, fit$cluster)

(85)

(86)

(87)

(88)



The function cluster.stats() in the fpc package provides a mechanism for comparing the similarity of two cluster solutions using a variety of validation criteria (Hubert's gamma coefficient, the Dunn index and the corrected rand index)



# comparing 2 cluster solutions

library(fpc)

cluster.stats(d, fit1$cluster, fit2$cluster)



where d is a distance matrix among objects, and fit1$cluster

and fit$cluster are integer vectors containing classification

results from two different clusterings of the same data.

(89)

$n [1] 150

$cluster.number [1] 3

$cluster.size [1] 62 50 38

$diameter [1] 2.677686

2.428992 2.418677

$average.distance [1] 1.033869

0.698122 1.022906

$median.distance [1] 0.9746794

0.6164414 0.9273618

> fit <- kmeans(dane,3)

> fit2 <- kmeans(dane,4)

> cluster.stats(d, fit$cluster, fit2$cluster)

$separation

[1] 0.2645751 1.6401219 0.2645751

$average.toother

[1] 2.797715 4.060413 3.343079

(90)

$separation.matrix [,1] [,2] [,3]

[1,] 0.0000000 1.640122 0.2645751 [2,] 1.6401219 0.000000 3.6891733 [3,] 0.2645751 3.689173 0.0000000

$average.between [1] 3.384621

$average.within [1] 0.9241552

$n.between [1] 7356

$n.within [1] 3819

$within.cluster.ss [1] 78.94084

$clus.avg.silwidths 1 2 3

0.4171823 0.7976299 0.4511051

$avg.silwidth [1] 0.5525919

$g2NULL

$g3NULL

$pearsongamma [1] 0.7144752

$dunn

[1] 0.0988074

$entropy [1] 1.079224

$wb.ratio [1] 0.2730454

$ch[1] 560.3999

$corrected.rand [1] 0.7150246

$vi[1] 0.5077792

(91)

(92)

(93)

(94)

(95)

(96)

TP (true positive): liczba obserwacji sklasyfikowanych jako prawdziwe (1) i faktycznie prawdziwych (1)

TN (true negative): liczba obserwacji sklasyfikowanych jako nieprawdziwe (0) i faktycznie nieprawdziwych (0)

FP (false positive): liczba obserwacji błędnie sklasyfikowanych jako prawdziwe (1) ale tak naprawdę nieprawdziwych (0)

FN (false negative): liczba obserwacji błędnie sklasyfikowanych jako

nieprawdziwe (0) ale tak naprawdę prawdziwych (1)

(97)

Czułość (sensitivity): (true positive rate) (dokładność klasyfikacji)

Wrażliwość (specificity): (true negative rate):

(98)

(99)

Model C jest najbardziej dokładny

(100)

(101)