Systemy uczace sie lab2

(1)

1

Wprowadzenie do programu RapidMiner, część 2 Michał Bereta

www.michalbereta.pl

1. Wykorzystanie wykresu ROC do porównania modeli klasyfikatorów

Zaimportuj dane „pima-indians-diabetes.csv”. (Baza danych poświęcona diagnozowaniu cukrzycy). Jest to problem klasyfikacji dwuklasowej.

Podczas importu danych bądź pewny, że pierwsza linijka nie jest wzięta jako opis atrybutów (wybierz „-„):

Upewnij się również, że wszystkie atrybuty są typu „real” …

(2)

2 Użyjemy operatora „Compare ROCs”:

Jest to proces zagnieżdżony i możemy dodad do niego podprocesy, np. proces uczenia naiwnego klasyfikatora Bayesa oraz drzewo decyzyjne:

(3)

3 Jakie są parametry „Compare ROCs”?

Z dokumentacji:

„The comparison is based on the average values of a k-fold cross validation. Please study the documentation of the X-Validation operator for more information about cross validation. Alternatively, this operator can use an internal split into a test and a training set from the given data set in this case the operator behaves like the Split Validation operator. Please note that any former predicted label of the given ExampleSet will be removed during the application of this operator.”

Co oznacza podejście optymistyczne, pesymistyczne i neutralne? Z dokumentacji:

„ROC curves are calculated by first ordering the classified examples by confidence. Afterwards all the examples are taken into account with decreasing confidence to plot the false positive rate on the x-axis and the true positive rate on the y-x-axis. With optimistic, neutral and pessimistic there are three possibilities to calculate ROC curves. If there is more than one example for a confidence with optimistic ROC calculation the correct classified examples are taken into account before looking at the false classification. With pessimistic calculation it is the other way round: wrong classifications are taken into account before looking at correct classifications. Neutral calculation is a mix of both calculation methods described above. Here correct and false classifications are taken into account alternately. If there are no examples with equal confidence or all examples with equal confidence are assigned to the same class the optimistic, neutral and pessimistic ROC curves will be the same.”

(4)

4

Zwród uwagę, że analiza ROC jest możliwa (w swojej oryginalnej postaci) dla problemów dwuklasowych. Próba użycia bazy”Iris”, gdzie są trzy klasy, zakooczy się niepowodzeniem:

Aby sprawdzid wartośd AUC (Area under Curve), można skorzystad ze znanego już operatora „Performance”. Z dokumentacji:

(5)

5 Otrzymujemy:

(6)

6 Zadanie:

(7)

7

2. Wykorzystanie metody bootstrap do estymacji błędu modelu

RapidMiner udostępnia operator „Sample (Bootstrapping)”, który może wygenerowad pseudopróbę na podstawie dostępnego zbioru przykładów.

Np.

(8)

8 dostaniemy:

Zwród uwagę, że pseudopróba może mied większy rozmiar niż oryginalny zbiór dostępnych przykładów (uwzględniane są duplikaty - wynik losowania ze zwracaniem).

Nas jednak bardziej interesuje operator „Bootstrapping Validation”. Z dokumentacji:

„This operator performs validation after bootstrapping a sampling of training data set in order to estimate the statistical performance of a learning operator (usually on unseen data sets). It is mainly used to estimate how accurately a model (learnt by a particular learning operator) will perform in practice.”

(9)

9 Z dokumentacji:

„Bootstrapping sampling is sampling with replacement. In sampling with replacement, at every step all examples have equal probability of being selected. Once an example has been selected for the sample, it remains candidate for selection and it can be selected again in any other coming steps. Thus a sample with replacement can have the same example multiple number of times. More importantly, a sample with replacement can be used to generate a sample that is greater in size than the original ExampleSet.”

Zwród uwagę na to co oznacza parametr „sample ratio”: „sample ratio

This parameter specifies the relative size of the training set. In other validation schemes this parameter should be between 1 and 0, where 1 means that the entire ExampleSet will be used as training set. In this operator its value can be greater than 1 because bootstrapping sampling can generate an ExampleSet with a number of examples greater than the original ExampleSet. All examples that are not selected for the training set are automatically selected for the test set.”

Wygenerowane ID pozwoli sprawdzid, że rzeczywiście przykłady w pseudoprobie się powtarzają (lecz nie w testowej części wszystkich dostępnych danych). Aby to zobaczyd, ustaw „break point before” na etapie trenowania:

(10)

10 oraz na etapie testowania:

(11)

11 Testowe (bez powtórzeo):

Wynik:

Powyższe wartości podlegają podobnej interpretacji jak np. „accuracy” otrzymane z kros walidacji. Pytanie: co lepiej szacuje spodziewany błąd predykcji: kros walidacja czy metoda bootstrap?

(12)

12 Zadanie:

1. Dla danych „Pima” przygotuj podział na dane trenujące i testowe (podobnie jak to miało miejsce na kolokwium) w proporcji np. 60% / 40%.

2. Wybierz konkretny model klasyfikatora, np. drzewo decyzyjne.

3. Wykorzystując dane trenujące oszacuj poziom spodziewanego błędu na danych testowych za pomocą zarówno „Split validation”, kros walidacji jak i metody bootstrap.

4. Wytrenuj klasyfikator na całości zbioru trenującego i wygeneruj odpowiedzi dla danych testowych i sprawdź jaki jest rzeczywisty poziom błędu. Która metoda szacowania błędu była najbliżej?