INTRODUCTION TO DATA SCIENCE
WFAiS UJ, Informatyka Stosowana I stopień studiów
1
10/11, 17/11, 24/11/2020
This lecture is
based on course by E. Fox and C. Guestrin, Univ of Washington
What is a classification?
10/11, 17/11, 24/11/2020
2
Overwiew of the content
10/11, 17/11, 24/11/2020
3
10/11, 17/11, 24/11/2020
4
Linear classifier
An inteligent restaurant review system
10/11, 17/11, 24/11/2020
5
Reviews
10/11, 17/11, 24/11/2020
6
Classifying sentiment of review
10/11, 17/11, 24/11/2020
7
Classifier
10/11, 17/11, 24/11/2020
8
A (linear) classifier
10/11, 17/11, 24/11/2020
9
Scoring a sentence
10/11, 17/11, 24/11/2020
10
Score(xi) = 1.2+1.7 -2.1 = 0.8 >0
=> y = +1
positive review
Simple linear classifier
10/11, 17/11, 24/11/2020
11
Training a classifier = Learning the coefficients
10/11, 17/11, 24/11/2020
12
We will discuss
latter how do we
learn clasifier
from data
Decision boundary example
10/11, 17/11, 24/11/2020
13
Decision boundary
10/11, 17/11, 24/11/2020
14
Flow chart:
10/11, 17/11, 24/11/2020
15
Coefficients of classifier
10/11, 17/11, 24/11/2020
16
General notation
10/11, 17/11, 24/11/2020
17
Simple hyperplane
10/11, 17/11, 24/11/2020
18
D-dimensional hyperplane
10/11, 17/11, 24/11/2020
19
Flow chart:
10/11, 17/11, 24/11/2020
20
10/11, 17/11, 24/11/2020
21
Linear classifier
Class probability
How confident is your prediction?
10/11, 17/11, 24/11/2020
22
Basics of probabilities
10/11, 17/11, 24/11/2020
23
Interpreting probabilities as degrees of belief
10/11, 17/11, 24/11/2020
24
Conditional probability
10/11, 17/11, 24/11/2020
25
Interpreting conditional probabilities
10/11, 17/11, 24/11/2020
26
How confident is your prediction?
10/11, 17/11, 24/11/2020
27
Learn conditional probabilities from data
10/11, 17/11, 24/11/2020
28
Predicting class probabilities
10/11, 17/11, 24/11/2020
29
Flow chart:
10/11, 17/11, 24/11/2020
30
Thus far we focused on decision boundaries
10/11, 17/11, 24/11/2020
31
How to relate
Interpreting Score(x i )
10/11, 17/11, 24/11/2020
32
Why not just use regression to build classifier?
10/11, 17/11, 24/11/2020
33
Link function
10/11, 17/11, 24/11/2020
34
Flow chart:
10/11, 17/11, 24/11/2020
35
10/11, 17/11, 24/11/2020
36
Logistic regression classifier:
linear score with logistic link
function
Simplest link function: sign(z)
10/11, 17/11, 24/11/2020
37
Logistic function (sigmoid, logit)
10/11, 17/11, 24/11/2020
38
0.5
0.0 0.12 0.88 1.0
Logistic regression model
10/11, 17/11, 24/11/2020
39
Understanding the logistic regression model
10/11, 17/11, 24/11/2020
40
0
-2 2 4
0.5
0.12
0.88
0.98
Logistic regression
10/11, 17/11, 24/11/2020
41
Score(xi) < 0
Score(xi) >0
Effect of coefficients
10/11, 17/11, 24/11/2020
42
Flow chart:
10/11, 17/11, 24/11/2020
43
Learning logistic regression model
10/11, 17/11, 24/11/2020
44
Categorical inputs
10/11, 17/11, 24/11/2020
45
Encoding categories as numeric features
10/11, 17/11, 24/11/2020
46
Multiclass classification
10/11, 17/11, 24/11/2020
47
Multiclass classification
10/11, 17/11, 24/11/2020
48
1 versus all
10/11, 17/11, 24/11/2020
49
1 versus all
10/11, 17/11, 24/11/2020
50
10/11, 17/11, 24/11/2020
51
Summary: Logistic regression classifier
10/11, 17/11, 24/11/2020
52
What you can do now…
10/11, 17/11, 24/11/2020
53
10/11, 17/11, 24/11/2020
54
Linear classifier
Parameters learning
Learn a probabilistic classification model
10/11, 17/11, 24/11/2020
55
A (linear) classifier
10/11, 17/11, 24/11/2020
56
Logistic regression
10/11, 17/11, 24/11/2020
57
Flow chart:
10/11, 17/11, 24/11/2020
58
Learning problem
10/11, 17/11, 24/11/2020
59
Finding best coefficients
10/11, 17/11, 24/11/2020
60
Quality metric: probability of data
10/11, 17/11, 24/11/2020
61
Maximizing likelihood (probability of data)
10/11, 17/11, 24/11/2020
62
Maximum likelihood estimation (MLE)
10/11, 17/11, 24/11/2020
63
Learn logistic regression model with MLE
Flow chart:
10/11, 17/11, 24/11/2020
64
Find „best” classifier
10/11, 17/11, 24/11/2020
65
Find best classifier
10/11, 17/11, 24/11/2020
66
Maximizing likelihood
10/11, 17/11, 24/11/2020
67
Gradient ascent
10/11, 17/11, 24/11/2020
68
Finding the max via hill climbing
Gradient ascent
10/11, 17/11, 24/11/2020
69
Convergence criteria
Gradient ascent
10/11, 17/11, 24/11/2020
70
Gradient ascent
10/11, 17/11, 24/11/2020
71
Gradient ascent
10/11, 17/11, 24/11/2020
72
The log trick, often used in ML…
10/11, 17/11, 24/11/2020
73
Derivative for logistic regression
10/11, 17/11, 24/11/2020
74
See slides at the end of this lecture
If you are interested how it is derived.
10/11, 17/11, 24/11/2020
75
Derivative for logistic regression
Derivative for logistic regression
10/11, 17/11, 24/11/2020
76
Gradient ascent for logistic regression
10/11, 17/11, 24/11/2020
77
Choosing the step size
10/11, 17/11, 24/11/2020
78
Choosing the step size
10/11, 17/11, 24/11/2020
79
Choosing the step size
10/11, 17/11, 24/11/2020
80
Choosing the step size
10/11, 17/11, 24/11/2020
81
Choosing the step size
10/11, 17/11, 24/11/2020
82
Flow chart: final look at it
10/11, 17/11, 24/11/2020
83
What you can do now
10/11, 17/11, 24/11/2020
84
10/11, 17/11, 24/11/2020
85
Linear classifier
Overfitting & regularization
Training a classifier = Learning the coefficients
10/11, 17/11, 24/11/2020
86
Classification error & accuracy
10/11, 17/11, 24/11/2020
87
Overfitting in classification
10/11, 17/11, 24/11/2020
88
Decision boundary example
Overfitting in classification
10/11, 17/11, 24/11/2020
89
Learned decision boundary
Overfitting in classification
10/11, 17/11, 24/11/2020
90
Quadratic features (in 2d)
Overfitting in classification
10/11, 17/11, 24/11/2020
91
Degree 6 features (in 2d)
Overfitting in classification
10/11, 17/11, 24/11/2020
92
Degree 20 features (in 2d)
Overfitting in classification
10/11, 17/11, 24/11/2020
93
Overfitting in logistic regression
10/11, 17/11, 24/11/2020
94
Remember about this
probability interpretation
Effect of coefficients on logistic regression model
10/11, 17/11, 24/11/2020
95
With increasing coefficients model becomes overconfident on
predictions
Learned probabilities
10/11, 17/11, 24/11/2020
96
Quadratic features: learned probabilities
10/11, 17/11, 24/11/2020
97
Overfitting → overconfident predictions
10/11, 17/11, 24/11/2020
98
Quality metric → penelazing large coefficients
10/11, 17/11, 24/11/2020
99
Desired total cost format
10/11, 17/11, 24/11/2020
100
Maximum likelihood estimation (MLE)
10/11, 17/11, 24/11/2020
101
Measure of fit = Data likelihood
!!!
Measure of magnitude of logistic regression coefficients
10/11, 17/11, 24/11/2020
102
Consider specific total cost
10/11, 17/11, 24/11/2020
103
Consider resulting objectives
10/11, 17/11, 24/11/2020
104
Consider resulting objectives
10/11, 17/11, 24/11/2020
105
Bias-variance tradeoff
10/11, 17/11, 24/11/2020
106
Visualizing effect of regularisation
10/11, 17/11, 24/11/2020
107
Visualizing effect of regularisation
10/11, 17/11, 24/11/2020
108
Effect of regularisation
10/11, 17/11, 24/11/2020
109
Visualizing effect of regularisation
10/11, 17/11, 24/11/2020
110
Flow chart:
10/11, 17/11, 24/11/2020
111
Lets discuss now
finding best
L2-regularized
linear classifier
with gradient ascent
Gradient ascent
10/11, 17/11, 24/11/2020
112
Gradient of L2 regularized log-likelihood
10/11, 17/11, 24/11/2020
113
Gradient of L2 regularized log-likelihood
10/11, 17/11, 24/11/2020
114
Gradient of L2 regularized log-likelihood
10/11, 17/11, 24/11/2020
115
Gradient ascent with L2 regularization
10/11, 17/11, 24/11/2020
116
Logistic regression with L1 regularization
10/11, 17/11, 24/11/2020
117
Sparse logistic regression
10/11, 17/11, 24/11/2020
118
L1 regularised logistic regression
10/11, 17/11, 24/11/2020
119
L1 regularised logistic regression
10/11, 17/11, 24/11/2020
120
What you can do now…
10/11, 17/11, 24/11/2020
121
10/11, 17/11, 24/11/2020
122
Decision trees
What makes a loan risky?
10/11, 17/11, 24/11/2020
123
Credit history explained
10/11, 17/11, 24/11/2020
124
Income
10/11, 17/11, 24/11/2020
125
Loan terms
10/11, 17/11, 24/11/2020
126
Personal information
10/11, 17/11, 24/11/2020
127
Inteligent application
10/11, 17/11, 24/11/2020
128
Classifier: review type
10/11, 17/11, 24/11/2020
129
Classifier: decision trees
10/11, 17/11, 24/11/2020
130
Scoring a loan application
10/11, 17/11, 24/11/2020
131
Scoring a loan application
10/11, 17/11, 24/11/2020
132
Scoring a loan application
10/11, 17/11, 24/11/2020
133
Decision tree model
10/11, 17/11, 24/11/2020
134
Flow chart:
10/11, 17/11, 24/11/2020
135
Learn decision tree from data
10/11, 17/11, 24/11/2020
136
Learn decision tree from data
10/11, 17/11, 24/11/2020
137
Quality metric: Classification error
10/11, 17/11, 24/11/2020
138
Find the tree with lowest classification error
10/11, 17/11, 24/11/2020
139
How do we find the best tree?
10/11, 17/11, 24/11/2020
140
Simple (greedy) algorithm finds good tree
10/11, 17/11, 24/11/2020
141
Greedy algorithm
10/11, 17/11, 24/11/2020
142
Greedy algorithm
10/11, 17/11, 24/11/2020
143
Greedy algorithm
10/11, 17/11, 24/11/2020
144
Greedy algorithm
10/11, 17/11, 24/11/2020
145
Greedy algorithm
10/11, 17/11, 24/11/2020
146
Greedy decision tree learning
10/11, 17/11, 24/11/2020
147
Feature split learning
10/11, 17/11, 24/11/2020
148
Feature split learning
10/11, 17/11, 24/11/2020
149
Compact notation
Decision stump: single level tree
10/11, 17/11, 24/11/2020
150
Making predictions with a decision stump
10/11, 17/11, 24/11/2020
151
How do we select the best feature to split on?
10/11, 17/11, 24/11/2020
152
How do we measure effectiveness of a split?
10/11, 17/11, 24/11/2020
153
Calculating classification error
10/11, 17/11, 24/11/2020
154
Classification error
10/11, 17/11, 24/11/2020
155
Classification error
10/11, 17/11, 24/11/2020
156
Choice 1 vs Choise 2
10/11, 17/11, 24/11/2020
157
Feauture split selection algorithm
10/11, 17/11, 24/11/2020
158
Greedy decision tree learning algorithm
10/11, 17/11, 24/11/2020
159
Recursive stump learning
10/11, 17/11, 24/11/2020
160
Recursive stump learning
10/11, 17/11, 24/11/2020
161
Simple greedy decision tree learning
10/11, 17/11, 24/11/2020
162
Recursive algorithm
Stopping condition 1
10/11, 17/11, 24/11/2020
163
Stopping condition 2
10/11, 17/11, 24/11/2020
164
Greedy decision tree algorithm
10/11, 17/11, 24/11/2020
165
Predictions with decision trees
10/11, 17/11, 24/11/2020
166
Predictions with decision trees
10/11, 17/11, 24/11/2020
167
Predictions with decision tree
10/11, 17/11, 24/11/2020
168
Multiclass prediction
10/11, 17/11, 24/11/2020
169
Multiclass decision stump
10/11, 17/11, 24/11/2020
170
Predicting probabilities with decision trees
10/11, 17/11, 24/11/2020
171
How to use real values inputs
10/11, 17/11, 24/11/2020
172
How to use real values inputs
10/11, 17/11, 24/11/2020
173
Visualizing the threshold split
10/11, 17/11, 24/11/2020
174
Visualizing the threshold split
10/11, 17/11, 24/11/2020
175
Visualizing the threshold split
10/11, 17/11, 24/11/2020
176
Visualizing the threshold split
10/11, 17/11, 24/11/2020
177
Finding the best threshold split
10/11, 17/11, 24/11/2020
178
Finding the best threshold split
10/11, 17/11, 24/11/2020
179
Decision trees vs logistic regression
10/11, 17/11, 24/11/2020
180
Decision trees vs logistic regression
10/11, 17/11, 24/11/2020
181
Decision trees vs logistic regression
10/11, 17/11, 24/11/2020
182
Decision tree vs logistic regression
10/11, 17/11, 24/11/2020
183
Decision tree vs logistic regression
10/11, 17/11, 24/11/2020
184
Decision tree vs logistic regression
10/11, 17/11, 24/11/2020
185
What you can do now
10/11, 17/11, 24/11/2020
186
10/11, 17/11, 24/11/2020
187
Overfitting
in decision trees
Overfitting in decision tree
10/11, 17/11, 24/11/2020
188
Overfitting in decision tree
10/11, 17/11, 24/11/2020
189
Overfitting in decision tree
10/11, 17/11, 24/11/2020
190
Overfitting in decision tree
10/11, 17/11, 24/11/2020
191
Overfitting in decision tree
10/11, 17/11, 24/11/2020
192
Overfitting in decision tree
10/11, 17/11, 24/11/2020
193
Simplest tree is better
10/11, 17/11, 24/11/2020
194
Simplest tree is better
10/11, 17/11, 24/11/2020
195
Simplest tree is better
10/11, 17/11, 24/11/2020
196
Simplest tree is better
10/11, 17/11, 24/11/2020
197
Early stopping for learning decision trees
10/11, 17/11, 24/11/2020
198
Early stopping condition 1
10/11, 17/11, 24/11/2020
199
Early stopping condition 2
10/11, 17/11, 24/11/2020
200