INTRODUCTION TO DATA SCIENCE

(1)

INTRODUCTION TO DATA SCIENCE

1

(2)

Regression for predictions

2

 Simple regression

 Multiple regression

 Accesing performance

 Ridge regression

 Feature selection and lasso regression

 Nearest neighbor and kernel regression

(3)

What is regression?

3

(4)

Case study

4

(5)

Data

5

Input vs output

• y is quantity of interest

• assume y can be predicted from x

(6)

Model: assume functional relationship

6

„Essentially, all models are wrong but some are usefull.”

George Box, 1987.

(7)

Task 1:

7

Which model to fit?

(8)

Task 2:

8

For a given model f(x) estimate function

from data

(9)

How it works: baseline flow chart

9

(10)

10

SIMPLE LINEAR REGRESSION

(11)

Simple linear regression model

11

(12)

The cost of using a given line

12

(13)

Find „best” line

13

(14)

Predicting size of house you can afford

14

Estimated

parameter

(15)

Interpreting the coefficients

15

(16)

Interpreting the coefficients

16

Magnitude of fit parameters

depend on the units of both

features and observations

(17)

ML algorithm: minimasing the cost

17

(18)

Convex/concave function

18

(19)

Finding max/min analytically

19

(20)

Finding the max via hill climbing

20

Sign of the derivative is saying me

what I want to do :move left or right

or stay where I am

(21)

Finding the min via hill descent

21

(22)

Choosing the step size (stepsize schedule)

22

Fixed Varying

Works well for strongly

convex functions

(23)

Convergence criteria

23

(24)

Moving to multiple dimensions

24

(25)

Gradient example

25

(26)

Contour plots

26

(27)

Gradient descent

27

(28)

Compute the gradient

28

(29)

Approach 1: set gradient to 0

29

This method is called

„Closed form solution”

(30)

Approach 2: gradient descent

30

(31)

Approach 2: gradient descent

31

(32)

Comparing the approaches

32

(33)

Symmetric cost function

33

Assumes error of overestimating

sales price is the same as

(34)

Asymmetric cost functions

34

We can weight differently

positive and negative errors

in RSS calculations.

(35)

What you can do now

35

(36)

36

MULTIPLE REGRESSION

(37)

Multiple regression

37

(38)

Polynomial regression

38

(39)

Other functional forms of one input

39

 Trends in time series

This trend can be modeled with

polynomial function.

(40)

Other functional forms of one input

40

 Seasonality

(41)

Example of detrending

41

(42)

Example of detrending

42

(43)

Other examples of seasonality

43

(44)

Generic basic expansion

44

(45)

More realistic flow chart

45

(46)

Incorporating multiple inputs

46

Only one bathroom,

not same as my

3 bathrooms

(47)

Incorporating multiple inputs

47

Many possible inputs

(48)

Reading your brain

48

Whole collection of inputs

(49)

General notation

49

(50)

Simple hyperplane

50

Noise term

(51)

More generally: D-dimensional curve

51

(52)

Interpreting coefficients

52

(53)

Interpreting coefficients

53

(54)

Interpreting coefficients

54

For fixed

# sq.ft.!

But…

increasing #bathrooms for fixed #sq.ft will make your bedrooms smaller and smaller.

Think about interpretation.

(55)

Interpreting coefficients

55

Can’t hold other

features fixed

Then …

can’t interpret

coefficients

(56)

Interpreting coefficients

56

But…

increasing #bedrooms for fixed #sq.ft will make your bedrooms smaller and smaller.

You can end with negative coefficient. Might not be so if you removed #sq.ft from the model.

Think about interpretation

(57)

Fitting in D-dimmensions

57

Look now at

this block

(58)

Rewriting in vector notation

58

e

_i

e

_i

+

(59)

Rewriting in matrix notation

59

Here is our

ML algorithm

(60)

Fitting in D-dimmensions

60

Look now at

this block

(61)

Cost function in D-dimmension

61

RSS in vector notation

(62)

Cost function in D-dimmension

62

RSS in matrix notation

(63)

Regression model for D-dimmension

63

RSS in matrix notation

(64)

Regression model for D-dimmension

64

Gradient of RSS

(65)

Regression model for D-dimmension

65

Approach 1: set gradient to zero

Closed form solution

(66)

Closed-form solution

66

This matrix might not be invertible.

(67)

Regression model for D-dimmension

67

Approach 2: gradient descent

We initialise our solution somewhere

and then …

(68)

Gradient descent

68

(69)

Regression model for D-dimmension

69

Interpreting elementwise

(70)

Summary of gradient descent

70

Extremely useful algorithm in several applications

(71)

What you can do now

71

(72)

72

ACCESSING PERFORMANCE

(73)

Assessing performance

73

(74)

Assessing performance

74

(75)

Measuring loss

75

Symmetric loss

functions

(76)

Accessing the loss

76

Use training data

(77)

Compute training error

77

(78)

Training error

78

Convention is to take

average here

(79)

Training error

79

More intuitive is to take RMSE, same units as y

(80)

Training error vs. model complexity

80

Descrease as you increase your model complexity.

Very intuitive why it is

like that.

(81)

Is training error a good measure?

81

Is there something particularly wrong

about having x

_t

square feet ???

(82)

Generalisation (true) error

82

(83)

Distribution over house

83

Popularity of a given

#sq.ft.

(84)

Generalisation error definition

84

(85)

Generalisation error (weighted with popularity) vs model complexity

85

1 s popularity area

(86)

Generalisation error vs model complexity

86

However … in contrast to the training

error, in practice we cannot really compute

true generalisation error. We don’t have

data on all possible houses in the area.

(87)

Forming a test set

87

We want to approximate generalisation error.

Test set: proxy for

„everything you might see”

(88)

Compute test error

88

(89)

Training, true and test error vs. model complexity. Notion of overfitting.

89

Test error: noisy version due

to limited statistics.

(90)

Training/test splits

90

(91)

Three sources of errors

91

(92)

Data are inherently noisy

92

There is some true relatioship between sq.ft and value of the house, specific to the given house.

We cannot reduce it by chosing

(93)

Bias contribution

93

This contribution we can control.

(94)

Bias contribution

94

(95)

Bias contribution

95

(96)

Variance contribution

96

(97)

Variance contribution

97

(98)

Variance of high complexity models

98

For each train remove

few random houses

(99)

Bias of high complexity models

99

For each train remove

few random houses

(100)

Bias –variance tradeoff

100

MSE = mean square error Machine Learing

is all about this tradeoff

But….

(101)

Errors vs amount of data

101

(102)

The regression/ML workflow

102

(103)

Hypothetical implementation

103

(104)

Hypothetical implementation

104

(105)

Hypothetical implementation

105

(106)

Practical implementation

106

(107)

Practical implementation

107

(108)

Typical splits

108

(109)

What you can do now

109

(110)

110

RIDGE REGRESSION

(111)

Flexibility of high-order polynomials

111

(112)

Overfitting with many features

112

(113)

How does # of observations influence overfitting?

113

(114)

How does # of inputs influence overfitting?

114

(115)

How does # of inputs influence overfitting?

115

(116)

Lets improve quality metric blok

116

(117)

Desire total cost format

117

(118)

Measure of fit to training data

118

(119)

Measure of magnitude of regression coefficients

119

But … the coefficients

are very large

(120)

Consider specific total cost

120

(121)

Consider resulting objectives

121

(122)

Ridge regression: bias-variance tradeoff

122

(123)

Ridge regression: coefficients path

123

features

scaled to

unit norm

sweet spot

(124)

Flow chart

124

(125)

Ridge regression: cost in matrix notation

125

(126)

Gradient of ridge regresion cost

126

(127)

Ridge regression: closed-form solution

127

(128)

Ridge regression: gradient descent

128

(129)

Summary of ridge regression algorithm

129

(130)

How to choose l

130

(131)

How to choose l

131

(132)

How to choose l

132

(133)

How to choose l

133

(134)

How to choose l

134

(135)

What value of K

135

(136)

How to handle the intercept

136

Recall multiple regression model

(137)

Do we penalize intercept?

137

(138)

Do we penalize intercept?

138

 Option 1: don’t penalize intercept

 Option 2: Center data first

(139)

What you can do now

139

(140)

140

FEATURES SELECTION

&

LASSO REGRESSION

(141)

Why features selection?

141

(142)

Sparcity

142

(143)

Sparcity

143

(144)

Find best model of size: 0

144

(145)

Find best model of size: 1

145

(146)

Find best model of size: 2

146

Note: not necessarily nested!

(147)

Find best model of size: N

147

Which model complexity to choose?

Certainly not that with the smalest training error!

(148)

Choosing model complexity

148

(149)

Complexity of „all subsets”

149

(150)

Greedy algorithm

150

(151)

Visualizing greedy algorithm

151

(152)

Visualizing greedy algorithm

152

(153)

Visualizing greedy algorithm

153

Notice… it is suboptimal .

Adding next best thing, fit is nested now.

(154)

Visualizing greedy algorithm

154

(155)

When do we stop?

155

(156)

Complexity of forward stepwise

156

(157)

Other greedy algorithms

157

(158)

Using regularisation for features selection

158

(159)

Thresholding ridge coefficients?

159

(160)

Thresholding ridge coefficients?

160

(161)

Thresholding ridge coefficients?

161

(162)

Thresholding ridge coefficients?

162

(163)

Thresholding ridge coefficients?

163

(164)

Try this cost instead of ridge …

164

(165)

Lasso regression

165

(166)

Coefficient path: ridge

166

(167)

Coefficient path: lasso

167

(168)

Visualising ridge cost in 2D

168

(169)

Visualising ridge cost in 2D

169

(170)

Visualising ridge cost in 2D

170

(171)

Visualising lasso cost in 2D

171

(172)

Visualising lasso cost in 2D

172

(173)

Visualising lasso cost in 2D

173

We are getting sparce solution,

the w

₀

= 0

(174)

How we optimise for objective

174

(175)

Optimise for lasso objective

175

(176)

Coordinate descent

176

(177)

Comments on coordinate descent

177

(178)

Normalizing features

178

(179)

Optimising least squares objective

179

One coordinate at a time

(180)

Optimising least squares objective

180

One coordinate at a time

(181)

Coordinate descent for least squares regression

181

(182)

How to access convergence

182

(183)

Soft thresholding

183

(184)

Convergence criteria

184

(185)

Other lasso solvers

185

(186)

How do we chose l

186

(187)

How do we chose l

187

(188)

How do we chose l

188

(189)

Impact of feature selection and lasso

189

(190)

What you can do now

190

(191)

191

NONPARAMETRIC

REGRESSION

(192)

Fit globaly vs fit locally

192

Parametric models

Below …

f(x) is not really

a polynomial function

linear constant

quadratic

(193)

What alternative do we have?

193

(194)

Nearest Neighbor & Kernel Regression (nonparametric approach)

194

Simple implementation, flexibility increases as we have more data)

(195)

Fit locally to each data point

195

(196)

What people do naturally…

196

(197)

1-NN regression more formally

197

Transition point

(198)

Visualizing 1-NN in multiple dimensions

198

(199)

Distance metrics: Notion of „closest”

199

(200)

Weighting housing inputs

200