• Nie Znaleziono Wyników

INTRODUCTION TO DATA SCIENCE

N/A
N/A
Protected

Academic year: 2021

Share "INTRODUCTION TO DATA SCIENCE"

Copied!
219
0
0

Pełen tekst

(1)

INTRODUCTION TO DATA SCIENCE

WFAiS UJ, Informatyka Stosowana I stopień studiów

1

12.11, 19.11 2019

This lecture is

based on course by E. Fox and C. Guestrin, Univ of Washington

(2)

What is retrieval?

12.11, 19.11 2019

2

(3)

What is retrieval?

12.11, 19.11 2019

3

(4)

What is retrieval?

12.11, 19.11 2019

4

(5)

Retrieval applications

12.11, 19.11 2019

5

(6)

What is clustering?

12.11, 19.11 2019

6

(7)

Clustring applications

12.11, 19.11 2019

7

(8)

Clustering applications

12.11, 19.11 2019

8

(9)

Impact of retrieval & clustering

12.11, 19.11 2019

9

(10)

Overwiew of the extended content

12.11, 19.11 2019

10

(11)

12.11, 19.11 2019

11

Retrieval as

k-nearest neighbor search

(12)

1-NN search for retrieval

12.11, 19.11 2019

12

(13)

1-NN search for retrieval

12.11, 19.11 2019

13

(14)

1-NN search for retrieval

12.11, 19.11 2019

14

(15)

1-NN search for retrieval

12.11, 19.11 2019

15

(16)

1-NN algorithm

12.11, 19.11 2019

16

(17)

1-NN algorithm

12.11, 19.11 2019

17

(18)

k-NN algorithm

12.11, 19.11 2019

18

(19)

k-NN algorithm

12.11, 19.11 2019

19

(20)

Critical elements of NN search

12.11, 19.11 2019

20

(21)

Document representation

12.11, 19.11 2019

21

(22)

Document representation

12.11, 19.11 2019

22

(23)

Document representation

12.11, 19.11 2019

23

(24)

Document representation

12.11, 19.11 2019

24

(25)

Distance metrics:

12.11, 19.11 2019

25

(26)

Distance metrics:

12.11, 19.11 2019

26

(27)

Distance metrics:

12.11, 19.11 2019

27

(28)

Distance metrics:

12.11, 19.11 2019

28

(29)

Distance metrics:

12.11, 19.11 2019

29

(30)

Distance metrics:

12.11, 19.11 2019

30

(31)

Distance metrics:

12.11, 19.11 2019

31

(32)

Distance metrics:

12.11, 19.11 2019

32

(33)

Distance metrics:

12.11, 19.11 2019

33

(34)

Distance metrics:

12.11, 19.11 2019

34

(35)

Distance metrics:

12.11, 19.11 2019

35

(36)

Distance metrics

12.11, 19.11 2019

36

(37)

Distance metrics

12.11, 19.11 2019

37

(38)

Distance metrics

12.11, 19.11 2019

38

(39)

Distance metrics

12.11, 19.11 2019

39

(40)

Distance metrics

12.11, 19.11 2019

40

(41)

Distance metrics

12.11, 19.11 2019

41

(42)

Distance metrics

12.11, 19.11 2019

42

(43)

Combining distance metrics

12.11, 19.11 2019

43

(44)

12.11, 19.11 2019

44

Scaling up k-NN search

by storing data in a KD-tree

(45)

Complexity of brute-force search

12.11, 19.11 2019

45

(46)

KD-trees

12.11, 19.11 2019

46

(47)

KD-trees

12.11, 19.11 2019

47

(48)

KD-trees

12.11, 19.11 2019

48

(49)

KD-trees

12.11, 19.11 2019

49

(50)

KD-trees

12.11, 19.11 2019

50

(51)

KD-trees

12.11, 19.11 2019

51

(52)

KD-trees

12.11, 19.11 2019

52

(53)

KD-trees

12.11, 19.11 2019

53

(54)

Nearest neighbor with KD-trees

12.11, 19.11 2019

54

(55)

Nearest neighbor with KD-trees

12.11, 19.11 2019

55

(56)

Nearest neighbor with KD-trees

12.11, 19.11 2019

56

(57)

Nearest neighbor with KD-trees

12.11, 19.11 2019

57

(58)

Nearest neighbor with KD-trees

12.11, 19.11 2019

58

(59)

Nearest neighbor with KD-trees

12.11, 19.11 2019

59

(60)

Nearest neighbor with KD-trees

12.11, 19.11 2019

60

(61)

Nearest neighbor with KD-trees

12.11, 19.11 2019

61

(62)

Nearest neighbor with KD-trees

12.11, 19.11 2019

62

(63)

Nearest neighbor with KD-trees

12.11, 19.11 2019

63

(64)

Nearest neighbor with KD-trees

12.11, 19.11 2019

64

(65)

Nearest neighbor with KD-trees

12.11, 19.11 2019

65

(66)

Complexity for N queries

12.11, 19.11 2019

66

(67)

Complexity for N queries

12.11, 19.11 2019

67

(68)

k-NN with KD-trees

12.11, 19.11 2019

68

(69)

Approximate k-NN with KD-trees

12.11, 19.11 2019

69

(70)

Closing remarks on KD-trees

12.11, 19.11 2019

70

(71)

KD-tree in high dimmensions

12.11, 19.11 2019

71

(72)

Moving away from exact NN search

12.11, 19.11 2019

72

(73)

12.11, 19.11 2019

73

Locality Sensitive Hashing (LHS)

as alternative to KD-trees

(74)

Locality sensitive hashing

12.11, 19.11 2019

74

(75)

Locality sensitive hashing

12.11, 19.11 2019

75

(76)

Locality sensitive hashing

12.11, 19.11 2019

76

(77)

Locality sensitive hashing

12.11, 19.11 2019

77

(78)

Locality sensitive hashing

12.11, 19.11 2019

78

(79)

Locality sensitive hashing

12.11, 19.11 2019

79

(80)

Locality sensitive hashing

12.11, 19.11 2019

80

(81)

Locality sensitive hashing

12.11, 19.11 2019

81

(82)

Locality sensitive hashing

12.11, 19.11 2019

82

(83)

Locality sensitive hashing

12.11, 19.11 2019

83

(84)

LSH: improving efficiency

12.11, 19.11 2019

84

(85)

LSH: improving efficiency

12.11, 19.11 2019

85

(86)

LSH: improving efficiency

12.11, 19.11 2019

86

(87)

LSH: improving efficiency

12.11, 19.11 2019

87

(88)

LSH: improving efficiency

12.11, 19.11 2019

88

(89)

LSH: improving efficiency

12.11, 19.11 2019

89

(90)

LSH recap

12.11, 19.11 2019

90

(91)

LSH: moving to higher dimmensions d

12.11, 19.11 2019

91

(92)

LSH: moving to higher dimmensions d

12.11, 19.11 2019

92

(93)

What you can do now …

12.11, 19.11 2019

93

(94)

12.11, 19.11 2019

94

Clustering:

An unsupervised learning task

(95)

Motivation

12.11, 19.11 2019

95

(96)

Motivation

12.11, 19.11 2019

96

I dont’t just like sport!

(97)

Motivation

12.11, 19.11 2019

97

(98)

Clustering: a supervised learning

12.11, 19.11 2019

98

(99)

Custering: a supervised learning

12.11, 19.11 2019

99

Example of

supervised learning

(100)

Clustering: an unsupervised learning

12.11, 19.11 2019

100

An unsupervised learning task

(101)

What defines a cluster ?

12.11, 19.11 2019

101

(102)

Hope for unsupervised learning

12.11, 19.11 2019

102

(103)

Other (challenging!) clusters to discover

12.11, 19.11 2019

103

Analysed by your eyes

(104)

Other (challenging!) clusters to discover

12.11, 19.11 2019

104

Analysed by clustering algorithms

(105)

12.11, 19.11 2019

105

k-means

clustering algorithm

(106)

k-means clustering algorithm

12.11, 19.11 2019

106

(107)

k-means clustering algorithm

12.11, 19.11 2019

107

(108)

k-means clustering algorithm

12.11, 19.11 2019

108

(109)

k-means clustering algorithm

12.11, 19.11 2019

109

(110)

k-means clustering algorithm

12.11, 19.11 2019

110

(111)

k-means as coordinate descent algorithm

12.11, 19.11 2019

111

(112)

K-means as coordinate descent algorithm

12.11, 19.11 2019

112

(113)

Convergence of k-means

12.11, 19.11 2019

113

Because we can cast k-means as coordinate descent algorithm we know that we are converging to local optimum

(114)

Convergence of k-mans to local mode

12.11, 19.11 2019

114

Crosses: initialised centers

(115)

Convergence of k-mans to local mode

12.11, 19.11 2019

115

Crosses: initialised centers

(116)

Convergence of k-mans to local mode

12.11, 19.11 2019

116

Crosses: initialised centers

Assigment to which group has changed

k-means very sensitive to initiased centers

(117)

Smart initialisation: k-means++ overwiew

12.11, 19.11 2019

117

(118)

k-means++ visualised

12.11, 19.11 2019

118

(119)

k-means++ visualised

12.11, 19.11 2019

119

(120)

k-means++ visualised

12.11, 19.11 2019

120

(121)

k-means++ visualised

12.11, 19.11 2019

121

(122)

Smart initialisation: k-means++ overwiew

12.11, 19.11 2019

122

(123)

Assessing quality of the clustering

12.11, 19.11 2019

123

(124)

k-means objective

12.11, 19.11 2019

124

(125)

Cluster heterogeneity

12.11, 19.11 2019

125

(126)

What happens to heterogeneity as k increases?

12.11, 19.11 2019

126

(127)

How to choose k?

12.11, 19.11 2019

127

(128)

What you can do now …

12.11, 19.11 2019

128

(129)

12.11, 19.11 2019

129

Probabilistic approach:

mixture model

(130)

Why probabilistic approach?

12.11, 19.11 2019

130

(131)

Why probabilistic approach?

12.11, 19.11 2019

131

(132)

Why probabilistic approach?

12.11, 19.11 2019

132

(133)

Why probabilistic approach?

12.11, 19.11 2019

133

(134)

Mixture models

12.11, 19.11 2019

134

(135)

Application: clustering images

12.11, 19.11 2019

135

(136)

Application: clustering images

12.11, 19.11 2019

136

Single RGB vector per image

(137)

Application: clustering images

12.11, 19.11 2019

137

(138)

Application: clustering images

12.11, 19.11 2019

138

(139)

Application: clustering images

12.11, 19.11 2019

139

(140)

Application: clustering images

12.11, 19.11 2019

140

We see that they are grouping!

But not easy to distinguish between groups

(141)

Application: clustering images

12.11, 19.11 2019

141

In this dimmension separable groups!

(142)

Model for a given image type

12.11, 19.11 2019

142

(143)

Model for a given image type

12.11, 19.11 2019

143

(144)

Application: clustering images

12.11, 19.11 2019

144

(145)

Application: clustering images

12.11, 19.11 2019

145

(146)

Application: clustering images

12.11, 19.11 2019

146

(147)

Application: clustering images

12.11, 19.11 2019

147

(148)

Mixture of Gaussians

12.11, 19.11 2019

148

(149)

Mixture of Gaussians

12.11, 19.11 2019

149

(150)

Mixture of Gaussians

12.11, 19.11 2019

150

(151)

Mixture of Gaussians

12.11, 19.11 2019

151

(152)

Mixture of Gaussians

12.11, 19.11 2019

152

(153)

Mixture of Gaussians

12.11, 19.11 2019

153

(154)

Mixture of Gaussians

12.11, 19.11 2019

154

(155)

Application: clustering documents

12.11, 19.11 2019

155

(156)

Application: clustering documents

12.11, 19.11 2019

156

(157)

Application: clustering documents

12.11, 19.11 2019

157

(158)

Application: clustering documents

12.11, 19.11 2019

158

(159)

Application: clustering documents

12.11, 19.11 2019

159

(160)

Application: clustering documents

12.11, 19.11 2019

160

(161)

Application: clustering documents

12.11, 19.11 2019

161

(162)

12.11, 19.11 2019

162

Inferring soft assignments with

expectation maximization (EM)

(163)

Inferring cluster labels

12.11, 19.11 2019

163

(164)

12.11, 19.11 2019

164

(165)

12.11, 19.11 2019

165

(166)

12.11, 19.11 2019

166

(167)

12.11, 19.11 2019

167

(168)

12.11, 19.11 2019

168

(169)

12.11, 19.11 2019

169

(170)

12.11, 19.11 2019

170

Part 1: Summary

(171)

12.11, 19.11 2019

171

(172)

12.11, 19.11 2019

172

Then split into separate tables and consider them independently.

(173)

12.11, 19.11 2019

173

(174)

12.11, 19.11 2019

174

(175)

12.11, 19.11 2019

175

(176)

12.11, 19.11 2019

176

Part 2a : Summary

(177)

12.11, 19.11 2019

177

(178)

12.11, 19.11 2019

178

(179)

12.11, 19.11 2019

179

(180)

12.11, 19.11 2019

180

(181)

12.11, 19.11 2019

181

(182)

12.11, 19.11 2019

182

(183)

12.11, 19.11 2019

183

(184)

12.11, 19.11 2019

184

Part 2b: Summary

(185)

Expectation maximization (ME)

12.11, 19.11 2019

185

(186)

Expectation maximization (ME)

12.11, 19.11 2019

186

(187)

Expectation maximization (ME)

12.11, 19.11 2019

187

(188)

Expectation maximization (ME)

12.11, 19.11 2019

188

(189)

Expectation maximization (ME)

12.11, 19.11 2019

189

(190)

Expectation maximization (ME)

12.11, 19.11 2019

190

(191)

Expectation maximization (ME)

12.11, 19.11 2019

191

(192)

Expectation maximization (ME)

12.11, 19.11 2019

192

(193)

Expectation maximization (ME)

12.11, 19.11 2019

193

(194)

Expectation maximization (ME)

12.11, 19.11 2019

194

(195)

Expectation maximization (ME)

12.11, 19.11 2019

195

(196)

Expectation maximization (ME)

12.11, 19.11 2019

196

(197)

Expectation maximization (ME)

12.11, 19.11 2019

197

(198)

What you can do now …

12.11, 19.11 2019

198

(199)

12.11, 19.11 2019

199

Hierarchical clustering

(200)

Why hierarchical clustering

12.11, 19.11 2019

200

Cytaty

Powiązane dokumenty

Guestrin, Univ

 Personalisation: purhase history, monthly and yearly trends, etc.?. Customers who bought product A also bought

Cetinkaya-Rundel, Duke University Data Analysis and

Case studied are about building, evaluating, deploying inteligence in data analysis.. Regression: Predicting

– Time for you to write your code and (for me) to disscuss with each student her/his progress with assignments.. • COVID-19 times:

Case studied are about building, evaluating, deploying inteligence in data analysis. Use pre-specified or develop

Cetinkaya-Rundel, Duke University Data Analysis and

Case studied are about building, evaluating, deploying inteligence in data analysis. Use pre-specified or develop