• Nie Znaleziono Wyników

INTRODUCTION TO DATA SCIENCE

N/A
N/A
Protected

Academic year: 2021

Share "INTRODUCTION TO DATA SCIENCE"

Copied!
297
0
0

Pełen tekst

(1)

INTRODUCTION TO DATA SCIENCE

WFAiS UJ, Informatyka Stosowana I stopień studiów

1

1/12, 8/12, 15/12, 22/12/2020

This lecture is

based on course by E. Fox and C. Guestrin, Univ of Washington

(2)

What is retrieval?

1/12, 8/12, 15/12, 22/12/2020

2

(3)

What is retrieval?

1/12, 8/12, 15/12, 22/12/2020

3

(4)

What is retrieval?

1/12, 8/12, 15/12, 22/12/2020

4

(5)

Retrieval applications

1/12, 8/12, 15/12, 22/12/2020

5

(6)

What is clustering?

1/12, 8/12, 15/12, 22/12/2020

6

(7)

Clustring applications

1/12, 8/12, 15/12, 22/12/2020

7

(8)

Clustering applications

1/12, 8/12, 15/12, 22/12/2020

8

(9)

Impact of retrieval & clustering

1/12, 8/12, 15/12, 22/12/2020

9

(10)

Overwiew of content

1/12, 8/12, 15/12, 22/12/2020

10

(11)

1/12, 8/12, 15/12, 22/12/2020

11

Retrieval as

k-nearest neighbor search

(12)

1-NN search for retrieval

1/12, 8/12, 15/12, 22/12/2020

12

(13)

1-NN search for retrieval

1/12, 8/12, 15/12, 22/12/2020

13

(14)

1-NN search for retrieval

1/12, 8/12, 15/12, 22/12/2020

14

(15)

1-NN search for retrieval

1/12, 8/12, 15/12, 22/12/2020

15

(16)

1-NN algorithm

1/12, 8/12, 15/12, 22/12/2020

16

(17)

1-NN algorithm

1/12, 8/12, 15/12, 22/12/2020

17

(18)

k-NN algorithm

1/12, 8/12, 15/12, 22/12/2020

18

(19)

k-NN algorithm

1/12, 8/12, 15/12, 22/12/2020

19

(20)

Critical elements of NN search

1/12, 8/12, 15/12, 22/12/2020

20

(21)

Document representation

1/12, 8/12, 15/12, 22/12/2020

21

(22)

Document representation

1/12, 8/12, 15/12, 22/12/2020

22

(23)

Document representation

1/12, 8/12, 15/12, 22/12/2020

23

(24)

Document representation

1/12, 8/12, 15/12, 22/12/2020

24

(25)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

25

(26)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

26

(27)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

27

(28)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

28

(29)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

29

(30)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

30

(31)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

31

(32)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

32

(33)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

33

(34)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

34

(35)

Distance metrics:

1/12, 8/12, 15/12, 22/12/2020

35

(36)

Distance metrics

1/12, 8/12, 15/12, 22/12/2020

36

(37)

Distance metrics

1/12, 8/12, 15/12, 22/12/2020

37

(38)

Distance metrics

1/12, 8/12, 15/12, 22/12/2020

38

(39)

Distance metrics

1/12, 8/12, 15/12, 22/12/2020

39

(40)

Distance metrics

1/12, 8/12, 15/12, 22/12/2020

40

(41)

Distance metrics

1/12, 8/12, 15/12, 22/12/2020

41

(42)

Distance metrics

1/12, 8/12, 15/12, 22/12/2020

42

(43)

Combining distance metrics

1/12, 8/12, 15/12, 22/12/2020

43

(44)

1/12, 8/12, 15/12, 22/12/2020

44

Scaling up k-NN search

by storing data in a KD-tree

(45)

Complexity of brute-force search

1/12, 8/12, 15/12, 22/12/2020

45

(46)

KD-trees

1/12, 8/12, 15/12, 22/12/2020

46

(47)

KD-trees

1/12, 8/12, 15/12, 22/12/2020

47

(48)

KD-trees

1/12, 8/12, 15/12, 22/12/2020

48

(49)

KD-trees

1/12, 8/12, 15/12, 22/12/2020

49

(50)

KD-trees

1/12, 8/12, 15/12, 22/12/2020

50

(51)

KD-trees

1/12, 8/12, 15/12, 22/12/2020

51

(52)

KD-trees

1/12, 8/12, 15/12, 22/12/2020

52

(53)

KD-trees

1/12, 8/12, 15/12, 22/12/2020

53

(54)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

54

(55)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

55

(56)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

56

(57)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

57

(58)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

58

(59)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

59

(60)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

60

(61)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

61

(62)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

62

(63)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

63

(64)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

64

(65)

Nearest neighbor with KD-trees

1/12, 8/12, 15/12, 22/12/2020

65

(66)

Complexity for N queries

1/12, 8/12, 15/12, 22/12/2020

66

(67)

Complexity for N queries

1/12, 8/12, 15/12, 22/12/2020

67

(68)

k-NN with KD-trees

1/12, 8/12, 15/12, 22/12/2020

68

(69)

Approximate k-NN with KD-trees

1/12, 8/12, 15/12, 22/12/2020

69

(70)

Closing remarks on KD-trees

1/12, 8/12, 15/12, 22/12/2020

70

(71)

KD-tree in high dimmensions

1/12, 8/12, 15/12, 22/12/2020

71

(72)

Moving away from exact NN search

1/12, 8/12, 15/12, 22/12/2020

72

(73)

1/12, 8/12, 15/12, 22/12/2020

73

Locality Sensitive Hashing (LHS)

as alternative to KD-trees

(74)

Locality sensitive hashing

1/12, 8/12, 15/12, 22/12/2020

74

(75)

Locality sensitive hashing

1/12, 8/12, 15/12, 22/12/2020

75

(76)

Locality sensitive hashing

1/12, 8/12, 15/12, 22/12/2020

76

(77)

Locality sensitive hashing

1/12, 8/12, 15/12, 22/12/2020

77

(78)

Locality sensitive hashing

1/12, 8/12, 15/12, 22/12/2020

78

(79)

Locality sensitive hashing

1/12, 8/12, 15/12, 22/12/2020

79

(80)

Locality sensitive hashing

1/12, 8/12, 15/12, 22/12/2020

80

(81)

Locality sensitive hashing

1/12, 8/12, 15/12, 22/12/2020

81

(82)

Locality sensitive hashing

1/12, 8/12, 15/12, 22/12/2020

82

(83)

Locality sensitive hashing

1/12, 8/12, 15/12, 22/12/2020

83

(84)

LSH: improving efficiency

1/12, 8/12, 15/12, 22/12/2020

84

(85)

LSH: improving efficiency

1/12, 8/12, 15/12, 22/12/2020

85

(86)

LSH: improving efficiency

1/12, 8/12, 15/12, 22/12/2020

86

(87)

LSH: improving efficiency

1/12, 8/12, 15/12, 22/12/2020

87

(88)

LSH: improving efficiency

1/12, 8/12, 15/12, 22/12/2020

88

(89)

LSH: improving efficiency

1/12, 8/12, 15/12, 22/12/2020

89

(90)

LSH recap

1/12, 8/12, 15/12, 22/12/2020

90

(91)

LSH: moving to higher dimmensions d

1/12, 8/12, 15/12, 22/12/2020

91

(92)

LSH: moving to higher dimmensions d

1/12, 8/12, 15/12, 22/12/2020

92

(93)

What you can do now …

1/12, 8/12, 15/12, 22/12/2020

93

(94)

1/12, 8/12, 15/12, 22/12/2020

94

Clustering:

An unsupervised learning task

(95)

Motivation

1/12, 8/12, 15/12, 22/12/2020

95

(96)

Motivation

1/12, 8/12, 15/12, 22/12/2020

96

I dont’t just

like sport!

(97)

Motivation

1/12, 8/12, 15/12, 22/12/2020

97

(98)

Clustering: a supervised learning

1/12, 8/12, 15/12, 22/12/2020

98

(99)

Custering: a supervised learning

1/12, 8/12, 15/12, 22/12/2020

99

Example of

supervised learning

(100)

Clustering: an unsupervised learning

1/12, 8/12, 15/12, 22/12/2020

100

An unsupervised

learning task

(101)

What defines a cluster ?

1/12, 8/12, 15/12, 22/12/2020

101

(102)

Hope for unsupervised learning

1/12, 8/12, 15/12, 22/12/2020

102

(103)

Other (challenging!) clusters to discover

1/12, 8/12, 15/12, 22/12/2020

103

Analysed by your eyes

(104)

Other (challenging!) clusters to discover

1/12, 8/12, 15/12, 22/12/2020

104

Analysed by clustering algorithms

(105)

1/12, 8/12, 15/12, 22/12/2020

105

k-means

clustering algorithm

(106)

k-means clustering algorithm

1/12, 8/12, 15/12, 22/12/2020

106

(107)

k-means clustering algorithm

1/12, 8/12, 15/12, 22/12/2020

107

(108)

k-means clustering algorithm

1/12, 8/12, 15/12, 22/12/2020

108

(109)

k-means clustering algorithm

1/12, 8/12, 15/12, 22/12/2020

109

(110)

k-means clustering algorithm

1/12, 8/12, 15/12, 22/12/2020

110

(111)

k-means as coordinate descent algorithm

1/12, 8/12, 15/12, 22/12/2020

111

(112)

K-means as coordinate descent algorithm

1/12, 8/12, 15/12, 22/12/2020

112

(113)

Convergence of k-means

1/12, 8/12, 15/12, 22/12/2020

113

Because we can cast k-means as coordinate

descent algorithm we know that we are

converging to local optimum

(114)

Convergence of k-mans to local mode

1/12, 8/12, 15/12, 22/12/2020

114

Crosses: initialised centers

(115)

Convergence of k-mans to local mode

1/12, 8/12, 15/12, 22/12/2020

115

Crosses: initialised centers

(116)

Convergence of k-mans to local mode

1/12, 8/12, 15/12, 22/12/2020

116

Crosses: initialised centers

Assigment to which group has changed

k-means very sensitive to initiased centers

(117)

Smart initialisation: k-means++ overwiew

1/12, 8/12, 15/12, 22/12/2020

117

(118)

k-means++ visualised

1/12, 8/12, 15/12, 22/12/2020

118

(119)

k-means++ visualised

1/12, 8/12, 15/12, 22/12/2020

119

(120)

k-means++ visualised

1/12, 8/12, 15/12, 22/12/2020

120

(121)

k-means++ visualised

1/12, 8/12, 15/12, 22/12/2020

121

(122)

Smart initialisation: k-means++ overwiew

1/12, 8/12, 15/12, 22/12/2020

122

(123)

Assessing quality of the clustering

1/12, 8/12, 15/12, 22/12/2020

123

(124)

k-means objective

1/12, 8/12, 15/12, 22/12/2020

124

(125)

Cluster heterogeneity

1/12, 8/12, 15/12, 22/12/2020

125

(126)

What happens to heterogeneity as k increases?

1/12, 8/12, 15/12, 22/12/2020

126

(127)

How to choose k?

1/12, 8/12, 15/12, 22/12/2020

127

(128)

1/12, 8/12, 15/12, 22/12/2020

128

MapReduce

(129)

Counting words on a single processor

1/12, 8/12, 15/12, 22/12/2020

129

(130)

Naive parallel word counting

1/12, 8/12, 15/12, 22/12/2020

130

(131)

Counting words & merging tabels

1/12, 8/12, 15/12, 22/12/2020

131

(132)

MapReduce abstraction

1/12, 8/12, 15/12, 22/12/2020

132

(133)

MapReduce – Execution overwiew

1/12, 8/12, 15/12, 22/12/2020

133

(134)

Improving performance

1/12, 8/12, 15/12, 22/12/2020

134

(135)

Scaling up k-means via MapReduce

1/12, 8/12, 15/12, 22/12/2020

135

(136)

Scaling up k-means via MapReduce

1/12, 8/12, 15/12, 22/12/2020

136

(137)

Scaling up k-means via MapReduce

1/12, 8/12, 15/12, 22/12/2020

137

(138)

Scaling up k-means via MapReduce

1/12, 8/12, 15/12, 22/12/2020

138

(139)

Parallel k-means via MapReduce

1/12, 8/12, 15/12, 22/12/2020

139

(140)

What you can do now …

1/12, 8/12, 15/12, 22/12/2020

140

(141)

1/12, 8/12, 15/12, 22/12/2020

141

Probabilistic approach:

mixture model

(142)

Why probabilistic approach?

1/12, 8/12, 15/12, 22/12/2020

142

(143)

Why probabilistic approach?

1/12, 8/12, 15/12, 22/12/2020

143

(144)

Why probabilistic approach?

1/12, 8/12, 15/12, 22/12/2020

144

(145)

Why probabilistic approach?

1/12, 8/12, 15/12, 22/12/2020

145

(146)

Mixture models

1/12, 8/12, 15/12, 22/12/2020

146

(147)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

147

(148)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

148

Single RGB vector per image

(149)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

149

(150)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

150

(151)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

151

(152)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

152

We see that they are grouping!

But not easy to distinguish between groups

(153)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

153

In this dimmension

separable groups!

(154)

Model for a given image type

1/12, 8/12, 15/12, 22/12/2020

154

(155)

Model for a given image type

1/12, 8/12, 15/12, 22/12/2020

155

(156)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

156

(157)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

157

(158)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

158

(159)

Application: clustering images

1/12, 8/12, 15/12, 22/12/2020

159

(160)

Mixture of Gaussians

1/12, 8/12, 15/12, 22/12/2020

160

(161)

Mixture of Gaussians

1/12, 8/12, 15/12, 22/12/2020

161

(162)

Mixture of Gaussians

1/12, 8/12, 15/12, 22/12/2020

162

(163)

Mixture of Gaussians

1/12, 8/12, 15/12, 22/12/2020

163

(164)

Mixture of Gaussians

1/12, 8/12, 15/12, 22/12/2020

164

(165)

Mixture of Gaussians

1/12, 8/12, 15/12, 22/12/2020

165

(166)

Mixture of Gaussians

1/12, 8/12, 15/12, 22/12/2020

166

(167)

Application: clustering documents

1/12, 8/12, 15/12, 22/12/2020

167

(168)

Application: clustering documents

1/12, 8/12, 15/12, 22/12/2020

168

(169)

Application: clustering documents

1/12, 8/12, 15/12, 22/12/2020

169

(170)

Application: clustering documents

1/12, 8/12, 15/12, 22/12/2020

170

(171)

Application: clustering documents

1/12, 8/12, 15/12, 22/12/2020

171

(172)

Application: clustering documents

1/12, 8/12, 15/12, 22/12/2020

172

(173)

Application: clustering documents

1/12, 8/12, 15/12, 22/12/2020

173

(174)

1/12, 8/12, 15/12, 22/12/2020

174

Inferring soft assignments with

expectation maximization (EM)

(175)

Inferring cluster labels

1/12, 8/12, 15/12, 22/12/2020

175

(176)

1/12, 8/12, 15/12, 22/12/2020

176

(177)

1/12, 8/12, 15/12, 22/12/2020

177

(178)

1/12, 8/12, 15/12, 22/12/2020

178

(179)

1/12, 8/12, 15/12, 22/12/2020

179

(180)

1/12, 8/12, 15/12, 22/12/2020

180

(181)

1/12, 8/12, 15/12, 22/12/2020

181

(182)

1/12, 8/12, 15/12, 22/12/2020

182

Part 1: Summary

(183)

1/12, 8/12, 15/12, 22/12/2020

183

(184)

1/12, 8/12, 15/12, 22/12/2020

184

Then split into separate tables and consider them independently.

(185)

1/12, 8/12, 15/12, 22/12/2020

185

(186)

1/12, 8/12, 15/12, 22/12/2020

186

(187)

1/12, 8/12, 15/12, 22/12/2020

187

(188)

1/12, 8/12, 15/12, 22/12/2020

188

Part 2a : Summary

(189)

1/12, 8/12, 15/12, 22/12/2020

189

(190)

1/12, 8/12, 15/12, 22/12/2020

190

(191)

1/12, 8/12, 15/12, 22/12/2020

191

(192)

1/12, 8/12, 15/12, 22/12/2020

192

(193)

1/12, 8/12, 15/12, 22/12/2020

193

(194)

1/12, 8/12, 15/12, 22/12/2020

194

(195)

1/12, 8/12, 15/12, 22/12/2020

195

(196)

1/12, 8/12, 15/12, 22/12/2020

196

Part 2b: Summary

(197)

Expectation maximization (ME)

1/12, 8/12, 15/12, 22/12/2020

197

(198)

Expectation maximization (ME)

1/12, 8/12, 15/12, 22/12/2020

198

(199)

Expectation maximization (ME)

1/12, 8/12, 15/12, 22/12/2020

199

(200)

Expectation maximization (ME)

1/12, 8/12, 15/12, 22/12/2020

200

Cytaty

Powiązane dokumenty

 Personalisation: purhase history, monthly and yearly trends, etc.?. Customers who bought product A also bought

Cetinkaya-Rundel, Duke University Data Analysis and

Case studied are about building, evaluating, deploying inteligence in data analysis.. Regression: Predicting

– Time for you to write your code and (for me) to disscuss with each student her/his progress with assignments.. • COVID-19 times:

Case studied are about building, evaluating, deploying inteligence in data analysis. Use pre-specified or develop

Cetinkaya-Rundel, Duke University Data Analysis and

Guestrin, Univ

Case studied are about building, evaluating, deploying inteligence in data analysis. Use pre-specified or develop