• Nie Znaleziono Wyników

Infobright’s Approach for Approximate Querying

N/A
N/A
Protected

Academic year: 2021

Share "Infobright’s Approach for Approximate Querying"

Copied!
24
0
0

Pełen tekst

(1)

Infobright’s Approach for

Approximate Querying

Rick Glick – Infobright Inc.

Dominik Ślęzak – University of Warsaw & Infobright Inc.

(2)

Agenda

Historical background: Infobright’s RDBMS software

so far (2005-2015)

Toward approximate query database engines: A new

aspect of scalability

New Infobright’s Approximate Query (IAQ) database

engine (2016-…)

IAQ Details: Capturing Knowledge (DATA-IN)

IAQ Details: Generating Insight (DATA-OUT)

Challenges and Conclusions

(3)

Historical background:

Infobright’s RDBMS software so far

(2005-2015)

(4)

Outlook Temp. Humid. Wind Sport? 1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Outlook Temp. Humid. Wind Sport?

DATA LOAD

2

16

Outlook Temp. Humid. Wind Sport? row pack 1 rough value rough value rough value rough value rough value

ROUGH VALUE CALCULATION QUERY example related to filtering r[1] r[2] r[3] ROUGH VALUE USAGE r[216] ORIGINAL DATA DATA PACKS compressed collections of attribute values

row pack 2 rough value rough value rough value rough value rough value row pack 3 rough value rough value rough value rough value rough value

ROUGH ATTRIBUTES

GRANULATED TABLE

a collection of rough values for each of rough attributes

is stored as a separate knowledge node

identification of row packs and

rows which satisfy query conditions rp[1] rp[2] rp[3]

(5)

SELECT MAX(A) FROM T WHERE B > 15;

E

E

I X ≥ 22 I X ≥ 22

B > 15

B > 15, A

18

B > 15, A

X

Data Table T

I: Irrelevant Blocks

(Negative Region)

S: Suspect Blocks

(Boundary Region)

R: Relevant Blocks

(Positive Region)

E: Exact Computation

(necessary, if the final

query result cannot be

obtained only from the

statistical snapshots)

(6)

Toward approximate query database engines:

A new aspect of scalability

(7)

How Accurate Calculations do we Need?

Exact

Accuracy AccuracyEventual ApproximateOnly

Network Intrusion Detection Demand Forecasting Trade Settlement Audience Profiling Trading Algorithms Development Operational Reporting Technical Troubleshooting and Remediation Financial Risk, Scoring and Analysis Logistics and Demand Forecasting Agricultural Yield Management Cyber Security Test Scoring Price Sensitivity Analysis and Optimization Pharmaceutical Effectiveness

Tools for Machine Learning and

(8)

New Infobright’s Approximate Query (IAQ)

database engine (2016-…)

(9)

Infobright’s Approximate Query (IAQ)

New Database Engine for Approximate SQL

 Knowledge capture upon

ingestion

 Data passes through IAQ

statistical layer, creating knowledge by forming correlation models

 Resulting queries are

100-1000 times faster than traditional databases

(10)

Outlook Temp. Humid. Wind Sport? 1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Outlook Temp. Humid. Wind Sport?

DATA LOAD

2

16

Outlook Temp. Humid. Wind Sport? row pack 1 rough value rough value rough value rough value rough value

ROUGH VALUE CALCULATION QUERY example related to filtering r[1] r[2] r[3] ROUGH VALUE USAGE r[216] ORIGINAL DATA DATA PACKS compressed collections of attribute values

row pack 2 rough value rough value rough value rough value rough value row pack 3 rough value rough value rough value rough value rough value

ROUGH ATTRIBUTES

GRANULATED TABLE

a collection of rough values for each of rough attributes

is stored as a separate knowledge node

identification of row packs and

rows which satisfy query conditions rp[1] rp[2] rp[3]

(11)

IAQ Query Execution Process

Ingested knowledge

Materialize

Output ≅ X

Perform operations on statistical data Transformed

knowledge

(12)
(13)

 Each data pack contains values

and counts which can be represented in a histogram

 Knowledge can be broken down

into:

– Ranges – Gaps

– Specials (i.e. outliers)

 Number of histogram bars, gaps,

and specials are minimized in order to maximize performance (i.e. less data footprint, faster query execution) 0 100 200 300 350 Range = 200-350 100 200 300 400 350 195 450 Special value = 300 Gap = 40-60

(14)

Single-Column Statistical Models

 During data ingestion we need to decide: – What are the ranges of histogram bars

– Which values should be stored as special ones – Which gaps are important enough to remember

 Decisions are taken automatically by IAQ algorithms, which maximize

quality of statistical model with respect to original data, under specified budget (footprint) constraints

 Decisions are taken independently for each data chunk being

(15)

Example: Single Dimensions

Probabilities: P(A=𝑅𝑅𝐴𝐴1) = 100250

=

1025

=

25 P(A=𝑅𝑅𝐴𝐴2) = 30 250

=

3 25 P(A=𝑅𝑅𝐴𝐴3) = 120 250

=

12 25 𝑅𝑅𝐴𝐴1 𝑅𝑅𝐴𝐴2 𝑅𝑅𝐴𝐴3 100 30 120 𝑅𝑅𝐵𝐵1 𝑅𝑅𝐵𝐵2 𝑅𝑅𝐵𝐵3 50 150 Probabilities: P(B=𝑅𝑅𝐵𝐵1) = 50 250

=

5 25 P(B=𝑅𝑅𝐵𝐵2) = 50 250

=

5 25 P(B=𝑅𝑅𝐵𝐵3) = 150 250

=

15 25

(16)

What is missing in

single-dimensional layer?

If 80 rows out of 250 detected during data ingest have values on A within 𝑅𝑅𝐴𝐴1 and values on B within 𝑅𝑅𝐵𝐵3, then:

𝑃𝑃

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅

(A=𝑅𝑅

𝐴𝐴1

and B=𝑅𝑅

𝐵𝐵3

) =

25080

=

258

Contrast with:

(17)

𝑃𝑃

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅

(A=𝑅𝑅

𝐴𝐴1

and B=𝑅𝑅

𝐵𝐵3

) =

25080

=

258

Contrast with:

𝑃𝑃

𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑅𝑅𝐸𝐸𝑅𝑅𝐸𝐸

(A=𝑅𝑅

𝐴𝐴1

and B=𝑅𝑅

𝐵𝐵3

) =

25

X

35

=

256

Rank as:

Rank(𝑅𝑅

𝐴𝐴1

, B=𝑅𝑅

𝐵𝐵3

) =

8/256/25

-1 =

13

If 80 rows out of 250 detected during data ingest have values on A within 𝑅𝑅𝐴𝐴1 and values on B within 𝑅𝑅𝐵𝐵3, then:

What is missing in

(18)

Multi-column Statistical Models

𝑅𝑅

𝐴𝐴1

𝑅𝑅

𝐴𝐴2

𝑅𝑅

𝐴𝐴3

𝑅𝑅

𝐵𝐵1

𝑅𝑅

𝐵𝐵2

𝑅𝑅

𝐵𝐵3

𝑃𝑃

𝑅𝑅𝐸𝐸𝐴𝐴𝑅𝑅

𝑃𝑃

𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐴𝐴𝐸𝐸𝐸𝐸𝐸𝐸

 Store only a limited

number of ratios to maximize

performance

 Sort descending with

Rank=[Ratio – 1]

 Store top-N ratios or

pairs

8

6

<1

<1

<1

<1

<1

<1

<1

<1

(19)
(20)

IAQ Query Execution Process

Ingested knowledge

Materialize

Output ≅ X

Perform operations on statistical data Transformed

knowledge

(21)

Transformation in Action

Example:

SELECT SUM(A) FROM T WHERE B > 7;

𝑅𝑅

𝐴𝐴1

𝑅𝑅

𝐴𝐴2

𝑅𝑅

𝐴𝐴3

𝑅𝑅

𝐵𝐵1

𝑅𝑅

𝐵𝐵2

𝑅𝑅

𝐵𝐵3

8

6

<1

<1

<1

<1

<1

<1

<1

<1

Estimate number of rows

satisfying B > 7 as number of rows in 𝑅𝑅𝐵𝐵3 and - as a proportional estimation -1 3 * number of rows in 𝑅𝑅𝐵𝐵2 Thus: Count(*) = 150 + 1 3 * 50 = 67

(22)

Transformation in Action

E.g.:

SELECT SUM(A) FROM T WHERE B > 7;

𝑅𝑅

𝐴𝐴1

𝑅𝑅

𝐴𝐴2

𝑅𝑅

𝐴𝐴3

𝑅𝑅

𝐵𝐵1

𝑅𝑅

𝐵𝐵2

𝑅𝑅

𝐵𝐵3

8

6

<1

<1

<1

<1

<1

<1

<1

<1

Estimate number of rows

satisfying B > 7 as number of rows in 𝑅𝑅𝐵𝐵3 and - as a proportional estimation -1 3 * number of rows in 𝑅𝑅𝐵𝐵2 Thus: Count(*) = 150 + 1 3 * 50 = 67 B > 7

(23)

Transformation in Action (Cont’d)

𝑅𝑅

𝐴𝐴1

𝑅𝑅

𝐴𝐴2

𝑅𝑅

𝐴𝐴3

𝑅𝑅

𝐴𝐴1

𝑅𝑅

𝐴𝐴2

𝑅𝑅

𝐴𝐴3 >1025 <253 <1225 In probabilities after transformation

How will probabilities on A change subject to B > 7?

P (A = 𝑅𝑅𝐴𝐴1) will now be higher than because of positive correlation with 1025 𝑅𝑅𝐵𝐵3 This will improve the quality of estimating sum(A) where B > 7

10 25 3 25 12 25

(24)

Challenges and Conclusions

We keep searching for:

Better functions for comparison between results of

exact and approximate queries (better reflecting users’

perception while making business decisions based on

query results)

Better functions for histogram quality (better correlated

with the above comparison functions), so it is possible to

predict query accuracy based on information about

histograms’ quality

Better heuristic algorithms for searching for good

histograms (by means of the above histogram quality

functions), so the whole process of data ingestion

becomes as fast as possible

In the meantime, the Infobright’s Engineering Team keeps

working on design and implementation of better and faster

histogram-based approximate query execution mechanisms

Cytaty

Powiązane dokumenty

Zmiany umowy wymagać będą zachowania formy pisemnego aneksu podpisanego przez obie strony, pod rygorem nieważności, i dopuszczalne będą w warunkach określonych w

Dodatkowo, celem pozyskania koniecznych danych do raportowania wysokości pomocy publicznej w zakresie wysokości zwolnień od podatku akcyzowego wyrobów gazowych

Dlugoletnia przynaleznosc do partii orz petnione przez niego eksponowane i decyzyjne funkcje w aparacie partyjnym powoduje, Ze w ocenie Oddzialu Instytutu Pami?ci Narodowej -

Już Księżyc siwiutki cylindrem się kłania, już Droga, ta Mleczna, porywa nas w dal3. Tam gwiazdy się śmieją,

Rozporządzenie Ministra Edukacji i Nauki zmieniające rozporządzenie w sprawie podstaw programowych kształcenia w zawodach szkolnictwa branżowego oraz dodatkowych

Oblicz pochodną odwzorowania odwrotnego g −1 w punkcie (4, −1) na dwa sposoby: jako pochodną wyznaczonego odwzorowania odwrotnego oraz z twierdzenia o funkcji odwrotnej..

Rozporządzenie Komisji (UE) 2015/830 z dnia 28 maja 2015r zmieniające rozporządzenie (WE) nr 1907/2006 Parlamentu Europejskiego i Rady w sprawie rejestracji, oceny,

Alkohol Izopropylowy Ostra toksyczność - doustnie: LD50&gt;2000 mg/kg (dla 100% izopropanolu) Ostra toksyczność - skóra: LD50&gt;2000 mg/kg (dla 100% izopropanolu) Ostra toksyczność