Infobright’s Approach for Approximate Querying

(1)

Infobright’s Approach for

Approximate Querying

Rick Glick – Infobright Inc.

Dominik Ślęzak – University of Warsaw & Infobright Inc.

(2)

Agenda



Historical background: Infobright’s RDBMS software

so far (2005-2015)



Toward approximate query database engines: A new

aspect of scalability



New Infobright’s Approximate Query (IAQ) database

engine (2016-…)



IAQ Details: Capturing Knowledge (DATA-IN)



IAQ Details: Generating Insight (DATA-OUT)



Challenges and Conclusions

(3)

Historical background:

Infobright’s RDBMS software so far

(2005-2015)

(4)

Outlook Temp. Humid. Wind Sport? 1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Outlook Temp. Humid. Wind Sport?

DATA LOAD

2

16

Outlook Temp. Humid. Wind Sport? row pack 1 rough value rough value rough value rough value rough value

ROUGH VALUE CALCULATION QUERY example related to filtering r[1] r[2] r[3] ROUGH VALUE USAGE r[216_] ORIGINAL DATA DATA PACKS compressed collections of attribute values

row pack 2 rough value rough value rough value rough value rough value row pack 3 rough value rough value rough value rough value rough value

ROUGH ATTRIBUTES

GRANULATED TABLE

a collection of rough values for each of rough attributes

is stored as a separate knowledge node

identification of row packs and

rows which satisfy query conditions rp[1] rp[2] rp[3]

(5)

SELECT MAX(A) FROM T WHERE B > 15;

E

I ⇔ X ≥ 22 I ⇔ X ≥ 22

B > 15

B > 15, A

≥

18 B > 15, A

≥

X

Data Table T



I: Irrelevant Blocks

(Negative Region)



S: Suspect Blocks

(Boundary Region)



R: Relevant Blocks

(Positive Region)



E: Exact Computation

(necessary, if the final

query result cannot be

obtained only from the

statistical snapshots)

(6)

Toward approximate query database engines:

A new aspect of scalability

(7)

How Accurate Calculations do we Need?

Exact

Accuracy AccuracyEventual ApproximateOnly

Network Intrusion Detection Demand Forecasting Trade Settlement Audience Profiling Trading Algorithms Development Operational Reporting Technical Troubleshooting and Remediation Financial Risk, Scoring and Analysis Logistics and Demand Forecasting Agricultural Yield Management Cyber Security Test Scoring Price Sensitivity Analysis and Optimization Pharmaceutical Effectiveness

Tools for Machine Learning and

(8)

New Infobright’s Approximate Query (IAQ)

database engine (2016-…)

(9)

Infobright’s Approximate Query (IAQ)

New Database Engine for Approximate SQL

 Knowledge capture upon

ingestion

 Data passes through IAQ

statistical layer, creating knowledge by forming correlation models

 Resulting queries are

100-1000 times faster than traditional databases

(10)

Outlook Temp. Humid. Wind Sport? 1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Outlook Temp. Humid. Wind Sport?

DATA LOAD

2

16

Outlook Temp. Humid. Wind Sport? row pack 1 rough value rough value rough value rough value rough value

ROUGH VALUE CALCULATION QUERY example related to filtering r[1] r[2] r[3] ROUGH VALUE USAGE r[216_] ORIGINAL DATA DATA PACKS compressed collections of attribute values

row pack 2 rough value rough value rough value rough value rough value row pack 3 rough value rough value rough value rough value rough value

ROUGH ATTRIBUTES

GRANULATED TABLE

a collection of rough values for each of rough attributes

is stored as a separate knowledge node

identification of row packs and

rows which satisfy query conditions rp[1] rp[2] rp[3]

(11)

IAQ Query Execution Process

Ingested knowledge

Materialize

Output ≅ X

Perform operations on statistical data Transformed

knowledge

(12)

(13)

 Each data pack contains values

and counts which can be represented in a histogram

 Knowledge can be broken down

into:

– Ranges – Gaps

– Specials (i.e. outliers)

 Number of histogram bars, gaps,

and specials are minimized in order to maximize performance (i.e. less data footprint, faster query execution) 0 100 200 300 350 Range = 200-350 100 200 300 400 350 195 450 Special value = 300 Gap = 40-60

(14)

Single-Column Statistical Models

 During data ingestion we need to decide: – What are the ranges of histogram bars

– Which values should be stored as special ones – Which gaps are important enough to remember

 Decisions are taken automatically by IAQ algorithms, which maximize

quality of statistical model with respect to original data, under specified budget (footprint) constraints

 Decisions are taken independently for each data chunk being

(15)

Example: Single Dimensions

Probabilities: P(A=𝑅𝑅_𝐴𝐴1) = 100₂₅₀

=

10₂₅

=

2₅ P(A=𝑅𝑅_𝐴𝐴2_{) =} 30 250

=

3 25 P(A=𝑅𝑅_𝐴𝐴3_{) =}120 250

=

12 25 𝑅𝑅_𝐴𝐴1 𝑅𝑅_𝐴𝐴2 𝑅𝑅_𝐴𝐴3 100 30 120 𝑅𝑅_𝐵𝐵1 𝑅𝑅_𝐵𝐵2 𝑅𝑅_𝐵𝐵3 50 150 Probabilities: P(B=𝑅𝑅_𝐵𝐵1_{) =} 50 250

=

5 25 P(B=𝑅𝑅_𝐵𝐵2_{) =} 50 250

=

5 25 P(B=𝑅𝑅_𝐵𝐵3_{) =}150 250

=

15 25

(16)

What is missing in

single-dimensional layer?

If 80 rows out of 250 detected during data ingest have values on A within 𝑅𝑅_𝐴𝐴1 and values on B within 𝑅𝑅_𝐵𝐵3, then:

𝑃𝑃

_{𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅}

(A=𝑅𝑅

_𝐴𝐴1

and B=𝑅𝑅

_𝐵𝐵3

) =

₂₅₀80

=

₂₅8

Contrast with:

(17)

𝑃𝑃

_{𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅}

(A=𝑅𝑅

_𝐴𝐴1

and B=𝑅𝑅

_𝐵𝐵3

) =

₂₅₀80

=

₂₅8

Contrast with:

𝑃𝑃

_{𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑅𝑅𝐸𝐸𝑅𝑅𝐸𝐸}

(A=𝑅𝑅

_𝐴𝐴1

and B=𝑅𝑅

_𝐵𝐵3

) =

2₅

X

3₅

=

₂₅6

Rank as:

_{Rank(𝑅𝑅}

𝐴𝐴1

, B=𝑅𝑅

𝐵𝐵3

) =

8/25_6/25

-1 =

1₃

If 80 rows out of 250 detected during data ingest have values on A within 𝑅𝑅_𝐴𝐴1 and values on B within 𝑅𝑅_𝐵𝐵3, then:

What is missing in

(18)

Multi-column Statistical Models

𝑅𝑅

𝐴𝐴1

𝑅𝑅

𝐴𝐴2

𝑅𝑅

𝐴𝐴3

𝑅𝑅

𝐵𝐵1

𝑅𝑅

𝐵𝐵2

𝑅𝑅

_𝐵𝐵3

𝑃𝑃

_{𝑅𝑅𝐸𝐸𝐴𝐴𝑅𝑅}

𝑃𝑃

_{𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐴𝐴𝐸𝐸𝐸𝐸𝐸𝐸}

 Store only a limited

number of ratios to maximize

performance

 Sort descending with

Rank=[Ratio – 1]

 Store top-N ratios or

pairs

8

6 <1

<1

(19)

(20)

IAQ Query Execution Process

Ingested knowledge

Materialize

Output ≅ X

Perform operations on statistical data Transformed

knowledge

(21)

Transformation in Action

Example:

SELECT SUM(A) FROM T WHERE B > 7;

𝑅𝑅

_𝐴𝐴1

𝑅𝑅

_𝐴𝐴2

𝑅𝑅

_𝐴𝐴3

𝑅𝑅

𝐵𝐵1

𝑅𝑅

𝐵𝐵2

𝑅𝑅

_𝐵𝐵3

8

6 <1

<1

Estimate number of rows

satisfying B > 7 as number of rows in 𝑅𝑅_𝐵𝐵3 and - as a proportional estimation -1 3 * number of rows in 𝑅𝑅𝐵𝐵2 Thus: Count(*) = 150 + 1 3 * 50 = 67

(22)

Transformation in Action

E.g.:

SELECT SUM(A) FROM T WHERE B > 7;

𝑅𝑅

_𝐴𝐴1

𝑅𝑅

_𝐴𝐴2

𝑅𝑅

_𝐴𝐴3

𝑅𝑅

𝐵𝐵1

𝑅𝑅

𝐵𝐵2

𝑅𝑅

_𝐵𝐵3

8

6 <1

<1

Estimate number of rows

satisfying B > 7 as number of rows in 𝑅𝑅_𝐵𝐵3 and - as a proportional estimation -1 3 * number of rows in 𝑅𝑅𝐵𝐵2 Thus: Count(*) = 150 + 1 3 * 50 = 67 B > 7

(23)

Transformation in Action (Cont’d)

𝑅𝑅

𝐴𝐴1

𝑅𝑅

_𝐴𝐴2

𝑅𝑅

_𝐴𝐴3

𝑅𝑅

𝐴𝐴1

𝑅𝑅

𝐴𝐴2

𝑅𝑅

_𝐴𝐴3 >10₂₅ <₂₅3 <12₂₅ In probabilities after transformation

How will probabilities on A change subject to B > 7?

P (A = 𝑅𝑅_𝐴𝐴1) will now be higher than because of positive correlation with 10₂₅ 𝑅𝑅_𝐵𝐵3 This will improve the quality of estimating sum(A) where B > 7

10 25 3 25 12 25

(24)