Infobright’s Approach for
Approximate Querying
Rick Glick – Infobright Inc.
Dominik Ślęzak – University of Warsaw & Infobright Inc.
Agenda
Historical background: Infobright’s RDBMS software
so far (2005-2015)
Toward approximate query database engines: A new
aspect of scalability
New Infobright’s Approximate Query (IAQ) database
engine (2016-…)
IAQ Details: Capturing Knowledge (DATA-IN)
IAQ Details: Generating Insight (DATA-OUT)
Challenges and Conclusions
Historical background:
Infobright’s RDBMS software so far
(2005-2015)
Outlook Temp. Humid. Wind Sport? 1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cold Normal Weak Yes
6 Rain Cold Normal Strong No
7 Overcast Cold Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cold Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Outlook Temp. Humid. Wind Sport?
DATA LOAD
2
16Outlook Temp. Humid. Wind Sport? row pack 1 rough value rough value rough value rough value rough value
ROUGH VALUE CALCULATION QUERY example related to filtering r[1] r[2] r[3] ROUGH VALUE USAGE r[216] ORIGINAL DATA DATA PACKS compressed collections of attribute values
row pack 2 rough value rough value rough value rough value rough value row pack 3 rough value rough value rough value rough value rough value
ROUGH ATTRIBUTES
GRANULATED TABLE
a collection of rough values for each of rough attributes
is stored as a separate knowledge node
identification of row packs and
rows which satisfy query conditions rp[1] rp[2] rp[3]
SELECT MAX(A) FROM T WHERE B > 15;
E
E
I ⇔ X ≥ 22 I ⇔ X ≥ 22B > 15
B > 15, A
≥
18
B > 15, A
≥
X
Data Table T
I: Irrelevant Blocks
(Negative Region)
S: Suspect Blocks
(Boundary Region)
R: Relevant Blocks
(Positive Region)
E: Exact Computation
(necessary, if the final
query result cannot be
obtained only from the
statistical snapshots)
Toward approximate query database engines:
A new aspect of scalability
How Accurate Calculations do we Need?
Exact
Accuracy AccuracyEventual ApproximateOnly
Network Intrusion Detection Demand Forecasting Trade Settlement Audience Profiling Trading Algorithms Development Operational Reporting Technical Troubleshooting and Remediation Financial Risk, Scoring and Analysis Logistics and Demand Forecasting Agricultural Yield Management Cyber Security Test Scoring Price Sensitivity Analysis and Optimization Pharmaceutical Effectiveness
Tools for Machine Learning and
New Infobright’s Approximate Query (IAQ)
database engine (2016-…)
Infobright’s Approximate Query (IAQ)
New Database Engine for Approximate SQL
Knowledge capture uponingestion
Data passes through IAQ
statistical layer, creating knowledge by forming correlation models
Resulting queries are
100-1000 times faster than traditional databases
Outlook Temp. Humid. Wind Sport? 1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cold Normal Weak Yes
6 Rain Cold Normal Strong No
7 Overcast Cold Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cold Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Outlook Temp. Humid. Wind Sport?
DATA LOAD
2
16Outlook Temp. Humid. Wind Sport? row pack 1 rough value rough value rough value rough value rough value
ROUGH VALUE CALCULATION QUERY example related to filtering r[1] r[2] r[3] ROUGH VALUE USAGE r[216] ORIGINAL DATA DATA PACKS compressed collections of attribute values
row pack 2 rough value rough value rough value rough value rough value row pack 3 rough value rough value rough value rough value rough value
ROUGH ATTRIBUTES
GRANULATED TABLE
a collection of rough values for each of rough attributes
is stored as a separate knowledge node
identification of row packs and
rows which satisfy query conditions rp[1] rp[2] rp[3]
IAQ Query Execution Process
Ingested knowledge
Materialize
Output ≅ X
Perform operations on statistical data Transformed
knowledge
Each data pack contains values
and counts which can be represented in a histogram
Knowledge can be broken down
into:
– Ranges – Gaps
– Specials (i.e. outliers)
Number of histogram bars, gaps,
and specials are minimized in order to maximize performance (i.e. less data footprint, faster query execution) 0 100 200 300 350 Range = 200-350 100 200 300 400 350 195 450 Special value = 300 Gap = 40-60
Single-Column Statistical Models
During data ingestion we need to decide: – What are the ranges of histogram bars
– Which values should be stored as special ones – Which gaps are important enough to remember
Decisions are taken automatically by IAQ algorithms, which maximize
quality of statistical model with respect to original data, under specified budget (footprint) constraints
Decisions are taken independently for each data chunk being
Example: Single Dimensions
Probabilities: P(A=𝑅𝑅𝐴𝐴1) = 100250=
1025=
25 P(A=𝑅𝑅𝐴𝐴2) = 30 250=
3 25 P(A=𝑅𝑅𝐴𝐴3) = 120 250=
12 25 𝑅𝑅𝐴𝐴1 𝑅𝑅𝐴𝐴2 𝑅𝑅𝐴𝐴3 100 30 120 𝑅𝑅𝐵𝐵1 𝑅𝑅𝐵𝐵2 𝑅𝑅𝐵𝐵3 50 150 Probabilities: P(B=𝑅𝑅𝐵𝐵1) = 50 250=
5 25 P(B=𝑅𝑅𝐵𝐵2) = 50 250=
5 25 P(B=𝑅𝑅𝐵𝐵3) = 150 250=
15 25What is missing in
single-dimensional layer?
If 80 rows out of 250 detected during data ingest have values on A within 𝑅𝑅𝐴𝐴1 and values on B within 𝑅𝑅𝐵𝐵3, then:
𝑃𝑃
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅(A=𝑅𝑅
𝐴𝐴1and B=𝑅𝑅
𝐵𝐵3) =
25080=
258Contrast with:
𝑃𝑃
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅(A=𝑅𝑅
𝐴𝐴1and B=𝑅𝑅
𝐵𝐵3) =
25080=
258Contrast with:
𝑃𝑃
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑅𝑅𝐸𝐸𝑅𝑅𝐸𝐸(A=𝑅𝑅
𝐴𝐴1and B=𝑅𝑅
𝐵𝐵3) =
25X
35=
256Rank as:
Rank(𝑅𝑅
𝐴𝐴1
, B=𝑅𝑅
𝐵𝐵3) =
8/256/25-1 =
13If 80 rows out of 250 detected during data ingest have values on A within 𝑅𝑅𝐴𝐴1 and values on B within 𝑅𝑅𝐵𝐵3, then:
What is missing in
Multi-column Statistical Models
𝑅𝑅
𝐴𝐴1𝑅𝑅
𝐴𝐴2𝑅𝑅
𝐴𝐴3𝑅𝑅
𝐵𝐵1𝑅𝑅
𝐵𝐵2𝑅𝑅
𝐵𝐵3𝑃𝑃
𝑅𝑅𝐸𝐸𝐴𝐴𝑅𝑅
𝑃𝑃
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐴𝐴𝐸𝐸𝐸𝐸𝐸𝐸
Store only a limited
number of ratios to maximize
performance
Sort descending with
Rank=[Ratio – 1]
Store top-N ratios or
pairs
8
6
<1
<1
<1
<1
<1
<1
<1
<1
IAQ Query Execution Process
Ingested knowledge
Materialize
Output ≅ X
Perform operations on statistical data Transformed
knowledge
Transformation in Action
Example:
SELECT SUM(A) FROM T WHERE B > 7;
𝑅𝑅
𝐴𝐴1𝑅𝑅
𝐴𝐴2𝑅𝑅
𝐴𝐴3𝑅𝑅
𝐵𝐵1𝑅𝑅
𝐵𝐵2𝑅𝑅
𝐵𝐵38
6
<1
<1
<1
<1
<1
<1
<1
<1
Estimate number of rowssatisfying B > 7 as number of rows in 𝑅𝑅𝐵𝐵3 and - as a proportional estimation -1 3 * number of rows in 𝑅𝑅𝐵𝐵2 Thus: Count(*) = 150 + 1 3 * 50 = 67
Transformation in Action
E.g.:
SELECT SUM(A) FROM T WHERE B > 7;
𝑅𝑅
𝐴𝐴1𝑅𝑅
𝐴𝐴2𝑅𝑅
𝐴𝐴3𝑅𝑅
𝐵𝐵1𝑅𝑅
𝐵𝐵2𝑅𝑅
𝐵𝐵38
6
<1
<1
<1
<1
<1
<1
<1
<1
Estimate number of rowssatisfying B > 7 as number of rows in 𝑅𝑅𝐵𝐵3 and - as a proportional estimation -1 3 * number of rows in 𝑅𝑅𝐵𝐵2 Thus: Count(*) = 150 + 1 3 * 50 = 67 B > 7
Transformation in Action (Cont’d)
𝑅𝑅
𝐴𝐴1𝑅𝑅
𝐴𝐴2𝑅𝑅
𝐴𝐴3𝑅𝑅
𝐴𝐴1𝑅𝑅
𝐴𝐴2𝑅𝑅
𝐴𝐴3 >1025 <253 <1225 In probabilities after transformationHow will probabilities on A change subject to B > 7?
P (A = 𝑅𝑅𝐴𝐴1) will now be higher than because of positive correlation with 1025 𝑅𝑅𝐵𝐵3 This will improve the quality of estimating sum(A) where B > 7
10 25 3 25 12 25