Adaptive query optimization in PostgreSQL
2o1dJtD
2o1dJtD
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Adaptive</strong> <strong>query</strong> <strong>optimization</strong><br />
<strong>in</strong> <strong>PostgreSQL</strong><br />
Oleg Ivanov<br />
Postgres Professional<br />
postgrespro.ru
What is <strong>query</strong> <strong>optimization</strong>?<br />
How does <strong>PostgreSQL</strong> optimize queries?<br />
What is adaptive <strong>query</strong> <strong>optimization</strong>?<br />
Mach<strong>in</strong>e learn<strong>in</strong>g and kNN method.<br />
How to use mach<strong>in</strong>e learn<strong>in</strong>g for adaptive <strong>query</strong> <strong>optimization</strong>?<br />
How much can it improve <strong>PostgreSQL</strong> performance?<br />
Implementaion details: AQO.<br />
2
What is <strong>query</strong> <strong>optimization</strong>?<br />
SELECT *<br />
FROM users AS u1, messages AS m, users AS u2<br />
WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />
NestedLoopJo<strong>in</strong><br />
MergeJo<strong>in</strong><br />
SeqScan<br />
IndexScan<br />
IndexScan<br />
users u1<br />
messages m<br />
users u2<br />
3
What is <strong>query</strong> <strong>optimization</strong>?<br />
SELECT *<br />
FROM users AS u1, messages AS m, users AS u2<br />
WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />
NestedLoopJo<strong>in</strong><br />
NestedLoopJo<strong>in</strong><br />
IndexScan<br />
SeqScan<br />
IndexScan<br />
users u1<br />
messages m<br />
users u2<br />
4
What is <strong>query</strong> <strong>optimization</strong>?<br />
SELECT *<br />
FROM users AS u1, messages AS m, users AS u2<br />
WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />
HashJo<strong>in</strong><br />
HashJo<strong>in</strong><br />
SeqScan<br />
SeqScan<br />
SeqScan<br />
users u1<br />
messages m<br />
users u2<br />
5
What is <strong>query</strong> <strong>optimization</strong>?<br />
EXPLAIN SELECT *<br />
FROM users AS u1, messages AS m, users AS u2<br />
WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />
QUERY PLAN<br />
-----------------------------------------------------------------------------------<br />
Hash Jo<strong>in</strong> (cost=540.00..439429.44 rows=10003825 width=27)<br />
Hash Cond: (m.receiver_id = u2.id)<br />
-> Hash Jo<strong>in</strong> (cost=270.00..301606.84 rows=10003825 width=23)<br />
Hash Cond: (m.sender_id = u1.id)<br />
-> Seq Scan on messages m (cost=0.00..163784.25 rows=10003825 width=19)<br />
-> Hash (cost=145.00..145.00 rows=10000 width=4)<br />
-> Seq Scan on users u1 (cost=0.00..145.00 rows=10000 width=4)<br />
-> Hash (cost=145.00..145.00 rows=10000 width=4)<br />
-> Seq Scan on users u2 (cost=0.00..145.00 rows=10000 width=4)<br />
(9 rows)<br />
6
What is <strong>query</strong> <strong>optimization</strong>?<br />
Plan execution cost<br />
EXPLAIN SELECT *<br />
FROM users AS u1, messages AS m, users AS u2<br />
WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />
QUERY PLAN<br />
-----------------------------------------------------------------------------------<br />
Hash Jo<strong>in</strong> (cost=540.00..439429.44 rows=10003825 width=27)<br />
Hash Cond: (m.receiver_id = u2.id)<br />
-> Hash Jo<strong>in</strong> (cost=270.00..301606.84 rows=10003825 width=23)<br />
Hash Cond: (m.sender_id = u1.id)<br />
-> Seq Scan on messages m (cost=0.00..163784.25 rows=10003825 width=19)<br />
-> Hash (cost=145.00..145.00 rows=10000 width=4)<br />
-> Seq Scan on users u1 (cost=0.00..145.00 rows=10000 width=4)<br />
-> Hash (cost=145.00..145.00 rows=10000 width=4)<br />
-> Seq Scan on users u2 (cost=0.00..145.00 rows=10000 width=4)<br />
(9 rows)<br />
Plan node execution cost<br />
Plan node card<strong>in</strong>ality<br />
7
How does <strong>PostgreSQL</strong> optimize queries?<br />
Cost-based <strong>query</strong> <strong>optimization</strong><br />
System R (1974)<br />
Choose the cheapest plan<br />
among all the possible plans<br />
8
How does <strong>PostgreSQL</strong> optimize queries?<br />
Cost=n s c s +n r c r +n t c t +n i c i +n o c o<br />
c s seq_page_cost 1.0<br />
c r random_page_cost 4.0<br />
c t cpu_Tuple_cost 0.01<br />
c i cpu_Index_tuple_cost 0.005<br />
c o cpu_Operator_cost 0.0025<br />
9
How does <strong>PostgreSQL</strong> optimize queries?<br />
SELECT * FROM users<br />
WHERE age < 25;<br />
SeqScan<br />
Data<br />
Cost=n s c s +n o ⋅c o<br />
n s =N pages<br />
n o =N tuples<br />
IndexScan<br />
10
How does <strong>PostgreSQL</strong> optimize queries?<br />
SELECT * FROM users<br />
WHERE age < 25;<br />
SeqScan<br />
Data<br />
Cost=n s c s +n o ⋅c o<br />
n s =N pages<br />
n o =N tuples<br />
IndexScan<br />
Index<br />
Data<br />
Cost=n r ⋅c r<br />
n r =Card<strong>in</strong>ality<br />
11
How does <strong>PostgreSQL</strong> optimize queries?<br />
SELECT * FROM users<br />
WHERE age < 25;<br />
14000<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
0 – 4<br />
5 – 9<br />
15 - 19 25 - 29 35 - 39 45 - 49 55 - 59 65 - 69 75 - 79 85 - 89 95 – 99<br />
10 - 14 20 - 24 30 - 34 40 - 44 50 - 54 60 - 64 70 - 74 80 - 84 90 - 94 100 or more<br />
Card<strong>in</strong>ality<br />
Age<br />
12
How does <strong>PostgreSQL</strong> optimize queries?<br />
SELECT *<br />
FROM users AS u1, messages AS m, users AS u2<br />
WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />
HashJo<strong>in</strong><br />
HashJo<strong>in</strong><br />
SeqScan<br />
SeqScan<br />
SeqScan<br />
users u1<br />
messages m<br />
users u2<br />
13
How does <strong>PostgreSQL</strong> optimize queries?<br />
SELECT *<br />
FROM users AS u1, messages AS m, users AS u2<br />
WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />
10000000<br />
SeqScan<br />
HashJo<strong>in</strong><br />
10000000<br />
SeqScan<br />
HashJo<strong>in</strong><br />
10000 10000000<br />
10000<br />
SeqScan<br />
users u1<br />
messages m<br />
users u2<br />
14
How does <strong>PostgreSQL</strong> optimize queries?<br />
SELECT *<br />
FROM users AS u1, messages AS m, users AS u2<br />
WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />
10000000<br />
SeqScan<br />
HashJo<strong>in</strong><br />
301607<br />
10000000<br />
SeqScan<br />
HashJo<strong>in</strong><br />
439429<br />
10000 10000000<br />
10000<br />
SeqScan<br />
145 163784<br />
145<br />
users u1<br />
messages m<br />
users u2<br />
15
How does <strong>PostgreSQL</strong> optimize queries?<br />
Optimization method<br />
HashJo<strong>in</strong><br />
HashJo<strong>in</strong><br />
SeqScan SeqScan SeqScan<br />
users<br />
messages<br />
pictures<br />
Full search<br />
Cost = 439429<br />
MergeJo<strong>in</strong><br />
Plan's cost<br />
estimation<br />
SeqScan<br />
messages<br />
SeqScan<br />
pictures<br />
Cost = 304528<br />
16
How does <strong>PostgreSQL</strong> optimize queries?<br />
Optimization method<br />
HashJo<strong>in</strong><br />
HashJo<strong>in</strong><br />
SeqScan SeqScan SeqScan<br />
users<br />
messages<br />
pictures<br />
Full search<br />
Cost = 439429<br />
MergeJo<strong>in</strong><br />
Plan's cost<br />
estimation<br />
SeqScan<br />
messages<br />
SeqScan<br />
pictures<br />
Cost = 304528<br />
17
How does <strong>PostgreSQL</strong> optimize queries?<br />
Optimization method<br />
HashJo<strong>in</strong><br />
HashJo<strong>in</strong><br />
SeqScan SeqScan SeqScan<br />
users<br />
messages<br />
pictures<br />
Dynamic programm<strong>in</strong>g<br />
or<br />
Cost = 439429<br />
MergeJo<strong>in</strong><br />
Plan's cost<br />
estimation<br />
Genetic algorithm<br />
SeqScan<br />
messages<br />
SeqScan<br />
pictures<br />
Cost = 304528<br />
18
Dynamic programm<strong>in</strong>g over subsets<br />
● System R<br />
●<br />
Time complexity: 3n<br />
●<br />
Memory consumption: 2n<br />
● Always f<strong>in</strong>ds the cheapest plan<br />
19
Genetic algorithm<br />
● <strong>PostgreSQL</strong><br />
● Common and flexible method<br />
● Can be stopped on every iteration<br />
● No guarantees<br />
20
How does <strong>PostgreSQL</strong> optimize queries?<br />
Optimization method<br />
HashJo<strong>in</strong><br />
HashJo<strong>in</strong><br />
SeqScan SeqScan SeqScan<br />
users<br />
messages<br />
pictures<br />
Dynamic programm<strong>in</strong>g<br />
or<br />
Cost = 439429<br />
MergeJo<strong>in</strong><br />
Plan's cost<br />
estimation<br />
Genetic algorithm<br />
SeqScan<br />
messages<br />
SeqScan<br />
pictures<br />
Cost = 304528<br />
21
Query clauses<br />
Card<strong>in</strong>ality<br />
estimation<br />
Cost estimation<br />
Information about<br />
stored data<br />
<strong>PostgreSQL</strong> state<br />
22
Dataset:<br />
The TPC BenchmarkDS (TPC-DS)<br />
http://www.tpc.org/tpcds/<br />
23
Error: 300 times<br />
Error: 4 times<br />
Dataset:<br />
The TPC BenchmarkDS (TPC-DS)<br />
http://www.tpc.org/tpcds/<br />
24
Query clauses<br />
How good are <strong>query</strong> optimizers, really?<br />
V. Leis, A. Gubichev, A. Mirchev et al.<br />
Card<strong>in</strong>ality<br />
estimation<br />
Cost estimation<br />
Information about<br />
stored data<br />
<strong>PostgreSQL</strong> state<br />
25
SELECT * FROM users<br />
WHERE age < 25;<br />
14000<br />
12000<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
0 – 4<br />
5 – 9<br />
15 - 19 25 - 29 35 - 39 45 - 49 55 - 59 65 - 69 75 - 79 85 - 89 95 – 99<br />
10 - 14 20 - 24 30 - 34 40 - 44 50 - 54 60 - 64 70 - 74 80 - 84 90 - 94 100 or more<br />
Age<br />
Selectivity≃0.3<br />
Card<strong>in</strong>ality=N tuples ⋅Selectivity<br />
26
SELECT * FROM users<br />
WHERE age < 25 AND city = 'Moscow';<br />
Only selectivities of <strong>in</strong>dividual clauses<br />
(i.e. marg<strong>in</strong>al selectivities)<br />
are known<br />
Selectivity age<br />
= 1 3<br />
Selectivity city<br />
= 1 7<br />
Selectivity age ,city =?<br />
27
1/3<br />
age < 25<br />
city = 'Moscow'<br />
1/7 1/21<br />
28
SELECT * FROM users<br />
WHERE age < 25 AND city = 'Moscow';<br />
Only selectivities of <strong>in</strong>dividual clauses<br />
are known<br />
The clauses are considered to be <strong>in</strong>dependent:<br />
Selectivity age ,city<br />
=Selectivity age<br />
⋅Selectivity city<br />
With the exception of Selectivity 25
age < 25<br />
city = 'Moscow'<br />
1/7 1/7<br />
1/3<br />
30
age < 25<br />
1/3<br />
1/7<br />
city = 'Moscow'<br />
31
SELECT * FROM users<br />
WHERE age < 12 AND married = true;<br />
age < 12<br />
married = true<br />
32
SELECT * FROM users<br />
WHERE age < 12 AND married = false;<br />
age < 12<br />
married = false<br />
33
SELECT * FROM users<br />
WHERE age > 25 AND married = true<br />
AND position = 'CTO';<br />
married = true<br />
position = 'CTO'<br />
age > 25<br />
34
Multidimensional histograms<br />
35
What is adaptive <strong>query</strong> <strong>optimization</strong>?<br />
Query <strong>optimization</strong><br />
Query execution<br />
SQL <strong>query</strong><br />
Feedback<br />
Result<br />
A priori cost estimation<br />
Histograms<br />
Cost models<br />
Execution statistics<br />
36
What is adaptive <strong>query</strong> <strong>optimization</strong>?<br />
Query <strong>optimization</strong><br />
Query execution<br />
SQL <strong>query</strong><br />
Feedback<br />
Result<br />
A priori cost estimation<br />
Histograms<br />
Cost models<br />
Execution statistics<br />
37
Mach<strong>in</strong>e learn<strong>in</strong>g<br />
Objects<br />
Feature 1<br />
Features<br />
Feature 2<br />
Feature 3<br />
Hidden values<br />
Tra<strong>in</strong> set<br />
Test set<br />
38
K Nearest Neighbours method<br />
Age<br />
Salary<br />
25 47 55 32 22 45 28<br />
50 120 100 80 30 90 ?<br />
39
K Nearest Neighbours method<br />
Age<br />
Salary<br />
25 47 55 32 22 45 28<br />
50 120 100 80 30 90 ?<br />
40
Gradient approach to kNN<br />
2/3<br />
1/3<br />
1/3 5/3<br />
Age<br />
Salary<br />
27 47 28<br />
53 103 ?<br />
41
How to use mach<strong>in</strong>e learn<strong>in</strong>g<br />
for adaptive <strong>query</strong> <strong>optimization</strong>?<br />
42
The object is a node with its subtree<br />
u1.id = messages.sender_id<br />
NestedLoopJo<strong>in</strong><br />
MergeJo<strong>in</strong><br />
u2.id = messages.receiver_id<br />
SeqScan<br />
IndexScan<br />
IndexScan<br />
u2.married = true<br />
AND<br />
u2.age < 25<br />
users u1<br />
messages m<br />
users u2<br />
43
Histograms<br />
users.id = messages.receiver_id<br />
AND<br />
users.married = true<br />
AND<br />
users.age < 25<br />
Clauses list<br />
Node card<strong>in</strong>ality is<br />
105 tuples!<br />
<strong>PostgreSQL</strong> estimator<br />
44
Clause selectivities<br />
● 0.0001<br />
● 0.73<br />
● 0.23<br />
users.id = messages.receiver_id<br />
AND<br />
users.married = const<br />
AND<br />
users.age < const<br />
Clauses list<br />
Node card<strong>in</strong>ality is<br />
1017 tuples!<br />
Mach<strong>in</strong>e learn<strong>in</strong>g<br />
45
Mach<strong>in</strong>e learn<strong>in</strong>g problem statement<br />
Object is a plan node<br />
Features<br />
users.id = messages.receiver_id<br />
users.married = const<br />
users.age < const<br />
0.0001<br />
0.73<br />
0.23<br />
Hidden value Node card<strong>in</strong>ality<br />
?<br />
46
Workflow<br />
Query pars<strong>in</strong>g<br />
Query <strong>optimization</strong><br />
Query execution<br />
Query execution statistics<br />
Mach<strong>in</strong>e learn<strong>in</strong>g<br />
Card<strong>in</strong>ality estimation<br />
Learn<strong>in</strong>g<br />
Mach<strong>in</strong>e learn<strong>in</strong>g data<br />
47
Theoretical properties<br />
●<br />
●<br />
●<br />
Will it converge?<br />
Yes, <strong>in</strong> the f<strong>in</strong>ite number of steps<br />
How fast will it converge?<br />
Don't know (<strong>in</strong> practice <strong>in</strong> a few steps)<br />
What guarantees on obta<strong>in</strong>ed plans or<br />
regressor do we have?<br />
Predictions are correct for all executed paths<br />
With perfect cost model obta<strong>in</strong>ed plans are not worse<br />
48
How much can it improve <strong>PostgreSQL</strong> performance?<br />
Experimental evaluation<br />
49
Estimation error<br />
50
Estimation error<br />
51
Estimation error<br />
52
Performance improvement<br />
Orig<strong>in</strong>al<br />
<strong>Adaptive</strong><br />
TPC-H fast<br />
+1.3%<br />
TPC-H slow<br />
0 10 20 30 40 50 60<br />
53
Performance improvement<br />
Orig<strong>in</strong>al<br />
<strong>Adaptive</strong><br />
TPС-H fast<br />
TPC-H slow<br />
-4.4%<br />
0 5000 10000 15000 20000<br />
54
Performance improvement<br />
Orig<strong>in</strong>al<br />
<strong>Adaptive</strong><br />
TPC-DS very fast<br />
+12%<br />
TPC-DS fast<br />
TPC-DS normal<br />
TPC-DS slow<br />
TPC-DS very slow<br />
0 2 4 6 8 10 12 14 16 18<br />
55
Performance improvement<br />
Orig<strong>in</strong>al<br />
<strong>Adaptive</strong><br />
TPC-DS very fast<br />
TPC-DS fast<br />
+24%<br />
TPC-DS normal<br />
TPC-DS slow<br />
TPC-DS very slow<br />
0 20 40 60 80 100 120 140<br />
56
Performance improvement<br />
Orig<strong>in</strong>al<br />
<strong>Adaptive</strong><br />
TPC-DS very fast<br />
TPC-DS fast<br />
TPC-DS normal<br />
+41%<br />
TPC-DS slow<br />
TPC-DS very slow<br />
0 50 100 150 200 250 300<br />
57
Performance improvement<br />
Orig<strong>in</strong>al<br />
<strong>Adaptive</strong><br />
TPC-DS very fast<br />
TPC-DS fast<br />
TPC-DS normal<br />
TPC-DS slow<br />
+285%<br />
TPC-DS very slow<br />
0<br />
200<br />
400<br />
600<br />
1000 1400 1800<br />
800 1200 1600<br />
58
Performance improvement<br />
Orig<strong>in</strong>al<br />
<strong>Adaptive</strong><br />
TPC-DS very fast<br />
TPC-DS fast<br />
TPC-DS normal<br />
TPC-DS slow<br />
TPC-DS very slow<br />
0 5000 10000 15000 20000<br />
+115%<br />
59
Maximum acceleration<br />
60
Overheads<br />
Experimental evaluation<br />
Slowdown for genetic algorithm is not more than 2 seconds<br />
Slowdown for dynamic programm<strong>in</strong>g is not more than 30 ms<br />
61
Applicability<br />
Complex analytical queries<br />
with a repeat<strong>in</strong>g pattern.<br />
62
Learn<strong>in</strong>g progress<br />
1.4<br />
TPC-DS 18<br />
1.3<br />
1.2<br />
Execution time, s<br />
1.1<br />
1.0<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0 1 2 3 4 5 6 7 8 9<br />
# iter<br />
63
Learn<strong>in</strong>g progress<br />
10 4<br />
TPC-DS 69<br />
10 3<br />
Execution time, s<br />
10 2<br />
10 1<br />
10 0<br />
0 1 2 3 4 5 6 7 8 9<br />
# iter<br />
64
Learn<strong>in</strong>g progress<br />
1.3<br />
TPC-DS 26<br />
1.2<br />
Execution time, s<br />
1.1<br />
1.0<br />
0.9<br />
0.8<br />
0.7<br />
0 1 2 3 4 5 6 7 8 9<br />
# iter<br />
65
AQO: adaptive <strong>query</strong> <strong>optimization</strong><br />
Current code for vanilla <strong>PostgreSQL</strong> (extension + patch):<br />
https://github.com/tigvarts/aqo<br />
Available <strong>in</strong> Postgres Pro Enterprise<br />
66
AQO: adaptive <strong>query</strong> <strong>optimization</strong><br />
For some queries we don't need AQO.<br />
So we need a mechanism to determ<strong>in</strong>e whether the <strong>query</strong> needs AQO.<br />
67
AQO: adaptive <strong>query</strong> <strong>optimization</strong><br />
Query type is the set of queries, which differ only <strong>in</strong> their constants.<br />
Query type:<br />
SELECT * FROM users WHERE age > const AND city = const;<br />
Queries:<br />
SELECT * FROM users WHERE age > 18 AND city = 'Moscow';<br />
SELECT * FROM users WHERE age > 65 AND city = 'Kostroma';<br />
…<br />
68
AQO: adaptive <strong>query</strong> <strong>optimization</strong><br />
aqo.mode:<br />
●<br />
●<br />
●<br />
●<br />
Disabled for all <strong>query</strong> types<br />
Enabled for all <strong>query</strong> types<br />
Use manual sett<strong>in</strong>gs for known <strong>query</strong> types, ignore others<br />
Use manual sett<strong>in</strong>gs for known <strong>query</strong> types, tries to tune others automatically<br />
69
What is next?<br />
Query <strong>optimization</strong><br />
Query execution<br />
SQL <strong>query</strong><br />
Feedback<br />
Result<br />
A priori cost estimation<br />
Histograms<br />
Cost models<br />
Execution statistics<br />
70
Questions<br />
Contacts:<br />
● o.ivanov@postgrespro.ru<br />
● +7 (916) 377-55-63<br />
71
Postgres Professional<br />
http://postgrespro.ru/<br />
+7 495 150 06 91<br />
<strong>in</strong>fo@postgrespro.ru<br />
postgrespro.ru