05.04.2017 Views

Adaptive query optimization in PostgreSQL

2o1dJtD

2o1dJtD

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Adaptive</strong> <strong>query</strong> <strong>optimization</strong><br />

<strong>in</strong> <strong>PostgreSQL</strong><br />

Oleg Ivanov<br />

Postgres Professional<br />

postgrespro.ru


What is <strong>query</strong> <strong>optimization</strong>?<br />

How does <strong>PostgreSQL</strong> optimize queries?<br />

What is adaptive <strong>query</strong> <strong>optimization</strong>?<br />

Mach<strong>in</strong>e learn<strong>in</strong>g and kNN method.<br />

How to use mach<strong>in</strong>e learn<strong>in</strong>g for adaptive <strong>query</strong> <strong>optimization</strong>?<br />

How much can it improve <strong>PostgreSQL</strong> performance?<br />

Implementaion details: AQO.<br />

2


What is <strong>query</strong> <strong>optimization</strong>?<br />

SELECT *<br />

FROM users AS u1, messages AS m, users AS u2<br />

WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />

NestedLoopJo<strong>in</strong><br />

MergeJo<strong>in</strong><br />

SeqScan<br />

IndexScan<br />

IndexScan<br />

users u1<br />

messages m<br />

users u2<br />

3


What is <strong>query</strong> <strong>optimization</strong>?<br />

SELECT *<br />

FROM users AS u1, messages AS m, users AS u2<br />

WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />

NestedLoopJo<strong>in</strong><br />

NestedLoopJo<strong>in</strong><br />

IndexScan<br />

SeqScan<br />

IndexScan<br />

users u1<br />

messages m<br />

users u2<br />

4


What is <strong>query</strong> <strong>optimization</strong>?<br />

SELECT *<br />

FROM users AS u1, messages AS m, users AS u2<br />

WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />

HashJo<strong>in</strong><br />

HashJo<strong>in</strong><br />

SeqScan<br />

SeqScan<br />

SeqScan<br />

users u1<br />

messages m<br />

users u2<br />

5


What is <strong>query</strong> <strong>optimization</strong>?<br />

EXPLAIN SELECT *<br />

FROM users AS u1, messages AS m, users AS u2<br />

WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />

QUERY PLAN<br />

-----------------------------------------------------------------------------------<br />

Hash Jo<strong>in</strong> (cost=540.00..439429.44 rows=10003825 width=27)<br />

Hash Cond: (m.receiver_id = u2.id)<br />

-> Hash Jo<strong>in</strong> (cost=270.00..301606.84 rows=10003825 width=23)<br />

Hash Cond: (m.sender_id = u1.id)<br />

-> Seq Scan on messages m (cost=0.00..163784.25 rows=10003825 width=19)<br />

-> Hash (cost=145.00..145.00 rows=10000 width=4)<br />

-> Seq Scan on users u1 (cost=0.00..145.00 rows=10000 width=4)<br />

-> Hash (cost=145.00..145.00 rows=10000 width=4)<br />

-> Seq Scan on users u2 (cost=0.00..145.00 rows=10000 width=4)<br />

(9 rows)<br />

6


What is <strong>query</strong> <strong>optimization</strong>?<br />

Plan execution cost<br />

EXPLAIN SELECT *<br />

FROM users AS u1, messages AS m, users AS u2<br />

WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />

QUERY PLAN<br />

-----------------------------------------------------------------------------------<br />

Hash Jo<strong>in</strong> (cost=540.00..439429.44 rows=10003825 width=27)<br />

Hash Cond: (m.receiver_id = u2.id)<br />

-> Hash Jo<strong>in</strong> (cost=270.00..301606.84 rows=10003825 width=23)<br />

Hash Cond: (m.sender_id = u1.id)<br />

-> Seq Scan on messages m (cost=0.00..163784.25 rows=10003825 width=19)<br />

-> Hash (cost=145.00..145.00 rows=10000 width=4)<br />

-> Seq Scan on users u1 (cost=0.00..145.00 rows=10000 width=4)<br />

-> Hash (cost=145.00..145.00 rows=10000 width=4)<br />

-> Seq Scan on users u2 (cost=0.00..145.00 rows=10000 width=4)<br />

(9 rows)<br />

Plan node execution cost<br />

Plan node card<strong>in</strong>ality<br />

7


How does <strong>PostgreSQL</strong> optimize queries?<br />

Cost-based <strong>query</strong> <strong>optimization</strong><br />

System R (1974)<br />

Choose the cheapest plan<br />

among all the possible plans<br />

8


How does <strong>PostgreSQL</strong> optimize queries?<br />

Cost=n s c s +n r c r +n t c t +n i c i +n o c o<br />

c s seq_page_cost 1.0<br />

c r random_page_cost 4.0<br />

c t cpu_Tuple_cost 0.01<br />

c i cpu_Index_tuple_cost 0.005<br />

c o cpu_Operator_cost 0.0025<br />

9


How does <strong>PostgreSQL</strong> optimize queries?<br />

SELECT * FROM users<br />

WHERE age < 25;<br />

SeqScan<br />

Data<br />

Cost=n s c s +n o ⋅c o<br />

n s =N pages<br />

n o =N tuples<br />

IndexScan<br />

10


How does <strong>PostgreSQL</strong> optimize queries?<br />

SELECT * FROM users<br />

WHERE age < 25;<br />

SeqScan<br />

Data<br />

Cost=n s c s +n o ⋅c o<br />

n s =N pages<br />

n o =N tuples<br />

IndexScan<br />

Index<br />

Data<br />

Cost=n r ⋅c r<br />

n r =Card<strong>in</strong>ality<br />

11


How does <strong>PostgreSQL</strong> optimize queries?<br />

SELECT * FROM users<br />

WHERE age < 25;<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

0 – 4<br />

5 – 9<br />

15 - 19 25 - 29 35 - 39 45 - 49 55 - 59 65 - 69 75 - 79 85 - 89 95 – 99<br />

10 - 14 20 - 24 30 - 34 40 - 44 50 - 54 60 - 64 70 - 74 80 - 84 90 - 94 100 or more<br />

Card<strong>in</strong>ality<br />

Age<br />

12


How does <strong>PostgreSQL</strong> optimize queries?<br />

SELECT *<br />

FROM users AS u1, messages AS m, users AS u2<br />

WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />

HashJo<strong>in</strong><br />

HashJo<strong>in</strong><br />

SeqScan<br />

SeqScan<br />

SeqScan<br />

users u1<br />

messages m<br />

users u2<br />

13


How does <strong>PostgreSQL</strong> optimize queries?<br />

SELECT *<br />

FROM users AS u1, messages AS m, users AS u2<br />

WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />

10000000<br />

SeqScan<br />

HashJo<strong>in</strong><br />

10000000<br />

SeqScan<br />

HashJo<strong>in</strong><br />

10000 10000000<br />

10000<br />

SeqScan<br />

users u1<br />

messages m<br />

users u2<br />

14


How does <strong>PostgreSQL</strong> optimize queries?<br />

SELECT *<br />

FROM users AS u1, messages AS m, users AS u2<br />

WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;<br />

10000000<br />

SeqScan<br />

HashJo<strong>in</strong><br />

301607<br />

10000000<br />

SeqScan<br />

HashJo<strong>in</strong><br />

439429<br />

10000 10000000<br />

10000<br />

SeqScan<br />

145 163784<br />

145<br />

users u1<br />

messages m<br />

users u2<br />

15


How does <strong>PostgreSQL</strong> optimize queries?<br />

Optimization method<br />

HashJo<strong>in</strong><br />

HashJo<strong>in</strong><br />

SeqScan SeqScan SeqScan<br />

users<br />

messages<br />

pictures<br />

Full search<br />

Cost = 439429<br />

MergeJo<strong>in</strong><br />

Plan's cost<br />

estimation<br />

SeqScan<br />

messages<br />

SeqScan<br />

pictures<br />

Cost = 304528<br />

16


How does <strong>PostgreSQL</strong> optimize queries?<br />

Optimization method<br />

HashJo<strong>in</strong><br />

HashJo<strong>in</strong><br />

SeqScan SeqScan SeqScan<br />

users<br />

messages<br />

pictures<br />

Full search<br />

Cost = 439429<br />

MergeJo<strong>in</strong><br />

Plan's cost<br />

estimation<br />

SeqScan<br />

messages<br />

SeqScan<br />

pictures<br />

Cost = 304528<br />

17


How does <strong>PostgreSQL</strong> optimize queries?<br />

Optimization method<br />

HashJo<strong>in</strong><br />

HashJo<strong>in</strong><br />

SeqScan SeqScan SeqScan<br />

users<br />

messages<br />

pictures<br />

Dynamic programm<strong>in</strong>g<br />

or<br />

Cost = 439429<br />

MergeJo<strong>in</strong><br />

Plan's cost<br />

estimation<br />

Genetic algorithm<br />

SeqScan<br />

messages<br />

SeqScan<br />

pictures<br />

Cost = 304528<br />

18


Dynamic programm<strong>in</strong>g over subsets<br />

● System R<br />

●<br />

Time complexity: 3n<br />

●<br />

Memory consumption: 2n<br />

● Always f<strong>in</strong>ds the cheapest plan<br />

19


Genetic algorithm<br />

● <strong>PostgreSQL</strong><br />

● Common and flexible method<br />

● Can be stopped on every iteration<br />

● No guarantees<br />

20


How does <strong>PostgreSQL</strong> optimize queries?<br />

Optimization method<br />

HashJo<strong>in</strong><br />

HashJo<strong>in</strong><br />

SeqScan SeqScan SeqScan<br />

users<br />

messages<br />

pictures<br />

Dynamic programm<strong>in</strong>g<br />

or<br />

Cost = 439429<br />

MergeJo<strong>in</strong><br />

Plan's cost<br />

estimation<br />

Genetic algorithm<br />

SeqScan<br />

messages<br />

SeqScan<br />

pictures<br />

Cost = 304528<br />

21


Query clauses<br />

Card<strong>in</strong>ality<br />

estimation<br />

Cost estimation<br />

Information about<br />

stored data<br />

<strong>PostgreSQL</strong> state<br />

22


Dataset:<br />

The TPC BenchmarkDS (TPC-DS)<br />

http://www.tpc.org/tpcds/<br />

23


Error: 300 times<br />

Error: 4 times<br />

Dataset:<br />

The TPC BenchmarkDS (TPC-DS)<br />

http://www.tpc.org/tpcds/<br />

24


Query clauses<br />

How good are <strong>query</strong> optimizers, really?<br />

V. Leis, A. Gubichev, A. Mirchev et al.<br />

Card<strong>in</strong>ality<br />

estimation<br />

Cost estimation<br />

Information about<br />

stored data<br />

<strong>PostgreSQL</strong> state<br />

25


SELECT * FROM users<br />

WHERE age < 25;<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

0 – 4<br />

5 – 9<br />

15 - 19 25 - 29 35 - 39 45 - 49 55 - 59 65 - 69 75 - 79 85 - 89 95 – 99<br />

10 - 14 20 - 24 30 - 34 40 - 44 50 - 54 60 - 64 70 - 74 80 - 84 90 - 94 100 or more<br />

Age<br />

Selectivity≃0.3<br />

Card<strong>in</strong>ality=N tuples ⋅Selectivity<br />

26


SELECT * FROM users<br />

WHERE age < 25 AND city = 'Moscow';<br />

Only selectivities of <strong>in</strong>dividual clauses<br />

(i.e. marg<strong>in</strong>al selectivities)<br />

are known<br />

Selectivity age<br />

= 1 3<br />

Selectivity city<br />

= 1 7<br />

Selectivity age ,city =?<br />

27


1/3<br />

age < 25<br />

city = 'Moscow'<br />

1/7 1/21<br />

28


SELECT * FROM users<br />

WHERE age < 25 AND city = 'Moscow';<br />

Only selectivities of <strong>in</strong>dividual clauses<br />

are known<br />

The clauses are considered to be <strong>in</strong>dependent:<br />

Selectivity age ,city<br />

=Selectivity age<br />

⋅Selectivity city<br />

With the exception of Selectivity 25


age < 25<br />

city = 'Moscow'<br />

1/7 1/7<br />

1/3<br />

30


age < 25<br />

1/3<br />

1/7<br />

city = 'Moscow'<br />

31


SELECT * FROM users<br />

WHERE age < 12 AND married = true;<br />

age < 12<br />

married = true<br />

32


SELECT * FROM users<br />

WHERE age < 12 AND married = false;<br />

age < 12<br />

married = false<br />

33


SELECT * FROM users<br />

WHERE age > 25 AND married = true<br />

AND position = 'CTO';<br />

married = true<br />

position = 'CTO'<br />

age > 25<br />

34


Multidimensional histograms<br />

35


What is adaptive <strong>query</strong> <strong>optimization</strong>?<br />

Query <strong>optimization</strong><br />

Query execution<br />

SQL <strong>query</strong><br />

Feedback<br />

Result<br />

A priori cost estimation<br />

Histograms<br />

Cost models<br />

Execution statistics<br />

36


What is adaptive <strong>query</strong> <strong>optimization</strong>?<br />

Query <strong>optimization</strong><br />

Query execution<br />

SQL <strong>query</strong><br />

Feedback<br />

Result<br />

A priori cost estimation<br />

Histograms<br />

Cost models<br />

Execution statistics<br />

37


Mach<strong>in</strong>e learn<strong>in</strong>g<br />

Objects<br />

Feature 1<br />

Features<br />

Feature 2<br />

Feature 3<br />

Hidden values<br />

Tra<strong>in</strong> set<br />

Test set<br />

38


K Nearest Neighbours method<br />

Age<br />

Salary<br />

25 47 55 32 22 45 28<br />

50 120 100 80 30 90 ?<br />

39


K Nearest Neighbours method<br />

Age<br />

Salary<br />

25 47 55 32 22 45 28<br />

50 120 100 80 30 90 ?<br />

40


Gradient approach to kNN<br />

2/3<br />

1/3<br />

1/3 5/3<br />

Age<br />

Salary<br />

27 47 28<br />

53 103 ?<br />

41


How to use mach<strong>in</strong>e learn<strong>in</strong>g<br />

for adaptive <strong>query</strong> <strong>optimization</strong>?<br />

42


The object is a node with its subtree<br />

u1.id = messages.sender_id<br />

NestedLoopJo<strong>in</strong><br />

MergeJo<strong>in</strong><br />

u2.id = messages.receiver_id<br />

SeqScan<br />

IndexScan<br />

IndexScan<br />

u2.married = true<br />

AND<br />

u2.age < 25<br />

users u1<br />

messages m<br />

users u2<br />

43


Histograms<br />

users.id = messages.receiver_id<br />

AND<br />

users.married = true<br />

AND<br />

users.age < 25<br />

Clauses list<br />

Node card<strong>in</strong>ality is<br />

105 tuples!<br />

<strong>PostgreSQL</strong> estimator<br />

44


Clause selectivities<br />

● 0.0001<br />

● 0.73<br />

● 0.23<br />

users.id = messages.receiver_id<br />

AND<br />

users.married = const<br />

AND<br />

users.age < const<br />

Clauses list<br />

Node card<strong>in</strong>ality is<br />

1017 tuples!<br />

Mach<strong>in</strong>e learn<strong>in</strong>g<br />

45


Mach<strong>in</strong>e learn<strong>in</strong>g problem statement<br />

Object is a plan node<br />

Features<br />

users.id = messages.receiver_id<br />

users.married = const<br />

users.age < const<br />

0.0001<br />

0.73<br />

0.23<br />

Hidden value Node card<strong>in</strong>ality<br />

?<br />

46


Workflow<br />

Query pars<strong>in</strong>g<br />

Query <strong>optimization</strong><br />

Query execution<br />

Query execution statistics<br />

Mach<strong>in</strong>e learn<strong>in</strong>g<br />

Card<strong>in</strong>ality estimation<br />

Learn<strong>in</strong>g<br />

Mach<strong>in</strong>e learn<strong>in</strong>g data<br />

47


Theoretical properties<br />

●<br />

●<br />

●<br />

Will it converge?<br />

Yes, <strong>in</strong> the f<strong>in</strong>ite number of steps<br />

How fast will it converge?<br />

Don't know (<strong>in</strong> practice <strong>in</strong> a few steps)<br />

What guarantees on obta<strong>in</strong>ed plans or<br />

regressor do we have?<br />

Predictions are correct for all executed paths<br />

With perfect cost model obta<strong>in</strong>ed plans are not worse<br />

48


How much can it improve <strong>PostgreSQL</strong> performance?<br />

Experimental evaluation<br />

49


Estimation error<br />

50


Estimation error<br />

51


Estimation error<br />

52


Performance improvement<br />

Orig<strong>in</strong>al<br />

<strong>Adaptive</strong><br />

TPC-H fast<br />

+1.3%<br />

TPC-H slow<br />

0 10 20 30 40 50 60<br />

53


Performance improvement<br />

Orig<strong>in</strong>al<br />

<strong>Adaptive</strong><br />

TPС-H fast<br />

TPC-H slow<br />

-4.4%<br />

0 5000 10000 15000 20000<br />

54


Performance improvement<br />

Orig<strong>in</strong>al<br />

<strong>Adaptive</strong><br />

TPC-DS very fast<br />

+12%<br />

TPC-DS fast<br />

TPC-DS normal<br />

TPC-DS slow<br />

TPC-DS very slow<br />

0 2 4 6 8 10 12 14 16 18<br />

55


Performance improvement<br />

Orig<strong>in</strong>al<br />

<strong>Adaptive</strong><br />

TPC-DS very fast<br />

TPC-DS fast<br />

+24%<br />

TPC-DS normal<br />

TPC-DS slow<br />

TPC-DS very slow<br />

0 20 40 60 80 100 120 140<br />

56


Performance improvement<br />

Orig<strong>in</strong>al<br />

<strong>Adaptive</strong><br />

TPC-DS very fast<br />

TPC-DS fast<br />

TPC-DS normal<br />

+41%<br />

TPC-DS slow<br />

TPC-DS very slow<br />

0 50 100 150 200 250 300<br />

57


Performance improvement<br />

Orig<strong>in</strong>al<br />

<strong>Adaptive</strong><br />

TPC-DS very fast<br />

TPC-DS fast<br />

TPC-DS normal<br />

TPC-DS slow<br />

+285%<br />

TPC-DS very slow<br />

0<br />

200<br />

400<br />

600<br />

1000 1400 1800<br />

800 1200 1600<br />

58


Performance improvement<br />

Orig<strong>in</strong>al<br />

<strong>Adaptive</strong><br />

TPC-DS very fast<br />

TPC-DS fast<br />

TPC-DS normal<br />

TPC-DS slow<br />

TPC-DS very slow<br />

0 5000 10000 15000 20000<br />

+115%<br />

59


Maximum acceleration<br />

60


Overheads<br />

Experimental evaluation<br />

Slowdown for genetic algorithm is not more than 2 seconds<br />

Slowdown for dynamic programm<strong>in</strong>g is not more than 30 ms<br />

61


Applicability<br />

Complex analytical queries<br />

with a repeat<strong>in</strong>g pattern.<br />

62


Learn<strong>in</strong>g progress<br />

1.4<br />

TPC-DS 18<br />

1.3<br />

1.2<br />

Execution time, s<br />

1.1<br />

1.0<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0 1 2 3 4 5 6 7 8 9<br />

# iter<br />

63


Learn<strong>in</strong>g progress<br />

10 4<br />

TPC-DS 69<br />

10 3<br />

Execution time, s<br />

10 2<br />

10 1<br />

10 0<br />

0 1 2 3 4 5 6 7 8 9<br />

# iter<br />

64


Learn<strong>in</strong>g progress<br />

1.3<br />

TPC-DS 26<br />

1.2<br />

Execution time, s<br />

1.1<br />

1.0<br />

0.9<br />

0.8<br />

0.7<br />

0 1 2 3 4 5 6 7 8 9<br />

# iter<br />

65


AQO: adaptive <strong>query</strong> <strong>optimization</strong><br />

Current code for vanilla <strong>PostgreSQL</strong> (extension + patch):<br />

https://github.com/tigvarts/aqo<br />

Available <strong>in</strong> Postgres Pro Enterprise<br />

66


AQO: adaptive <strong>query</strong> <strong>optimization</strong><br />

For some queries we don't need AQO.<br />

So we need a mechanism to determ<strong>in</strong>e whether the <strong>query</strong> needs AQO.<br />

67


AQO: adaptive <strong>query</strong> <strong>optimization</strong><br />

Query type is the set of queries, which differ only <strong>in</strong> their constants.<br />

Query type:<br />

SELECT * FROM users WHERE age > const AND city = const;<br />

Queries:<br />

SELECT * FROM users WHERE age > 18 AND city = 'Moscow';<br />

SELECT * FROM users WHERE age > 65 AND city = 'Kostroma';<br />

…<br />

68


AQO: adaptive <strong>query</strong> <strong>optimization</strong><br />

aqo.mode:<br />

●<br />

●<br />

●<br />

●<br />

Disabled for all <strong>query</strong> types<br />

Enabled for all <strong>query</strong> types<br />

Use manual sett<strong>in</strong>gs for known <strong>query</strong> types, ignore others<br />

Use manual sett<strong>in</strong>gs for known <strong>query</strong> types, tries to tune others automatically<br />

69


What is next?<br />

Query <strong>optimization</strong><br />

Query execution<br />

SQL <strong>query</strong><br />

Feedback<br />

Result<br />

A priori cost estimation<br />

Histograms<br />

Cost models<br />

Execution statistics<br />

70


Questions<br />

Contacts:<br />

● o.ivanov@postgrespro.ru<br />

● +7 (916) 377-55-63<br />

71


Postgres Professional<br />

http://postgrespro.ru/<br />

+7 495 150 06 91<br />

<strong>in</strong>fo@postgrespro.ru<br />

postgrespro.ru

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!