Adaptive query optimization in PostgreSQL

Adaptive query optimization 

in PostgreSQL 

Oleg Ivanov 

Postgres Professional 

postgrespro.ru

What is query optimization? 

How does PostgreSQL optimize queries? 

What is adaptive query optimization? 

Machine learning and kNN method. 

How to use machine learning for adaptive query optimization? 

How much can it improve PostgreSQL performance? 

Implementaion details: AQO. 

2


SELECT * 

FROM users AS u1, messages AS m, users AS u2 

WHERE u1.id = m.sender_id AND m.receiver_id = u2.id; 

NestedLoopJoin 

MergeJoin 

SeqScan 

IndexScan 

IndexScan 

users u1 

messages m 

users u2 

3


SELECT * 





IndexScan 

SeqScan 

IndexScan 

users u1 

messages m 

users u2 

4


SELECT * 



HashJoin 


SeqScan 

SeqScan 

SeqScan 

users u1 

messages m 

users u2 

5


EXPLAIN SELECT * 



QUERY PLAN 

----------------------------------------------------------------------------------- 

Hash Join (cost=540.00..439429.44 rows=10003825 width=27) 

Hash Cond: (m.receiver_id = u2.id) 

-> Hash Join (cost=270.00..301606.84 rows=10003825 width=23) 

Hash Cond: (m.sender_id = u1.id) 

-> Seq Scan on messages m (cost=0.00..163784.25 rows=10003825 width=19) 

-> Hash (cost=145.00..145.00 rows=10000 width=4) 

-> Seq Scan on users u1 (cost=0.00..145.00 rows=10000 width=4) 



(9 rows) 

6


Plan execution cost 

EXPLAIN SELECT * 



QUERY PLAN 

----------------------------------------------------------------------------------- 

Hash Join (cost=540.00..439429.44 rows=10003825 width=27) 

Hash Cond: (m.receiver_id = u2.id) 

-> Hash Join (cost=270.00..301606.84 rows=10003825 width=23) 

Hash Cond: (m.sender_id = u1.id) 

-> Seq Scan on messages m (cost=0.00..163784.25 rows=10003825 width=19) 





(9 rows) 

Plan node execution cost 

Plan node cardinality 

7


Cost-based query optimization 

System R (1974) 

Choose the cheapest plan 

among all the possible plans 

8


Cost=n s c s +n r c r +n t c t +n i c i +n o c o 

c s seq_page_cost 1.0 

c r random_page_cost 4.0 

c t cpu_Tuple_cost 0.01 

c i cpu_Index_tuple_cost 0.005 

c o cpu_Operator_cost 0.0025 

9


SELECT * FROM users 

WHERE age < 25; 

SeqScan 

Data 

Cost=n s c s +n o ⋅c o 

n s =N pages 

n o =N tuples 

IndexScan 

10




SeqScan 

Data 

Cost=n s c s +n o ⋅c o 

n s =N pages 

n o =N tuples 

IndexScan 

Index 

Data 

Cost=n r ⋅c r 

n r =Cardinality 

11




14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 

0 – 4 

5 – 9 

15 - 19 25 - 29 35 - 39 45 - 49 55 - 59 65 - 69 75 - 79 85 - 89 95 – 99 

10 - 14 20 - 24 30 - 34 40 - 44 50 - 54 60 - 64 70 - 74 80 - 84 90 - 94 100 or more 

Cardinality 

Age 

12


SELECT * 





SeqScan 

SeqScan 

SeqScan 

users u1 

messages m 

users u2 

13


SELECT * 



10000000 

SeqScan 


10000000 

SeqScan 


10000 10000000 

10000 

SeqScan 

users u1 

messages m 

users u2 

14


SELECT * 



10000000 

SeqScan 


301607 

10000000 

SeqScan 


439429 

10000 10000000 

10000 

SeqScan 

145 163784 

145 

users u1 

messages m 

users u2 

15


Optimization method 



SeqScan SeqScan SeqScan 

users 

messages 

pictures 

Full search 

Cost = 439429 


Plan's cost 

estimation 

SeqScan 

messages 

SeqScan 

pictures 

Cost = 304528 

16






users 

messages 

pictures 

Full search 

Cost = 439429 


Plan's cost 

estimation 

SeqScan 

messages 

SeqScan 

pictures 

Cost = 304528 

17






users 

messages 

pictures 

Dynamic programming 

or 

Cost = 439429 


Plan's cost 

estimation 

Genetic algorithm 

SeqScan 

messages 

SeqScan 

pictures 

Cost = 304528 

18

Dynamic programming over subsets 

● System R 

● 

Time complexity: 3n 

● 

Memory consumption: 2n 

● Always finds the cheapest plan 

19


● PostgreSQL 

● Common and flexible method 

● Can be stopped on every iteration 

● No guarantees 

20






users 

messages 

pictures 

Dynamic programming 

or 

Cost = 439429 


Plan's cost 

estimation 


SeqScan 

messages 

SeqScan 

pictures 

Cost = 304528 

21

Query clauses 


estimation 

Cost estimation 

Information about 

stored data 

PostgreSQL state 

22

Dataset: 

The TPC BenchmarkDS (TPC-DS) 

http://www.tpc.org/tpcds/ 

23

Error: 300 times 

Error: 4 times 

Dataset: 

The TPC BenchmarkDS (TPC-DS) 

http://www.tpc.org/tpcds/ 

24

Query clauses 

How good are query optimizers, really? 

V. Leis, A. Gubichev, A. Mirchev et al. 


estimation 

Cost estimation 

Information about 

stored data 

PostgreSQL state 

25



14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 

0 – 4 

5 – 9 

15 - 19 25 - 29 35 - 39 45 - 49 55 - 59 65 - 69 75 - 79 85 - 89 95 – 99 

10 - 14 20 - 24 30 - 34 40 - 44 50 - 54 60 - 64 70 - 74 80 - 84 90 - 94 100 or more 

Age 

Selectivity≃0.3 

Cardinality=N tuples ⋅Selectivity 

26


WHERE age < 25 AND city = 'Moscow'; 

Only selectivities of individual clauses 

(i.e. marginal selectivities) 

are known 

Selectivity age 

= 1 3 

Selectivity city 

= 1 7 

Selectivity age ,city =? 

27

1/3 

age < 25 

city = 'Moscow' 

1/7 1/21 

28


WHERE age < 25 AND city = 'Moscow'; 

Only selectivities of individual clauses 

are known 

The clauses are considered to be independent: 

Selectivity age ,city 

=Selectivity age 

⋅Selectivity city 

With the exception of Selectivity 25

age < 25 


1/7 1/7 

1/3 

30

age < 25 

1/3 

1/7 


31


WHERE age < 12 AND married = true; 

age < 12 

married = true 

32


WHERE age < 12 AND married = false; 

age < 12 

married = false 

33


WHERE age > 25 AND married = true 

AND position = 'CTO'; 

married = true 

position = 'CTO' 

age > 25 

34

Multidimensional histograms 

35


Query optimization 

Query execution 

SQL query 

Feedback 

Result 

A priori cost estimation 

Histograms 

Cost models 

Execution statistics 

36





Feedback 

Result 


Histograms 

Cost models 


37

Machine learning 

Objects 

Feature 1 

Features 

Feature 2 

Feature 3 

Hidden values 

Train set 

Test set 

38

K Nearest Neighbours method 

Age 

Salary 

25 47 55 32 22 45 28 

50 120 100 80 30 90 ? 

39

K Nearest Neighbours method 

Age 

Salary 

25 47 55 32 22 45 28 

50 120 100 80 30 90 ? 

40

Gradient approach to kNN 

2/3 

1/3 

1/3 5/3 

Age 

Salary 

27 47 28 

53 103 ? 

41

How to use machine learning 

for adaptive query optimization? 

42

The object is a node with its subtree 

u1.id = messages.sender_id 



u2.id = messages.receiver_id 

SeqScan 

IndexScan 

IndexScan 

u2.married = true 

AND 

u2.age < 25 

users u1 

messages m 

users u2 

43

Histograms 

users.id = messages.receiver_id 

AND 

users.married = true 

AND 

users.age < 25 

Clauses list 

Node cardinality is 

105 tuples! 

PostgreSQL estimator 

44

Clause selectivities 

● 0.0001 

● 0.73 

● 0.23 


AND 

users.married = const 

AND 

users.age < const 

Clauses list 

Node cardinality is 

1017 tuples! 


45

Machine learning problem statement 

Object is a plan node 

Features 


users.married = const 

users.age < const 

0.0001 

0.73 

0.23 

Hidden value Node cardinality 

? 

46

Workflow 

Query parsing 



Query execution statistics 


Cardinality estimation 

Learning 

Machine learning data 

47

Theoretical properties 

● 

● 

● 

Will it converge? 

Yes, in the finite number of steps 

How fast will it converge? 

Don't know (in practice in a few steps) 

What guarantees on obtained plans or 

regressor do we have? 

Predictions are correct for all executed paths 

With perfect cost model obtained plans are not worse 

48

How much can it improve PostgreSQL performance? 

Experimental evaluation 

49

Estimation error 

50


51


52

Performance improvement 

Original 

Adaptive 

TPC-H fast 

+1.3% 

TPC-H slow 

0 10 20 30 40 50 60 

53




TPС-H fast 

TPC-H slow 

-4.4% 

0 5000 10000 15000 20000 

54




TPC-DS very fast 

+12% 

TPC-DS fast 

TPC-DS normal 

TPC-DS slow 

TPC-DS very slow 

0 2 4 6 8 10 12 14 16 18 

55





TPC-DS fast 

+24% 

TPC-DS normal 

TPC-DS slow 


0 20 40 60 80 100 120 140 

56





TPC-DS fast 

TPC-DS normal 

+41% 

TPC-DS slow 


0 50 100 150 200 250 300 

57





TPC-DS fast 

TPC-DS normal 

TPC-DS slow 

+285% 


0 

200 

400 

600 

1000 1400 1800 

800 1200 1600 

58





TPC-DS fast 

TPC-DS normal 

TPC-DS slow 


0 5000 10000 15000 20000 

+115% 

59

Maximum acceleration 

60

Overheads 

Experimental evaluation 

Slowdown for genetic algorithm is not more than 2 seconds 

Slowdown for dynamic programming is not more than 30 ms 

61

Applicability 

Complex analytical queries 

with a repeating pattern. 

62

Learning progress 

1.4 

TPC-DS 18 

1.3 

1.2 

Execution time, s 

1.1 

1.0 

0.9 

0.8 

0.7 

0.6 

0 1 2 3 4 5 6 7 8 9 

# iter 

63


10 4 

TPC-DS 69 

10 3 


10 2 

10 1 

10 0 

0 1 2 3 4 5 6 7 8 9 

# iter 

64


1.3 

TPC-DS 26 

1.2 


1.1 

1.0 

0.9 

0.8 

0.7 

0 1 2 3 4 5 6 7 8 9 

# iter 

65

AQO: adaptive query optimization 

Current code for vanilla PostgreSQL (extension + patch): 

https://github.com/tigvarts/aqo 

Available in Postgres Pro Enterprise 

66


For some queries we don't need AQO. 

So we need a mechanism to determine whether the query needs AQO. 

67


Query type is the set of queries, which differ only in their constants. 

Query type: 

SELECT * FROM users WHERE age > const AND city = const; 

Queries: 

SELECT * FROM users WHERE age > 18 AND city = 'Moscow'; 

SELECT * FROM users WHERE age > 65 AND city = 'Kostroma'; 

… 

68


aqo.mode: 

● 

● 

● 

● 

Disabled for all query types 

Enabled for all query types 

Use manual settings for known query types, ignore others 

Use manual settings for known query types, tries to tune others automatically 

69

What is next? 




Feedback 

Result 


Histograms 

Cost models 


70

Questions 

Contacts: 

● o.ivanov@postgrespro.ru 

● +7 (916) 377-55-63 

71

Postgres Professional 

http://postgrespro.ru/ 

+7 495 150 06 91 

info@postgrespro.ru 

postgrespro.ru

Adaptive query optimization in PostgreSQL

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?