The Optimizer is the "brain" brain of the database, interpreting SQL queries and choose the fastest execution method. In this tutorial I will use explain command to show how the optimizer interprets and execute query.
1. What decisions does the optimizer have to make
2. Which scan method ?
CREATE TABLE sample (letter, junk) AS
SELECT substring(relname, 1, 1), repeat('x', 250) FROM pg_class
ORDER BY random();
CREATE INDEX i_sample on sample (letter);
CREATE OR REPLACE FUNCTION lookup_letter(text) RETURNS SETOF text AS $$ BEGIN
RETURN QUERY EXECUTE '
EXPLAIN SELECT letter
FROM sample
WHERE letter = ''' || $1 || '''';
END
$$ LANGUAGE plpgsql;
WITH letters (letter, count) AS ( SELECT letter, COUNT(*)
FROM sample
GROUP BY 1 )
SELECT letter, count, (count * 100.0 / (SUM(count) OVER ()))::numeric(4,1) AS "distribution %" FROM letters
ORDER BY 2 DESC;
2.2 Sequential Scan
2.3 Index Scan
2.4 Bitmap Index Scan
2.5 Demo
WITH letter (letter, count) AS ( SELECT letter, COUNT(*)
FROM sample
GROUP BY 1 )
SELECT letter AS l, count, (count * 100.0 / (SUM(count) OVER ()))::numeric(4,1) AS "distribution %" , (SELECT *
FROM lookup_letter(letter) AS l2
LIMIT 1) AS lookup_letter FROM letter
ORDER BY 2 DESC;
How we force to use Index Scan only ???
SET enable_seqscan = false;
SET enable_bitmapscan = false;
Conclusion : Optimizer refer the lower cost
With Inner Sequential Scan
With Inner Index Scan
CREATE TABLE sample1 (id, junk) AS SELECT oid, repeat('x', 250)
FROM pg_proc
ORDER BY random();
CREATE TABLE sample2 (id, junk) AS SELECT oid, repeat('x', 250)
FROM pg_class
ORDER BY random();
3.1 Nested Loop
EXPLAIN SELECT sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id) WHERE sample1.id = 33;
These tables have no indexes and no optimizer statistics
Pseudocode
for (i = 0; i < length(outer); i++)
for (j = 0; j < length(inner); j++)
if (outer[i] == inner[j]) output(outer[i], inner[j]);
3.2 Hash Join
Join the Two Tables with a Looser Restriction
EXPLAIN SELECT sample1.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id) WHERE sample2.id > 32;
Pseudocode
for (j = 0; j < length(inner); j++)
hash_key = hash(inner[j]);
append(hash_store[hash_key], inner[j]);
for (i = 0; i < length(outer); i++)
hash_key = hash(outer[i]);
for (j = 0; j < length(hash_store[hash_key]); j++)
if (outer[i] == hash_store[hash_key][j])
output(outer[i], inner[j]);
3.3 Merge Join
EXPLAIN SELECT sample1.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id);
Pseudocode
sort(outer);
sort(inner);
i=0;
j=0;
save_j = 0;
while (i < length(outer))
if (outer[i] == inner[j])
output(outer[i], inner[j]);
if (outer[i] >= inner[j] && j < length(inner))
j++;
if (outer[i] > inner[j])
save_j = j;
else
i++;
j = save_j;
CREATE INDEX i_sample1 on sample1 (id);
CREATE INDEX i_sample2 on sample2 (id);
EXPLAIN SELECT sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id) WHERE sample1.id = 33;
for (i = 0; i < length(outer); i++)
index_entry = get_first_match(outer[i])
while (index_entry)
output(outer[i], inner[index_entry]);
index_entry = get_next_match(index_entry);
Query Restrictions Affect Join Usage
EXPLAIN analyze SELECT sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id) WHERE sample2.junk ~'^aaa';
All ’junk’ Columns Begin with ’xxx’
EXPLAIN SELECT sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id) WHERE sample2.junk ~'^xxx';
Hash join was chosen because many more rows are expected. The smaller table, e.g., sample2, is always hashed.
Without LIMIT, Hash Is Used for this Unrestricted Join
EXPLAIN SELECT sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id)
ORDER BY 1;
LIMIT Can Affect Join Usage
EXPLAIN SELECT sample2.id, sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id) ORDER BY 1
LIMIT 1;
Sort is unneeded since an index is being used on the outer side.
LIMIT 100 Switches to Merge Join
EXPLAIN SELECT sample2.id, sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id) ORDER BY 1
LIMIT 100;
Merge join is normally used for large joins, but the indexes eliminate the need for sorting both sides band LIMIT reduces the number of index entries that need to be accessed.
LIMIT 1000 Switches Back to Hash Join
EXPLAIN SELECT sample2.id, sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id) ORDER BY 1
LIMIT 1000;
VACUUM Causes Merge Join Again
VACUUM sample1, sample2;
EXPLAIN SELECT sample2.id, sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id) ORDER BY 1
LIMIT 1000;
VACUUM reduces the cost of the index-only scan by making heap access less likely, so merge join again becomes the cheapest option. The inner index scan has also changed to
No LIMIT Was a Merge Join
EXPLAIN SELECT sample2.id, sample2.junk
FROM sample1 JOIN sample2 ON (sample1.id = sample2.id) ORDER BY 1;
Same Join, Different Plans
|Query Modifier| Plan| | ----------- | ----------- | |No LIMIT| Hash Join| |LIMIT 1| Nested loop join with two index scans| |LIMIT 10| Nested loop join with two index scans| |LIMIT 100| Merge join with two index scans| |LIMIT 1000| Hash join| |VACUUM, LIMIT 1000| Merge join with index scan and sort| |No LIMIT| Merge join with index scan and sort|
The last two are different from previous matching lines because of VACUUM.