PostgreSQL is a popular database that is used by many applications. It’s been in active development for 30 years! PostgreSQL (or Postgres) is a powerful database management system (DBMS) that you can use and worry less about performance.
Although Postgres has a powerful optimizer and can fix performance issues for you, sometimes you have to tweak your query to improve your query performance. You will find why in the next section.
Many developers tend to write the query first and then optimize it later. This is often not a good practice because it makes the query more complex and will make it harder to understand and thus difficult to optimize. This is why it’s better to write the query the right way from the start.
This article talks about specific techniques to improve query performance; however, the most important thing is to understand how a database engine processes the query, what a query planner is, and how it decides what execution plan to use.
In this tutorial, you will learn about optimization techniques when you imagine yourself as a database engine. Think from the perspective of a database engine, and then decide what your query should be optimized for.
It’s not an easy task to optimize a query, but if you understand the basics of query optimization, you can go far and make your queries performant.
Understanding the SQL nature
Why would two queries that give the same result drastically differ in execution time? It’s a difficult question.
The source of the problem is that SQL is a declarative language. That means you describe what the result of the query should be and not how the query should be executed.
The reverse is true for the imperative language. It means you describe how the result will be obtained through the sequence of the steps that will be executed.
Under the hood, the database optimizer chooses the best way to execute the query. How is best determined? Many factors come into play such as storage structures, indexes, and data statistics.
Setting up a Postgres database
First set up the environment used in this tutorial so that you can follow up on the performance issues that will be tackled.
I’m assuming that you have already installed Postgres on your machine. Begin by creating a new database.
$ sudo -u postgres psql -c "CREATE DATABASE testdb;"
Note: The shell commands run in this tutorial are used on a Linux/Mint system.
Now, the testdb
database has been created.
Setting up Postgres tables
To be able to create tables, you need to be connected to the database through a SQL client or a command line tool like psql.
If you want to access the database from the command line, you can use the following command:
$ sudo -u postgres psql
and to connect to the database, you can use:
postgres=# \c testdb testdb=#
The prompt postgres=#
indicates that my machine is, by default, connected to the postgres database. Once that prompt is changed to testdb=#,
you are connected to the testdb
database.
This example creates two tables: genders and employees. Feel free to name them with your conventions. I just prefer the plural nouns for tables.
For the rest of this tutorial, I’ll use a SQL client like DBeaver, so don’t be surprised when you no longer see the prompt testdb=#.
Create the genders table:
CREATE TABLE genders (id int, name text); INSERT INTO genders VALUES (1, 'male'), (2, 'female');
Confirm with SELECT *
:
SELECT * FROM genders;
Editor’s note: For simplicity of the examples, only two genders are listed, but, in practice, there will be more.
Create the employees
table:
CREATE TABLE employees (id serial, gender_id int, hire_date date);
Now populate 5 million entries in the employees
table:
INSERT INTO employees (gender_id) SELECT x % 2 + 1 FROM generate_series(1, 5000000) AS x; UPDATE employees SET hire_date = '2020-04-01' WHERE id < 101; UPDATE employees SET hire_date = '2021-04-01' WHERE id BETWEEN 101 AND 1000000; UPDATE employees SET hire_date = '2022-04-01' WHERE id BETWEEN 1000001 AND 5000000;
Note that the first column gender_id
in the previous query contains 1’s and 2’s and is evenly distributed across the 5,000,000 records. However, the hire_date
column is distributed non-uniformly to be able to play with different distributions in this tutorial.
To see the differences in performance, look at execution time. If you’re using psql, turn on the execution time with \timing
. The DBeaver SQL client calculates the execution time for you.
This example aggregates the genders table by how many females and males are employees.
SELECT g.name, count(g.name) FROM genders AS g, employees AS e WHERE g.id = e.gender_id GROUP BY 1;
This query takes around 10.7 seconds on my machine.
name | count --------+--------- female | 2500000 male | 2500000
Tracking the execution time is the easiest way to diagnose a performance issue. Next, find out more ways to diagnose and fix performance issues.
Using EXPLAIN
One of the most important tools for debugging performance issues is the EXPLAIN
command. It’s a great way to understand what Postgres is doing behind the scenes. The result would be the execution plan for the query. This plan is a tree of nodes that Postgres uses to execute the query.
Run this to see the plan for the previous query:
EXPLAIN SELECT name, count(g.name) FROM genders AS g, employees AS e WHERE g.id = e.gender_id GROUP BY 1;
which outputs the following in my case:
Finalize GroupAggregate (cost=757108.87..757113.87 rows=200 width=40) Group Key: g.name -> Sort (cost=757108.87..757109.87 rows=400 width=40) Sort Key: g.name -> Gather (cost=757049.58..757091.58 rows=400 width=40) Workers Planned: 2 -> Partial HashAggregate (cost=756049.58..756051.58 rows=200 width=40) Group Key: g.name -> Hash Join (cost=38.58..675062.54 rows=16197408 width=32) Hash Cond: (e.gender_id = g.id) -> Parallel Seq Scan on employees e (cost=0.00..101737.73 rows=2550773 width=4) -> Hash (cost=22.70..22.70 rows=1270 width=36) -> Seq Scan on genders g (cost=0.00..22.70 rows=1270 width=36)
When you explore each node of the previous execution plan, you’ll find different data access algorithms whether it is a sorting, grouping, hash aggregating, or hash joining. These algorithms are computed by the Postgres optimizer. The optimizer chooses the best way to execute the query. This decision depends on the data and the query itself.
But how to read such execution plan? Look at the bottom of that plan. You’ll see a sequential scan is done on genders table. This means that the optimizer has chosen a simple sequential scan on all rows of genders table.
This seq scan is followed by parentheses and includes some information: the cost, rows, and width. The cost has two numbers; the estimated startup cost (e.g., if you have a sort node, it would be the time to do the sorting), and the estimated total cost which assumes all available rows, related to this node, are retrieved.
The rows value defines the estimated number of rows output by this node.
The width indicates the estimated average width of rows output size (in bytes) by this plan node.
How fast the query is executed depends on the selectivity ratio. This ratio is the ratio of the number of retained rows to the total number of rows in the stored table.
It’s better to seek a low selectivity ratio to avoid the read operation cost.
The worst-case scenario is to have a selectivity ratio of 1. In this case, the optimizer will do a full scan of all the rows in the table. The engine consecutively reads all the rows in a table and checks the filter condition on each block.
Note: A block is the unit of storage that is used to transfer data from disk to memory. The block size is fixed, and it’s 8192 bytes in Postgres.
To understand how the optimizer makes different decisions depending on the selectivity ratio, filter a big chunk of the employees table by the following query:
EXPLAIN SELECT hire_date FROM employees WHERE hire_date = '2022-04-01';
The following execution plan is returned:
Seq Scan on employees (cost=0.00..152753.19 rows=4924828 width=4) Filter: (hire_date = '2022-04-01'::date)
Note that this big chunk that is filtered out is around 4 million rows out of the total 5 million rows. That’s why there is a high selectivity ratio. Since the optimizer is smart, it has chosen the sequential scan algorithm.
However, the optimizer can choose a different execution plan for the same query if you just change the filter condition in the WHERE
clause.
EXPLAIN SELECT hire_date FROM employees WHERE hire_date = '2020-04-01';
which results in the following execution plan:
Gather (cost=1000.00..109114.76 rows=1 width=4) Workers Planned: 2 -> Parallel Seq Scan on employees (cost=0.00..108114.66 rows=1 width=4) Filter: (hire_date = '2020-04-01'::date)
This is because the selectivity ratio is low (100 rows out of 5 million rows returned), so the Postgres optimizer plans two workers to execute the query in parallel.
Note that you can look at a detailed execution time of the query by the EXPLAIN
ANALYZE
command. Just write it before the SELECT
statement.
Using indexes
PostgreSQL (like any relational database) allows additional, redundant data structures to speed up the table access more than a sequential scan.
These data structures are called indexes.
Indexes are described as “redundant” because they do not store new information than the data is already stored in the table.
What exactly do indexes do? They provide data access paths to the table. They allow you to access the rows without having to read the entire table.
If there is a filtering condition on the table, then the index allows you to only read the blocks that match the condition. These are determined by the pointers. The underlying data structure is a heap. A heap is a tree-like data structure which means rows are stored in an unordered fashion.
The cost of the heap algorithm becomes higher than the full scan if the selectivity ratio is high.
For small values of selectivity, the cost is proportional to the number of blocks that match the condition.
The next section demonstrates how creating an index will speed up queries in different scenarios.
Unique indexes
A unique index supports any primary key or a unique constraint on the table.
A table can have a single primary key (though it’s not a requirement) and multiple unique constraints. Note that the definition of the genders table didn’t specify id as the primary key.
You can alter the genders table with a new primary key:
ALTER TABLE genders ADD CONSTRAINT genders_pkey PRIMARY KEY (id);
If, by mistake, you try to alter another primary key in the same table as the following query:
ALTER TABLE genders ADD CONSTRAINT genders_pkey2 PRIMARY KEY (name);
you will see the following error:
SQL Error [42P16]: ERROR: multiple primary keys for table “genders” are not allowed
What you can do instead is to create a unique constraint on the name column:
ALTER TABLE genders ADD CONSTRAINT genders_name UNIQUE (name);
For the employees table, the primary key is done here:
ALTER TABLE employees ADD CONSTRAINT employees_pkey PRIMARY KEY (id);
and the index is created on the same pkey:
CREATE INDEX employees_id ON employees(id);
To utilize these features, create unique indexes on these constraints.
CREATE INDEX genders_id ON genders(id);
and another index on the name column:
CREATE INDEX genders_name_idx ON genders(name);
It is also possible to create a unique index without formally creating a unique constraint. Just add the keyword unique
to the index statement:
CREATE UNIQUE INDEX genders_name_idx on genders(name);
Note that if there are any duplicate values in the column, the index will not be created.
How about foreign keys? Alter the gender_id
column in the employees
table:
ALTER TABLE employees ADD CONSTRAINT employees_gender_id_fk FOREIGN KEY (gender_id) REFERENCES genders (id);
If searches by the foreign key gender_id
become slow, you need to explicitly create an index on the column:
CREATE INDEX employees_gender_id on employees(gender_id);
Originally, the query took 10.7 seconds on my machine:
SELECT g.name, count(g.name) FROM genders AS g, employees AS e WHERE g.id = e.gender_id GROUP BY 1;
Now, I get an execution plan that makes use of the index employees_gender_id
executed at around 1.4 seconds:
Finalize GroupAggregate (cost=125538.79..125538.84 rows=2 width=40) (actual time=1349.107..1351.767 rows=2 loops=1) Group Key: g.name -> Sort (cost=125538.79..125538.80 rows=4 width=40) (actual time=1349.101..1351.760 rows=6 loops=1) Sort Key: g.name Sort Method: quicksort Memory: 25kB -> Gather (cost=125538.33..125538.75 rows=4 width=40) (actual time=1349.080..1351.742 rows=6 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial HashAggregate (cost=124538.33..124538.35 rows=2 width=40) (actual time=1343.562..1343.563 rows=2 loops=3) Group Key: g.name -> Hash Join (cost=1.04..114121.66 rows=2083333 width=32) (actual time=0.097..790.907 rows=1666667 loops=3) Hash Cond: (e.gender_id = g.id) -> Parallel Seq Scan on employees e (cost=0.00..97063.33 rows=2083333 width=4) (actual time=0.030..302.195 rows=1666667 loops=3) -> Hash (cost=1.02..1.02 rows=2 width=36) (actual time=0.018..0.018 rows=2 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on genders g (cost=0.00..1.02 rows=2 width=36) (actual time=0.012..0.013 rows=2 loops=3) Planning time: 0.323 ms Execution time: 1351.849 ms
Note that although there’s an index for the genders table, the optimizer decides on a sequential scan as indicated in the execution plan in this part:
-> Seq Scan on genders g (cost=0.00..1.02 rows=2 width=36) (actual time=0.012..0.013 rows=2 loops=3)
This happens because the genders table is only two rows, so there is no need to create an index on that table because the optimizer is smart enough to choose a full scan algorithm.
Partial indexes
One of the best features of PostgreSQL is partial indexes. Partial indexes are indexes that only cover a subset of the table.
For the unlikely case that the hire date is on an early date like April 4, 2020, create a partial index on the hire_date
column:
CREATE INDEX employees_hire_date_2020 on employees(hire_date) where hire_date = '2020-04-01'; SELECT hire_date FROM employees WHERE hire_date = '2020-04-01';
And here is the execution plan:
Index Only Scan using employees_hire_date_2020 on employees (cost=0.14..4.16 rows=1 width=4) (actual time=0.022..0.031 rows=100 loops=1) Heap Fetches: 0 Planning time: 0.224 ms Execution time: 0.045 ms
Note that before the index was created, the query took about 400ms to run on my machine.
Conclusion
You’ve have seen how to optimize queries in PostgreSQL and fix the performance issues that you’ve have been facing. This article covered how the optimizer works and how to create indexes, constraints, and partial indexes.
The post How to troubleshoot and fix PostgreSQL performance issues appeared first on Simple Talk.
from Simple Talk https://ift.tt/tZzCWsB
via
No comments:
Post a Comment