SQL Performance – 101
1.
Fetch only the required columns. Do not use
select *. Do not fetch the columns from the database and ignore them in the
code.
a.
This is to minimize the network traffic between
the database server and the server on which the data is being processed.
b.
It will reduce the memory utilization on the
database server and on the server where the results are being processed.
2.
Fetch only the required rows. Use the proper
“where” clause to limit the number of rows retrieved. Do not filter out records
in the code.
a.
This will reduce the disk I/O in the database
server.
b.
This will minimize the network traffic between
the database server and the server on which the data is being processed.
c.
This will reduce the memory utilization on the
database server and on the server where the results are being processed.
d.
Use the “LIMIT”/”TOP” clause as available in the
database to limit the number of rows retrieved. These can come very handy for
paginating data on the screen. This can come in handy when working with export
of large amounts of data too.
3.
Join only minimum required tables. As a thumb
rule do not join more than three tables.
a.
The joining of any two tables is typically a
Cartesian product of the two tables. This means that if we join two tables
which have 10000 records each at some stage the database will be dealing with
100,000,000 i.e. 100 million! records, and if we join three such tables we will
making the database handle 1,000,000,000,000 1 trillion records. To avoid
overloading the database server one should try and join minimum number of
tables.
b.
Note that the limit of three has nothing to do
with a limit of 1 trillion records. The latest techniques of fetching database
use hash joins which are much more efficient than Cartesian joins and one will
be better off if the database uses this instead. But one cannot tell the
database to use Hash Joins instead of Cartesian products. One needs to use the
explain plan to determine the joins that the database uses and plan according.
c.
Do not hesitate too much to denormalize the
tables if it is found that one is always ending up joining too many tables.
d.
Leverage the memory of the process which is
processing the records to cache certain master data, list of value types of
data so that one can avoid joins with the tables that contain such data. E.g. a
typical table is most application will be a StatusMaster. Now this table will
have a list of statuses and the description for these statuses. To show the
data to the user on the screen one may end up joining with this table all the
time. This can be avoided if one has kept a hash table of the statuses in
memory of the process. When the records are fetched on the status id is
obtained from the table, but one can look up the hash table to get the
description to be shown to the user.
4.
Do not use Distinct unless absolutely necessary.
a.
When the database finds a distinct clause it has
to fetch the required rows, it has to then sort it on the columns that have
been marked distinct and it has to remove the duplicates. If we are sure that
the values will be distinct then do not use the keyword distinct. If possible
try and change the application and if necessary the database to ensure that the
distinct keyword can be avoided. This is important in table which have large
amount of data.
5.
Do not use Order By unless absolutely necessary.
a.
If an order by clause is specified then the
database needs to fetch the records and then it needs to sort the data. This is
an extra overhead for the database. Do not use the sort unless absolutely
necessary.
6.
Do not use column functions with a column in the
where clause.
a.
If a column function is used along with a column
in the where clause then any index on the column cannot be leveraged and this
will make the query slower. E.g. where toupper(name) = ?. Programmatically
ensure that it is not necessary have a column function in the where clause. If
necessary add an additional column on which the search can be executed.
7.
Similarly do not convert data types from one to
another, especially in where clauses. If the column is an integer then use an
integer value to compare do not use a string value and compare, even though the
database may have the ability to do it. Data Type castings/conversions are
expensive.
8.
Do not use “in” with a sub-query that will
return a large number of values in the “in” clause. Instead use “exists” clause
if possible.
a.
If the in clause has a large list of values then
the database finds it difficult to compare the value from the row to the values
in the list. This slows down the query drastically. Instead use a join to fetch
the data in such scenarios.
9.
Do not use a query like select … from tableb b1
where columna = (select max(columna) from tableb b2 where b1.columnx =
b2.columnx).
a.
This kind of query is typically required to be
executed in tables where we need to pick up the latest record from the history
of records. Instead of this adopt one of the following strategies
i.
Maintain the history in a separate table so that
the main table has only the latest information and so the correlated subquery
is not required.
ii.
Maintain a flag in the table which will indicate
which is the latest record. When a new version of the record is being entered
then update the flag in the earlier record to indicate that it is not the
latest and in the new record mark this flag to indicate that this is the
latest. Include the flag column in the appropriate index.
10.
Query with integers and numbers wherever
possible. Avoid querying by Strings. It is easier to compare two numbers
compared to comparing two string.
11.
Avoid usage of “like”. Especially avoid usage of
wildcard on both sides of a string column. E.g. name like “%smith%”.
Some Design Considerations/Optimizations
12.
To generate sequences for a primary column or
for a unique column where the temporal sequence of these numbers do not matter
and one can tolerate gaps use the auto-increment feature that is available in
most databases. In case of Oracle use “sequences” to generate the values for
these columns.
a.
In Oracle create the sequence with the keywords
“cache” and “noorder”. Along with “cache” specify an integer number which
specifies the number of sequence ids that should be cached. This will be
important for tables where this number needs to be generated very frequently.
13.
Create indexes judiciously. If a table has too
many indexes then the inserts and updates into the tables will become slower as
it now needs to update the indexes too.
14.
When creating an index ensure that the
cardinality of the values in the index is not too small. “Cardinality” of a
column indicates the uniqueness of the values in the column. More unique the
data in the column higher the “Cardinality” of the column. Creating an index in
a column which has a very low “Cardinality” does not help as the database will
not be able to use this column effectively to fetch the rows. E.g. if there is
a column which contains only a “Y”/”N” value or has a list of status codes,
creating an index on such columns will not help query optimization.
15.
Use the clustered index option only if required.
An ideal clustered index should reflect the temporal order in which the records
are to be inserted into the table. If this is not the case then the inserts and
updates can become slow as the data in the table increases.
16.
Use the smallest “standard” datatype for the
columns. This is important because if one uses an unnecessarily large datatype
then one will end up using up unnecessary memory on the database server
everytime the records are processed. E.g. try and use INTEGER or a SMALLINT for
all columns which are Integers. Do not use a bigint because it is too big and
do not use a byte as it may not be supported on all the databases.
Summary
To get a good performance from the database we need to
follow the following norms:
1.
Keep the design simple
2.
Keep the queries simple and ensure that the
database does not have to do something that is not necessary
3.
Fetch only that data that is required
4.
Normalize but do not forget to denormalize where
required
No comments:
Post a Comment