Wednesday, February 12, 2014

How to make SQL Perform



SQL Performance – 101

1.       Fetch only the required columns. Do not use select *. Do not fetch the columns from the database and ignore them in the code.
a.       This is to minimize the network traffic between the database server and the server on which the data is being processed.
b.      It will reduce the memory utilization on the database server and on the server where the results are being processed.
2.       Fetch only the required rows. Use the proper “where” clause to limit the number of rows retrieved. Do not filter out records in the code.
a.       This will reduce the disk I/O in the database server.
b.      This will minimize the network traffic between the database server and the server on which the data is being processed.
c.       This will reduce the memory utilization on the database server and on the server where the results are being processed.
d.      Use the “LIMIT”/”TOP” clause as available in the database to limit the number of rows retrieved. These can come very handy for paginating data on the screen. This can come in handy when working with export of large amounts of data too.
3.       Join only minimum required tables. As a thumb rule do not join more than three tables.
a.       The joining of any two tables is typically a Cartesian product of the two tables. This means that if we join two tables which have 10000 records each at some stage the database will be dealing with 100,000,000 i.e. 100 million! records, and if we join three such tables we will making the database handle 1,000,000,000,000 1 trillion records. To avoid overloading the database server one should try and join minimum number of tables.
b.      Note that the limit of three has nothing to do with a limit of 1 trillion records. The latest techniques of fetching database use hash joins which are much more efficient than Cartesian joins and one will be better off if the database uses this instead. But one cannot tell the database to use Hash Joins instead of Cartesian products. One needs to use the explain plan to determine the joins that the database uses and plan according.
c.       Do not hesitate too much to denormalize the tables if it is found that one is always ending up joining too many tables.
d.      Leverage the memory of the process which is processing the records to cache certain master data, list of value types of data so that one can avoid joins with the tables that contain such data. E.g. a typical table is most application will be a StatusMaster. Now this table will have a list of statuses and the description for these statuses. To show the data to the user on the screen one may end up joining with this table all the time. This can be avoided if one has kept a hash table of the statuses in memory of the process. When the records are fetched on the status id is obtained from the table, but one can look up the hash table to get the description to be shown to the user.
4.       Do not use Distinct unless absolutely necessary.
a.       When the database finds a distinct clause it has to fetch the required rows, it has to then sort it on the columns that have been marked distinct and it has to remove the duplicates. If we are sure that the values will be distinct then do not use the keyword distinct. If possible try and change the application and if necessary the database to ensure that the distinct keyword can be avoided. This is important in table which have large amount of data.
5.       Do not use Order By unless absolutely necessary.
a.       If an order by clause is specified then the database needs to fetch the records and then it needs to sort the data. This is an extra overhead for the database. Do not use the sort unless absolutely necessary.
6.       Do not use column functions with a column in the where clause.
a.       If a column function is used along with a column in the where clause then any index on the column cannot be leveraged and this will make the query slower. E.g. where toupper(name) = ?. Programmatically ensure that it is not necessary have a column function in the where clause. If necessary add an additional column on which the search can be executed.
7.       Similarly do not convert data types from one to another, especially in where clauses. If the column is an integer then use an integer value to compare do not use a string value and compare, even though the database may have the ability to do it. Data Type castings/conversions are expensive.
8.       Do not use “in” with a sub-query that will return a large number of values in the “in” clause. Instead use “exists” clause if possible.
a.       If the in clause has a large list of values then the database finds it difficult to compare the value from the row to the values in the list. This slows down the query drastically. Instead use a join to fetch the data in such scenarios.
9.       Do not use a query like select … from tableb b1 where columna = (select max(columna) from tableb b2 where b1.columnx = b2.columnx).
a.       This kind of query is typically required to be executed in tables where we need to pick up the latest record from the history of records. Instead of this adopt one of the following strategies
                                                   i.      Maintain the history in a separate table so that the main table has only the latest information and so the correlated subquery is not required.
                                                 ii.      Maintain a flag in the table which will indicate which is the latest record. When a new version of the record is being entered then update the flag in the earlier record to indicate that it is not the latest and in the new record mark this flag to indicate that this is the latest. Include the flag column in the appropriate index.
10.   Query with integers and numbers wherever possible. Avoid querying by Strings. It is easier to compare two numbers compared to comparing two string.
11.   Avoid usage of “like”. Especially avoid usage of wildcard on both sides of a string column. E.g. name like “%smith%”.

Some Design Considerations/Optimizations

12.   To generate sequences for a primary column or for a unique column where the temporal sequence of these numbers do not matter and one can tolerate gaps use the auto-increment feature that is available in most databases. In case of Oracle use “sequences” to generate the values for these columns.
a.       In Oracle create the sequence with the keywords “cache” and “noorder”. Along with “cache” specify an integer number which specifies the number of sequence ids that should be cached. This will be important for tables where this number needs to be generated very frequently.
13.   Create indexes judiciously. If a table has too many indexes then the inserts and updates into the tables will become slower as it now needs to update the indexes too.
14.   When creating an index ensure that the cardinality of the values in the index is not too small. “Cardinality” of a column indicates the uniqueness of the values in the column. More unique the data in the column higher the “Cardinality” of the column. Creating an index in a column which has a very low “Cardinality” does not help as the database will not be able to use this column effectively to fetch the rows. E.g. if there is a column which contains only a “Y”/”N” value or has a list of status codes, creating an index on such columns will not help query optimization.
15.   Use the clustered index option only if required. An ideal clustered index should reflect the temporal order in which the records are to be inserted into the table. If this is not the case then the inserts and updates can become slow as the data in the table increases.
16.   Use the smallest “standard” datatype for the columns. This is important because if one uses an unnecessarily large datatype then one will end up using up unnecessary memory on the database server everytime the records are processed. E.g. try and use INTEGER or a SMALLINT for all columns which are Integers. Do not use a bigint because it is too big and do not use a byte as it may not be supported on all the databases.

Summary

To get a good performance from the database we need to follow the following norms:
1.       Keep the design simple
2.       Keep the queries simple and ensure that the database does not have to do something that is not necessary
3.       Fetch only that data that is required
4.       Normalize but do not forget to denormalize where required

No comments: