SQL – Always use aggregate functions if you can

One would wonder what the title means. Well it was a thunder bolt for me when I was trying to optimize some headless scripts. Well we at Saturn do heavily use headless scripts written in php. Some use mysql some use xml and some other use memcache, in fact pretty much all would use memcache. But that is not the situation now.

In an attempt to multi thread a cron job part of optimizations done an year back, instead of sequential processing, we had switched it to single row processing. Which required to have a method getNextRow which was passed a parameter, the primary key of the table. The cron starts with a value 0 (zero) and after each is processed, the value is supplied as the last processed one. The getNextRow had

select ID from [table] WHERE ID > [LAST_ID] ORDER BY ID ASC LIMIT 1;

Well the query is a very innocent looking and a trend to write that way. Why this tickled me is that this particular table though InnoDB had 3581916 rows. We were planning to run 32 threads – were already running 4 processing this but they were not upto the standards that is why the optimization had to happen. The first trial was to switch the code to

select min(ID) from [table] WHERE ID > [LAST_ID];

Pretty smooth eh! But see the profiles and you will know why I wrote this. I tried both with profiling on and mysql show profiles output is as below.

+----------+------------+-------------------------------------------------------------------------------+
| Query_ID | Duration   | Query                                                                         |
+----------+------------+-------------------------------------------------------------------------------+
|        1 | 0.05797450 | select ID from [table] WHERE ID > [LAST_ID] ORDER BY ID ASC LIMIT 1           |
|        2 | 0.00066200 | select MIN(ID) from [table] WHERE ID > [LAST_ID]                              |
+----------+------------+-------------------------------------------------------------------------------+

Wow! the difference is quite high.. and could not imagine how it could have skipped my mind as every clock cycle counts towards our cron completion. Well could not resist doing an explain on both the queries.

mysql> EXPLAIN select ID from [table] WHERE ID > [LAST_ID] ORDER BY ID ASC LIMIT 1;
+----+-------------+----------------------+-------+---------------+---------+---------+------+---------+--------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows    | Extra                    |
+----+-------------+----------------------+-------+---------------+---------+---------+------+---------+--------------------------+
|  1 | SIMPLE      | [table]              | range | PRIMARY       | PRIMARY | 8       | NULL | 1488844 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+---------+--------------------------+
1 row in set (0.01 sec)

mysql> EXPLAIN select MIN(ID) from [table] WHERE ID > [LAST_ID];
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
|  1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
1 row in set (0.00 sec)

Ah! Now things are more clear.. on the mysql reference docs it says the explanation of “Select tables optimized away”

The query contained only aggregate functions (MIN(), MAX()) that were all resolved using an index, or COUNT(*) for MyISAM, and no GROUP BY clause. The optimizer determined that only one row should be returned.