This time the query’s selecting all the female employees who joined after some time in a particular team. (Suppose the team is huge, like a product group)

SELECT id FROM employees WHERE gender = 1 AND team = 1 AND join_time > ‘2008-10-10 00:00:00’;

I no longer remember how long this query takes without any indexing. After I added the index (join_time, gender, team), the query did not improve. So I did an EXPLAIN, and it shows that the index (join_time, gender, team) is considered, but not used. So I had to use FORCE INDEX to tell the optimizer to use the index. Not that it lead to a good result – around 3:45 minutes :( I found out the later the reason why the optimizer decided not to use the index is because the percentage of the records being female from team #1 is too high. MySQL thinks it’s not worth loading the index and then the actual records, compared to just doing a full table scan.

After the experience with finding when the last employee joined a particular team, I created the index (gender, team, join_time). So all the exact match fields first. I still have to use FORCE INDEX, but the query dropped to 1:30 minutes!

Of course, 1:30 is still not desirable. The query is executed every time a detection algorithm runs, which takes around a minute. But the query returns 9 million rows – not sure what else can be down to speed it up. According to someone on the MySQL forum, 1:30 minutes for 9 million record retrieval seems pretty good.

While I was searching for solutions before creating the proper index, I came across table partitioning. The basic idea is to partition the table in storage area so the DBMS knows where to search. There are four ways of partitioning, that I won’t go into details because you can find it here. So if I create range partitions on the join_time, then when I do a range query, MySQL will only search in the partitions that contains that range of time. Seems like a cool idea, will try when I get the chance to.