You are currently browsing the category archive for the ‘Geeky’ category.

We narrowed down the reason causing our app to be really really slow to one of the methods where it extracts and returns contents of files under a directory. After some printlns of how long each step is taking, it turns out the culprit is Java I/O. The previous co-op used RandomAccessFile to read the file content. Why? I am not sure. He was checking if the file has the correct number of lines, then going back to the beginning of the file and reading the content. Without RandomAccessFile, the file pointer can only move forward. Hence it’s can’t go back to the beginning of the file after it reaches the end.

So I changed the program to use BufferedReader, and check the number of lines as the content is read. The result? 1 second to read ~70 files instead of 40 seconds with RandomAccessFile.

While I respect the students in other Engineering programs applying for Software jobs, but the employer should be very careful in their selection. I still remember the mess I was going through to refactor the spaghetti code at the beginning of the term. In fact, this shouldn’t be targeted at non SE/CS applicants, it should be applicants who have no idea what they’re doing. It’s much more destructive to have crappy code than have no code at all. At least the next person can start fresh.


This time the query’s selecting all the female employees who joined after some time in a particular team. (Suppose the team is huge, like a product group)

SELECT id FROM employees WHERE gender = 1 AND team = 1 AND join_time > ‘2008-10-10 00:00:00’;

I no longer remember how long this query takes without any indexing. After I added the index (join_time, gender, team), the query did not improve. So I did an EXPLAIN, and it shows that the index (join_time, gender, team) is considered, but not used. So I had to use FORCE INDEX to tell the optimizer to use the index. Not that it lead to a good result – around 3:45 minutes :( I found out the later the reason why the optimizer decided not to use the index is because the percentage of the records being female from team #1 is too high. MySQL thinks it’s not worth loading the index and then the actual records, compared to just doing a full table scan.

After the experience with finding when the last employee joined a particular team, I created the index (gender, team, join_time). So all the exact match fields first. I still have to use FORCE INDEX, but the query dropped to 1:30 minutes!

Of course, 1:30 is still not desirable. The query is executed every time a detection algorithm runs, which takes around a minute. But the query returns 9 million rows – not sure what else can be down to speed it up. According to someone on the MySQL forum, 1:30 minutes for 9 million record retrieval seems pretty good.

While I was searching for solutions before creating the proper index, I came across table partitioning. The basic idea is to partition the table in storage area so the DBMS knows where to search. There are four ways of partitioning, that I won’t go into details because you can find it here. So if I create range partitions on the join_time, then when I do a range query, MySQL will only search in the partitions that contains that range of time. Seems like a cool idea, will try when I get the chance to.

While I absolutely love the feeling of watching a query that used to take 3 minutes get reduced down to 0.00 second, it really is a headache trying to figure out how indexes work in MySQL. Or any database for that matter.

Say I have a table named employees. This table is huge, containing 50 million rows, and it will keep on growing. I have a query that selects when was last employee hired at a particular team. (This is not the actual situation; I’m using analogies) So the query looks like:

SELECT MAX(join_time) FROM employee WHERE team_id = 1;

So the first thing I did was to add an index (join_time, team_id). When I EXPLAIN the query above, it says that the query optimizer will use the index (join_time, team_id), but the estimated number of rows to scan for the result is still 50 million. And of course, the query still took 3 minutes to execute. So I was thinking why the query would still effectively do a full table scan. It didn’t really make sense to me. Perhaps the DB has to make sure the maximum join_time of the index really has team_id =1. For example, maybe team #1 stopped hiring a long time ago, so the maximum join_time in the index doesn’t belong to team #1. This would result in more scans. But in my case, the maximum join_time does belong to team #1. The DB should be smart enough to look at the first record in the index and stop there.  … I’m sure I completely misunderstand how DB indexes work…

So the next thing I tried was the index (team_id, join_time). And ran EXPLAIN.

And ta-da! The EXPLAIN says no indices are used, with “Select tables optimized away” in the Extra column. That means the optimizer decided that it only needs to look at one record! So the query took 0.00 seconds to execute; which is infinitely faster than before. Hm, can I put “optimized a critical query to run infinitely faster” on my resume…? ;)

Nevertheless, that was a stroke of luck. I am still not sure why the first index didn’t work.


Flickr Photos