Near its end, the CyberPenguin case study mentions the discovery of “some small accounting inaccuracies”. To be exact, users were occasionally being double-charged for sessions.
If you’ve done much software development, that statement alone should be enough to have you thinking “concurrency issue” or, more specifically, “race condition”. Given that the application involved both a web-based management interface and a once-a-minute scheduled task to handle accounting, the obvious point of conflict was for the problems to result when an operator submitted an update through the web interface while the accounting task was running.
So off I went to ensure that such a conflict wouldn’t cause issues, then sent the revised code off to the client. He reported back the next day that it hadn’t had any effect.
After looking harder at the code, I came up with a way that this conflict could conceivably have still occurred under some contrived circumstance, eliminated that possibility, and sent it off. Still no improvement.
We went through maybe half a dozen rounds of this before I ran out of ways that the web application and accounting task could conflict, no matter how far I stretched the bounds of probability, and finally took a closer look at the application’s logs. Once I started looking at them as a whole, instead of just records of single users, I quickly noticed that these mischarges came in clusters, affecting several users at once, not random individuals. These clusters also tended to hit right on the hour.
Checking in with the client, it turned out that the specific times when the problems occurred matched up with the times when the per-minute rates changed.
To keep the accounting simple, the application deals with rate changes by closing out each active user’s session, charging it at the old rate, and then opening a new session at the new rate. With larger numbers of users logged in, this process was taking more than a minute to complete. The conflict wasn’t between the web interface and the scheduled accounting task, it was between two (or more…) copies of the accounting task! This also explained why I was never able to reproduce the problem in my own development and testing environment, as I had never simulated enough concurrent users in my tests to produce the problem.
Once the correct culprit had been identified, it was a simple enough matter to resolve by adding a check to prevent multiple instances of the accounting task from running concurrently.
The moral of the story:
Even if the cause of an issue seems clearly obvious, don’t forget to fully examine the system to verify that you’re solving the right problem before you spend too much time solving the wrong one.