We were running long-term tests of a new version of our KeyChest (coming quite soon now), when I freaked out when everything seemed to break after 2 days of testing … at 6pm on Friday.
I tried to figure out what’s going on on Saturday to no avail. We didn’t have enough logging to show what the problem could be but even when I added new ones, I couldn’t put my finger on what was happening or why.
Only on Monday evening, I remembered something and went to have a look into the AWS Console. Clearly, it’s pretty poor when you have to go to your virtual servers’ administration when you debug your application, but hey we live in a real world.
I pulled charts of CPU usage – as expected, the CPU use went down at the same time the application started to misbehave – nothing unexpected, one would think that some threads died (for whatever reason) and CPU got a breeze. It got more suspicious when the steady line started getting longer and longer.
Eventually I pulled CPU credit use and CPU credit balance charts and finally understood – the application (and its EC2 machine) ran out of CPU credits and Amazon’s throttling kicked in … and KeyChest’s behavior changed.
Soo, CPU credits are for me “the small print” something I shouldn’t worry much about unless I’m Really interested in it. I can understand that it’s a “feature” for applications in need for short CPU usage bursts.
What I’d argue though is I’d like to be able to choose whether my application needs these bursts when I create a new EC2 instance. Or at least being able to disable these CPU credits somewhere. (I assume that the “baseline performance” is guaranteed, although I’m not sure if that’s true. It at least seems to be “continuously” available to your EC2 instances at these levels.
KeyChest, like many other applications, is a continuously running application with a number of automated background jobs and CPU credits don’t quite follow my “expectations” of how EC2 servers should behave.
This “running out of CPU credits” is something that has a significant impact on it.
Just another confirmation that everything should have a long-term test before being launched as you never know what may happen in a week’s time.