May 6, 2012

Random timeouts between zones in Amazon EC2?


I am having issues with tcp connections between two instances in EC2. I thought at first it was the tools I was using (JRuby on Rails stack + MongoDB) when I saw exceptions such as the following in my code:

A Mongo::OperationFailure occurred in foo#bar:Mongo::OperationFailure
.bundle/jruby/1.8/gems/mongo-1.6.2/lib/mongo/util/tcp_socket.rb:76:in `read'

So thinking this is a software issue, I didn’t come to ServerFault. I thought the IO classes in JRuby might be hosed after some research but that wasn’t accurate. I went ahead and installed Ruby 1.9.3 and moved the entire stack over to it. Sure enough after a while, a similar exception creeped up:

A Errno::ETIMEDOUT occurred in anotherfoo#anotherbar:Connection timed out
mongo (1.6.2) lib/mongo/util/tcp_socket.rb:70:in `readpartial'

The reason I come to serverfault for help is because I do believe this might be some inter zone timeout issue in Amazon’s infrastructure and I was wondering if any can verify or give me suggestions for how to further debug this as I’m running out of solutions. My app server is in us-east-1a. The mongodb server is in us-east-1c. Perhaps that could be a reason? Why might I bet getting these timeouts using a default Amazon Linux AMI (64-bit, XLARGE)?

Asked by imaginative


Newsflash: networking is unreliable. Whether it’s EC2 or local colo, sometimes your network won’t perform as you might like it to. If your code isn’t capable of handling that, you will have problems no matter where you’re hosting.

That being said, EC2 availability zones are geographically dispersed, so it’s unreasonable to expect that the networking will be as reliable as a LAN (or even within the same AZ). Moving things into the same AZ might improve your reliability, but not to the point that you can hope to get away with code that doesn’t take the occasional network hiccup into account. So, fix your code so that it catches appropriate exceptions and retries the failed operation.

Answered by womble

