I am stuck on this topic for a while now: How can I get more details on where response time burns.
My Problem is the extreme variance in response times. Sometimes it takes the server 5 or 10 seconds or more to respond (esp. for the first call). Firebug marks this time mostly as “waiting”. When I check localhost/server-status (where this delay occurs as well), most slots are occupied – but half a second later, they’re free, again. I can hardly imagine that there are so many load spikes to explain this behavior.
Another strange thing: There are requests for 100K JPG images that sometimes – according to server-status – take 1, 2, or even 10 seconds to perform (column Req). At the same time, PHP scripts that include some CPU load, are handled in 100 ms or less (well, others also need 1 or 2 seconds). Requests to other (smaller) GIF or PNG images are even listed with a time of 0 ms.
This is where I am stuck: Is there any way to see what takes 10 seconds to send a simple JPG image?
Thanks for your good ideas!
System: I am talking about an Apache 2 webserver on Debian Linux (Sequeeze) that mostly delivers PHP scripted pages and images. The server is running on a VPS at a professional Germany server hoster. There is no memory swapping on the server (as far as I can see from the stats) and CPU load is not especially high (uptime reports a value around 3 that can rise to about 32 under extreme load – I think it should be an 8-CPU system). Of course, I can never be sure what the other VPSs on the server do.
Special Settings: Notably the server is sending all data via SSL. I further reduced keep-alive time to 1 seconds, because users typically spend very much time on each page (30-60 sec.) and keeping these connections alive after the image(s) are retrieved would quickly exhaust the server’s memory (or the 2 GB I may use on the VPS). Due to larger PHP scripts, a typical thread takes up 20 MB of RAM. Therefore there are only 50 server slots (MaxClient) of which 35 support keep-alive.
Material: I created a test page (https://www.soscisurvey.de/example/?debug&password=demo) that is observed by the server site24x7.com (usually reponds in 1.4 seconds, but regularly there are spikes up to 20 or 30 seconds). To cross-check the results, I sent it to Load Impact es well: http://loadimpact.com/load-test/www.soscisurvey.de-35648bef3b84d3269e1fc7cb11bf1721
Adding this as an answer rather than just the comment since this is what it turned out to be
The issue sounded like a disk latency issue. There were some reasons I thought of this being the problem
- Response times varied wildly with no warning signs from the standard load indicators.
- Hosted on a VPS which are frequently oversold, and backed by NAS/SAN disks
- Other attempts to squash the problem were fruitless
As you are not in control of the hardware, you have limited ways to solve this problem. You can contact the provider to have them try to fix it, use a RAM backed filesystem or in-memory cache (which you experimented with), or switch providers.
- Trouble shooting wireless signal performance
- How can I track down a memory leak with wsgi, django, php and apache2?
- default domain and first domain in apache2 causing trouble
- Trouble installing new SSL certificate (Apache2 Virtual Host)
- NFS performance trouble: with infiniband gets stacked with many users?