Load average: What is it, and what's the best load average for your Linux servers?
If you're using a Linux server, you're probably familiar with the term load average/system load. Measuring the load average is critical to understanding how your servers are performing; if overloaded, you need to kill or optimize the processes consuming high amounts of resources, or provide more resources to balance the workload.
But how do you determine if your server has sufficient load capacity, and when should you be worried? Let's dive in and find out.
What is a load average?
The load average is the average system load on a Linux server for a defined period of time. In other words, it is the CPU demand of a server that includes sum of the running and the waiting threads.
Typically, the top or the uptime command will provide the load average of your server with output that looks like:
These numbers are the averages of the system load over a period of one, five, and 15 minutes.
Before getting into how to measure the load average output and what each of these values mean, let's get into the simplest example: a server with a single core processor.
Breaking down the load
A server with a single core processor is like a single line of customers waiting to get their items billed in a grocery store. During peak hours, there is usually a long line and the waiting time for every individual is also high.
If you're the cashier and want to record the waiting time, one important metric would be the number of people waiting during a particular period of time. If there are no customers waiting, then the wait time is zero. On the other hand, if there is a long line of customers, then the wait time is high.
Applying that to the load average output (0.5, 1.5, 3.0) that we got above:
- 0.5 means the minimum waiting time at the counter. Between 0.00 and 1.0, there is no need to worry. Your servers are safe!
- 1.5 means the queue is filling up. If the average gets any higher, things are going to start slowing down.
- 3.00 means there's a considerably long queue waiting, and an extra resource/counter is required to clear up the queue faster.
What you want is a queue/load average value between 0.00 and 1.00. So can we conclude that the ideal load average is 1.00, and anything above that is an action call to troubleshoot? Well, although it's a safe bet, a more proactive approach is leaving some extra headroom to manage unexpected loads.
Multicores and multiprocessors to the rescue
Are a single quad core processor and a server with four processors (with one core each) the same? Relatively, yes. The main difference between multicore and multiprocessor is that the former refers to a single CPU having multiple cores, while the latter refers to multiple CPUs. To sum up: one quad core is equal to two dual cores which is equal to four single cores.
The load average is relative to the number of cores available in the server and not how they are spread out over CPUs. This means the maximum utilization range is 0-1 for a single core, 0-2 for a dual core, 0-4 for a quad core, 0-8 for an octa-core, and so on.
Referring to the cashier example again, a load of 1.00 would mean the capacity is just right on a single core processor; while on a dual core processor, a load of 1.50 would mean one line is filled up, and the other line is filling up. Similarly, a load of 5.00 on a quad core processor is something to worry about, while on an octa-core processor, 5.00 is only just filling up, and there is optimum space available.
Role of Site24x7: Monitoring load average
Adding resources for a higher load value might add to your infrastructure costs. It's ideal to manage the load efficiently and maintain an optimum level to avoid server performance degradation issues. Site24x7 Linux Monitoring monitors load averages among over 60 performance metrics and provides the 1, 5, and 15 minute average values in an intuitive and easy-to-understand graph.
Further, you can set thresholds and receive notification when there is a breach. But what if there's a breach in the middle of the night? Site24x7 has a solution for that, too. The monitoring tool provides a set of IT automations for automatic fault resolution.
For example, if the system load threshold is set at 2.90 for a dual core processor, you can upload a server script or add server commands to execute the corrective action automatically when the threshold is breached. This way, without any manual intervention, the issue can be resolved and the mean time to repair (MTTR) is vastly reduced.
Adding more cores might accelerate your server performance, but might also add on to your infrastructure spending. Monitoring the load average consistently to maintain efficient management of the existing set up is an ideal alternative. Site24x7 Server Monitoring not only monitors the load average, but also provides complementary fault resolution tools to act before a high load average impacts server performance. Sign up for a 30-day free trial now!
Great example so far
Easy way to let me understand load average
To kill process! That is the best solution for high load averages! Great guide!
Thank you for the feedback.
I think that the part "To kill process! That is the best solution for high load averages! Great guide!" was ironical... to kill blindly the most consuming process as soon as a server gets a too high load average is something just crazy. I think this is one of the worst pages about load average on the internet.
I think that you are mis-reading the document.
They posed the question: What do you do when the system is overloaded?
The answer: You have to either kill processes or add more resources.
They didn't say that you must just kill high-load processes.
The whole intent of the article was to show you how, with proper monitoring, you can right-size your server infrastructure so that you DON'T have to kill processes, whilst also not wasting money on buying server resources that aren't need.
I think that this page is a little misleading.
It starts with explaining that the output (load average 0.5 1.5 3.0) means the average in 1, 5 and 15 minutes, and suddenly it explains that the load average of 0.5 is fine, 1.5 is medium and 3.0 is problematic. (That is, my system was ok in the last 1 minute, it was a bit overloaded in the last 5 minutes, and it was terribly overloaded in the last 15 minutes.) It even suggests that the 3 numbers always form an increasing series.
I agree that this is not the best explanation here.
Average load does NOT only measure CPU, but also other server resources that cause a task to wait.
So slow IO might mean the CPU task is waiting, but the CPU is not overloaded.
Also, an ave. load of say 2 might be a problem for a webserver but perfectly fine for a mail server.
The question of "what is a good ave. load value" is much more complex and is very much dependent on the the environment, what the server is used for, how big is the budget etc etc
You can use this article as a "rough" guide, but a better approach is to "measure your current load averages when you know your server is working fine" i.e no user complaints, web pages load fast, apps run smoothly etc. Then monitor for significant changes over time.
I always thought that it was some kind of average percentage, now I know :P Thanks ;)