CPU steal time, also known as stolen CPU, is the percentage of time a virtual CPU within a cloud server involuntarily waits on a physical CPU for its processing time.
In a cloud environment, a hypervisor acts as an interface between the physical server and its virtualized environment. This software layer is installed on the physical hardware and manages all tasks by allocating CPU time to processes such as virtual machines (VMs), networking operations, storage I/O requests, and more.
CPU steal time occurs when the processes are ready to be executed by the virtual CPU, but it’s waiting for the hypervisor to allocate a physical CPU to it. This happens because the hypervisor is servicing another VM.
VMs share their resources—such as RAM, hard disk, CPU cycles, etc.—with other VMs on the same host in a virtual environment. If a VM is one of four identical VMs with the same load and size on a physical server, simple math suggests that each VM should be getting 25% of the CPU cycles.
In most cases, however, the VM will end up consuming more than the allocated cycles. In some rare cases, where it’s not getting the assigned CPU cycles, the steal time will increase.
It’s almost impossible to avoid some amount of steal time while running a system as a VM in a cloud environment. This is because the VM is sharing the physical server with other VMs.
The VM’s steal time can be seen by running the top command on the Linux terminal:Fig. 1: Output of top command
The top command presents the system summary along with the list of the processes or threads currently being managed by the Linux kernel.
As highlighted in the above screenshot, steal time is labeled as
%st. If the value of steal time is
20st, that means 20% of the total process time is spent waiting for a physical CPU to be allocated.
The maximum value st can have is 100.0. This is the worst case scenario, where the virtual CPU does nothing but wait for the hypervisor to allocate a physical CPU. Luckily, this situation is very rare.
There are two major causes of high CPU steal time:
When load-heavy processes are run on a VM, the CPU cycles allocated to it may not be enough to handle the workload.
In this case, the cloud server providers oversubscribe the physical server with VMs, making it impossible for the physical CPU to handle the processes.
Unfortunately, it’s difficult to figure out which of these two cases a situation falls under just by looking at the steal time. There are other factors to consider. However, if there are identical VMs with similar workloads running on different hosts, it might be possible to figure out which case the situation falls under.
VMs running on a hypervisor with high steal time can cause significant problems, including:
It is wise to have zero tolerance towards high steal time on a server. Stolen CPU slows machines down, even causing them to stop altogether in extreme circumstances.
If the CPU steal time is lower than 10%, then there’s nothing to worry about, and the application should run smoothly. However, the VM is probably running slower than expected if the value of steal time is greater than 10%—i.e., above the normal value for around 20-30 minutes. If the steal time remains high, it indicates CPU contention, which can reduce the performance of the application.
As discussed above, in most cases CPU steal is caused by poor allocation and insufficient resources leading to an overloaded CPU. Now let’s look at some possible fixes:
When deploying an application to a virtualized computing environment, CPU steal time is a crucial metric to watch, as it can impact the application in multiple ways. By monitoring steal time and identifying its correct cause, you can take appropriate measures to reduce it.
Write for Site24x7 is a special writing program that supports writers who create content for Site24x7 “Learn” portal. Get paid for your writing.Apply Now