Periodic health checks can help ensure that your Unix systems are going to be available when they’re needed most. In this post, we’re going to look at some aspects of performance that should be included in your system check-ups and some handy commands that will provide you with some especially useful insights. CPU load Probably the most obvious health check for a Unix/Linux system is to take a look at the CPU load on a system. This is the heartbeat of a Unix system. A healthy system will have CPU power to spare. And one of the best commands for giving you a quick and easy view of how hard your CPU is working is the top command. There are a number of measurements to focus on when you use the command. Providing a lot of information on your system’s performance, top manages to be surprisingly concise in how it displays the measurements that it reports. In particular, the load average measurements can give you a clear view of how busy the CPU is, though the numbers only report the last 15 minutes’ worth of activity. Knowing how many processes on average are having to wait for their time on the processor tells you whether the system is working hard and how hard to keep up with demands. A load average of .50 would mean that, on average, every other time top checks, a process is having to wait to run. The three figures provided show the load averages over the last one, five, and fifteen minutes — so you get some perspective and can also get a feel for whether the load is getting heavier or lighter. Once these numbers climb to 1.00 (especially the fifteen-minute average), a system is likely hurting. If this number increases or persists for a considerably longer time, the system’s performance will be noticeably poor. But, again, we’re only looking at 15 minutes worth of data. The top command also displays the number of running processes (196 in the listing below) and usage stats both for memory and swap space. On the system displayed below, swap is not being used at all. In fact, looking at the third line, you’ll see that the CPU is idle more than 99% of the time. This system is obviously only lightly used. The memory and swap stats are shown the fourth and fifth lines of top’s output. With no swap in use and significant free memory, this system is clearly having an easy day — at least a very easy 15 minutes. If there were any processes dominating the CPU, we’d see them in the list of tasks shown after the five summary lines. By default, top ranks its process list in order of CPU usage (highest first). top - 20:47:17 up 4:25, 3 users, load average: 0.54, 0.15, 0.05 Tasks: 196 total, 1 running, 195 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 99.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 2017064 total, 662924 free, 448904 used, 905236 buff/cache KiB Swap: 3635904 total, 3635904 free, 0 used. 1091240 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1775 shs 20 0 273460 79668 48676 S 0.3 3.9 0:49.03 compiz 3811 shs 20 0 9944 3640 3092 R 0.3 0.2 0:00.52 top 1 root 20 0 27360 6592 5132 S 0.0 0.3 0:02.48 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:+ 6 root 20 0 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/0 7 root 20 0 0 0 0 S 0.0 0.0 0:00.49 rcu_sched 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh 9 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 10 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 lru-add-dr+ 11 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1 14 root rt 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/1 15 root rt 0 0 0 0 S 0.0 0.0 0:00.12 migration/1 16 root 20 0 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/1 18 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:+ Using the sar command, you can get an idea if what you see in your top output has held true for a considerably longer period of time. In the example below, sar has been collecting data every ten minutes for almost an hour and a half. stinkbug# sar Linux 4.10.0-19-generic (stinkbug) 05/08/2017 _i686_ (2 CPU) 19:32:20 LINUX RESTART (2 CPU) 07:35:01 PM CPU %user %nice %system %iowait %steal %idle 07:45:01 PM all 0.15 0.00 0.02 0.02 0.00 99.80 07:55:01 PM all 0.14 0.00 0.02 0.02 0.00 99.82 08:05:01 PM all 0.15 0.00 0.02 0.02 0.00 99.81 08:15:01 PM all 0.15 0.00 0.02 0.02 0.00 99.80 08:25:01 PM all 0.15 0.00 0.02 0.02 0.00 99.82 08:35:01 PM all 0.14 0.00 0.02 0.02 0.00 99.83 08:45:01 PM all 0.22 0.00 0.06 0.05 0.00 99.67 08:55:01 PM all 0.55 0.00 0.70 2.70 0.00 96.05 Average: all 0.21 0.00 0.11 0.36 0.00 99.33 In this example, it’s clear that this system is consistently only lightly used. One of the key benefits of sar is that can collect information around the clock, so that you can see how your system is performing even when you’re not available to look. You can also use it to look at how the system is running right now. In the example below, we’re asking for three data samples, each 5 seconds apart. stinkbug# sar -u 5 3 Linux 4.10.0-19-generic (stinkbug) 05/08/2017 _i686_ (2 CPU) 09:04:09 PM CPU %user %nice %system %iowait %steal %idle 09:04:14 PM all 0.20 0.00 0.20 0.00 0.00 99.60 09:04:19 PM all 0.10 0.00 0.20 0.00 0.00 99.70 09:04:24 PM all 0.20 0.00 0.10 0.00 0.00 99.70 Average: all 0.17 0.00 0.17 0.00 0.00 99.67 Both the top and sar commands shown above provide data on how the CPU on the system is spending its time. While largely 99% or more idle, the CPU on this system is also spending a small amount of time running user processes (“%user” or “us”) and a small amount of time for system tasks (“%system” or “sy”). On a busy system, these numbers can help you to determine why the system is so busy. Memory Usage To look just at memory and swap space, the free command is the most convenient one to use. It will display the same variety of data that top provides, but just the memory stats. stinkbug$ free total used free shared buffers cached Mem: 2074932 1837504 237428 0 523476 815368 -/+ buffers/cache: 498660 1576272 Swap: 4192956 112 4192844 If you run the free command with the -m option, the numbers will be expressed in megabytes – probably easier on the eyes! If you run the free command with the -m option, the numbers will be expressed in megabytes – probably easier on the eyes! stinkbug$ free -m total used free shared buffers cached Mem: 2026 1794 231 0 511 796 -/+ buffers/cache: 486 1539 Swap: 4094 0 4094 The take-homes for this system are that swap space is not being used and a good amount of memory is free and available (nearly 1/3 of it not in use). Paging and swapping When the memory on a system is in high demand, the system has to use paging and swapping – the processes that move process data out of memory and off to the swap device and back when needed. This allows the system to behave as if it has more physical memory than it does, but comes at some cost in terms of performance. A system that is doing a lot of swapping will likely slow down considerably. The columns to focus on are the si (average number of LWPs swapped in per second) and so (number of whole processes swapped out) columns. These numbers are all 0 in the example below, but imagine them populated with numbers with two or three digits. stinkbug# vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 663184 132276 772964 0 0 15 2 16 53 0 0 99 0 0 stinkbug# vmstat 5 3 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 662936 132284 772996 0 0 15 2 16 53 0 0 99 0 0 0 0 0 662928 132284 772996 0 0 0 0 27 52 0 0 100 0 0 0 0 0 662928 132284 772996 0 0 0 1 28 55 0 0 100 0 0 Disk IO The iostat command (particularly iostat -x)is useful for observing device input/output loading. Sometimes this information is used to justify changing the system configuration to better balance the load between devices. To make use of this information, you have to be able to translate the space-saving acronyms hovering over the device measurements — like rrqm/s and rkB/s. rrqm/s, wrqm/s -- number of merged read and write requests queued per second r/s, w/s -- number of read and write requests per second rkB/s -- number of kilobytes read from the device per second wkB/s -- number of kilobytes written to the device per second avgrq-sz -- average request size (in sectors) avgqu-sz -- number of requests waiting in the device’s queue await -- average time (milliseconds) for I/O requests to be served r_await, w_await -- average time (milliseconds) for read and write requests to be served svctm -- number of milliseconds spent servicing request %util -- percentage of CPU time during which requests were issued Of these, the avgqu-sz is one of the most important. A low value generally indicates that your systems is not heavily loaded. stinkbug# iostat -x 5 3 Linux 4.10.0-19-generic (stinkbug) 05/08/2017 _i686_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.24 0.01 0.09 0.33 0.00 99.33 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.67 0.17 1.75 0.11 29.38 3.46 35.45 0.02 10.78 8.94 41.38 2.93 0.54 Disk space Disks can fill up fast depending on what’s happening on a system. Be aware of disks that might be getting close to filling up. I’ve often set up systems that I managed to send me warnings when the used space reached particular thresholds — like 75% full, 90% full, and 98% full. In the example below, we see a couple of disks that are getting close. dragonfly# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 78361192 23185840 51130588 32% / /dev/sda2 24797380 22273432 1243972 95% /home /dev/sda3 29753588 25503792 2713984 91% /data /dev/sda4 295561 21531 258770 8% /boot tmpfs 257476 0 257476 0% /dev/shm The hardware Don’t depend on the command line to tell you everything you need to know to ensure that the systems you manage are in good shape. Check them from time to time in person. Look for warning lights and fans that might not be working as well as expected. Make sure that critical systems are plugged into UPS devices whenever possible. Backups Also remember that usable backups are an important part of system health. A system that cannot be fully resuscitated after a data disaster is not in good shape. Check your backups regularly to ensure that they are usable. Wrap-up Being proactive can help you ward off system problems long before they threaten operations. Periodic health checks can also help you to be familiar with how a system is generally performing and this can help you recognize when a system is undergoing an unusual problem. Related content news Supermicro unveils AI-optimized storage powered by Nvidia New storage system features multiple Nvidia GPUs for high-speed throughput. By Andy Patrizio Oct 24, 2024 3 mins Enterprise Storage Data Center news Nvidia to power India’s AI factories with tens of thousands of AI chips India’s cloud providers and server manufacturers plan to boost Nvidia GPU deployment nearly tenfold by the year’s end compared to 18 months ago. By Prasanth Aby Thomas Oct 24, 2024 5 mins GPUs Artificial Intelligence Data Center news Gartner: 13 AI insights for enterprise IT Costs, security, management and employee impact are among the core AI challenges that enterprises face. By Michael Cooney Oct 23, 2024 6 mins Generative AI Careers Data Center news Network jobs watch: Hiring, skills and certification trends What IT leaders need to know about expanding responsibilities, new titles and hot skills for network professionals and I&O teams. By Denise Dubie Oct 23, 2024 33 mins Careers Data Center Networking PODCASTS VIDEOS RESOURCES EVENTS NEWSLETTERS Newsletter Promo Module Test Description for newsletter promo module. Please enter a valid email address Subscribe