Book Excerpt: Troubleshooting Ubuntu ServerBook Excerpt: Troubleshooting Ubuntu Server
<i>The Official Ubuntu Server Book</i> will help you identify and resolve the open source server's network and hardware issues, including an unresponsive Linux host, memory issues, and network card errors.
In this chapter I'm going to discuss some aspects of my general philosophy on troubleshooting that could be applied to a wide range of problems. Then I will cover a few common problems that you might run into and introduce some tools and techniques to help solve them. By the end of the chapter you should have a head start the next time a problem turns up.
Good Communication Is Critical When Collaborating
If you are part of a team that is troubleshooting a problem, you absolutely must have good communication among team members. That could be as simple as yelling across cubicle walls, or it could mean setting up a chat room. A common problem when a team works an issue is multiple members testing the same hypothesis. With good communication each person can tackle a different hypothesis and report the results. These results can then lead to new hypotheses that can be divided among the team members. One final note: Favor communication methods that allow multiple people to communicate at the same time. This means that often chat rooms work much better than phones for problem solving, since over the phone everyone has to wait for a turn to speak; in a chat room multiple people can communicate at once.
Understand How Systems Work
The more deeply you understand how a system works, the faster you can rule out causes of problems. Over the years I've noticed that when a problem occurs, people first tend to blame the technology they understand the least. At one point in my career, every time a network problem occurred, everyone immediately blamed DNS, even when it appeared obvious (at least to me) that not only was DNS functioning correctly, it never had actually been the cause of any of the problems. One day we decided to hold a lecture to explain how DNS worked and traced an ordinary DNS request from the client to every DNS server and back. Afterward everyone who attended the class stopped jumping to DNS as the first cause of network problems. There are core technologies with which every sysadmin deals on a daily basis, such as TCP/IP networking, DNS, Linux processes, programming, and memory management; it is crucial that you learn about these in as much depth as possible if you want to find a solution to a problem quickly.
Document Your Problems And Solutions
Many organizations have as part of their standard practice a postmortem meeting after every production issue. A postmortem allows the team to document the troubleshooting steps they took to arrive at a root cause as well as what solution ultimately fixed the issue. Not only does this help make sure that there is no disagreement about what the root cause is, but when everyone is introduced to each troubleshooting step, it helps make all the team members better problem solvers going forward. When you document your problem-solving steps, you have a great guide you can go to the next time a similar problem crops up so it can be solved that much faster.
Use the Internet, But Carefully
The Internet is an incredibly valuable resource when you troubleshoot a problem, especially if you are able to articulate it in search terms. After all, you are rarely the only person to face a particular problem, and in many cases other people have already come up with the solution. Be careful with your Internet research, though. Often your results are only as good as your understanding of the problem. I've seen many people go off on completely wrong paths to solve a problem because of a potential solution they found on the Internet. After all, a search for "Ubuntu server not on network" will turn up all sorts of completely different problems irrelevant to your issue.
Resist Rebooting
OK, so those of us who have experience with Windows administration have learned over the years that when you have a weird problem, a reboot often fixes it. Resist this "technique" on your Ubuntu servers! I've had servers with uptimes measured in years because most problems found on a Linux machine can be solved without a reboot. The problem with rebooting a machine (besides ruining your uptime) is that if the problem does go away, you may never know what actually caused it. That means you can't solve it for good and will ultimately see the problem again. As attractive as rebooting might be, keep it as your last resort.
Localhost Troubleshooting
While I would say that a majority of problems you will find on a server have some basis in networking, there is still a class of issues that involves only the localhost. What makes this tricky is that some local and networking problems often create the same set of symptoms, and in fact local problems can create network problems and vice versa. In this section I will cover problems that occur specifically on a host and leave issues that impact the network to the next section.
Host Is Sluggish Or Unresponsive
Probably one of the most common problems you will find on a host is that it is sluggish or completely unresponsive. Often this can be caused by network issues, but here I will discuss some local troubleshooting tools you can use to tell the difference between a loaded network and a loaded machine.
When a machine is sluggish, it is often because you have consumed all of a particular resource on the system. The main resources are CPU, RAM, disk I/O, and network (which I will leave to the next section). Overuse of any of these resources can cause a system to bog down to the point that often the only recourse is your last resort-a reboot. If you can log in to the system, however, there are a number of tools you can use to identify the cause.
System Load
System load average is probably the fundamental metric you start from when troubleshooting a sluggish system. One of the first commands I run when I'm troubleshooting a slow system is uptime:
$ uptime
13:35:03 up 103 days, 8 min, 5 users, load average: 2.03, 20.17, 15.09
The three numbers after the load average, 2.03, 20.17, and 15.09, represent the 1-, 5-, and 15-minute load averages on the machine, respectively. A system load average is equal to the average number of processes in a runnable or uninterruptible state. Runnable processes are either currently using the CPU or waiting to do so, and uninterruptible processes are waiting for I/O. A single-CPU system with a load average of 1 means the single CPU is under constant load. If that single-CPU system has a load average of 4, there is 4 times the load on the system that it can handle, so three out of four processes are waiting for resources. The load average reported on a system is not tweaked based on the number of CPUs you have, so if you have a two-CPU system with a load average of 1, one of your two CPUs is loaded at all times-i.e., you are 50% loaded. So a load of 1 on a single-CPU system is the same as a load of 4 on a four-CPU system in terms of the amount of available resources used.
The 1-, 5-, and 15-minute load averages describe the average amount of load over that respective period of time and are valuable when you try to determine the current state of a system. The 1-minute load average will give you a good sense of what is currently happening on a system, so in my previous example you can see that I most recently had a load of 2 over the last minute, but the load had spiked over the last 5 minutes to an average of 20. Over the last 15 minutes the load was an average of 15. This tells me that the machine had been under high load for at least 15 minutes and the load appeared to increase around 5 minutes ago, but it appears to have subsided. Let's compare this with a completely different load average:
$ uptime
05:11:52 up 20 days, 55 min, 2 users, load average: 17.29, 0.12, 0.01
In this case both the 5- and 15-minute load averages are low, but the 1-minute load average is high, so I know that this spike in load is relatively recent. Often in this circumstance I will run uptime multiple times in a row (or use a tool like top, which I will discuss in a moment) to see whether the load is continuing to climb or is on its way back down.
For instance, if you run an application that generates a high number of simultaneous threads at different points, and all of those threads are launched at once, you might see your load spike to 20, 40, or higher as they all compete for system resources. As they complete, the load might come right back down. In my experience systems seem to be more responsive when under CPU-bound load than when under I/O-bound load. I've seen systems with loads in the hundreds that were CPU-bound, and I could run diagnostic tools on those systems with pretty good response times. On the other hand, I've seen systems with relatively low I/O-bound loads on which just logging in took a minute, since the disk I/O was completely saturated. A system that runs out of RAM resources often appears to have I/O-bound load, since once the system starts using swap storage on the disk, it can consume disk resources and cause a downward spiral as processes slow to a halt.
What Is A High Load Average?
A fair question to ask is what load average you consider to be high. The short answer is "It depends on what is causing it." Since the load describes the average number of active processes that are using resources, a spike in load could mean a few things. What is important to determine is whether the load is CPU-bound (processes waiting on CPU resources), RAM-bound (specifically, high RAM usage that has moved into swap), or I/O-bound (processes fighting for disk or network I/O).
For instance, if you run an application that generates a high number of simultaneous threads at different points, and all of those threads are launched at once, you might see your load spike to 20, 40, or higher as the all compete for system resources. As they complete, the load might come right back down. In my experience systems seem to be more responsive when under CPU-bound load than when under I/O-bound load. I've seen systems with loads in the hundreds that were CPU-bound, and I could run Localhost Troubleshooting diagnostic tools on those systems with pretty good response times. On the other hand, I've seen systems with relatively low I/O-bound loads on which just logging in took a minute, since the disk I/O was completely saturated. A system that runs out of RAM resources often appears to have I/Obound load, since once the system starts using swap storage on the disk, it can consume disk resources and cause a downward spiral as processes slow to a halt.
top
One of the first tools I turn to when I need to diagnose high load is top. I have discussed the basics of how to use the top command in Chapter 2, so here I will focus more on how to use its output to diagnose load. The basic steps are to examine the top output to identify what resources you are running out of (CPU, RAM, disk I/O). Once you have figured that out, you can try to identify what processes are consuming those resources the most.
To keep reading, download the rest of the chapter as a (free) .PDF, click here.
This book has a Barnes & Noble Special Edition, which includes a bonus DVD containing the Kubuntu and Edubuntu versions of Ubuntu. A bonus chapter (#14) and Appendix cover some of the core foundation concepts behind Linux administration, and the authors’ favorite time-saving tips and hacks. Both versions of the book include enclosed media with Ubuntu 8.04 Long-Term Support (LTS) and the new Ubuntu Server 9.04.
For Further Reading
About the Author
You May Also Like