This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

CLI Magic: Linux troubleshooting tools 101

By M. Shuaib Khan on February 19, 2007 (8:00:00 AM)

Share    Print    Comments   

When something goes wrong with your Linux-based system, you can try to diagnose it yourself with the many troubleshooting tools bundled with the operating system. Knowing about these tools, and how to effectively use them, can help you overcome many of the common problems on your system. Here's a list of some of the weapons in your arsenal against Linux problems.

Strace

When an application you successfully compiled fails during run time, it usually gives you an error. On a lucky day, the error message might contain details of what went wrong, and give you clues about what to do to fix the problem. But this is not what usually happens. Often, error messages are obscure and of little help in figuring out what went wrong.

Strace can come in handy in such situations. This utility traces the system calls a program uses during its run time. A system call is a Linux kernel function that provides secure access to a system's resources, such as memory, disk, and network.

Strace is easy to use -- just pass the name of the executable you want to run as an argument to the strace application. As an example, check out what output you get when you trace the following simple "Hello, world!" program:

#include
int main()
{
printf("Hello, world!\n");
return 0;
}

$gcc -o hello hello.c
$strace ./hello

execve("./hello", ["./hello"], [/* 94 vars */]) = 0
brk(0)                                  = 0x804b000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7eff000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/tls/i686/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/tls/i686/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/tls/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/tls/i686", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/tls/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/tls/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/tls/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/tls", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/i686/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/i686/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/i686", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory)
open("/opt/wx/2.8/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/opt/wx/2.8/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=186839, ...}) = 0
mmap2(NULL, 186839, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7ed1000
close(3)                                = 0
open("/lib/libc.so.6", O_RDONLY)        = 3

.
.
.

write(1, "Hello, world!\n", 14Hello, world!
)         = 14
exit_group(0)                           = ?
Process 6006 detached

In the above output, you can see that to run this simple program, a good number of system calls were made to open, read, write, close, etc. Notice that there were a large number of unsuccessful calls to open the libc.so.6 library. That's because the run time linker is looking in several places to find the library. The only successful call to open the library is when the linker looks for it in the /lib location, as shown by the line shown in bold letters in the output, where the open system call returns a value of '3,' which is an indication of successful opening. If we could somehow make the loader look in /lib first, we could save a lot of unsuccessful calls for the library search. And of course we can, by bringing the string /lib to the beginning of the environment variable LD_LIBRARY_PATH, which the run time linker uses to search for the libraries required by the running program.

$export LD_LIBRARY_PATH=/lib

The output of strace can be quite unwieldy when it's dumped to the console. It is common to redirect this text to a file by using the command's -o option. Another common option is -p, or PID, which allows you to connect to a running program and see its output. This is useful in the case of long-running daemons which you cannot restart easily, or which need to be monitored very rarely.

A nice example of how useful strace can get comes from a user who had installed multimedia codecs, including libdvdcss, which allowed him to play encrypted DVDs. But when he tried to use his movie player to play DVDs, he got strange errors. On tracing the movie player with strace, he figured out that the run time linker was looking in the wrong places for the installed codecs. After searching for the required library and putting it in a directory where the linker could find it, he was able to run the movie player to play his DVDs.

ltrace

ltrace is a sister application of strace. It works just like strace, but instead of tracing the system calls executed during the run time of a program, it traces the dynamic library calls. If we ltrace the previous "Hello, world!" program, here is what we get as the ouput:

$ltrace ./hello
__libc_start_main(0x80483b4, 1, 0xbfacb0d4, 0x80483f0, 0x80483e0
puts("\001"Hello, world!
)                                                                         = 14
+++ exited (status 0) +++

The output shows that the executable "hello" uses only one library function -- namely "puts" to put the string "Hello, world!\n" on the output console.

ltrace isn't as commonly used as strace. It is preferred when a detail trace of a program is required, especially when we are interested in the details of the dynamic library functions the program uses, such as malloc(), gethostbyname(), and setenv().

lsof

The lsof tool is used to list all the files open on a Linux system. Remember that in true Unix spirit, almost everything is a file. You access your hardware through files located in /dev, information about CPU, memory, and other devices is located in files on /proc, and network connections, a.k.a. sockets, are also sometimes represented as files.

lsof becomes really handy when you want to know what files a process has currently opened, or which processes are currently acting on a certain file:

$lsof
COMMAND    PID       USER   FD      TYPE     DEVICE     SIZE       NODE NAME
init         1       root  cwd       DIR        8,1     4096          2 /
init         1       root  rtd       DIR        8,1     4096          2 /
init         1       root  txt       REG        8,1   533224    1658100 /sbin/init
init         1       root   10u     FIFO       0,14                2941 /dev/initctl
migration    2       root  cwd       DIR        8,1     4096          2 /
migration    2       root  rtd       DIR        8,1     4096          2 /

lsof lists the running command, its process ID, the user to whom the process belongs, file descriptor of the opened file, type of the file opened, major and minor device numbers of the file, size of the file, node number of its inode, and the name of the file opened or the mount point of the device being acted on.

To list files opened by process belonging to a particular user, use:

$lsof -u user

To see a list of files opened by a particular process, use:

$lsof -p pid

Sometimes, you are unable to unmount a particular device because the system reports it as busy, even though you think it is not used by any process. To see what process is still using it, use:

$lsof /dev/mount-point

This will give you the list of processes using the device. Kill them, and you are ready to unmount the device.

top

Top lists the top processes running on a system at any specific time. The criteria for top could be top CPU consumers, top memory consumer, etc.

$top
top - 18:21:33 up  1:40,  4 users,  load average: 0.30, 0.21, 0.27
Tasks: 155 total,   2 running, 148 sleeping,   0 stopped,   5 zombie
Cpu(s):  6.9%us,  2.7%sy,  0.0%ni, 80.5%id,  9.6%wa,  0.1%hi,  0.1%si,  0.0%st
Mem:    506908k total,   492384k used,    14524k free,    12900k buffers
Swap:  1052248k total,    39836k used,  1012412k free,   144944k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    1 root      15   0   744  124   80 S    0  0.0   0:01.37 init
    2 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.00 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/1
    5 root      34  19     0    0    0 S    0  0.0   0:00.00 ksoftirqd/1

Top can be useful when you want to know what process is consuming how much of a system's resources. In particular, if a certain process is consuming too much memory, you can locate it through top and take appropriate measures to bring it down, if it's not critical.

Traceroute

Traceroute is a network troubleshooting tool. For a network packet to reach a remote computer from your machine, it has to go through different routers on the network. Sometimes, even though both the local and the remote machines are functioning properly and connected to the network, they can't communicate with each other because of a problem somewhere in between the two machines. To trace where the packet is dropped on the network, use traceroute:

$traceroute google.com
Hop	(ms)	(ms)	(ms)		IP Address	Host name
1	0	0	0		66.98.244.1	gphou-66-98-244-1.ev1servers.net
2	0	1	0		66.98.241.16	gphou-66-98-241-16.ev1servers.net
.
.
.
13	29	28	28		72.14.232.57	-
14	34	35	36		64.233.175.42	-
15	28	28	29		64.233.167.99	py-in-f99.google.com

The output shows that the packet had to go through 15 different machines before successfully reaching google.com. It lists the IP addresses and names (if available) of all the intermediate machines the packet went through.

ping

Ping can help you figure out if a remote machine on the network is up and connected. Ping sends ICMP messages to the remote machine, and prints the details if it gets a reply from the remote machine. Sometimes system administrators disable ICMP messages on their machines, which means that a ping won't get a reply from that particular machine, even it is present on the network, so be sure that the remote machine you're interested in does reply to ICMP messages before assuming that it is down.

$ping google.com
PING google.com (72.14.207.99) 56(84) bytes of data.
64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=1 ttl=238 time=265 ms
64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=2 ttl=238 time=269 ms
64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=3 ttl=238 time=272 ms
64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=4 ttl=238 time=263 ms

hexdump

The hexdump utility is useful for seeing the contents of a binary file in a human-readable format, which can be ASCII, hexadecimal, octal, or decimal. For example, to see what the contents of the executable /bin/ls looks like in hex and ASCII, use:

$hexdump -C /bin/ls
00000000  7f 45 4c 46 01 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  02 00 03 00 01 00 00 00  80 9c 04 08 34 00 00 00  |............4...|
00000020  0c 5c 01 00 00 00 00 00  34 00 20 00 0a 00 28 00  |.\......4. ...(.|
00000030  1f 00 1e 00 06 00 00 00  34 00 00 00 34 80 04 08  |........4...4...|
00000040  34 80 04 08 40 01 00 00  40 01 00 00 05 00 00 00  |4...@...@.......|
.
.
.

The information on the left is the contents of the file in hex, while the text between the bars is the ASCII representation.

Hexdump is useful for searching text strings within an executable file for which source code might not be available. It can help you locate specific error messages and where they occur in a file.

Conclusion

Troubleshooting Linux is an art, but these tools can help you master it. You can read more usage details about these tools on their respective man pages. Remember that knowing how to use a tool is not the same as knowing when to use it. As you encounter different problems and tackle them, you'll eventually learn the art of diagnosing trouble and fixing problems on your Linux system.

Share    Print    Comments   

Comments

on CLI Magic: Linux troubleshooting tools 101

Note: Comments are owned by the poster. We are not responsible for their content.

htop

Posted by: Anonymous Coward on February 19, 2007 05:57 PM
Nice list.

You'd try htop instead top because is far more complete one.

#

dtrace?

Posted by: Anonymous Coward on February 19, 2007 07:17 PM
i really like solaris' dtrace, is there anything like that for linux / bsd?

dtrace really, really sucks when it comes to debugging.

#

Re:dtrace?

Posted by: Anonymous Coward on February 19, 2007 07:17 PM
i meant strace really sucks when it comes to debugging *sheepish grin*

#

Re:dtrace?

Posted by: Anonymous Coward on February 21, 2007 03:01 AM
Dtrace is reportedly being ported with some success to FreeBSD. Haven't heard of a Linux port, tho, which is a shame, as Dtrace sounds extremely useful.

#

strace

Posted by: Anonymous Coward on February 20, 2007 12:58 AM
Try "strace<nobr> <wbr></nobr>/bin/ls" in addition to hexdump.

#

Misuse of LD_LIBRARY_PATH

Posted by: Anonymous Coward on February 20, 2007 04:09 AM
Normally the dynamic linker will use<nobr> <wbr></nobr>/etc/ld.so.cache to find out where libraries are rather than trying each possible path in turn. By setting LD_LIBRARY_PATH you invalidate this cache and force it to search through that and the system path (defined in<nobr> <wbr></nobr>/etc/ld.so.conf). Instead of setting it on a routine basis you should add the extra library directories to<nobr> <wbr></nobr>/etc/ld.so.conf (or under<nobr> <wbr></nobr>/etc/ld.so.conf.d if your distribution supports that).

#

dazuko

Posted by: Anonymous Coward on February 20, 2007 07:32 AM
This is not a distro standard tool but an interesting kernel module to control file access in Linux and FreeBSD. Load the module, run the included example and you'll monitor in real time the access to the files (a functionality similar to the one of "filemon" for M$ systems).

Their site: <a href="http://www.dazuko.de/" title="dazuko.de">http://www.dazuko.de/</a dazuko.de>

#

fuser? ldd? gdb? lsof? netstat?

Posted by: Anonymous Coward on February 20, 2007 10:19 AM
Some additions that bear mentioning:

fuser<nobr> <wbr></nobr>/path/to/file:
list current open handles on a file

ldd<nobr> <wbr></nobr>/path/to/executable:
Show dynamic linker library resolution for executable. Incredibly useful when handling weird issues due to library path problems etc.

gdb:
duh. The GNU Debugger. Even on non-debug binaries
it's incredibly useful in diagnosing crashes
(except on Debian, who strip their binaries
of all symbol information). Distros with
-debuginfo packages make it even more handy.<nobr> <wbr></nobr>/proc/$PID
The<nobr> <wbr></nobr>/proc file system has a lot of very useful info. You can `kill -STOP' a process (or not) and poke around in it's<nobr> <wbr></nobr>/proc/pid directory for things like open file handles, memory maps, etc.

lsof
List Open Files. Also useful for sockets and all sorts of other magic. Learn this command in detail, it's amazing, and like `top' has a huge amount of functionality.

ps
Almost too obvious to mention, but yet often overlooked in detail. Learn the details of ps, it can do some great stuff. I find the `wchan' field particularly useful as it shows WHERE in the kernel a process is blocked. Top can also be told to show this with an rc file edit.

`ip'
Your networking swiss army knife. `ip route show' is obviously useful, but ip does a lot more, and most of it is worth learning when facing network issues.

`netstat'
Display network status information. Impossibly useful, especially with the `-p' argument (requires root) to show which process is associated with a socket. That said, lsof provides essentially all netstat's functionality and much more besides.

To the person asking about dtrace: No, nothing like that in standard kernels. There are some 3rd party tracing patches of various quality and utility.

Personally I'm interested in finding a tool to list mandatory and advisory locks on a file/dir, so if anyone knows of one I'm all ears. I'm sure it's a standard tool, I just haven't found it.

#

lsof

Posted by: Anonymous Coward on February 20, 2007 10:20 AM
Whoops, lsof was already there. Never mind, it's well worth REALLY stressing how useful that tool is.

#

mtr &amp; tcptraceroute

Posted by: Anonymous Coward on February 20, 2007 10:23 AM
Users of traceroute should look into `mtr' ( a faster, better, more useful traceroute ) and `tcptraceroute' (for tracing through networks that block traceroute's ICMP packets).

#

What a shitty list???

Posted by: Anonymous Coward on February 20, 2007 12:48 PM
I can't believe that you guys printed this! All the mentioned tools are common knowledge for years, I was rather expecting coverage of new tools, like some pointed out in the discussion!

Another lowering of the standards!

#

Re:What a shitty list???

Posted by: Anonymous Coward on February 20, 2007 02:13 PM
You're the shitty one. Not because you're a super duper hacker (who really doesn't know how to hack) doesn't mean the new users don't have the right to know this.

Just like you said, "Another lowering of the standards!". And you're the one who did it.

#

Re:What a shitty list???

Posted by: Anonymous Coward on February 20, 2007 04:28 PM
Uhm, does someone need to explain to you the meaning of "101"?

#

Re:What a shitty list???

Posted by: Anonymous Coward on February 21, 2007 04:11 AM
I don't speak english fluently. What does "101" mean here, please?

#

Re:What a shitty list???

Posted by: Anonymous Coward on February 21, 2007 04:16 AM
101 = "basics" in that context

#

Safe?

Posted by: Anonymous Coward on February 21, 2007 08:00 PM
is "strace" and "ltrace" safe to run on executables?
Will it execute the binary?

Sounds like two good tools that can come in handy for inspecting suspect files such as malware, like virus, worm, trojan, keylogger, botnet, etc.

#

Re:Safe?

Posted by: Anonymous Coward on February 22, 2007 02:58 AM
strace and ltrace aren't really something you want to run on suspect files such as malware/viruses/worms/etc because they only monitor what the binaries do, they don't actually prevent them from doing anything.

Now if you set up a machine thats disconnected from the rest of your network to try and see what they DO do (such as trying to find out what files they would modify to make sure the rest of your systems aren't infected), these would be good for that. Though they're really better for when programs are acting up (like in the past I have used strace on a program that was freezing up midstartup and was able to find that it froze after opening a certain file, IIRC I was able to delete that configuration file then it worked fine).

So to recap, strace/ltrace won't protect you at all, but you MAY get some idea of what the nasties are doing.

#

hexdump does not dump in "Hex and ASCII"

Posted by: Anonymous Coward on February 24, 2007 12:29 AM
ASCII is a code for representing text in bits. In the hexdump example, the ASCII is the hexadecimal stuff on the left. What's on the right is the text that those bits represent if you assume the bits to be ASCII.

Hexdump display is hex and text.

#

oh . . . the terror . . .

Posted by: Anonymous Coward on March 01, 2007 11:34 PM
Last monday I made a complete fool of myself by making tiny typing mistake as root. The consequence was that all the vhost configuration files vanished from the server (for about 50 sites with lots of aliases). Apache was still running though.

This meant that after some sweaty moments googling for a solution. I found one that explained how to restore the complete vhost configuration from the memory of the running apache process:
- ps ax --forest| grep apache2 # get the pid of the parent process
- gdb apache2 [found pid] # hook onto the running apache process
- gcore # dump a core file
- quit # quit gdb
- strings [the core file] > saved.txt

Saved.txt will contain the vhost configuration, mostly intact, some bits slightly garbled.

But boy was I happy!

The backup scripts have been fixed to include the vhost files. The author of the backup scripts has been dragged naked around the block behind a bike.

#

CLI Magic: Linux troubleshooting tools 101

Posted by: Anonymous [ip: 59.160.207.20] on September 28, 2007 07:21 AM
hi
when is installed linux fc4 the gui mode does not come what shall i do

#

CLI Magic: Linux troubleshooting tools 101

Posted by: Anonymous [ip: 210.211.211.120] on November 28, 2007 11:57 AM
hi

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya