| 
									
										
										
										
											2006-03-22 00:37:42 +01:00
										 |  |  | Table of contents | 
					
						
							|  |  |  | ================= | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Last updated: 20 December 2005 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Contents | 
					
						
							|  |  |  | ======== | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | - Introduction | 
					
						
							|  |  |  | - Devices not appearing | 
					
						
							|  |  |  | - Finding patch that caused a bug | 
					
						
							|  |  |  | -- Finding using git-bisect | 
					
						
							|  |  |  | -- Finding it the old way | 
					
						
							|  |  |  | - Fixing the bug | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Introduction | 
					
						
							|  |  |  | ============ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Always try the latest kernel from kernel.org and build from source. If you are | 
					
						
							|  |  |  | not confident in doing that please report the bug to your distribution vendor | 
					
						
							|  |  |  | instead of to a kernel developer. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Finding bugs is not always easy. Have a go though. If you can't find it don't | 
					
						
							|  |  |  | give up. Report as much as you have found to the relevant maintainer. See | 
					
						
							|  |  |  | MAINTAINERS for who that is for the subsystem you have worked on. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Before you submit a bug report read REPORTING-BUGS. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Devices not appearing | 
					
						
							|  |  |  | ===================== | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Often this is caused by udev. Check that first before blaming it on the | 
					
						
							|  |  |  | kernel. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Finding patch that caused a bug | 
					
						
							|  |  |  | =============================== | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Finding using git-bisect | 
					
						
							|  |  |  | ------------------------ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Using the provided tools with git makes finding bugs easy provided the bug is | 
					
						
							|  |  |  | reproducible. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Steps to do it: | 
					
						
							|  |  |  | - start using git for the kernel source | 
					
						
							|  |  |  | - read the man page for git-bisect | 
					
						
							|  |  |  | - have fun | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Finding it the old way | 
					
						
							|  |  |  | ---------------------- | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | [Sat Mar  2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)] | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | This is how to track down a bug if you know nothing about kernel hacking.   | 
					
						
							|  |  |  | It's a brute force approach but it works pretty well. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | You need: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         . A reproducible bug - it has to happen predictably (sorry) | 
					
						
							|  |  |  |         . All the kernel tar files from a revision that worked to the | 
					
						
							|  |  |  |           revision that doesn't | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | You will then do: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         . Rebuild a revision that you believe works, install, and verify that. | 
					
						
							|  |  |  |         . Do a binary search over the kernels to figure out which one | 
					
						
							|  |  |  |           introduced the bug.  I.e., suppose 1.3.28 didn't have the bug, but  | 
					
						
							|  |  |  |           you know that 1.3.69 does.  Pick a kernel in the middle and build | 
					
						
							|  |  |  |           that, like 1.3.50.  Build & test; if it works, pick the mid point | 
					
						
							|  |  |  |           between .50 and .69, else the mid point between .28 and .50. | 
					
						
							|  |  |  |         . You'll narrow it down to the kernel that introduced the bug.  You | 
					
						
							|  |  |  |           can probably do better than this but it gets tricky.   | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         . Narrow it down to a subdirectory | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |           - Copy kernel that works into "test".  Let's say that 3.62 works, | 
					
						
							|  |  |  |             but 3.63 doesn't.  So you diff -r those two kernels and come | 
					
						
							|  |  |  |             up with a list of directories that changed.  For each of those | 
					
						
							|  |  |  |             directories: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |                 Copy the non-working directory next to the working directory | 
					
						
							|  |  |  |                 as "dir.63".   | 
					
						
							|  |  |  |                 One directory at time, try moving the working directory to | 
					
						
							|  |  |  |                 "dir.62" and mv dir.63 dir"time, try  | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |                         mv dir dir.62 | 
					
						
							|  |  |  |                         mv dir.63 dir | 
					
						
							|  |  |  |                         find dir -name '*.[oa]' -print | xargs rm -f | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |                 And then rebuild and retest.  Assuming that all related | 
					
						
							|  |  |  |                 changes were contained in the sub directory, this should  | 
					
						
							|  |  |  |                 isolate the change to a directory.   | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |                 Problems: changes in header files may have occurred; I've | 
					
						
							|  |  |  |                 found in my case that they were self explanatory - you may  | 
					
						
							|  |  |  |                 or may not want to give up when that happens. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         . Narrow it down to a file | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |           - You can apply the same technique to each file in the directory, | 
					
						
							|  |  |  |             hoping that the changes in that file are self contained.   | 
					
						
							|  |  |  |              | 
					
						
							|  |  |  |         . Narrow it down to a routine | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |           - You can take the old file and the new file and manually create | 
					
						
							|  |  |  |             a merged file that has | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |                 #ifdef VER62 | 
					
						
							|  |  |  |                 routine() | 
					
						
							|  |  |  |                 { | 
					
						
							|  |  |  |                         ... | 
					
						
							|  |  |  |                 } | 
					
						
							|  |  |  |                 #else | 
					
						
							|  |  |  |                 routine() | 
					
						
							|  |  |  |                 { | 
					
						
							|  |  |  |                         ... | 
					
						
							|  |  |  |                 } | 
					
						
							|  |  |  |                 #endif | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |             And then walk through that file, one routine at a time and | 
					
						
							|  |  |  |             prefix it with | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |                 #define VER62 | 
					
						
							|  |  |  |                 /* both routines here */ | 
					
						
							|  |  |  |                 #undef VER62 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |             Then recompile, retest, move the ifdefs until you find the one | 
					
						
							|  |  |  |             that makes the difference. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Finally, you take all the info that you have, kernel revisions, bug | 
					
						
							|  |  |  | description, the extent to which you have narrowed it down, and pass  | 
					
						
							|  |  |  | that off to whomever you believe is the maintainer of that section. | 
					
						
							|  |  |  | A post to linux.dev.kernel isn't such a bad idea if you've done some | 
					
						
							|  |  |  | work to narrow it down. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | If you get it down to a routine, you'll probably get a fix in 24 hours. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | My apologies to Linus and the other kernel hackers for describing this | 
					
						
							|  |  |  | brute force approach, it's hardly what a kernel hacker would do.  However, | 
					
						
							|  |  |  | it does work and it lets non-hackers help fix bugs.  And it is cool | 
					
						
							|  |  |  | because Linux snapshots will let you do this - something that you can't | 
					
						
							|  |  |  | do with vendor supplied releases. | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-22 00:37:42 +01:00
										 |  |  | Fixing the bug | 
					
						
							|  |  |  | ============== | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Nobody is going to tell you how to fix bugs. Seriously. You need to work it | 
					
						
							|  |  |  | out. But below are some hints on how to use the tools. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | To debug a kernel, use objdump and look for the hex offset from the crash | 
					
						
							|  |  |  | output to find the valid line of code/assembler. Without debug symbols, you | 
					
						
							|  |  |  | will see the assembler code for the routine shown, but if your kernel has | 
					
						
							|  |  |  | debug symbols the C code will also be available. (Debug symbols can be enabled | 
					
						
							|  |  |  | in the kernel hacking menu of the menu configuration.) For example: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     objdump -r -S -l --disassemble net/dccp/ipv4.o | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | NB.: you need to be at the top level of the kernel tree for this to pick up | 
					
						
							|  |  |  | your C files. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | If you don't have access to the code you can also debug on some crash dumps | 
					
						
							|  |  |  | e.g. crash dump output as shown by Dave Miller. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | >    EIP is at ip_queue_xmit+0x14/0x4c0 | 
					
						
							|  |  |  | >     ... | 
					
						
							|  |  |  | >    Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00 | 
					
						
							|  |  |  | >    00 00 55 57  56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08 | 
					
						
							|  |  |  | >    <8b> 83 3c 01 00 00 89 44  24 14 8b 45 28 85 c0 89 44 24 18 0f 85 | 
					
						
							|  |  |  | > | 
					
						
							|  |  |  | >    Put the bytes into a "foo.s" file like this: | 
					
						
							|  |  |  | > | 
					
						
							|  |  |  | >           .text | 
					
						
							|  |  |  | >           .globl foo | 
					
						
							|  |  |  | >    foo: | 
					
						
							|  |  |  | >           .byte  .... /* bytes from Code: part of OOPS dump */ | 
					
						
							|  |  |  | > | 
					
						
							|  |  |  | >    Compile it with "gcc -c -o foo.o foo.s" then look at the output of | 
					
						
							|  |  |  | >    "objdump --disassemble foo.o". | 
					
						
							|  |  |  | > | 
					
						
							|  |  |  | >    Output: | 
					
						
							|  |  |  | > | 
					
						
							|  |  |  | >    ip_queue_xmit: | 
					
						
							|  |  |  | >        push       %ebp | 
					
						
							|  |  |  | >        push       %edi | 
					
						
							|  |  |  | >        push       %esi | 
					
						
							|  |  |  | >        push       %ebx | 
					
						
							|  |  |  | >        sub        $0xbc, %esp | 
					
						
							|  |  |  | >        mov        0xd0(%esp), %ebp        ! %ebp = arg0 (skb) | 
					
						
							|  |  |  | >        mov        0x8(%ebp), %ebx         ! %ebx = skb->sk | 
					
						
							|  |  |  | >        mov        0x13c(%ebx), %eax       ! %eax = inet_sk(sk)->opt | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-06-01 00:46:50 -07:00
										 |  |  | In addition, you can use GDB to figure out the exact file and line | 
					
						
							|  |  |  | number of the OOPS from the vmlinux file. If you have | 
					
						
							|  |  |  | CONFIG_DEBUG_INFO enabled, you can simply copy the EIP value from the | 
					
						
							|  |  |  | OOPS: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |  EIP:    0060:[<c021e50e>]    Not tainted VLI | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | And use GDB to translate that to human-readable form: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   gdb vmlinux | 
					
						
							|  |  |  |   (gdb) l *0xc021e50e | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | If you don't have CONFIG_DEBUG_INFO enabled, you use the function | 
					
						
							|  |  |  | offset from the OOPS: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |  EIP is at vt_ioctl+0xda8/0x1482 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | And recompile the kernel with CONFIG_DEBUG_INFO enabled: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   make vmlinux | 
					
						
							|  |  |  |   gdb vmlinux | 
					
						
							|  |  |  |   (gdb) p vt_ioctl | 
					
						
							|  |  |  |   (gdb) l *(0x<address of vt_ioctl> + 0xda8) | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-22 00:37:42 +01:00
										 |  |  | Another very useful option of the Kernel Hacking section in menuconfig is | 
					
						
							|  |  |  | Debug memory allocations. This will help you see whether data has been | 
					
						
							|  |  |  | initialised and not set before use etc. To see the values that get assigned | 
					
						
							|  |  |  | with this look at mm/slab.c and search for POISON_INUSE. When using this an | 
					
						
							|  |  |  | Oops will often show the poisoned data instead of zero which is the default. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Once you have worked out a fix please submit it upstream. After all open | 
					
						
							|  |  |  | source is about sharing what you do and don't you want to be recognised for | 
					
						
							|  |  |  | your genius? | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Please do read Documentation/SubmittingPatches though to help your code get | 
					
						
							|  |  |  | accepted. |