Holy Cow's "Moo"sings: Linux Kernel

Showing posts with label Linux Kernel. Show all posts

Thursday, February 12, 2009

Kernel Crash? Get the dying moans...

So you had a kernel crash, when you were not around, and could not see on the console what the issue was. Or it happened on a server, which has a hardware watchdog which reboots the machine if it has crashed. So here is a way to get the messages which the kernel spit out before dying...

First of all you need to have enabled kernel crashdumps... there is no reason not to. So you are all set and the next time a crash happens the core is saved in a file called vmcore. Now fire up GDB and run the following commands


(gdb) set logging file ~/gdb.out                                                      
(gdb) set logging on
Copying output to ~/gdb.out.
(gdb) set print elements 41216
(gdb) set height 0            
(gdb) p (char*)__log_buf

The file gdb.out will now contain the messages, open it in and replace all instances of '\n' with a real newline (using something like ":%s/\\n/\r/gc" in vim), and you will see something like this


<4>ide-floppy driver 0.99.newide
<6>usbcore: registered new driver hiddev
<6>usbcore: registered new driver usbhid
<6>drivers/usb/input/hid-core.c: v2.6:USB HID core driver
<6>PNP: PS/2 Controller [PNP0303:KBC0,PNP0f13:MSE0] at 0x60,0x64 irq 1,12

Happy Debugging! (I KNOW it wont be HAPPY :P)

Monday, September 22, 2008

TCP Offload Engine support in Linux

TCP Offload Engines (TOE) are customized hardware which handle TCP connections completely in the network card itself, instead of in the kernel. Lately 10Gbps Ethernet cards are becoming the industry standard in the high end server market. A simple rule of thumb regarding TCP processing in the CPU requires 1 HZ of CPU for every 1 bit of TCP data handled per second. This means that a 10GigE card, requiring 10 GHz, can quickly eat the CPU like there is no tomorrow. Even with multiple CPUs and multiple cores per CPU, the impact is significant.

Added to this, IEEE standards are already being prepared for 40Gbps and 100Gbps Ethernet. So in this situation, using TOE becomes inevitable. There are already many TCP functions already being done in hardware like checksumming, LRO (Large Receive offload), LSO (Large Send Offload) etc. But a TOE provides a complete end-to-end solution.

TCP Offload Engines were never a hit with the linux networking community. Linux Kernel Maintainers, esp David Miller, have been against the idea of TOE due to various valid reasons like it reduces maintainability of code, etc. Also the kernel maintainers argue that TOE was only a stopgap solution, before CPU speeds caught up with the loads, citing cases in the past where TOE was implemented even for 100Mbps links. More details about their position can be found in this article - Linux and TCP Offload Engines. This has caused a situation where the 10GigE vendors like Chelsio are forced to maintain the TOE patches to the Linux kernel, out-of-tree. This causes the code to be inherently unstable.

Anyway, the end users do not face any loss of functionality since the vendor provided patches to the Linux kernel can be used to build kernel modules, which provide support for TOE hardware in Linux machines. This is what I like about Open Source. You do not have to be bound by what others think. You leave that decision to time.

Wednesday, March 5, 2008

TCP Congestion

It surprises me to no end that TCP, the most widely used protocol, has remained practically unchanged from its RFC proposal in 1982. From a time where connections were in hundreds of bytes per second to 10Gbps links now-a-days, It is still used in a more or less unchanged form.

However this flexibility comes at a cost - you will not get the maximum throughput your link supports right away. The "right away" part is important. True TCP can eventually take up the entire bandwidth. But there is an over looked issue - TCPs Congestion control. TCPs congestion control rapidly reduces the connections speed as soon as it starts seeing lost packets, and it does not recover as quickly. This has caused linux to adopt newer congestion control algorithms in TCP in a pluggable fashion.

The default congestion control algorithm is CUBIC which is a less aggressive version of BIC. You have a plethora of congestion control algorithms to choose from. They can be checked by reading from and changed by writing to /proc/sys/net/ipv4/tcp_congestion_control file.

This Linux Gazette article has a detailed description of the remaining options.

Thursday, February 28, 2008

New form of interrupts - Message Signaled Interrupts

The other day I was looking at the /proc/interrupts file, and I found a line similar to below


494: [   ff]        46   PCI-MSI-edge      eth4
495: [   ff]         0   PCI-MSI-edge      eth4
496: [   ff]         0   PCI-MSI-edge      eth4

First thing off is the presence of 3 interrupts for eth4 device, which is an NVidia NForce card with the forcedeth driver. Also the name "PCI-MSI-edge" is a type I have not head about before. The other was the range of the interrupt number - which was above 255, the max valid interrupt range.

After some googling, I found out that it was the new Message Signaled Interrupts, which were mandated by the PCIe standard. Now instead of a single interrupt, where we have to query the device for the actual event that happened, device drivers can put up an interrupt for each and every event - like in the case above one for rx, one for tx and other for link status etc. This simplifies and cleans up the code. On the hardware side, it makes interrupts in-band, and removes the necessity of a separate pin, reducing the chip's footprint.

Take a look at the <linux kernel src>/Documentation/MSI-HOWTO.txt for more details.

Wednesday, February 27, 2008

Drivers & Hardware

My friend who works in NVidia always complains about how Software drivers have to cover all the mistakes in the hardware. I never faced any issue like that, until today. There was this GBPS network card which gives a pathetic netperf throughput of 780Mbps. After checking a lot of stuff I chanced upon a discussion in netdev about the same driver and how the driver works around a software bug by disabling a very important feature.

I removed the workaround from the kernel, recompiled and installed it on the test machine. I ran netperf and presto! It comes to 989Mbps, as close as possible to the line speed of the interface! As for the hardware bug, it does not get triggered in this particular hardware setup since it requires the PCI-X bus running at 100/133MHz, whereas the test machine's bus is only 66MHz. Problem solved.

Tuesday, February 26, 2008

Same Subnet Interfaces in Linux issue

I always thought that the Linux Networking stack was a fully functional component which is never lacking in features, when compared to other OSs. However a few days ago, I came across a missing piece of functionality - interfaces on the same subnet issue.

You see when you configure 2 interfaces on the same subnet, and try to talk to other machines in the subnet, something unexpected happens. The packets can come in on both interfaces, but can leave only on one of them! This is because of the inherent design of the Linux Forwarding Base, which routes solely based on the destination address, ignoring the source.

Ofcourse its not a deadend in any way. You can still configure a bonding between the 2 interfaces and then assign the 2 ips to the bond virtual interface. But that would be too much trouble, and not exactly a solution for someone seeking resource compartmentalization. Solaris provides a concept called interface grouping for this issue.

Monday, February 25, 2008

Nice Try

Today I was trying to print some variables from the tg3 network driver in the running linux kernel, using GDB on /proc/kcore. Being the lazy type I tried to check kernel variable values though GDB instead of using /proc or /sys interface.

After printing the variables, I started a test to see how it would change. The test went to completion and surprize surprize the value never changed! I did not know what happened and re-ran the test, but no luck. Then I looked through the code and after making sure that the value absolutely has to change, I came to the conclusion that there was something wrong with GDB and restarted it, and presto the new values were visible.

After some googling it turned out that gdb tries to cache the values it retrieves from the core file. The intresting part was that, even though we indicate to GDB that the target is live it still does this. Anyway the solution, from LDD3, is to run "core-file /proc/kcore" everytime we want an updated value.

Holy Cow's "Moo"sings