Thursday, January 12, 2017

Lets Automate Kdump

Kdump is a kernel crash dumping mechanism and is reliable because the crash dump is captured from the context of a freshly booted kernel and not from the context of the crashed kernel. Kdump uses Kexec to boot into a second kernel whenever system crashes. This second kernel, often called a capture kernel, boots with very little memory and captures the dump image.
Kdump uses Kexec fast booting mechanism which facilitates booting with secondary kernel to capture memory image of the previous kernel. This skips BIOS initialization process. Both “kdump” & “Kexec” were introduced from RHEL 5.x on-wards. Kdump is supported on the i686, x86_64, ia64 and ppc64 platforms.

How to configure kdump?

The usual way to get this done, is to manually configure required parameters as explained below:-

Install “kexec-tools” package to start the process.

To configure a system (RHEL5/6/7 variants) to successfully capture core dump (vmcore), the following conditions needs to be met:-

[].. Make sure enough free space available under '/var/crash' (default dump location).

[].. The “crashkernel” parameter should be set with proper memory size and it should reflect in /proc/cmdline

[].. Same “crashkernel” should be set in '/boot/grub/grub.conf' or '/boot/grub2/grub.cfg' (RHEL7) files with correct value.

[].. The sysctl parameter “kernel.sysrq” should be set to 1.

[].. “kernel.unknown_nmi_panic” should be set to 1.

[].. Make sure to see if “Crash memory space” is set (/proc/iomem)

[].. Kernel crash should be loaded (/sys/kernel/kexec_crash_loaded)

Another important point to keep in mind is that the correct “crashkernel” value that needs to be set. According to Red Hat the following values should be set as based on “Total RAM” and RHEL version:

What is the necessity of kdump?

“Kdump” would help in situations where there is a need to analyze the memory dump when system crashed or system was in non-responsive state. Analyzing memory dump file (vmcore) would facilitate in understanding system state at that moment. So, by analyzing this file an administrator or Linux expert would get the system tuned up properly to avoid or overcome such incidents.

Let’s automate configuring kdump via a shell script...

I’ve written a simple shell script which would get the “kdump” configuration set properly on RHEL7.x/6.x/5.x variants.

These are the tasks performed by this script when run as root or as a sudo user:

[].. Install kdump package if not installed.

[].. Verify if enough free space is available under '/var' file system.

[]..Verify if “crashkernel=xxx” parameter is set. As per Red Hat and according to the 'RHEL version & Total RAM Availability' the correct size needs to be set, if not set.

[].. Make sure “crashkernel” is configured in '/boot/grub/grub.conf' (RHEL6/5) & '/boot/grub2/grub.cfg' (RHEL7) as recommended.

[].. Check if “crashkernel” is also added to '/etc/default/grub' (in case of RHEL7).

[].. Verify if “kernel.sysrq” and “kernel. unknown_nmi_panic” are enabled and set to 1, if not set.

[].. Enable the dump path which is /var/crash (default dump location) and “core_collector” parameters as per Red Hat in '/etc/kdump.conf' file, if already set then it would quickly verify this.

[].. Make sure the required service (kdump) is set to come up on boot.

[].. Finally if everything is set then user would be asked whether to run kdump test and if yes, kdump service would be started (if not started). If the “crashkernel” parameter is not found in '/proc/cmdline' then user would be prompted to reboot the system.

[].. If user wishes to test kdump then this script would flush out dirty data out of cache to disks, increase console logging verbosity, remount files-system read-only before prompting one more time for user confirmation to continue, and then would crash the system.

- If “crashkernel” parameter needs to be re-configured then a system reboot is required.

So, this shell script works by using native Linux commands which is ideal for RHEL7.x/6.x/5/x variants. This script would perform all sanity checks for proper function of kdump and allow user to test kdump.

How does this works?

>> The following snap shows what the script would do when run on a system which doesn’t have 'kexec-tools' package installed:

After installing “kexec-tools” and configuring required parameters, the script would prompt for a restart of the system to get valid kdump initrd image to be generated.

>> When run on a system where kdump service is up and “crashkernel” parameter is properly set, script would prompt for user option to test crash dump as shown below:

So, if a user hits “y” when prompted then the script would validate to see
..... if “crash” is loaded
....... if“crashkernel” is active in kernel and
........ if crash memory space is set, then it would prompt for user whether to continue or not as shown above.

After receiving further confirmation from a user the script perform the below sanity checks before crashing and generate vmcore dumpe file:

.... flush out dirty data to disks
...... re-mounts all active file systems into read-only (as a safety step)
......... enable debug mode for console messages
........... and then it would crash the system so that the vmcore file gets recorded under '/var/crash' path.  

The system may not respond for some time and once it reboots crash file would be available for further analysis.

>> Any changes to '/etc/kdump.conf' would needs a reboot so that the initrd image file gets re-generated and message as below would show up while reboot:

Any change to the “crashkernel” parameter would need a reboot since '/proc/cmdline' doesn’t reflect the changed value immediately.

>> So, irrespective of whether kdump service is up or not, you can safely run this script which would validate and process the requests.

>> The core dump path would be the default one which is '/var/crash' with recommended “core_collector” configuration. If there is a central server to collect such dumps then this has to be specified in the script as required.

>> If there is any configuration changes to the files ( kdump.conf or grub.cfg/grub.conf ) required then the script would take backup of the files ( ending with “-ddmmyyyy” format ) before making such changes.

>> Those parameters which are not specified here would be left to default.

The script file is attached as a text file, you would need to remove the header lines (starting two lines at beginning) which are commented out, set “execute” bit, and then run the script. Make sure line indents (space/tabs) are not altered, otherwise, script may not give proper results.


No comments: