[sos-devel] [RFC PATCH] Enable sos to capture kernel crash report/summary

Ankit Kumar ankit at linux.vnet.ibm.com
Tue Mar 28 09:50:21 UTC 2017


Service person need vmcore (kernel dump) to analyze kernel related issues.
Typically vmcore is large and takes time to transfer (vmcore size varies
based on system configuration. Ex: On 1TB machine usual vmcore size is ~1GB).
Also we have to request customer to transfer both vmcore and sosreport output.

Most of the kernel issues can be debugged by looking into some of the important
information available in vmcore (like calltrace, kernel irq stack info, process
status, memory usages etc,.). In our experience ~60% kernel issues can be resolved
with this information.

We use crash command to get various information from vmcore (kernel dump). This
patch enables sos package to capture various crash command output for last system
crash. We capture output in a file and include that in sosreport.

It finds out the path of vmlinux[debug] and vmcore[dump] file for last system
crash from standard path.

We check for various error conditions and log error message to stderr as well
as to output file.
Various error conditions are:
 - crash command is installed or not
 - if debug vmlinux/vmcore[dump] is not found for last system crashed kernel
 - file open/write related erros
 - if we are unable to retrieve last crashed time

This patch is tested for various error condition on RedHat and Ubuntu distros.

Here are few sample log:

If unable to find vmlinux:
Failed in retrieving debug linux:[ 4.4.0-66-generic ] inside path: [ /usr/lib/debug ]
Please install debug linux and rerun sosreport

Sample output file with partial log:
Ubuntu:
Complete Crash command: crash -si crash_cmd_file /usr/lib/debug/boot/vmlinux-4.44.0-66-generic /var/crash/201703130718/dump.201703130718
RedHat:
Complete Crash command: crash -si crash_cmd_file /usr/lib/debug/usr/lib/modules/3.10.0-327.el7.ppc64/vmlinux /var/crash/127.0.0.1-2017-03-13-11:34:13/vmcore

SYSTEM INFORMATION:

      KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-66-generic
    DUMPFILE: /var/crash/201703130718/dump.201703130718  [PARTIAL DUMP]
        CPUS: 1
        DATE: Mon Mar 13 07:17:54 2017
      UPTIME: 00:19:13
LOAD AVERAGE: 0.24, 0.27, 0.11
...

@Bryn,
 - This patch needs some cleanup and testing on various environment.
   I want to check whether my approach is fine or not. If you are fine
   with the approach, then I will fine tune the patch and will resend.
 - Presently we are logging error message to stderr. Are you fine with this?

Signed-off-by: Ankit Kumar <ankit at linux.vnet.ibm.com>
---
 sos/plugins/crash_report.py | 262 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 262 insertions(+)
 create mode 100644 sos/plugins/crash_report.py

diff --git a/sos/plugins/crash_report.py b/sos/plugins/crash_report.py
new file mode 100644
index 0000000..59a7c0e
--- /dev/null
+++ b/sos/plugins/crash_report.py
@@ -0,0 +1,262 @@
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+import os
+import sys
+import time
+from sos.plugins import Plugin, RedHatPlugin, DebianPlugin, UbuntuPlugin
+from subprocess import call
+from datetime import datetime
+
+class Crash_log(Plugin):
+    """Crash kernel report
+    """
+    plugin_name = "crash_report"
+    profiles = ('system', 'debug')
+
+    # crash result output file
+    CRASH_OUTPUT_FILE='/var/log/os_crash_report.txt'
+    # file needs too be passed to crash command(contains all command)
+    CRASH_CMD_FILE='/tmp/crash_cmd_file'
+    # temp file
+    INT_DATA_LOG='/tmp/temp_file'
+    # last cmd output file
+    LAST_CMD_OP_FILE = '/tmp/last_cmd_log'
+    DBG_VMLINUX_PATH='/usr/lib/debug/'
+    VMCORE_STND_PATH='/var/crash'
+
+    DISTRO_NAME = ''
+    CRASHED_KERNEL_NAME=''
+    VMLINUX_PATH=''
+    ERROR_MSG = ''
+    VMCORE_PATH=''
+
+    # used to find dump for crashed kernel.
+    # return
+    #     True  : In case of success.
+    #     False : In case of failure.
+    #
+    def is_dump_for_crashing_kernel(self,  file_path, crashing_time):
+        file_creation_info = time.ctime(os.path.getctime(file_path))
+
+        # get the actual time string using split on complete line
+        # ctime returns string as 'Mon Mar 13 02:54:32 2017'
+        # we are interested only in Month date hrs:min field
+        vmcore_creation_time=file_creation_info.rsplit(':', 1)[0]
+        vmcore_creation_time=vmcore_creation_time.split(' ', 1)[1]
+        t1 = datetime.strptime(vmcore_creation_time, "%b %d %H:%M")
+        t2 = datetime.strptime(crashing_time, "%b %d %H:%M")
+        difference = t1 - t2
+        return  True if difference.days == 0 else False
+
+    # retrieve path of vmlinux/vmcore
+    def retrieve_path(self, start_path, file_type, file_opt, crashing_time):
+
+        for path,dirs,files in os.walk(start_path):
+            for temp_path in files:
+                temp_path = os.path.join(path,temp_path)
+
+                # dump file name varies on different distros:
+                # UBUNTU: dump.***
+                # RHEL  : vmcore
+                # below condition check distro type and search vmcore file accordingly.
+                if (file_type == "vmcore") and (self.DISTRO_NAME == 'UBUNTU'):
+                    if (temp_path.find("dump.") != -1):
+                        ret = self.is_dump_for_crashing_kernel(temp_path, crashing_time)
+                        if (ret == True):
+                            self.VMCORE_PATH = temp_path
+                            return 0
+                elif (file_type == "vmcore") and (self.DISTRO_NAME == 'REDHAT'):
+                    if (os.path.basename(temp_path) == "vmcore"):
+                        ret = self.is_dump_for_crashing_kernel(temp_path, crashing_time)
+                        if (ret == True):
+                            self.VMCORE_PATH = temp_path
+                            return 0
+                elif (file_type == "vmlinu"):
+                    if (temp_path.find("vmlinu") != -1):
+                        if (temp_path.find(file_opt) != -1):
+                            self.VMLINUX_PATH = temp_path
+                            return 0
+        return -1
+
+    def get_crash_time(self, line):
+        crashing_time = line.split()[4] + ' ' + line.split()[5] + ' ' + line.split()[6]
+        try:
+            t2 = datetime.strptime(crashing_time, "%b %d %H:%M")
+        except ValueError:
+            t2 = None
+            crashing_time = line.split()[3] + ' ' + line.split()[4] + ' ' + line.split()[5]
+            try:
+                t2 = datetime.strptime(crashing_time, "%b %d %H:%M")
+            except ValueError:
+                t2 = None
+                return ""
+        return crashing_time
+
+    def get_vmlinux_vmcore_path(self):
+        last_cmd_log_file = 'last' + ' ' + '>' +' ' + self.LAST_CMD_OP_FILE
+        os.system(last_cmd_log_file)
+        open_file = open(self.LAST_CMD_OP_FILE)
+        crash_found = 0
+
+        for line in open_file:
+            line = line.rstrip()
+            if (crash_found == 1):
+                reboot_str = line.split()[0] +' ' + line.split()[1] + ' ' + line.split()[2]
+                if (reboot_str == 'reboot system boot'):
+                    line_contain_os = line
+                    self.CRASHED_KERNEL_NAME = line.split()[3]
+                    break
+                else:
+                    continue
+            words = line.split()
+            for word in words:
+                if (word == "crash"):
+                    line_contain_crash = line
+                    crash_found = 1
+                    break
+        if (crash_found == 1):
+            crashing_time = self.get_crash_time(line_contain_crash)
+            if not crashing_time:
+                self.ERROR_MSG = '\n\n' + 'Failed in retrieving time for last crash:' + '[ ' + line_contain_crash + ' ]' + '\n\n'
+                return -1
+            ret = self.retrieve_path(self.DBG_VMLINUX_PATH, "vmlinu", self.CRASHED_KERNEL_NAME, crashing_time)
+            if (ret != 0):
+                self.ERROR_MSG = '\n\n' + 'Failed in retrieving debug linux:' + '[ ' + self.CRASHED_KERNEL_NAME + ' ]' + ' inside path: [ ' + self.DBG_VMLINUX_PATH + ' ]' + '\nPlease install debug linux and rerun sosreport\n\n'
+                return -1
+            ret = self.retrieve_path(self.VMCORE_STND_PATH, "vmcore", "", crashing_time)
+            if (ret != 0):
+                self.ERROR_MSG = '\n\n' + 'Failed in retrieving vmcore for kernel[ ' + self.CRASHED_KERNEL_NAME + ']' +' inside path: [ ' + self.VMCORE_STND_PATH + ' ]' + '\n\n'
+                return -1
+            return 0
+        else:
+            self.ERROR_MSG = '\n\n' + 'Failed in retrieving crash from last cmd output:' + '\n\n'
+        return -1
+
+    def dump_crash_command_errorlog_to_report_file(self, ret, msg):
+        try:
+            f= open(self.CRASH_OUTPUT_FILE,"w")
+        except:
+            print "\n\nFailed opening file to report crash_command_errorlog_to file:[%s]\n\n" % self.CRASH_OUTPUT_FILE
+            return -1
+        if (ret != 0):
+            try:
+               f.write("\n\nDidn't execute crash command due to error[%s]\n\n" % self.ERROR_MSG)
+            except:
+                print "Unable to write[failure] status log to crash_output_file\n"
+        else:
+            try:
+	            f.write("Complete Crash command: %s\n\n" % msg)
+            except:
+                print "Unable to write[complete crash command] to crash_output_file\n"
+        f.close()
+        return 0
+
+    def postproc(self):
+        if os.path.exists(self.LAST_CMD_OP_FILE):
+            os.remove(self.LAST_CMD_OP_FILE)
+        if os.path.exists(self.INT_DATA_LOG):
+            os.remove(self.INT_DATA_LOG)
+        if os.path.exists(self.CRASH_CMD_FILE):
+            os.remove(self.CRASH_CMD_FILE)
+        if os.path.exists(self.CRASH_OUTPUT_FILE):
+            os.remove(self.CRASH_OUTPUT_FILE)
+
+    def get_crash_report(self):
+        cmd_list = [
+            'sys,"SYSTEM INFORMATION"',
+            'mach,"MACHINE SPECIFIC DATA"',
+            'bt,"STACK TRACE OF CURRENT CONTEXT"',
+            'bt -a,"STACK TRACES OF ACTIVE TASKS"',
+            'kmem -i,"MEMORY USAGE"',
+            'kmem -s,"KMALLOC SLAB DATA"',
+            'mod,"MODULES"',
+            'ps,"PROCESS STATUS"',
+            'log,"SYSTEM MESSAGE BUFFER"',
+            'files,"OPEN FILES OF CURRENT CONTEXT"',
+            'dev -p,"PCI DEVICE DATA"',
+            'runq,"RUN QUEUE TASK"',
+            'mach -o,"OPALMSG LOG"',
+            'irq -s,"DUMP KERNEL IRQ STATE"'
+        ]
+
+        try:
+            f2= open(self.CRASH_CMD_FILE,"w")
+        except:
+            self.ERROR_MSG = '\n\n' + 'Failed opening file:' + '[ ' + self.CRASH_CMD_FILE + ' ]' + '\n\n'
+            self.dump_crash_command_errorlog_to_report_file(-1, "")
+            print "\n\n%s\n\n" % self.ERROR_MSG
+            return
+        for element in cmd_list:
+            (cmd,header)=element.split(',',2)
+            header=header.split('\n', 1)[0]
+            try:
+                f2.write("""
+                    !echo {HEADER}: >{DATA_LOG}
+                    !echo >> {DATA_LOG}
+                    {CMD} >> {DATA_LOG}
+                    !echo >> {DATA_LOG}
+                    !echo >> {DATA_LOG}
+                    !cat {DATA_LOG} >> {OUTPUT}""".format(
+                        DATA_LOG=self.INT_DATA_LOG,
+                        HEADER=header.split('\n', 1)[0],
+                        OUTPUT=self.CRASH_OUTPUT_FILE, CMD=cmd))
+            except:
+                print "Failed to write cmd[%s]\n" %cmd
+        try:
+            f2.write('\nquit')
+        except:
+            print "Failed to write cmd[%s]\n" %cmd
+
+        f2.close()
+        ret = self.get_vmlinux_vmcore_path()
+        if (ret != 0):
+            self.dump_crash_command_errorlog_to_report_file(ret, "")
+            print "\n\n%s\n\n" % self.ERROR_MSG
+            return
+        crash_full_command  = "crash" + " -si " + self.CRASH_CMD_FILE + " " + self.VMLINUX_PATH + " " + self.VMCORE_PATH
+        self.dump_crash_command_errorlog_to_report_file(ret, crash_full_command)
+        print ("\n\nCollecting crash report. This may take a while !!! ...\n\n")
+        try:
+            ret = call(["crash", "-si", self.CRASH_CMD_FILE, self.VMLINUX_PATH, self.VMCORE_PATH])
+        except OSError as e:
+            if e.errno == os.errno.ENOENT:
+                self.ERROR_MSG = '\n\n' + 'crash tool is not installed . Kindly install crash-utility then rerun sosreport'
+            else:
+                self.ERROR_MSG = '\n\n' + 'Unable to run crash command due to exception occurred'
+
+            self.dump_crash_command_errorlog_to_report_file(-1, "")
+            print "%s\n\n" % self.ERROR_MSG
+        return
+
+class RedHatCrash_log(Crash_log, RedHatPlugin):
+
+    def setup(self):
+        Crash_log.DISTRO_NAME = 'REDHAT'
+        self.get_crash_report()
+        self.add_copy_spec([
+            Crash_log.CRASH_OUTPUT_FILE
+        ])
+
+
+class DebianCrash_log(Crash_log, DebianPlugin, UbuntuPlugin):
+
+    def setup(self):
+        Crash_log.DISTRO_NAME = 'UBUNTU'
+        self.get_crash_report()
+        self.add_copy_spec([
+            Crash_log.CRASH_OUTPUT_FILE
+        ])
+
+# vim: set et ts=4 sw=4 :
-- 
2.7.4




More information about the sos-devel mailing list