Kernel audit output is inconsistent, hard to parse

Thu Jan 31 21:11:38 UTC 2008

At one time we talked about converting to a binary record format.
Maybe some structure would be helpful in defining and enforcing
rules for what information in the records can/should be and what it
all means.

-- ljk

John Dennis wrote:
> The format of audit messages from the kernel is a mess. The
> bottom line is one cannot parse the audit messages without special
> case knowledge of each audit message because the data formatting does
> not follow any regular rules.
> 
> I don't know how it got this way, but it really needs to be fixed.
> 
> The primary offense is string formatting, specifically the use or
> non-use of the functions audit_log_hex() audit_log_untrustedstring(),
> and audit_log_n_untrustedstring(); depending on circumstances.
> 
> The net result is a field value might be one of the following cases:
> 
> 1) a string without quotes (maybe a string, maybe an int, etc.)
> 
> 2) a string enclosed in quotes (implies a string with no escaped chars)
> 
> 3) a string which is represented as a sequence of hex values (not
>    enclosed in quotes, but how do you distinguish this from case 1?)
> 
> Given the name=value formatting it is absolutely impossible to
> correctly interpret the value component unless you know how
> audit_log_format was invoked to generate the name=value pair. This
> will be dependent on the kernel version, the field name, and the audit
> record type. To be specific, during parsing only case 2 is
> unambiguous. You cannot determine between case 1 and case
> 3. Heuristics based on each character being in the hexadecimal
> character set fail for a significant subset of data, thus you don't
> know if the value is a string encoded in hexadecimal which needs to be
> decoded or a string which happens to be composed of hexadecimal
> characters but is not encoded.
> 
> Thus we have the situation where to correctly parse the name=value
> pair one must know the audit record type, the field name, and the
> kernel version. That is just plain CRAZY and UNNECESSARY. Trying hide
> this logic in auparse is just a band-aid over the problem compounded
> by the fact auparse does not always get it right either. This is in
> conjunction with the fact auparse has no way to know the kernel
> version of the audit data it is attempting to parse (nor should it
> even have tables based on kernel version).
> 
> The answer is to make the output parsable without special case
> knowledge. It would appear many of these problems were introduced with
> the functions audit_log_hex() audit_log_untrustedstring(), and
> audit_log_n_untrustedstring() which attempt to correct for a double
> quote, white space, or non-printable character in the output
> string. However these are not used uniformly nor do they follow any
> common approach for string representations in user land (why not?).
> 
> All field values without exception need to be enclosed in quotes to
> delimit the value. Special characters inside the quotes need to
> be escaped, following some standard convention. Please, lets not
> invent a new encoding, this problem has already been solved
> elsewhere many times before!
> 
> Also note the function audit_log_n_untrustedstring() in audit.c has a
> bug and ignores the len parameter (it iterates till it finds a NULL
> terminator even though it's supposed to stop after n chars).
> 
> Suggested Fix:
> --------------
> 
> Most of these problems can easily be fixed if there is exactly one
> central place to format an audit field value. The function
> audit_log_vformat() could very easily ensure consistent formatting via
> % format specifiers in the format string, e.g.:
> 
>     audit_log_format("n=%d path=%s", n, path);
> 
> Building audit output piecewise would be deprecated, e.g. these types
> of sequences would be eliminated.
> 
>     audit_log_format(ab, " n=%d", n);
>     audit_log_format(ab, " name=");
>     audit_log_foo();
> 
> and replaced with:
> 
>     audit_log_format(ab, " n=%d name=%s", foo_to_string(foo));
> 
> Whenever audit_log_vformat() encounters a % format specifier it
> formats to a string, then it converts the string to an escaped quoted
> string, and then inserts the escaped quoted string into the buffer
> (e.g. n="123" name="foo bar\n" )
> 
> This way the formatting is consistent, easy to apply, and is never
> special cased by the caller.
> 
> There are no performance penalties of any note, calling a routine to
> escape only needs to be done when the format specifier is %s. Currently
> this is already done for a subset of output strings, so all we're doing
> is removing the responsibility for escaping from the caller and doing
> it consistently instead of in a subset of cases.
> 
> I don't really care what the encoding is. I only care that it is an
> encoding with wide support. Backslash quoting is very popular,
> familiar and has many implementations. The MIME quoted-printable
> transfer encoding would be another option but might pose some problems
> with line endings. I think backslash quoting would be a good choice. I
> suspect everyone reading this message already knows exactly how to
> interpret a string with backslash escapes.
> 
> Auparse is not the answer:
> --------------------------
> 
> Auparse is not the answer to irregular kernel audit message
> formatting. First of all it forces auparse to have special case logic
> which is not 100% robust and is tied to the kernel source code
> version.
> 
> Second, in it's current implementation auparse confuses transfer
> decoding and substitution, two entirely different concepts needing to
> be applied in entirely different circumstances, but which have been
> conflated.
> 
> auparse_get_field_str() returns the field value in it's encoded form,
> this is almost never of value to the caller. The caller wants the
> field value to be unencoded so it can operate on it. If you want the
> field value to be unencoded you have to call
> auparse_interpret_field(). But auparse_interpret_field() performs two
> distinctly different operations, it both decodes AND performs
> contextual substitution. Contextual substitution only has meaning when
> applied on the same host and at approximately the same time as when
> the audit record was generated. Contextual substitution is mainly of
> value for human readable output, it is difficult to utilize with
> automated machine processing. At the moment it is not possible to get
> a decoded value from auparse without it also performing undesired
> substitution.
> 
> While we're at it:
> ------------------
> 
> If we do fix the format of audit messages we might as well fix some
> other inconsistencies at the same time.
> 
> 1) The initial part of AVC messages do not follow the standard
>    name=value formatting used everywhere else in audit.
> 
>    a) It includes the string "avc:" which is redundant with the audit
>    record type (e.g. type=AVC), the string "avc:" should be removed,
>    it serves no purpose and only makes parsing much harder because of
>    the inconsistency.
> 
>    b) denied|granted are bare words without a field name, it should be
>    seresult="denied", once again to avoid special case parsing.
> 
>    c) The list of operations are enclosed in curly braces {} without a
>    field name, this should be seperms=xxx, where xxx is a list. The
>    use of curly braces to encode a list in audit data is unique. We
>    should define how any audit message should encode a list of values
>    and use that consistently for all audit data. While one could
>    define a syntax such as "[value1, value2]" or some such, it might
>    be informative to look at how other transfer mechanisms such as
>    structured markup and ldap handle this case. They both utilize the
>    concept of multi-valued attributes. Thus there is no list
>    structure, but an attribute is allowed to repeat itself and in the
>    process implicitly creates a list of values for the attribute. Thus
>    {read write} might be represented as seperms="read"
>    seperms="write". This regularity makes parsing much easier, it
>    avoids special case syntax.
> 
> 2) (Note, this is not a kernel issue) The host data is currently
>    prepended to the audit record with the format host=xxx. Is this an
>    encoded string or not? It should be encoded and it should be
>    encoded in exactly the same format as the name/value pairs in the
>    audit records. The same holds true for the record type, it should
>    follow the same syntax as every other name/value pair.
> 
> 3) The string "audit(ssssss.mmmm:iiii):" is a critical delimiter, it
>    separates record properties (e.g. host, type, timestamp) from
>    record data, which must be a sequence of name="value" pairs. But
>    the time stamp should really follow the name/value pair encoding
>    used elsewhere.
> 
> Desired syntax:
> ---------------
> 
> Records consist of a sequence of name="value" pairs.
> 
> Ordering of name/value pairs is significant for multi-valued
> attributes (i.e. where name appears more than once), insignificant
> otherwise.
> 
> The value MUST be enclosed in double quotes with interior characters
> properly escaped.
> 
> White space between name, '=', and "value" is insignificant and
> ignored.
> 
> The audit record is partitioned into two parts
> 
>   a) record properties (i.e. host, record type, timestamp)
> 
>   b) record data
> 
> The partition of properties and data occurs at a colon delimiter,
> i.e. properties : data
> 
> The current formatting of the record timestamp
> (e.g. audit(ssss.mmm:iii) is inconsistent with
> all other name/value pairs. It should be "seconds="sss"
> milliseconds="mmm" serial="iii", this allows parsing to be regular and
> consistent.
> 
> Thus an audit record with consistent syntax would look like this,
> where brackets [] indicate optional components:
> 
> [host=""] type="" seconds="" milliseconds="" serial="" : name="" [name=""]
> 
> What has to change and what's optional:
> ---------------------------------------
> 
> The formatting of name/value pairs in the kernel must be fixed, it is
> simply impossible to correctly parse in it's current state.
> 
> The rest of the suggested changes are syntactic sugar which would make
> parsing easier because of regular syntax, but they are not
> critical. We could retain the existing formats if backwards
> compatibility is felt to trump syntactic cleanliness and ease in
> parsing. It's a judgment call over when and how to introduce change
> and the anticipated impact.
> 
>