Kernel audit output is inconsistent, hard to parse
Linda Knippers
linda.knippers at hp.com
Thu Jan 31 21:11:38 UTC 2008
At one time we talked about converting to a binary record format.
Maybe some structure would be helpful in defining and enforcing
rules for what information in the records can/should be and what it
all means.
-- ljk
John Dennis wrote:
> The format of audit messages from the kernel is a mess. The
> bottom line is one cannot parse the audit messages without special
> case knowledge of each audit message because the data formatting does
> not follow any regular rules.
>
> I don't know how it got this way, but it really needs to be fixed.
>
> The primary offense is string formatting, specifically the use or
> non-use of the functions audit_log_hex() audit_log_untrustedstring(),
> and audit_log_n_untrustedstring(); depending on circumstances.
>
> The net result is a field value might be one of the following cases:
>
> 1) a string without quotes (maybe a string, maybe an int, etc.)
>
> 2) a string enclosed in quotes (implies a string with no escaped chars)
>
> 3) a string which is represented as a sequence of hex values (not
> enclosed in quotes, but how do you distinguish this from case 1?)
>
> Given the name=value formatting it is absolutely impossible to
> correctly interpret the value component unless you know how
> audit_log_format was invoked to generate the name=value pair. This
> will be dependent on the kernel version, the field name, and the audit
> record type. To be specific, during parsing only case 2 is
> unambiguous. You cannot determine between case 1 and case
> 3. Heuristics based on each character being in the hexadecimal
> character set fail for a significant subset of data, thus you don't
> know if the value is a string encoded in hexadecimal which needs to be
> decoded or a string which happens to be composed of hexadecimal
> characters but is not encoded.
>
> Thus we have the situation where to correctly parse the name=value
> pair one must know the audit record type, the field name, and the
> kernel version. That is just plain CRAZY and UNNECESSARY. Trying hide
> this logic in auparse is just a band-aid over the problem compounded
> by the fact auparse does not always get it right either. This is in
> conjunction with the fact auparse has no way to know the kernel
> version of the audit data it is attempting to parse (nor should it
> even have tables based on kernel version).
>
> The answer is to make the output parsable without special case
> knowledge. It would appear many of these problems were introduced with
> the functions audit_log_hex() audit_log_untrustedstring(), and
> audit_log_n_untrustedstring() which attempt to correct for a double
> quote, white space, or non-printable character in the output
> string. However these are not used uniformly nor do they follow any
> common approach for string representations in user land (why not?).
>
> All field values without exception need to be enclosed in quotes to
> delimit the value. Special characters inside the quotes need to
> be escaped, following some standard convention. Please, lets not
> invent a new encoding, this problem has already been solved
> elsewhere many times before!
>
> Also note the function audit_log_n_untrustedstring() in audit.c has a
> bug and ignores the len parameter (it iterates till it finds a NULL
> terminator even though it's supposed to stop after n chars).
>
> Suggested Fix:
> --------------
>
> Most of these problems can easily be fixed if there is exactly one
> central place to format an audit field value. The function
> audit_log_vformat() could very easily ensure consistent formatting via
> % format specifiers in the format string, e.g.:
>
> audit_log_format("n=%d path=%s", n, path);
>
> Building audit output piecewise would be deprecated, e.g. these types
> of sequences would be eliminated.
>
> audit_log_format(ab, " n=%d", n);
> audit_log_format(ab, " name=");
> audit_log_foo();
>
> and replaced with:
>
> audit_log_format(ab, " n=%d name=%s", foo_to_string(foo));
>
> Whenever audit_log_vformat() encounters a % format specifier it
> formats to a string, then it converts the string to an escaped quoted
> string, and then inserts the escaped quoted string into the buffer
> (e.g. n="123" name="foo bar\n" )
>
> This way the formatting is consistent, easy to apply, and is never
> special cased by the caller.
>
> There are no performance penalties of any note, calling a routine to
> escape only needs to be done when the format specifier is %s. Currently
> this is already done for a subset of output strings, so all we're doing
> is removing the responsibility for escaping from the caller and doing
> it consistently instead of in a subset of cases.
>
> I don't really care what the encoding is. I only care that it is an
> encoding with wide support. Backslash quoting is very popular,
> familiar and has many implementations. The MIME quoted-printable
> transfer encoding would be another option but might pose some problems
> with line endings. I think backslash quoting would be a good choice. I
> suspect everyone reading this message already knows exactly how to
> interpret a string with backslash escapes.
>
> Auparse is not the answer:
> --------------------------
>
> Auparse is not the answer to irregular kernel audit message
> formatting. First of all it forces auparse to have special case logic
> which is not 100% robust and is tied to the kernel source code
> version.
>
> Second, in it's current implementation auparse confuses transfer
> decoding and substitution, two entirely different concepts needing to
> be applied in entirely different circumstances, but which have been
> conflated.
>
> auparse_get_field_str() returns the field value in it's encoded form,
> this is almost never of value to the caller. The caller wants the
> field value to be unencoded so it can operate on it. If you want the
> field value to be unencoded you have to call
> auparse_interpret_field(). But auparse_interpret_field() performs two
> distinctly different operations, it both decodes AND performs
> contextual substitution. Contextual substitution only has meaning when
> applied on the same host and at approximately the same time as when
> the audit record was generated. Contextual substitution is mainly of
> value for human readable output, it is difficult to utilize with
> automated machine processing. At the moment it is not possible to get
> a decoded value from auparse without it also performing undesired
> substitution.
>
> While we're at it:
> ------------------
>
> If we do fix the format of audit messages we might as well fix some
> other inconsistencies at the same time.
>
> 1) The initial part of AVC messages do not follow the standard
> name=value formatting used everywhere else in audit.
>
> a) It includes the string "avc:" which is redundant with the audit
> record type (e.g. type=AVC), the string "avc:" should be removed,
> it serves no purpose and only makes parsing much harder because of
> the inconsistency.
>
> b) denied|granted are bare words without a field name, it should be
> seresult="denied", once again to avoid special case parsing.
>
> c) The list of operations are enclosed in curly braces {} without a
> field name, this should be seperms=xxx, where xxx is a list. The
> use of curly braces to encode a list in audit data is unique. We
> should define how any audit message should encode a list of values
> and use that consistently for all audit data. While one could
> define a syntax such as "[value1, value2]" or some such, it might
> be informative to look at how other transfer mechanisms such as
> structured markup and ldap handle this case. They both utilize the
> concept of multi-valued attributes. Thus there is no list
> structure, but an attribute is allowed to repeat itself and in the
> process implicitly creates a list of values for the attribute. Thus
> {read write} might be represented as seperms="read"
> seperms="write". This regularity makes parsing much easier, it
> avoids special case syntax.
>
> 2) (Note, this is not a kernel issue) The host data is currently
> prepended to the audit record with the format host=xxx. Is this an
> encoded string or not? It should be encoded and it should be
> encoded in exactly the same format as the name/value pairs in the
> audit records. The same holds true for the record type, it should
> follow the same syntax as every other name/value pair.
>
> 3) The string "audit(ssssss.mmmm:iiii):" is a critical delimiter, it
> separates record properties (e.g. host, type, timestamp) from
> record data, which must be a sequence of name="value" pairs. But
> the time stamp should really follow the name/value pair encoding
> used elsewhere.
>
> Desired syntax:
> ---------------
>
> Records consist of a sequence of name="value" pairs.
>
> Ordering of name/value pairs is significant for multi-valued
> attributes (i.e. where name appears more than once), insignificant
> otherwise.
>
> The value MUST be enclosed in double quotes with interior characters
> properly escaped.
>
> White space between name, '=', and "value" is insignificant and
> ignored.
>
> The audit record is partitioned into two parts
>
> a) record properties (i.e. host, record type, timestamp)
>
> b) record data
>
> The partition of properties and data occurs at a colon delimiter,
> i.e. properties : data
>
> The current formatting of the record timestamp
> (e.g. audit(ssss.mmm:iii) is inconsistent with
> all other name/value pairs. It should be "seconds="sss"
> milliseconds="mmm" serial="iii", this allows parsing to be regular and
> consistent.
>
> Thus an audit record with consistent syntax would look like this,
> where brackets [] indicate optional components:
>
> [host=""] type="" seconds="" milliseconds="" serial="" : name="" [name=""]
>
> What has to change and what's optional:
> ---------------------------------------
>
> The formatting of name/value pairs in the kernel must be fixed, it is
> simply impossible to correctly parse in it's current state.
>
> The rest of the suggested changes are syntactic sugar which would make
> parsing easier because of regular syntax, but they are not
> critical. We could retain the existing formats if backwards
> compatibility is felt to trump syntactic cleanliness and ease in
> parsing. It's a judgment call over when and how to introduce change
> and the anticipated impact.
>
>
More information about the Linux-audit
mailing list