[augeas-devel] Weird print result with Perl bindings on amd64

Thu Jan 15 20:19:50 UTC 2009

David Lutterkort <lutter at redhat.com> wrote:
> On Thu, 2009-01-15 at 16:46 +0100, Jim Meyering wrote:
>> David Lutterkort <lutter at redhat.com> wrote:
>> > On Wed, 2009-01-14 at 19:04 +0100, Dominique Dumont wrote:
>> >> David Lutterkort <lutter at redhat.com> writes:
>> >> >> Actually, after much tinkering with sshd lens, I have the gut feeling
>> >> >> that the problem is in the key_re lens. Looks like the '-' operator
>> >> >> between the 2 regex is not working properly.
>> >> >
>> >> > At first, that was my suspicion, too, but the regular expressions that
>> >> > are used for matching are identical, and I can see in the debugger that
>> >> > the regex matcher produces different results.
>> >>
>> >> I'm not sure that I follow you.
>> >>
>> >> IMHO, the suspect lens is
>> >>
>> >>    let key_re = /[A-Za-z0-9]+/
>> >>          - /MACs|Match|AcceptEnv|Subsystem|(Allow|Deny)(Groups|Users)/
>> >>
>> >> The regex before and after the '-' are not identical ?? [ puzzled ]
>> >>
>> >> So, what do you mean by "the regular expressions that are used for
>> >> matching are identical" ?
>> >
>> > Oh .. what I meant was: I checked with gdb what is happening behind the
>> > scenes when the sshd lens is run on your example sshd_config, both
>> > running it with augtool and with your Perl example.
>> >
>> > In both cases, the regexp that is fed to re_match[1] is exactly the
>> > same, but the results of matching are different.
>> >
>> >> On my side, I've tinkered a lot the regex on the right side and never
>> >> managed to have an effect. Even
>> >>
>> >>   let key_re = /[A-Za-z0-9]+/ - "Match"
>> >>
>> >> does not work. Hence the suspicion regarding the '-'
>> >
>> > It does not appear that the '-' is the problem. When you compute the
>> > regexp for the above, you get
>> >
>> >         /Match[0-9A-Za-z][0-9A-Za-z]*|Matc([0-9A-Za-gi-z][0-9A-Za-z]*|())|Mat([0-9A-Zabd-z][0-9A-Za-z]*|())|Ma([0-9A-Za-su-z][0-9A-Za-z]*|())|(M[0-9A-Zb-z]|[0-9A-LN-Za-z][0-9A-Za-z])[0-9A-Za-z]*|M|[0-9A-LN-Za-z]/
>> >
>> > which is correct.
>>
>> Actually, that's the problem.
>> Using such ranges is portable only in the C locale.
>> Sometimes [A-Z] contains 51, sometimes a slightly different set of 51.
>> Sometimes the expected 52.  That's why [[:upper:]] came about.
>
> The problem is that Augeas should always be operating in the C locale,

i.e., Augeas wishes it could assume it is operating in the C locale ;-)
In other words, you'd like locale-agnostic/ignoring regexp code.

> no matter what the user has in their environment - the regexps are read
> from files that should mean exactly the same in any locale.
>
> AFAICT, there's no clean way for libaugeas to switch to C locale upon
> entry to one of its functions, and switch back to the user's locale on
> return, since setlocale changes the locale for the entire process, not
> just individual threads.

Right.  library code must not modify global (per-process,
thread-spanning) state.

>> That's one ugly regexp.  Glad it's generated.
>> But it's too bad you have to deal with it at all (ie when debugging).
>> Is it too late to consider using more powerful regexps?
>> IMHO, the spec imposed by using POSIX extended regexps is
>> seriously limiting and has been passé for years.
>
> Choosing the limited syntax of POSIX ERE was quite deliberate, since
> Augeas needs to convert regexps to finite automata for the typechecker.
> Some of the extensions, especially in Perl regexps, take them out of the
> realm of regular languages, most notably back references (which are also
> in POSIX, but not supported by Augeas) and recursive matches.
>
>> I.e., it's hard to write readable regexps when you're
>> restricted to POSIX EREs, compared to those of Perl/Ruby and even Emacs.
>> Adding usable (short) class name abbreviations \d, \w, \s, \S, etc. alone
>> makes a huge difference in practice.  Not to mention things like the
>> non-greedy (shy) .*? modifier, and...
>
> Some of those abbreviations would indeed be handy, but the Augeas
> language makes it possible to use these on a language level, i.e.
> instead of
>
>         let re = /[A-Z]*|([a-z]+[0-9]*)/
>
> you could write
>
>         let upper = /[A-Z]/
>         let lower = /[a-z]/
>         let digit = /[0-9]/
>         let re = upper* | lower+ . digit*
>
> but either way, [A-Z] has to be interpreted in the C locale, not the
> user's current locale.
>
>> Back to your example,
>> i.e., with perl, /whatever(?!Match)/ would match any occurrence
>> of "whatever" that is not followed by "Match".
>>
>> From "man perlre"
>>
>>                  "(?!pattern)"
>>                      A zero-width negative look-ahead assertion.  For example
>>                      "/foo(?!bar)/" matches any occurrence of "foo" that isn't
>>                      followed by "bar".
>
> This is one of the extensions that doesn't map very well to regular
> languages or finite automata. For Augeas, it's also not needed: since
> the regular expressions in Augeas must always match an entire string,
> i.e. they are implicitly embedded in a ^..$, there's no point for these
> assertions - you'd need to match something like 'foo' followed by
> something that is not bar (where the definition of 'something' depends
> on what you are using the regexp for)
>
>> > Under
>> > a C locale, this does in fact not match 'Match', but under many otehr
>> > locales, e.g. en_US or de_DE or en_US.utf8, it does.
>>
>> In en_US, the expansion of [A-Z] might include [AbBcCdD...zZ],
>> so that range doesn't do what you want.
>> It's for this reason that you see spelled-out ranges, e.g.,
>>
>>   [abcdefghijklmnopqrstuvwxyz]
>>
>> in applications (and all libraries!) that can't force the locale to C.
>> Applying that kludge would render your already ugly example totally
>> incomprehensible and unmaintainable.
>
> Yeah, ugly, but seeing how I have no way to switch temporarily to the C
> locale, I'll have to resort to that to make sure libaugeas always
> behaves as if it were using the C locale.
>
>> ...
>> >> > What have you tried to reproduce this on 32bit ? And with what LC_*/LANG
>> >> > vars ?
>> >>
>> >> Yes. 32 bits has *always* worked whatever LC_*/LANG I set (by default,
>> >> LANG is en_US with utf8). I can provide a more detailed report if you
>> >> want.
>> >
>> > I can reproduce these problems with augparse/augtool if I stick a
>> > 'setlocale(LC_ALL, "")' into their main, thus making them obey the LC_*
>> > env vars - why that would be architecture specific though is beyond me.
>>
>> [going from memory...]
>> It's because gnulib detects a particular bug in glibc's 32-bit
>> regexp support and then uses the replacement.  But the replacement
>> doesn't have glibc's locale support.
>
> It's actually the other way around: on 64 bit systems with 32 bit ints,
> gnulib's regex is used; from my testing, it seems the regex

Read the code, which dredged up some old memories ;-)
>From m4/regex.m4, (or configure --help)

    [AS_HELP_STRING([--without-included-regex],
		    [don't compile regex; this is the default on 32-bit
		     systems with recent-enough versions of the GNU C
		     Library (use with caution on other systems).
		     On systems with 64-bit ptrdiff_t and 32-bit int,
		     --with-included-regex is the default, in case
		     regex functions operate on very long strings (>2GB)])])

> implementation in glibc does _not_ match 'uvw' with '[A-Z]+' in en_US,
> whereas gnulib's does.

How did you test?  Here's what I did:
I had a coreutils build dir handy so used its regex.h and
already-built .a file:

/*
  $ gcc -g -I$HOME/w/cu/lib -I. -W -Wall rege.c \
	     $HOME/w/cu/lib/libcoreutils.a
  $ ./a.out
  [Exit 3]
*/
# define _GNU_SOURCE 1
#include <locale.h>
#include <regex.h>
#include <string.h>

int
main (void)
{
  if (!setlocale (LC_ALL, "C"))
    return 1;

  re_set_syntax (RE_SYNTAX_POSIX_MINIMAL_EXTENDED & ~(RE_DOT_NEWLINE));
  static struct re_pattern_buffer regex;
  memset (&regex, 0, sizeof regex);
  const char *pat = "[A-Z]+";
  const char *s = re_compile_pattern (pat, strlen (pat), &regex);
  if (s)
    return 2;

  struct re_registers regs;
  regoff_t o = re_match (&regex, "uvw", 3, 0, &regs);
  if (o != 3)
    return 3;

  return 0;
}