[augeas-devel] Weird print result with Perl bindings on amd64

Thu Jan 15 19:02:10 UTC 2009

On Thu, 2009-01-15 at 16:46 +0100, Jim Meyering wrote:
> David Lutterkort <lutter at redhat.com> wrote:
> > On Wed, 2009-01-14 at 19:04 +0100, Dominique Dumont wrote:
> >> David Lutterkort <lutter at redhat.com> writes:
> >> >> Actually, after much tinkering with sshd lens, I have the gut feeling
> >> >> that the problem is in the key_re lens. Looks like the '-' operator
> >> >> between the 2 regex is not working properly.
> >> >
> >> > At first, that was my suspicion, too, but the regular expressions that
> >> > are used for matching are identical, and I can see in the debugger that
> >> > the regex matcher produces different results.
> >>
> >> I'm not sure that I follow you.
> >>
> >> IMHO, the suspect lens is
> >>
> >>    let key_re = /[A-Za-z0-9]+/
> >>          - /MACs|Match|AcceptEnv|Subsystem|(Allow|Deny)(Groups|Users)/
> >>
> >> The regex before and after the '-' are not identical ?? [ puzzled ]
> >>
> >> So, what do you mean by "the regular expressions that are used for
> >> matching are identical" ?
> >
> > Oh .. what I meant was: I checked with gdb what is happening behind the
> > scenes when the sshd lens is run on your example sshd_config, both
> > running it with augtool and with your Perl example.
> >
> > In both cases, the regexp that is fed to re_match[1] is exactly the
> > same, but the results of matching are different.
> >
> >> On my side, I've tinkered a lot the regex on the right side and never
> >> managed to have an effect. Even
> >>
> >>   let key_re = /[A-Za-z0-9]+/ - "Match"
> >>
> >> does not work. Hence the suspicion regarding the '-'
> >
> > It does not appear that the '-' is the problem. When you compute the
> > regexp for the above, you get
> >
> >         /Match[0-9A-Za-z][0-9A-Za-z]*|Matc([0-9A-Za-gi-z][0-9A-Za-z]*|())|Mat([0-9A-Zabd-z][0-9A-Za-z]*|())|Ma([0-9A-Za-su-z][0-9A-Za-z]*|())|(M[0-9A-Zb-z]|[0-9A-LN-Za-z][0-9A-Za-z])[0-9A-Za-z]*|M|[0-9A-LN-Za-z]/
> >
> > which is correct.
> 
> Actually, that's the problem.
> Using such ranges is portable only in the C locale.
> Sometimes [A-Z] contains 51, sometimes a slightly different set of 51.
> Sometimes the expected 52.  That's why [[:upper:]] came about.

The problem is that Augeas should always be operating in the C locale,
no matter what the user has in their environment - the regexps are read
from files that should mean exactly the same in any locale.

AFAICT, there's no clean way for libaugeas to switch to C locale upon
entry to one of its functions, and switch back to the user's locale on
return, since setlocale changes the locale for the entire process, not
just individual threads.

> That's one ugly regexp.  Glad it's generated.
> But it's too bad you have to deal with it at all (ie when debugging).
> Is it too late to consider using more powerful regexps?
> IMHO, the spec imposed by using POSIX extended regexps is
> seriously limiting and has been passé for years.

Choosing the limited syntax of POSIX ERE was quite deliberate, since
Augeas needs to convert regexps to finite automata for the typechecker.
Some of the extensions, especially in Perl regexps, take them out of the
realm of regular languages, most notably back references (which are also
in POSIX, but not supported by Augeas) and recursive matches.

> I.e., it's hard to write readable regexps when you're
> restricted to POSIX EREs, compared to those of Perl/Ruby and even Emacs.
> Adding usable (short) class name abbreviations \d, \w, \s, \S, etc. alone
> makes a huge difference in practice.  Not to mention things like the
> non-greedy (shy) .*? modifier, and...

Some of those abbreviations would indeed be handy, but the Augeas
language makes it possible to use these on a language level, i.e.
instead of

        let re = /[A-Z]*|([a-z]+[0-9]*)/

you could write

        let upper = /[A-Z]/
        let lower = /[a-z]/
        let digit = /[0-9]/
        let re = upper* | lower+ . digit*

but either way, [A-Z] has to be interpreted in the C locale, not the
user's current locale.

> Back to your example,
> i.e., with perl, /whatever(?!Match)/ would match any occurrence
> of "whatever" that is not followed by "Match".
> 
> From "man perlre"
> 
>                  "(?!pattern)"
>                      A zero-width negative look-ahead assertion.  For example
>                      "/foo(?!bar)/" matches any occurrence of "foo" that isn't
>                      followed by "bar".

This is one of the extensions that doesn't map very well to regular
languages or finite automata. For Augeas, it's also not needed: since
the regular expressions in Augeas must always match an entire string,
i.e. they are implicitly embedded in a ^..$, there's no point for these
assertions - you'd need to match something like 'foo' followed by
something that is not bar (where the definition of 'something' depends
on what you are using the regexp for)

> > Under
> > a C locale, this does in fact not match 'Match', but under many otehr
> > locales, e.g. en_US or de_DE or en_US.utf8, it does.
> 
> In en_US, the expansion of [A-Z] might include [AbBcCdD...zZ],
> so that range doesn't do what you want.
> It's for this reason that you see spelled-out ranges, e.g.,
> 
>   [abcdefghijklmnopqrstuvwxyz]
> 
> in applications (and all libraries!) that can't force the locale to C.
> Applying that kludge would render your already ugly example totally
> incomprehensible and unmaintainable.

Yeah, ugly, but seeing how I have no way to switch temporarily to the C
locale, I'll have to resort to that to make sure libaugeas always
behaves as if it were using the C locale.

> ...
> >> > What have you tried to reproduce this on 32bit ? And with what LC_*/LANG
> >> > vars ?
> >>
> >> Yes. 32 bits has *always* worked whatever LC_*/LANG I set (by default,
> >> LANG is en_US with utf8). I can provide a more detailed report if you
> >> want.
> >
> > I can reproduce these problems with augparse/augtool if I stick a
> > 'setlocale(LC_ALL, "")' into their main, thus making them obey the LC_*
> > env vars - why that would be architecture specific though is beyond me.
> 
> [going from memory...]
> It's because gnulib detects a particular bug in glibc's 32-bit
> regexp support and then uses the replacement.  But the replacement
> doesn't have glibc's locale support.

It's actually the other way around: on 64 bit systems with 32 bit ints,
gnulib's regex is used; from my testing, it seems the regex
implementation in glibc does _not_ match 'uvw' with '[A-Z]+' in en_US,
whereas gnulib's does.

David