[augeas-devel] Weird print result with Perl bindings on amd64

Thu Jan 15 15:46:16 UTC 2009

David Lutterkort <lutter at redhat.com> wrote:
> On Wed, 2009-01-14 at 19:04 +0100, Dominique Dumont wrote:
>> David Lutterkort <lutter at redhat.com> writes:
>> >> Actually, after much tinkering with sshd lens, I have the gut feeling
>> >> that the problem is in the key_re lens. Looks like the '-' operator
>> >> between the 2 regex is not working properly.
>> >
>> > At first, that was my suspicion, too, but the regular expressions that
>> > are used for matching are identical, and I can see in the debugger that
>> > the regex matcher produces different results.
>>
>> I'm not sure that I follow you.
>>
>> IMHO, the suspect lens is
>>
>>    let key_re = /[A-Za-z0-9]+/
>>          - /MACs|Match|AcceptEnv|Subsystem|(Allow|Deny)(Groups|Users)/
>>
>> The regex before and after the '-' are not identical ?? [ puzzled ]
>>
>> So, what do you mean by "the regular expressions that are used for
>> matching are identical" ?
>
> Oh .. what I meant was: I checked with gdb what is happening behind the
> scenes when the sshd lens is run on your example sshd_config, both
> running it with augtool and with your Perl example.
>
> In both cases, the regexp that is fed to re_match[1] is exactly the
> same, but the results of matching are different.
>
>> On my side, I've tinkered a lot the regex on the right side and never
>> managed to have an effect. Even
>>
>>   let key_re = /[A-Za-z0-9]+/ - "Match"
>>
>> does not work. Hence the suspicion regarding the '-'
>
> It does not appear that the '-' is the problem. When you compute the
> regexp for the above, you get
>
>         /Match[0-9A-Za-z][0-9A-Za-z]*|Matc([0-9A-Za-gi-z][0-9A-Za-z]*|())|Mat([0-9A-Zabd-z][0-9A-Za-z]*|())|Ma([0-9A-Za-su-z][0-9A-Za-z]*|())|(M[0-9A-Zb-z]|[0-9A-LN-Za-z][0-9A-Za-z])[0-9A-Za-z]*|M|[0-9A-LN-Za-z]/
>
> which is correct.

Actually, that's the problem.
Using such ranges is portable only in the C locale.
Sometimes [A-Z] contains 51, sometimes a slightly different set of 51.
Sometimes the expected 52.  That's why [[:upper:]] came about.

That's one ugly regexp.  Glad it's generated.
But it's too bad you have to deal with it at all (ie when debugging).
Is it too late to consider using more powerful regexps?
IMHO, the spec imposed by using POSIX extended regexps is
seriously limiting and has been passé for years.
I.e., it's hard to write readable regexps when you're
restricted to POSIX EREs, compared to those of Perl/Ruby and even Emacs.
Adding usable (short) class name abbreviations \d, \w, \s, \S, etc. alone
makes a huge difference in practice.  Not to mention things like the
non-greedy (shy) .*? modifier, and...

Back to your example,
i.e., with perl, /whatever(?!Match)/ would match any occurrence
of "whatever" that is not followed by "Match".

>From "man perlre"

                 "(?!pattern)"
                     A zero-width negative look-ahead assertion.  For example
                     "/foo(?!bar)/" matches any occurrence of "foo" that isn't
                     followed by "bar".

> Under
> a C locale, this does in fact not match 'Match', but under many otehr
> locales, e.g. en_US or de_DE or en_US.utf8, it does.

In en_US, the expansion of [A-Z] might include [AbBcCdD...zZ],
so that range doesn't do what you want.
It's for this reason that you see spelled-out ranges, e.g.,

  [abcdefghijklmnopqrstuvwxyz]

in applications (and all libraries!) that can't force the locale to C.
Applying that kludge would render your already ugly example totally
incomprehensible and unmaintainable.

...
>> > What have you tried to reproduce this on 32bit ? And with what LC_*/LANG
>> > vars ?
>>
>> Yes. 32 bits has *always* worked whatever LC_*/LANG I set (by default,
>> LANG is en_US with utf8). I can provide a more detailed report if you
>> want.
>
> I can reproduce these problems with augparse/augtool if I stick a
> 'setlocale(LC_ALL, "")' into their main, thus making them obey the LC_*
> env vars - why that would be architecture specific though is beyond me.

[going from memory...]
It's because gnulib detects a particular bug in glibc's 32-bit
regexp support and then uses the replacement.  But the replacement
doesn't have glibc's locale support.