[augeas-devel] Weird print result with Perl bindings on amd64
Jim Meyering
jim at meyering.net
Thu Jan 15 20:19:50 UTC 2009
David Lutterkort <lutter at redhat.com> wrote:
> On Thu, 2009-01-15 at 16:46 +0100, Jim Meyering wrote:
>> David Lutterkort <lutter at redhat.com> wrote:
>> > On Wed, 2009-01-14 at 19:04 +0100, Dominique Dumont wrote:
>> >> David Lutterkort <lutter at redhat.com> writes:
>> >> >> Actually, after much tinkering with sshd lens, I have the gut feeling
>> >> >> that the problem is in the key_re lens. Looks like the '-' operator
>> >> >> between the 2 regex is not working properly.
>> >> >
>> >> > At first, that was my suspicion, too, but the regular expressions that
>> >> > are used for matching are identical, and I can see in the debugger that
>> >> > the regex matcher produces different results.
>> >>
>> >> I'm not sure that I follow you.
>> >>
>> >> IMHO, the suspect lens is
>> >>
>> >> let key_re = /[A-Za-z0-9]+/
>> >> - /MACs|Match|AcceptEnv|Subsystem|(Allow|Deny)(Groups|Users)/
>> >>
>> >> The regex before and after the '-' are not identical ?? [ puzzled ]
>> >>
>> >> So, what do you mean by "the regular expressions that are used for
>> >> matching are identical" ?
>> >
>> > Oh .. what I meant was: I checked with gdb what is happening behind the
>> > scenes when the sshd lens is run on your example sshd_config, both
>> > running it with augtool and with your Perl example.
>> >
>> > In both cases, the regexp that is fed to re_match[1] is exactly the
>> > same, but the results of matching are different.
>> >
>> >> On my side, I've tinkered a lot the regex on the right side and never
>> >> managed to have an effect. Even
>> >>
>> >> let key_re = /[A-Za-z0-9]+/ - "Match"
>> >>
>> >> does not work. Hence the suspicion regarding the '-'
>> >
>> > It does not appear that the '-' is the problem. When you compute the
>> > regexp for the above, you get
>> >
>> > /Match[0-9A-Za-z][0-9A-Za-z]*|Matc([0-9A-Za-gi-z][0-9A-Za-z]*|())|Mat([0-9A-Zabd-z][0-9A-Za-z]*|())|Ma([0-9A-Za-su-z][0-9A-Za-z]*|())|(M[0-9A-Zb-z]|[0-9A-LN-Za-z][0-9A-Za-z])[0-9A-Za-z]*|M|[0-9A-LN-Za-z]/
>> >
>> > which is correct.
>>
>> Actually, that's the problem.
>> Using such ranges is portable only in the C locale.
>> Sometimes [A-Z] contains 51, sometimes a slightly different set of 51.
>> Sometimes the expected 52. That's why [[:upper:]] came about.
>
> The problem is that Augeas should always be operating in the C locale,
i.e., Augeas wishes it could assume it is operating in the C locale ;-)
In other words, you'd like locale-agnostic/ignoring regexp code.
> no matter what the user has in their environment - the regexps are read
> from files that should mean exactly the same in any locale.
>
> AFAICT, there's no clean way for libaugeas to switch to C locale upon
> entry to one of its functions, and switch back to the user's locale on
> return, since setlocale changes the locale for the entire process, not
> just individual threads.
Right. library code must not modify global (per-process,
thread-spanning) state.
>> That's one ugly regexp. Glad it's generated.
>> But it's too bad you have to deal with it at all (ie when debugging).
>> Is it too late to consider using more powerful regexps?
>> IMHO, the spec imposed by using POSIX extended regexps is
>> seriously limiting and has been passé for years.
>
> Choosing the limited syntax of POSIX ERE was quite deliberate, since
> Augeas needs to convert regexps to finite automata for the typechecker.
> Some of the extensions, especially in Perl regexps, take them out of the
> realm of regular languages, most notably back references (which are also
> in POSIX, but not supported by Augeas) and recursive matches.
>
>> I.e., it's hard to write readable regexps when you're
>> restricted to POSIX EREs, compared to those of Perl/Ruby and even Emacs.
>> Adding usable (short) class name abbreviations \d, \w, \s, \S, etc. alone
>> makes a huge difference in practice. Not to mention things like the
>> non-greedy (shy) .*? modifier, and...
>
> Some of those abbreviations would indeed be handy, but the Augeas
> language makes it possible to use these on a language level, i.e.
> instead of
>
> let re = /[A-Z]*|([a-z]+[0-9]*)/
>
> you could write
>
> let upper = /[A-Z]/
> let lower = /[a-z]/
> let digit = /[0-9]/
> let re = upper* | lower+ . digit*
>
> but either way, [A-Z] has to be interpreted in the C locale, not the
> user's current locale.
>
>> Back to your example,
>> i.e., with perl, /whatever(?!Match)/ would match any occurrence
>> of "whatever" that is not followed by "Match".
>>
>> From "man perlre"
>>
>> "(?!pattern)"
>> A zero-width negative look-ahead assertion. For example
>> "/foo(?!bar)/" matches any occurrence of "foo" that isn't
>> followed by "bar".
>
> This is one of the extensions that doesn't map very well to regular
> languages or finite automata. For Augeas, it's also not needed: since
> the regular expressions in Augeas must always match an entire string,
> i.e. they are implicitly embedded in a ^..$, there's no point for these
> assertions - you'd need to match something like 'foo' followed by
> something that is not bar (where the definition of 'something' depends
> on what you are using the regexp for)
>
>> > Under
>> > a C locale, this does in fact not match 'Match', but under many otehr
>> > locales, e.g. en_US or de_DE or en_US.utf8, it does.
>>
>> In en_US, the expansion of [A-Z] might include [AbBcCdD...zZ],
>> so that range doesn't do what you want.
>> It's for this reason that you see spelled-out ranges, e.g.,
>>
>> [abcdefghijklmnopqrstuvwxyz]
>>
>> in applications (and all libraries!) that can't force the locale to C.
>> Applying that kludge would render your already ugly example totally
>> incomprehensible and unmaintainable.
>
> Yeah, ugly, but seeing how I have no way to switch temporarily to the C
> locale, I'll have to resort to that to make sure libaugeas always
> behaves as if it were using the C locale.
>
>> ...
>> >> > What have you tried to reproduce this on 32bit ? And with what LC_*/LANG
>> >> > vars ?
>> >>
>> >> Yes. 32 bits has *always* worked whatever LC_*/LANG I set (by default,
>> >> LANG is en_US with utf8). I can provide a more detailed report if you
>> >> want.
>> >
>> > I can reproduce these problems with augparse/augtool if I stick a
>> > 'setlocale(LC_ALL, "")' into their main, thus making them obey the LC_*
>> > env vars - why that would be architecture specific though is beyond me.
>>
>> [going from memory...]
>> It's because gnulib detects a particular bug in glibc's 32-bit
>> regexp support and then uses the replacement. But the replacement
>> doesn't have glibc's locale support.
>
> It's actually the other way around: on 64 bit systems with 32 bit ints,
> gnulib's regex is used; from my testing, it seems the regex
Read the code, which dredged up some old memories ;-)
>From m4/regex.m4, (or configure --help)
[AS_HELP_STRING([--without-included-regex],
[don't compile regex; this is the default on 32-bit
systems with recent-enough versions of the GNU C
Library (use with caution on other systems).
On systems with 64-bit ptrdiff_t and 32-bit int,
--with-included-regex is the default, in case
regex functions operate on very long strings (>2GB)])])
> implementation in glibc does _not_ match 'uvw' with '[A-Z]+' in en_US,
> whereas gnulib's does.
How did you test? Here's what I did:
I had a coreutils build dir handy so used its regex.h and
already-built .a file:
/*
$ gcc -g -I$HOME/w/cu/lib -I. -W -Wall rege.c \
$HOME/w/cu/lib/libcoreutils.a
$ ./a.out
[Exit 3]
*/
# define _GNU_SOURCE 1
#include <locale.h>
#include <regex.h>
#include <string.h>
int
main (void)
{
if (!setlocale (LC_ALL, "C"))
return 1;
re_set_syntax (RE_SYNTAX_POSIX_MINIMAL_EXTENDED & ~(RE_DOT_NEWLINE));
static struct re_pattern_buffer regex;
memset (®ex, 0, sizeof regex);
const char *pat = "[A-Z]+";
const char *s = re_compile_pattern (pat, strlen (pat), ®ex);
if (s)
return 2;
struct re_registers regs;
regoff_t o = re_match (®ex, "uvw", 3, 0, ®s);
if (o != 3)
return 3;
return 0;
}
More information about the augeas-devel
mailing list