[augeas-devel] Weird print result with Perl bindings on amd64

David Lutterkort lutter at redhat.com
Thu Jan 15 23:07:31 UTC 2009


On Thu, 2009-01-15 at 21:19 +0100, Jim Meyering wrote:
> David Lutterkort <lutter at redhat.com> wrote:
> > On Thu, 2009-01-15 at 16:46 +0100, Jim Meyering wrote:
> >> 
> >> Actually, that's the problem.
> >> Using such ranges is portable only in the C locale.
> >> Sometimes [A-Z] contains 51, sometimes a slightly different set of 51.
> >> Sometimes the expected 52.  That's why [[:upper:]] came about.
> >
> > The problem is that Augeas should always be operating in the C locale,
> 
> i.e., Augeas wishes it could assume it is operating in the C locale ;-)
> In other words, you'd like locale-agnostic/ignoring regexp code.

To be pedantic, the problem is that the files Augeas handles (both lens
definitions and config files) are written in some fixed locale, and that
that does not and must not change no matter what locale the user has set
in their environment.

But yes, I'd want a way to explicitly pass the locale to the regexp
compiler/matcher, re_compile_pattern_l (or the uselocale call that Dan
mentioned).

> > implementation in glibc does _not_ match 'uvw' with '[A-Z]+' in en_US,
> > whereas gnulib's does.
> 
> How did you test?  Here's what I did:

With a very similar test program (patch against Augeas' HEAD attached)
and build once --with-internal-regex (to get gnulib's regex) and once
--without-internal-regex (to get glibc's regex). With gnulib's regex,
the behavior matches what we've discussed so far, i.e. 'def' matches
'[A-Z]+' in en_US, but not in C locale.

With glibc's regex I get

        LD_DEBUG=bindings LC_ALL=en_US ./src/reloc '[A-Z]+' def 2>&1 | egrep 're_(match|syntax|compile)'
              9037:	binding file /lib64/libc.so.6 [0] to ./src/reloc [0]: normal symbol `re_syntax_options' [GLIBC_2.2.5]
              9037:	binding file ./src/reloc [0] to /lib64/libc.so.6 [0]: normal symbol `re_syntax_options' [GLIBC_2.2.5]
              9037:	binding file ./src/reloc [0] to /lib64/libc.so.6 [0]: normal symbol `re_compile_pattern' [GLIBC_2.2.5]
              9037:	binding file ./src/reloc [0] to /lib64/libc.so.6 [0]: normal symbol `re_match' [GLIBC_2.2.5]
        re_match: -1
        
i.e., with the glibc-2.9-3.x86_64 on my F10 machine, 'def' does not
match '[A-Z]+' in the en_US locale.

Playing some more with this, egrep agrees with glibc:

        >echo -e 'DEF\nDef\ndef' | LC_ALL=en_US egrep '^[A-Z]+$'
        DEF

Is there a good way to find out what a character range like 'A-Z'
includes in a given locale besides trawling through locale definition
files ?

> I had a coreutils build dir handy so used its regex.h and
> already-built .a file:
> 
> /*
>   $ gcc -g -I$HOME/w/cu/lib -I. -W -Wall rege.c \
> 	     $HOME/w/cu/lib/libcoreutils.a
>   $ ./a.out
>   [Exit 3]
> */
> # define _GNU_SOURCE 1
> #include <locale.h>
> #include <regex.h>
> #include <string.h>
> 
> int
> main (void)
> {
>   if (!setlocale (LC_ALL, "C"))
>     return 1;

Yes, no issue with the C locale ;) The differences only happen under
locales like en_US.

David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Test-program-to-play-with-locale-dependency-of-re_ma.patch
Type: text/x-patch
Size: 4203 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/augeas-devel/attachments/20090115/3937364c/attachment.bin>


More information about the augeas-devel mailing list