[Bug 166478] glibc or perl incorrect locale LC_CTYPE data
bugzilla at redhat.com
bugzilla at redhat.com
Wed Nov 2 16:45:39 UTC 2005
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug report.
Summary: glibc or perl incorrect locale LC_CTYPE data
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166478
jvdias at redhat.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |CLOSED
Resolution| |NOTABUG
------- Additional Comments From jvdias at redhat.com 2005-11-02 11:45 EST -------
Sorry I submitted my previous comment before finishing it -
then my machine rebooted (that's another story).
As I was saying in Comment #2 :
This version of your program shows the issue:
---
#!/usr/bin/perl -w -C
use strict;
use utf8;
use locale;
use Encode qw(decode);
my $str = decode('utf-8', "\xc3\x81\xc4\x8c");
# U+00C1 "A with acute", U+010C "C with caron" (encoded in UTF-8)
print 'Is UTF-8:',utf8::is_utf8($str),
' is word:', $str =~ /^\w+$/,
' is UTF-8 word: ', $str =~ /^\p{IsWord}+$/,
' str:',$str, "\n";
---
With the "en_US.UTF-8" locale in effect ( the default on Red Hat systems )
this prints:
$ ./test.pl
Is UTF-8:1 is word: is posix word: 1 is UTF-8 word: 1 str:ÁČ
The point is that \p{IsWord} or [[:word:]] matches UTF-8 word characters,
while \w / \W do not.
As the perlre man-page states:
"
The following equivalences to Unicode \p{} constructs and equivalent
backslash character classes (if available), will hold:
[:...:] \p{...} backslash
...
word IsWord
...
"
ie. the [:word:] / \p{IsWord} classes are NOT equivalent to \w .
As I said, I don't particularly agree with the way the upstream perl
developers have done this, but this is intended behaviour.
RE: your comment #3:
> Your statement that "\w matches any ASCII word char" is not true.
> See perlre(1):
> [...] If "use locale" is in effect, the list of alphabetic characters
> generated by "\w" is taken from the current locale.
Yes, that's alphabetic characters, not unicode sequences.
To match unicode sequences in the word class, you must use \p{IsWord} or
[:word:] .
> So the question is why Perl (or libc) in FreeBSD does consider U+00C1 to be a
> character under the UTF-8 locale, while the same perl with glibc on Linux
> doesn't.
Possibly because the default locale for Red Hat systems is UTF-8 enabled ?
--
Configure bugmail: https://bugzilla.redhat.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
More information about the Fedora-perl-devel-list
mailing list