[Fedora-packaging] file-not-utf8 complaints
Toshio Kuratomi
a.badger at gmail.com
Sat May 31 23:09:25 UTC 2008
Patrice Dumas wrote:
> On Fri, May 30, 2008 at 06:56:33PM -0700, Toshio Kuratomi wrote:
>> Reencoding the xml files that specify an encoding isn't strictly
>> necessary. We should probably ask upstream whether they are amenable to
>
> I think that reencoding files that carry over the encoding information
> (info, texinfo, tex and xml for example) is wrong. It is better to let
> upstream do whatever they want. Same for examples of code, better leave
> the encoding preferred by upstream.
>
> For NEWS/Changelog, other text files in %doc and also man pages that are
> not installed in a non utf8 locale, I agree that converting to UTF-8 is
> better.
>
I'm almost in complete agreement with you. The one extra piece that I
think should be considered is how the text is normally viewed/edited.
For instance, if a program has a plain text data file and the program
expects the data to be encoded in utf-16 that should stay utf-16. Since
the end user never views the file and the program has an expectation of
what's in it, this should be perfectly acceptable.
However, the flipside of this is if a program has an xml config file
that the user is expected to edit manually in a text editor and the
program will adapt to multiple encodings (for instance, by using libxml2
to parse the file[1]_) having it exist in utf-8 is much better than
having it exist in SOME_EXOTIC_ENCODING. In this case it's the program
that doesn't care that the config file is in utf-8 vs SHIFT-JIS. But
the user that opens the file in a text editor will be presented with
garbage if the text does not match the system default encoding. Yes,
the user can manually change the encoding that is displayed and saved in
some editors but:
1) This is not the full range of editors.
2) The user has to learn to enable the new encoding in their editor.
This involves reading, editing, and saving. Some editors will display
garbage unless you set the correct encoding on startup, others can
change while running; some convert on open with a best guess at what the
bytes mean but you have to specify what encoding to save the result
otherwise you get the default (utf-8 or dependent on your locale settings).
3) If the user wants to use characters that are not present in the
encoding the file is written in (for instance, the file is encoded in
KOI8-R but the user wants to use kanji.) They'll have to convert the
file to a unicode family of encodings and edit the header that tells the
character set to use before making their changes.
So really, whether the user is intended to edit/view the file directly
instead of through a program that can change the encoding appropriately
should be the dividing line rather than whether the format specifies the
encoding/does not specify encoding.
.. _[1]: http://xmlsoft.org/encoding.html#Default
Whether this is something we should do in our packages even if upstream
doesn't accept the changes involves other factors. In the case of
documentation files that have no encoding we should convert whether or
not upstream agrees. In the case of documentation that does specify the
encoding I lean towards converting [2]_. In the case of a file that is
used by a program we should definitely have a conversation with upstream
about it, although we could convert locally with upstream's blessing
(ie: Upstream says: "I'm going to continue writing my xml config file in
latin-1. If you want to convert them to utf-8 for your users that's
fine -- I'm going to continue to use a library for xml parsing that
understands encodings.")
.. _[2]: Note that this is only for documentation which is not supposed
to be viewed directly. xhtml, for instance, is normally going to be
viewed in a browser so this would not apply.
-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/fedora-packaging/attachments/20080531/6a70b0c0/attachment.sig>
More information about the Fedora-packaging
mailing list