Character encoding

Björn Persson bjorn at xn--rombobjrn-67a.se
Sat Sep 6 23:56:07 UTC 2008


Adil Drissi wrote:
> I want to know what is the encoding type of a file. So i run this command:
> "file --mime index.php". The output is : index.php: text/html
>
> But this does not give any character encoding type.
>
> I would like to convert this file to UTF-8 but the command convmv cannot be
> run without specifying the type of the file with -f option i think.

There is no general way to find out the character encoding of a random piece 
of data. Some encodings are fairly easy to recognize but the numerous 
eight-bit encodings can be difficult to tell apart. The character encoding 
must always be specified somewhere if it isn't implicitly known.

In some file systems it's possible to specify the character encoding of a file 
as an attribute, but I've never seen it used. HTML can contain a meta tag 
that specifies the encoding, like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If the HTML file is served by an HTTP server, then the server can specify the 
encoding in the Content-Type header, and there are rules that define what the 
encoding is if the server doesn't specify it.

You could open the file in a browser that lets you choose the encoding, and 
try an encoding that you think it may be. Then proofread the text. If all the 
characters are right, then you guessed right, or close enough to work for 
that particular file. If not, try the next encoding.

> o is there a way to convert this file to UTF-8

Once you know the current encoding, transcoding won't be a big problem. If the 
encoding is specified in the file, such as in a meta tag, then you'll have to 
change that too.

> or better how to set the default character encoding to utf-8?

Default in what context? The locale settings in the environment include a 
character encoding. Many programs assume that text files and filenames are 
encoded in that encoding, but some programs think they're smarter and assume 
something else. (The approach with environment variables will of course fail 
if different users use different locales and access the same files.)

Björn Persson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/fedora-list/attachments/20080907/87fbc468/attachment-0001.sig>


More information about the fedora-list mailing list