how to convert text file with unknown 16 bit encoding to 8 bit as cii
Björn Persson
bjorn at xn--rombobjrn-67a.se
Thu Aug 14 01:41:21 UTC 2008
Lancashire, Pete wrote:
> how do I convert a file (or output to stdont) with an unknown 16 encoding
> into plain
> ol' ASCII aka 8 BIT ?
>
> Example of files contents
>
> 0 255 254
> 2 60 0
> 4 72 0
> 6 84 0
> 8 77 0
> 10 76 0
> 12 62 0
>
> or ..
>
> 0000000 377 376 < \0 H \0 T \0 M \0 L \0 > \0 \n \0
> 0000020 \0 \0 < \0 B \0 O \0 D \0 Y \0 > \0
> 0000040 \n
This looks like either UCS-2 or UTF-16. Fortunately you don't have to figure
out which of those it is, because any UCS-2 text is encoded identically in
UTF-16, so you can just say that it is UTF-16.
On the other hand, UCS-2 can represent all characters that ASCII can
represent. If the text is in UTF-16 and contains anything that can't be
treated as UCS-2, then it can't be converted to ASCII, so when converting to
ASCII you can just as well treat it as UCS-2.
The first two bytes are a byte order mark that shows that the encoding is
little-endian. It's good that the byte order mark is there, but it must be
removed in order to convert to ASCII. (ASCII doesn't need byte order marks
anyway.)
If it's guaranteed that the text will always be representable in ASCII
(7-bit), then "iconv --from-code=UTF-16 --to-code=ASCII" should do the
conversion. Iconv seems to strip away the byte order mark automatically from
UTF-16 but not from UCS-2.
If any non-ASCII characters may occur, then you probably want to convert to
UTF-8 instead. UTF-8 can represent all Unicode characters. If you know
exactly which characters can occur, then you may be able to find a suitable
8-bit encoding (preferably one from the ISO 8859 family). Either way, make
sure that the receiving program knows which encoding it is. Otherwise the
text will probably get garbled.
Björn Persson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/fedora-list/attachments/20080814/304e1047/attachment-0001.sig>
More information about the fedora-list
mailing list