how to convert text file with unknown 16 bit encoding to 8 bit as cii

Björn Persson bjorn at xn--rombobjrn-67a.se
Thu Aug 14 01:41:21 UTC 2008


Lancashire, Pete wrote:
> how do I convert a file (or output to stdont) with an unknown 16 encoding
> into plain
> ol' ASCII aka 8 BIT ?
>
> Example of files contents
>
>  0 255 254
>  2  60   0
>  4  72   0
>  6  84   0
>  8  77   0
> 10  76   0
> 12  62   0
>
> or ..
>
> 0000000 377 376   <  \0   H  \0   T  \0   M  \0   L  \0   >  \0  \n  \0
> 0000020      \0      \0   <  \0   B  \0   O  \0   D  \0   Y  \0   >  \0
> 0000040  \n

This looks like either UCS-2 or UTF-16. Fortunately you don't have to figure 
out which of those it is, because any UCS-2 text is encoded identically in 
UTF-16, so you can just say that it is UTF-16.

On the other hand, UCS-2 can represent all characters that ASCII can 
represent. If the text is in UTF-16 and contains anything that can't be 
treated as UCS-2, then it can't be converted to ASCII, so when converting to 
ASCII you can just as well treat it as UCS-2.

The first two bytes are a byte order mark that shows that the encoding is 
little-endian. It's good that the byte order mark is there, but it must be 
removed in order to convert to ASCII. (ASCII doesn't need byte order marks 
anyway.)

If it's guaranteed that the text will always be representable in ASCII 
(7-bit), then "iconv --from-code=UTF-16 --to-code=ASCII" should do the 
conversion. Iconv seems to strip away the byte order mark automatically from 
UTF-16 but not from UCS-2.

If any non-ASCII characters may occur, then you probably want to convert to 
UTF-8 instead. UTF-8 can represent all Unicode characters. If you know 
exactly which characters can occur, then you may be able to find a suitable 
8-bit encoding (preferably one from the ISO 8859 family). Either way, make 
sure that the receiving program knows which encoding it is. Otherwise the 
text will probably get garbled.

Björn Persson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/fedora-list/attachments/20080814/304e1047/attachment-0001.sig>


More information about the fedora-list mailing list