Change encoding to UTF-8 [DENIED]

Elberet · Post by **Elberet** » Tue Sep 30, 2003 5:38 pm

exp wrote:The big problem here is that UTF8 would have to be supported by every anidb related software.

Nope. UTF-8 doesn't need special support. If you treat an UTF-8 encoded text as a string of single-byte chars and display it that way, you get good ol' ASCII text with funny characters here and there where UTF-8 encoded ideograms or unicode characters are meant to appear - the string can still be treated as an ordinary array of chars, tho. As a result of the lack of this feauture, the user would be unable to search the AniDB for titles that contain unicode characters, but that's about it.

An application that does support UTF-8 encoded text would parse the text into a string of double-byte unicode characters and display them as such.

Using UTF-8 doesn't hurt compatibility or features. If an application, be it the CGI, Chii or some client program, doesn't support the special encoding, anime titles using kanji will look weird, but the application is still fully useable.

Post by **exp** » Tue Sep 30, 2003 6:07 pm

Elberet wrote:Using UTF-8 doesn't hurt compatibility or features. If an application, be it the CGI, Chii or some client program, doesn't support the special encoding, anime titles using kanji will look weird, but the application is still fully useable.

how i wish that were true, but unfortunatly it isn't.
some of the applications are not save for some of the upper ascii chars >127 as it seems. this leads to crashes and weird errors on some unicode chars.
the main problem is the different handling of non-plain-ascii chars by the perl and java postgres drivers. The java side works with utf8 if the db is set to utf8 but breaks if the db is set to SQl_ASCII, the perl side works if the db is set to SQL_ASCII but breaks on an utf8 db

BYe!
EXP

Elberet · Post by **Elberet** » Tue Sep 30, 2003 7:30 pm

Then use UTF-7 instead. The database layer will see nothing but good ol' 7-bit ASCII text, so all applications, regardless of whether they're Java, C++ MFC, C# .NET or Perl continue to work. The result would be a little ugly, tho, unless the individual applications know how to handle the UTF-7 encoded unicode. For example:

www.faqs.org/rfcs/rfc25152.html wrote:The Unicode sequence representing the Han characters for the Japanese word "nihongo" (hexadecimal 65E5,672C,8A9E) may be encoded as follows:
+ZeVnLIqe-

Edit: Whoops, I missed something very important here: in UTF-7, '+' turns into an escape character, which means that the simple solution to tell webbrowsers to use UTF-7 character coding doesn't work. (For example, "20/20+1" would be displayed as "20/201" since the + supposedly starts an UTF-7 encoded passage.)
Instead, the CGI would have to convert the UTF-7 read from Postgres into UTF-8 or some other unicode character coding that doesn't use other, non-binary ASCII chars as escape characters...

zaufany · Post by **zaufany** » Tue Sep 30, 2003 8:09 pm

The only standard that is supported by every aplication is ASCII. For example: I can't use ² in filenames.
My OS is Windows XP PL.

rowaasr13 · Post by **rowaasr13** » Tue Sep 30, 2003 8:33 pm

exp wrote: the main problem is the different handling of non-plain-ascii chars by the perl and java postgres drivers. The java side works with utf8 if the db is set to utf8 but breaks if the db is set to SQl_ASCII, the perl side works if the db is set to SQL_ASCII but breaks on an utf8 db

Well then, just make another field for non-ASCII title data, as was suggested on first page of this thread and store UTF8 as binary data. This may make implementing searching a bit harder - but you can think about this later, while allowing adding such data for now.

rowaasr13 · Post by **rowaasr13** » Tue Sep 30, 2003 8:37 pm

zaufany wrote:The only standard that is supported by every aplication is ASCII.

And that should be changed. I'm sick and tired patching even open source programs I use, just to find, that my changes broke some fundamental ASCII-only assumptions somewhere deep in code.

For example: I can't use ² in filenames.
My OS is Windows XP PL.

Are you really sure? Did you try to type ² in file name in any MS program? Just editing name of file on desktop, for example? I just tried it on 1. WinXP Home Russian, 2. another WinXP home russian set to japanese locale and 3. WinXP Pro En with no problem on any of those.

rowaasr13 · Post by **rowaasr13** » Tue Sep 30, 2003 8:49 pm

Elberet wrote:Edit: Whoops, I missed something very important here: in UTF-7, '+' turns into an escape character, which means that the simple solution to tell webbrowsers to use UTF-7 character coding doesn't work. (For example, "20/20+1" would be displayed as "20/201" since the + supposedly starts an UTF-7 encoded passage.)

Hmm, converting all + in html template to +- isn't too hard...

Alternatively, all UTF7 data can be displayed in <inline>, but that's really weird idea. Forget I said that.

zaufany · Post by **zaufany** » Tue Sep 30, 2003 9:20 pm

I "can" name a file using ² in my Windows XP, but:
1: I can't make a ed2k link with ² in filename.
2: My eMule doesn't see files with ² in names.

Post by **exp** » Tue Sep 30, 2003 10:21 pm

Well,

it is possible to add non-ascii characters, you'll just have to use their respective html encoding. which also means that it won't be possible to search for them.

BYe!
EXP

rowaasr13 · Post by **rowaasr13** » Tue Sep 30, 2003 10:42 pm

exp wrote:it is possible to add non-ascii characters, you'll just have to use their respective html encoding. which also means that it won't be possible to search for them.

Sounds reasonable, considering that you can convert those to anything you want later. However, you should document it somewhere (alias adding page?), so everyone use one method.

Post by **exp** » Wed Oct 01, 2003 8:24 am

hm,

is there a good list/documentation on howto do that?
or better yet a program which takes unicode chars as input (from keyboard/clipboard) and returns their html encoded form?

BYe!
EXP

Elberet · Post by **Elberet** » Wed Oct 01, 2003 12:21 pm

You find the unicode character's id number. If you have an UTF-8 or UTF-7 encoded string, you first have to convert that to a binary form (i.e. UCF-2) to obtain the unicode char's value. Once you have the id number, the HTML entity is &#xN; where N is the unicode character's id number in hex, or &#N; for a decimal.

Getting the number is simple arithmetics. In C, take the first char, cast to int, leftshift by 8, get te second char, cast to int and add both. In Perl, you'd use the operator ord() to get the char's byte value.

rowaasr13 · Post by **rowaasr13** » Wed Oct 01, 2003 8:56 pm

exp wrote:or better yet a program which takes unicode chars as input (from keyboard/clipboard) and returns their html encoded form?

Simple perl script can take care of that. Just post it somewhere on server and make link in tools section or on all pages where one might need such conversion. Here - I just completed it - try http://www.lglobus.ru/~rover/test/utf8tohtml.pl

Just to test QUERY_STRING dequoting and all conversions I gave it 123 テスト 456 === 789 &&& ABC +++ DEF and received back string that you can see in this post source, acting as text for this URL.

Script source: http://www.lglobus.ru/~rover/test/utf8tohtml.rar
Feel free to remove any lines with $DEBUG.

rowaasr13 · Post by **rowaasr13** » Mon Oct 20, 2003 1:24 am

There's one more problem, exp. I tried adding alias for Labyrinth of Flames in &#number; form since it is longer than usual, normal creq complains "name too long", so I had to go to DB change request board. Can you do something about this? Extend allowed name length or treat &#number; as one char?

Post by **exp** » Sun Mar 14, 2004 9:20 pm

nice,

I will use that script.
I've already modified and stored a version on the anidb server, however i'll need to wait for gmni to install the unicode perl module.
so right now i am linking to the url above, if you do not wish that please drop me a line.

THX!

BYe!
EXP