Change encoding to UTF-8 [DENIED]

rowaasr13 · Post by **rowaasr13** » Sat Sep 27, 2003 6:08 am

It would be good if encoding were changed to UTF-8. That would allow to add japanese names as synonyms, as well as other official synonyms in other languages, like "ΑΠΗΕΝΤΟ ΣΟΜΑ" for Argento Soma or "Пламенный лабиринт" for Labyrinth Of Flames. (BTW, last one already have that synonym, but because of mess with encoding you can see it only after manually passing it through several recodes).

Skywalka · Post by **Skywalka** » Sat Sep 27, 2003 6:17 am

I guess as long as approximately less than 1% of all the users who enter filenames do not understand Kanji at all this request is kind of moot. I think with these many missing informations for files (hashes, aspect rations, groups, source for the RAWs etc) there shouldn't be an option to add something that could be entered and afterwards not checked by more than another person from that 1%.

It would be like those people running around with Kanji on their T-Shirts or Tatoos who were told "This means this and that" and in the end, it means something totally different. In the end we could end up with insults and other similar stuff in the database and nobody would notice.

Not that the Japanese don't run around with silly stuff in roman letters on their shirts but I guess that is why a japanese AniDB ... forget that, Anime titles are often enough so silly that they wouldn't even notice I guess ^_^

Elberet · Post by **Elberet** » Sat Sep 27, 2003 7:52 am

While the number of users who'll find Kanji titles interesting is certainly quite small, this feauture is IMO too easy to not implement it.

Either in the HTML template:

Code: Select all

<head>
    ...
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>

Or in perl:

Code: Select all

print "Content-type: text/html; charset=UTF-8\n\n";

The database itself doesn't have to worry about UTF-8 encoded titles since the individual octets pass as normal ASCII text. However, adding the charset directive only solves the display issue; I don't know if it's equally simple to e.g. search the database for UTF-8 encoded titles. (Shouldn't be a problem tho, if browsers submit form values as UTF-8.)

rowaasr13 · Post by **rowaasr13** » Sat Sep 27, 2003 8:00 am

I don't know if it's equally simple to e.g. search the database for UTF-8 encoded titles. (Shouldn't be a problem tho, if browsers submit form values as UTF-8.)

IE submits everything in currently selected encoding - just look at my first post - it was typed after selecting UTF-8 in IE. If I didn't do that, it would be badly scrambled.

wahaha · Post by **wahaha** » Sat Sep 27, 2003 9:55 am

rowaasr13 wrote:like "ΑΠΗΕΝΤΟ ΣΟΜΑ" for Argento Soma or "Пламенный лабиринт" for Labyrinth Of Flames.

(Should be Unicode-entities, thus encoding-independant

)

Interesting idea though

I think that info would fit best into a special field, like "Original title", instead of being "just another synonym".

For the search-issue (and Chii's "!stitle" aswell), it might be helpful to explicitly include a "romanized" version of the original title.

Post by **exp** » Sat Sep 27, 2003 11:02 am

*DENIED*

as we're having serious problems on the anidb client side with encodings AniDB will even switch to _plain_ ASCII (meaning chars 1-127) soon.
all exisiting entries will be converted, unknown chars will be replaced with "_". Some especial handling is done for common non ascii chars like äöüßéáèà...
So better don't start adding any none ASCII titles

BYe!
EXP

Post by **PetriW** » Sat Sep 27, 2003 1:35 pm

utf-8 is good!

rowaasr13 · Post by **rowaasr13** » Sat Sep 27, 2003 5:07 pm

What kind of problems do you have exactly? Considering that UTF-8 fully preserves ASCII <127 there shouldn't be anything serious. Most OS I know have either built-in or widely available libraries for UTF-8 based output, so that shouldn't be issue as well. UTF fits well in URLs in case of GET requests as well, and have absolutely no restrictions for POST too (I remember client will use HTTP based requests, right?)

In case you really want all chars <127, you can use UTF-7, it will work just as well, but will take more space. It won't affect ASCII only name at all (well, almost - not many anime have + in their title).

Elberet · Post by **Elberet** » Sat Sep 27, 2003 8:21 pm

I think the problem here is not the encoding but using double-byte chars within the programs.

But don't .NET as well as Java support double-byte char values natively? And I'm pretty sure that I've seen a Delphi program or two that did save text files as double-byte streams... So it should be doable, ne?

kidan · Post by **kidan** » Sun Sep 28, 2003 7:50 pm

In .NET strings are always stored in unicode, thus 2-byte. There shouldn't be a problem as even the char-type is a unicode-2-byte-character. You might only run into problems, if you start handling stings as bytearrays (which is a really stupid idea, as the string-class provides all services an array does).

Post by **exp** » Sun Sep 28, 2003 11:10 pm

Well,

it's all nice and good however it's just a pain to work with.
we've lost way to much time trying to get utf8 support working on the entire datapaths in anidb (db<->cgi<->bot<->client<->misc).
so we've decided to drop non-ascii support entirely for now.
as plain ascii is valid utf8 this will always allow us to step up to complete utf8 support once everyone can work with it.

BYe!
EXP

Post by **PetriW** » Tue Sep 30, 2003 10:45 am

Note it's not the clients that have a problem, its the protocol. As I understand it it SHOULD work just it doesn't.

(typical ne?)

kidan · Post by **kidan** » Tue Sep 30, 2003 5:12 pm

Aren't you using SOAP for the protocol-stuff? SOAP should be able to handle multibyte-charsets.

Post by **exp** » Tue Sep 30, 2003 5:20 pm

The big problem here is that UTF8 would have to be supported by every anidb related software. That would cover multiple programming languages, multiple db systems, multiple operating systems, ...
it's just hell :P
so at least for now we won't do it }:o)

BYe!
EXP

Post by **exp** » Tue Sep 30, 2003 5:33 pm

kidan wrote:Aren't you using SOAP for the protocol-stuff? SOAP should be able to handle multibyte-charsets.

no, we're using a good old hand-made plaintext protocoll.
but the encoding isn't really a protocoll problem, we could just make sure all data is utf8 encoded before it's passed to the protocoll level.

BYe!
EXP

Change encoding to UTF-8 [DENIED]

Change encoding to UTF-8 [DENIED]

At least IE does just that