[CGI] AniDB UTFtoHTML Converter BUG [FIXED?]
Moderator: AniDB
[CGI] AniDB UTFtoHTML Converter BUG [FIXED?]
I was trying to add the kanji title for Tenjou Tenge - http://anidb.ath.cx/perl-bin/animedb.pl ... e&aid=1540 - and discovered this bug
Steps:
1) Go here: http://anidb.ath.cx/cgi-bin/utf8tohtml.pl
2) Enter "天上天下" in Title to convert
3) Press Convert
Results:
The second kanji is not converted
Oh.... and apparently phpBB converts it just fine as I've discovered by posting this. It should be "& #22825;& #19978;& #22825;& #19979;" (without the spaces of course; I've added them so that it doesn't show up as kanji) but the UTFtoHTML converter apparently chockes on that second one.
Steps:
1) Go here: http://anidb.ath.cx/cgi-bin/utf8tohtml.pl
2) Enter "天上天下" in Title to convert
3) Press Convert
Results:
The second kanji is not converted
Oh.... and apparently phpBB converts it just fine as I've discovered by posting this. It should be "& #22825;& #19978;& #22825;& #19979;" (without the spaces of course; I've added them so that it doesn't show up as kanji) but the UTFtoHTML converter apparently chockes on that second one.
well,
it's not my script and i don't see how this could be fixed.
here is the source code if anyone has an idea, plz feel free to post it here )
BYe!
EXP
it's not my script and i don't see how this could be fixed.
here is the source code if anyone has an idea, plz feel free to post it here )
Code: Select all
#! /usr/bin/perl -w
use strict;
use Unicode::String qw(utf8 utf16);
print "Content-type: text/html; charset=utf-8\n\n";
print "<html>
<head>
<title>UTF-8 to HTML-escape converter</title>
<meta http-equiv=Content-Type content=\"text/html; charset=utf-8\">
</head>
<body>
AniDB UTFtoHTML Converter:<br>
<hr>
Title to convert:<br>
<form action=\"utf8tohtml.pl\" method=\"GET\">
<input type=\"text\" name=\"u\" size=\"100\"><input type=\"submit\" value=\"Convert\">
</form>";
if ($ENV{"QUERY_STRING"} =~ /^u=(.+)$/)
{
my $utf8 = $1;
if ($utf8=~/[&=]/)
{
print "Form error or wrong attempt to enter data into URL manually."
}
else
{
# Dequote
$utf8 =~ s/\+/ /g;
$utf8 =~ s/%(..)/chr(hex($1))/eg;
print "String recieved:<br>$utf8<br><br>";
my $utf16 = utf8($utf8)->utf16;
my ($escaped, $escapedd);
while($utf16 =~ /(.)(.)/g)
{
my $char = ord($1) * 256 + ord($2);
$escaped .= ( $char<128 ? chr($char) : "&#" . $char . ";");
$escapedd .= ( $char<128 ? chr($char) : "&#" . $char . ";");
}
print "HTML-escaped string test:<br>$escapedd<br><br>
HTML-escaped string: (copy this into the AniDB title field)<br>
<input type=\"text\" size=\"100\" value=\"$escaped\">
<br><br><br>
";
}
}
print "
<p align=\"right\"><small>(C) by rowaasr13</small></p>
</body>
</html>
";
EXP
I know what causes this particular problem: the jou of tenjou contains a 0A octet in UTF-16 representation. (Though I'm tempted to say that the real reason is over-reliance on Unicode libraries for such simple tasks as converting UTF-8 to UTF-16 and on Perl's regular expressions in place of normal numeric processing.)exp wrote:well, it's not my script and i don't see how this could be fixed.
should be:exp wrote:Code: Select all
while($utf16 =~ /(.)(.)/g)
Code: Select all
while($utf16 =~ /(.)(.)/gs)
Because there's the possibility of misunderstanding the interface, and utf-8 is actually simpler than I'm sure the documentation for Unicode::String is. (Besides, you only write the code once; you reuse it after that.)rowaasr13 wrote:Why bother writing something that is already done, every time you need some conversion, especially if module that does conversion is common one and present almost everywhere?
Nah, method names are quite intuitive.pelican wrote:Because there's the possibility of misunderstanding the interface, and utf-8 is actually simpler than I'm sure the documentation for Unicode::String is.
It is written already inside that module, so why reimplement? I'm 99% sure that I'd write exactly same code (excluding whitespaces), based on example from www.unicode.org, that is already in this module.pelican wrote:(Besides, you only write the code once; you reuse it after that.)
Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.Anonymous wrote:Nah, method names are quite intuitive.pelican wrote:Because there's the possibility of misunderstanding the interface, and utf-8 is actually simpler than I'm sure the documentation for Unicode::String is.
A small wheel doesn't mean there'll be no weird bugs, heck the above code is an even smaller wheel.pelican wrote:Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.
Even with UTF-8 decoding rolled in, it could be smaller. And size limits complexity, which has a tendency to hide bugs.PetriW wrote:A small wheel doesn't mean there'll be no weird bugs, heck the above code is an even smaller wheel.pelican wrote:Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.
It is not module error - it was my error that I forgot that UTF16 can contain \r and \n. Of course if I converted directly from UTF8 to numbers, there would be no such problem at all, but I could miss some other errors, considering that parsing UTF16 is waaay simplier code than parsing UTF8.pelican wrote:Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.
That's exactly why I prefer NOT to write something that is already done in well-debuged module. With current script there's just too few places where errors could happen, if I rolled my own UTF8 parser in, I would have to check it every time too.pelican wrote:Even with UTF-8 decoding rolled in, it could be smaller. And size limits complexity, which has a tendency to hide bugs.