[CGI] AniDB UTFtoHTML Converter BUG [FIXED?]
Moderator: AniDB
-
analogued2
[CGI] AniDB UTFtoHTML Converter BUG [FIXED?]
I was trying to add the kanji title for Tenjou Tenge - http://anidb.ath.cx/perl-bin/animedb.pl ... e&aid=1540 - and discovered this bug
Steps:
1) Go here: http://anidb.ath.cx/cgi-bin/utf8tohtml.pl
2) Enter "天上天下" in Title to convert
3) Press Convert
Results:
The second kanji is not converted
Oh.... and apparently phpBB converts it just fine as I've discovered by posting this. It should be "& #22825;& #19978;& #22825;& #19979;" (without the spaces of course; I've added them so that it doesn't show up as kanji) but the UTFtoHTML converter apparently chockes on that second one.
Steps:
1) Go here: http://anidb.ath.cx/cgi-bin/utf8tohtml.pl
2) Enter "天上天下" in Title to convert
3) Press Convert
Results:
The second kanji is not converted
Oh.... and apparently phpBB converts it just fine as I've discovered by posting this. It should be "& #22825;& #19978;& #22825;& #19979;" (without the spaces of course; I've added them so that it doesn't show up as kanji) but the UTFtoHTML converter apparently chockes on that second one.
well,
it's not my script and i don't see how this could be fixed.
here is the source code if anyone has an idea, plz feel free to post it here
)
BYe!
EXP
it's not my script and i don't see how this could be fixed.
here is the source code if anyone has an idea, plz feel free to post it here
Code: Select all
#! /usr/bin/perl -w
use strict;
use Unicode::String qw(utf8 utf16);
print "Content-type: text/html; charset=utf-8\n\n";
print "<html>
<head>
<title>UTF-8 to HTML-escape converter</title>
<meta http-equiv=Content-Type content=\"text/html; charset=utf-8\">
</head>
<body>
AniDB UTFtoHTML Converter:<br>
<hr>
Title to convert:<br>
<form action=\"utf8tohtml.pl\" method=\"GET\">
<input type=\"text\" name=\"u\" size=\"100\"><input type=\"submit\" value=\"Convert\">
</form>";
if ($ENV{"QUERY_STRING"} =~ /^u=(.+)$/)
{
my $utf8 = $1;
if ($utf8=~/[&=]/)
{
print "Form error or wrong attempt to enter data into URL manually."
}
else
{
# Dequote
$utf8 =~ s/\+/ /g;
$utf8 =~ s/%(..)/chr(hex($1))/eg;
print "String recieved:<br>$utf8<br><br>";
my $utf16 = utf8($utf8)->utf16;
my ($escaped, $escapedd);
while($utf16 =~ /(.)(.)/g)
{
my $char = ord($1) * 256 + ord($2);
$escaped .= ( $char<128 ? chr($char) : "&#" . $char . ";");
$escapedd .= ( $char<128 ? chr($char) : "&#" . $char . ";");
}
print "HTML-escaped string test:<br>$escapedd<br><br>
HTML-escaped string: (copy this into the AniDB title field)<br>
<input type=\"text\" size=\"100\" value=\"$escaped\">
<br><br><br>
";
}
}
print "
<p align=\"right\"><small>(C) by rowaasr13</small></p>
</body>
</html>
";
EXP
I know what causes this particular problem: the jou of tenjou contains a 0A octet in UTF-16 representation. (Though I'm tempted to say that the real reason is over-reliance on Unicode libraries for such simple tasks as converting UTF-8 to UTF-16 and on Perl's regular expressions in place of normal numeric processing.)exp wrote:well, it's not my script and i don't see how this could be fixed.
should be:exp wrote:Code: Select all
while($utf16 =~ /(.)(.)/g)
Code: Select all
while($utf16 =~ /(.)(.)/gs)
Because there's the possibility of misunderstanding the interface, and utf-8 is actually simpler than I'm sure the documentation for Unicode::String is. (Besides, you only write the code once; you reuse it after that.)rowaasr13 wrote:Why bother writing something that is already done, every time you need some conversion, especially if module that does conversion is common one and present almost everywhere?
-
Guest
Nah, method names are quite intuitive.pelican wrote:Because there's the possibility of misunderstanding the interface, and utf-8 is actually simpler than I'm sure the documentation for Unicode::String is.
It is written already inside that module, so why reimplement? I'm 99% sure that I'd write exactly same code (excluding whitespaces), based on example from www.unicode.org, that is already in this module.pelican wrote:(Besides, you only write the code once; you reuse it after that.)
Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.Anonymous wrote:Nah, method names are quite intuitive.pelican wrote:Because there's the possibility of misunderstanding the interface, and utf-8 is actually simpler than I'm sure the documentation for Unicode::String is.
A small wheel doesn't mean there'll be no weird bugs, heck the above code is an even smaller wheel.pelican wrote:Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.
Even with UTF-8 decoding rolled in, it could be smaller. And size limits complexity, which has a tendency to hide bugs.PetriW wrote:A small wheel doesn't mean there'll be no weird bugs, heck the above code is an even smaller wheel.pelican wrote:Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.
It is not module error - it was my error that I forgot that UTF16 can contain \r and \n. Of course if I converted directly from UTF8 to numbers, there would be no such problem at all, but I could miss some other errors, considering that parsing UTF16 is waaay simplier code than parsing UTF8.pelican wrote:Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.
That's exactly why I prefer NOT to write something that is already done in well-debuged module. With current script there's just too few places where errors could happen, if I rolled my own UTF8 parser in, I would have to check it every time too.pelican wrote:Even with UTF-8 decoding rolled in, it could be smaller. And size limits complexity, which has a tendency to hide bugs.