[CGI] AniDB UTFtoHTML Converter BUG [FIXED?]

analogued2 · Post by **analogued2** » Thu Mar 25, 2004 11:06 am

I was trying to add the kanji title for Tenjou Tenge - http://anidb.ath.cx/perl-bin/animedb.pl ... e&aid=1540 - and discovered this bug

Steps:
1) Go here: http://anidb.ath.cx/cgi-bin/utf8tohtml.pl
2) Enter "天上天下" in Title to convert
3) Press Convert

Results:
The second kanji is not converted

Oh.... and apparently phpBB converts it just fine as I've discovered by posting this. It should be "& #22825;& #19978;& #22825;& #19979;" (without the spaces of course; I've added them so that it doesn't show up as kanji) but the UTFtoHTML converter apparently chockes on that second one.

Post by **exp** » Sun Mar 28, 2004 1:05 pm

well,

it's not my script and i don't see how this could be fixed.

here is the source code if anyone has an idea, plz feel free to post it here

)

Code: Select all

#! /usr/bin/perl -w

use strict;
use Unicode::String qw(utf8 utf16);

print "Content-type: text/html; charset=utf-8\n\n";

print "<html>
    <head>
	<title>UTF-8 to HTML-escape converter</title>
	<meta http-equiv=Content-Type content=\"text/html; charset=utf-8\">
    </head>
<body>
AniDB UTFtoHTML Converter:<br>
<hr>
Title to convert:<br>
<form action=\"utf8tohtml.pl\" method=\"GET\">
<input type=\"text\" name=\"u\" size=\"100\"><input type=\"submit\" value=\"Convert\">
</form>";

if ($ENV{"QUERY_STRING"} =~ /^u=(.+)$/)
{
    my $utf8 = $1;
    if ($utf8=~/[&=]/)
    {
	print "Form error or wrong attempt to enter data into URL manually."
    }
    else
    {
	# Dequote  
	$utf8 =~ s/\+/ /g;
	$utf8 =~ s/%(..)/chr(hex($1))/eg;
 
	print "String recieved:<br>$utf8<br><br>";

	my $utf16 = utf8($utf8)->utf16;

	my ($escaped, $escapedd);

	while($utf16 =~ /(.)(.)/g)
	{
	    my $char = ord($1) * 256 + ord($2);

	    $escaped .= ( $char<128 ? chr($char) : "&#" . $char . ";");
	    $escapedd .= ( $char<128 ? chr($char) : "&#" . $char . ";");
	}

	print "HTML-escaped string test:<br>$escapedd<br><br>
HTML-escaped string: (copy this into the AniDB title field)<br>
<input type=\"text\" size=\"100\" value=\"$escaped\">
<br><br><br>
	";
    }
}

print "
<p align=\"right\"><small>(C) by rowaasr13</small></p>
</body>
</html>
";

BYe!
EXP

Post by **pelican** » Mon Aug 23, 2004 9:53 am

exp wrote:well, it's not my script and i don't see how this could be fixed.

I know what causes this particular problem: the jou of tenjou contains a 0A octet in UTF-16 representation. (Though I'm tempted to say that the real reason is over-reliance on Unicode libraries for such simple tasks as converting UTF-8 to UTF-16 and on Perl's regular expressions in place of normal numeric processing.)

exp wrote:
Code: Select all
        while($utf16 =~ /(.)(.)/g)

should be:

Code: Select all

        while($utf16 =~ /(.)(.)/gs)

rowaasr13 · Post by **rowaasr13** » Mon Aug 23, 2004 5:55 pm

Why bother writing something that is already done, every time you need some conversion, especially if module that does conversion is common one and present almost everywhere?

Anyway, that, of course, is correct solution. s should been there from the begining.

Post by **pelican** » Mon Aug 23, 2004 8:04 pm

rowaasr13 wrote:Why bother writing something that is already done, every time you need some conversion, especially if module that does conversion is common one and present almost everywhere?

Because there's the possibility of misunderstanding the interface, and utf-8 is actually simpler than I'm sure the documentation for Unicode::String is. (Besides, you only write the code once; you reuse it after that.)

Guest · Post by **Guest** » Mon Aug 23, 2004 8:39 pm

pelican wrote:Because there's the possibility of misunderstanding the interface, and utf-8 is actually simpler than I'm sure the documentation for Unicode::String is.

Nah, method names are quite intuitive.

pelican wrote:(Besides, you only write the code once; you reuse it after that.)

It is written already inside that module, so why reimplement? I'm 99% sure that I'd write exactly same code (excluding whitespaces), based on example from www.unicode.org, that is already in this module.

Post by **pelican** » Mon Aug 23, 2004 8:55 pm

Anonymous wrote:
pelican wrote:Because there's the possibility of misunderstanding the interface, and utf-8 is actually simpler than I'm sure the documentation for Unicode::String is.
Nah, method names are quite intuitive.

Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.

Post by **PetriW** » Mon Aug 23, 2004 9:49 pm

pelican wrote:Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.

A small wheel doesn't mean there'll be no weird bugs, heck the above code is an even smaller wheel.

Post by **pelican** » Mon Aug 23, 2004 10:55 pm

PetriW wrote:
pelican wrote:Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.
A small wheel doesn't mean there'll be no weird bugs, heck the above code is an even smaller wheel.

Even with UTF-8 decoding rolled in, it could be smaller. And size limits complexity, which has a tendency to hide bugs.

rowaasr13 · Post by **rowaasr13** » Tue Aug 24, 2004 10:22 am

pelican wrote:Method names are one thing; what they do, precisely, is another. I wouldn't normally advocate reinventing the wheel, but in this case it's a very small wheel and the code would likely have worked with all input on the first attempt.

It is not module error - it was my error that I forgot that UTF16 can contain \r and \n. Of course if I converted directly from UTF8 to numbers, there would be no such problem at all, but I could miss some other errors, considering that parsing UTF16 is waaay simplier code than parsing UTF8.

pelican wrote:Even with UTF-8 decoding rolled in, it could be smaller. And size limits complexity, which has a tendency to hide bugs.

That's exactly why I prefer NOT to write something that is already done in well-debuged module. With current script there's just too few places where errors could happen, if I rolled my own UTF8 parser in, I would have to check it every time too.

Post by **exp** » Fri Aug 27, 2004 8:57 am

might be fixed now.

BYe!
EXP