no results while searching anime by russian titles

Please report any sort of feature requests or bugs on the tracker instead of the forum! http://tracker.anidb.info

Moderator: AniDB

Locked
Great Vovs
Posts: 14
Joined: Fri Nov 18, 2005 12:24 pm
Location: Moscow, Russia
Contact:

no results while searching anime by russian titles

Post by Great Vovs »

When i'm entering russian symbols in 'search' field, i have no results. even if i do ctrl+c, ctrl+v from animepage.
for example Last Exile. i can find using japanese (ラストエグザイル) and even arabic (لمنفى الأخير) titles. but i can't do it searching for russian title (Изгнанник).
http://tracker.anidb.info/view.php?id=665
Rar
AniDB Staff
Posts: 1471
Joined: Fri Mar 12, 2004 2:41 pm
Location: UK
Contact:

Post by Rar »

Confirmed:
http://anidb.info/perl-bin/animedb.pl?s ... rch=%D0%B6
Greek also has the same problem:
http://anidb.info/perl-bin/animedb.pl?s ... rch=%CF%83

My suspicion (based on knowledge of how crappy non-ascii string handling is in perl) was that at some point it's treating input as latin-1 and lowercasing it*, but a quick glance over adbs_animelist.pm and adbs_all_misc.pm shows nothing obviously doing that - unless postgre is fucking up ILIKE.

Rar

*Reasoning:

Code: Select all

>>> for c in (u"し", u"σ", u"ж"):
...   utf8 = c.encode('utf8')
...   lcbork = utf8.decode('latin1').lower()
...   tripdone = lcbork.encode('latin1').decode('utf8','replace')
...   print " ".join(repr(s) for s in (c, utf8, lcbork, tripdone))
...
u'\u3057' '\xe3\x81\x97' u'\xe3\x81\x97' u'\u3057'
u'\u03c3' '\xcf\x83' u'\xef\x83' u'\ufffd'
u'\u0436' '\xd0\xb6' u'\xf0\xb6' u'\ufffd'
Rar
AniDB Staff
Posts: 1471
Joined: Fri Mar 12, 2004 2:41 pm
Location: UK
Contact:

Post by Rar »

Right, we've done a bit more testing, and as it turns out that it works* on dev and:
<EXP[BUSY]> well, main difference between dev and main atm should be locale
<EXP[BUSY]> C on main, en_GB.UTF-8 on dev
Changing locale should fix, this won't be done immediately though as it means quite a bit of downtime.

Rar

*Where works=doesn't fail as much.
These two queries return different results:
show=animelist&adb.search=%D0%96
show=animelist&adb.search=%D0%B6
So clearly still only as unicode as the normal unicode 'support' of just treating utf8 as funny ascii and being careful not to fuck with the top bit.
Locked