fahrenheit wrote:You know, the reason there is an udp api is to prevent stuff like what you are sugesting.
Yeah, you could curl the entire site for the data you want but you just need to think a bit to figure out why that isn't the best aproach, the amount of unnecessary data you are requesting is just too big. Example, you need some data of an anime, you have to get all the html code, plus you need to parse the said html to extract the data you need and only then you have the data you need, not pratical.
On a side note, you can get banned if you you try to request more than X of the http server, so instead of doing this by brute force it's much more sweet to do it the gentle way, like webaom does, trough the udp api.
Also if you wan't something that isn't provided by the udp api it's usualy faster to ask epox and he will do it in way less time it takes for something to happen on the http server (days vs months/years).
cURL or wget the entire site? Oh, God no. That would be evil. I was thinking more along the lines of certain useful pages for which there is no analog in the UDP API (and for which there will never be, due to server load.) Pages like http://anidb.info/perl-bin/animedb.pl?s ... t&uid=xxxx
, for example.
I mentioned gestalts -- this is the best way to do things. AOM does this already, with the kowai data dumps; however, AOM also has a leg up, since there is a way to incrementally retrieve massive amounts of data through the TCP API. (Also, kowai, for the purposes of discussion, will be off-limits to UDP API devs.) So, absent a gestalt, the only way to proceed with a UDP version of AOM is to be massively evil to the server, which is not cool.
Fortunately, we're not completely without tools. AniDB is, like any database, massively static, with almost all changes being addition of data, and next to no removals. Thus, incremental retrieval of data from the main database will be fine, as long as we cache all of it and never invalidate the cache. Cache flushes should be incremental (because the alternative is to invalidate an entire table at one time, and that's not good when your table is several thousand UDP requests' worth of data!)
The mylist, however, is sliiightly different. First, it's small. My mylist is about 2% of AniDB, and spans three pages in a web browser. This is a cURLable amount. All of the data in a mylist is self-contained -- you do not need to issue an ANIME or EPISODE for each item in the list. (You might end up doing that anyway, if the user requests it, but try to stay away from that kind of thinking.) More importantly, the mylist can change quite easily. It should be very
simple to invalidate the cached mylist, and load another gestalt, or to use a new gestalt to update the cached mylist (that's what gestalts are designed for, after all...)
(A mylist gestalt could also be obtained with the "mylist export" feature of the database, which can create a CSV. Whether or not this option is more taxing to the server [it probably is, XD] doesn't matter; it's just important that we support that option, since it is
a valid snapshot of the mylist.)
You are right about unnecessary data. Parsing HTML is a chore that takes a day to write and two seconds to execute. It's slow, arduous work. Unfortunately, there's no other option, since the only good, valid gestalts that could power an AOM replacement are the kowai data dumps, which are not accessible. (Well, actually, they're not inaccessible, but the time which it would take to reverse the TCP API and the code that controls them and reimplement it is much greater than the time that it would take to write out gestalt importing and caching code. I should know; I've already written code to import a mylist from an HTTP or CSV dump!) The other option is to ask for an implementation in the UDP API, but it will not happen, since the UDP API was not designed to handle large amounts of data. *cough*1400bytemtu*cough* On top of that, implementing retrieval of gestalts (a TCP thing) in the UDP API defeats the purpose of both APIs.