This post mostly meant for epox or any other administrator.
think this has been long dead, but since I picked up python recently, and amazed that how such easy it is to develop small apps using such scripting/JIT-compiled languages, tot I revisited the topic... but there is a slight change in focus... after trying out the UDP API (but frankly speaking, find it less informative than the actual html, prob due to limitation of data size being exchanged)
1) instead of requesting a special http api, which could costs to the server side as it is something totally different from the existing htmls in terms of both serving as well as dev/maintenance, I am now looking at a less robust but potentially more server friendly form of parsing the present htmls instead. Now, I know that normally it is not good to try to tax normal html servers this way, but I have a few assumptions to state and hopefully be clarified by epox or other people who knows the situation.
a) html can get cached. unlike UDP where every single UDP client bugs the server directly, normal (largely static) html gets cached at various levels: local cache, isp cache, isp xchange cache, secondary proxy cache etc. Meaning that it prob has a higher chance of NOT requesting any data from the server (web experts, correct me if I'm wrong on this). Also, the way I study the (my own, that is) usage pattern, the frequency of use isn't really more than looking up anidb manually.
b) html can get compressed easily (natively supported by many browsers), again unlike UDP, which is already fragile (error prone) without compression, and would require a compression scheme with very high redundancy to make it robust, but also makes it too hard to decode, since most UDP processors deals with simple text
c) html is close to xml (or xml itself if it is xhtml). meaning that as long as certain identifying rules are fixed e.g. certain data always identified by certain parent or certain class attribute or certain keywords etc, the rest of the html can be rearranged without affecting the extraction (it could be costly, but it's on the client side). Xpath will be a great tool in this case. On the other hand UDP replies are (and must be) fixed as they are referenced by position using some scehma which is NOT sent with the data (the client has to know beforehand). Any change to the reply formats will most certainly kill all the older clients that assumed a fixed way of parsing the data.
ah... I think I got ahead of myself by not explaining what I am trying to do...
(inserted out of order)
0)
Before I revisited this, I was actually writing a simple python script that checks several urls for any new links (ed2k links to anime). (nope. not from anidb. otherwise I don't have to crack my head over things). It does this by loading a simple list of urls to check, when executed, go through them one by one, download the source for that url and parses it locally for ed2k links. It then checks another oldlinks.txt stored locally to remove any links that is inside. it then saves the new links to a simple text file. I run this every other day to check for any new episodes to some series. I just have to look at the newlinks.txt to see if there is anything. Just for info, after I reject or started downloading certain links, I cut and pasted them manually into the oldlinks.txt so that they won't appear next time in the newlinks.txt. This is still primitive and lots of rooms to improve, but the simplified way of storing stuff could potentially allow a nicer and more feature-full GUI to be built over it in the future. The main reason why I use simple text files is that copy and pasting the raw links to and from clipboard allows them to be pasted directly into the eMule client (whereas if I had used binary database or xml, I can't do it without processing it).
Now comes the problem. given just a url, there is no easy way to tell me what anime it is refering to (forum thread like url naming, so it's mostly numbers). So when I add a new url, I want to tag it with extra info such as what anime it is, some "vital statstics" and a link to some "authority" for more info. The thing is that anidb is my number one (in fact... only one) source of getting anime information, so what I had to do was to go to anidb manually every time I add a new url to my program, then manually and labourously copy over the info one by one.
normally, this doesn't occur too frequently so when I have time, it is ok (but still boring job). But as fans know, animes comes out quarterly.... meaning that during the start of a quarter, there will potentially be tons and tons (I scream with an ironic mix of joy and frustration, that only a fan will know!!!) of new animes. But it also mean that setting up the thing takes up quite a bit of time and effort.
What I was hoping to do, is to have an easier way to get some anime info from the "authority" (in this case anidb) and compose some sort of (prob html) description which be used to identify the urls/links I deal with, have basic info, and have hyper links back to the detail info of the "authority", so that when I click on it, I get back to the anime's page on the browser.
The long term and too-ambitious goal is to have ONE abstract anime schema that rules them all (abstract coz not "owned" by any single anime database source). It's like what COLLADA is among 3d formats. basically, it is a universal anime schema that can be partially filled by data from various sources (since a single data source may not be able to fill up all data). But frankly speaking, since I am happy with anidb's schema, it prob turns out that this will be close, if not exactly what anidb is using... so it may turn out to be a "universal adaptor" for anidb's database instead
after anidb, I could maybe work on variants of my scripts that works with other anime sites that returns data in a compatible form (but really, I was hoping that after I release the scripts that defines the run-time representations of the schema, somebody else can do it, since I really need just anidb :p but come to think of it, it could be interesting if I can extract image links from sources and compose a image gallery to be displayed with the anime info.....) , and have a higher level script that can take in all this data and plant them into a single data template, with links back to original sources + potentially doing data-tidying (e.g. different sites may use different codes/strings for similar things, which gets mapped to a single abstract data field.... sort of like what unicode is to all the other locale-specific code pages). Could also compile interesting data e.g. show what rating each site gives to a single anime. Can even do conflict detection e.g. some sites disagree on maybe number of episodes etc etc... maybe it can motivate interested users to find out (from official websites. we can't update databases from each other since they are all secondary info) and inform the respective sites of neccessary updates.
another not so high priority and perhaps controversial feature is actually to integrate with the ed2klink harvester that I was using, such that it is actually possible to say load the links from some forum pages or some release sites, compare it with anidb's links and maybe thru some api adds the new links to anidb??? unless some maniac is doing this on a massive uncontrolled fashion, the way I see it is that since I am using the links myself, I can sort of validate them (false links/wrong links). This new feature just allow me an easy way to add to anidb's data in a by-the-way fashion.
(end of background story)
2) now coming back to the point. what I wanted to find out now is how acceptable this is. frankly speaking, I am doing this only partly because it is helpful to me as a anidb member. by itself, I love to script and "improve" the functionalities of the things that I like to use (e.g. anidb). I'm doing this entirely in Python, using some common extensions, all of which can be downloaded for free from the various sources. All my stuff are scripts also mean that they are available as sources and can be freely modified by anybody who can get their hands on it. What I was hoping to do is to create some standard scripts that acts as mini libraries. e.g. currently I am working on one that extracts the information such as main/official/synonum/short titles, ratings, producers, year (standard stuff we find on the main anime info page) and packs them into a easy to consume python map-like object (dictonary), which can then be further processed by other tools or saved to disk in some format. I'm willing to release all my work to fellow anidb members once they are ready (I could just paste the whole script file in a forum post so that anybody can just cut and paste it into their own text file and run it locally) but will only continue development if admin says ok. I use anidb a lot. I like anidb alot. and I really appreciate what epox and team is doing. if you guys say no (maybe due to techincal/resource reasons), I'll stop. period. of course, on the other hand, if you guys is supportive and can help answering my queries and give suggestions from your expertise, I'll greatly appreciate it, and should the scripts finally become public, I think so will the rest of anidb community
regards
sphere