anybody interested in developing HTTP API?

Want to help out? Need help accessing the AniDB API? This is the place to ask questions.

Moderator: AniDB

sphere
Posts: 19
Joined: Thu Nov 16, 2006 9:33 am

anybody interested in developing HTTP API?

Post by sphere »

does anidb has any plans for a http based API? namely through http get and post. despite the work being done on the udp API, I feel that it will be a lot more flexible in the long run.

For those who may not understand, basically, the client will make a http request maybe like

http://anidb.info/perl-bin/anime.php?id=979

The anime.php is a php script that will generate the JSON, while the query indicates which anime we are requesting.

a sample JSON reply will be:

Code: Select all

{
	  infotype: "anime_info"
	, animeid: 979
	, coverimage : {type:"url", url:"http://www.sonymusic.co.jp/Animation/hagaren/"}
	, title: {
		  full: "Hagane no Renkinjutsushi"
		, kana: "鋼の錬金術師"
		, english: "Fullmetal Alchemist"
		, synonym: ["Full Metal Alchemist", "Metalinis Alchemikas"]
		, short : ["FMA", "Fullmetal", "HagaRen", "HagaRen TV", "HnR", "fma tv", "fmatv"]
	}
	, genre: [[2,"Adventure"], [4,"Drama"], [14,"Magic"], [30,"Shounen"]]
	, type: "TV Series"
	, episodes : 51
	, year : "04.10.2003-02.10.2004"
	, companies : [
		[292,"Arakawa Hiroshi"], [47,"Aniplex"], [39,"BONES"], [282,"MBS"]
	]
	, url : "http://www.sonymusic.co.jp/Animation/hagaren/"
	, relations : {
		sequel:[
			[2359, "Gekijouban Hagane no Renkinjutsushi: Shambala o Yuku Mono"]
		]
	}
} 
I'm using JSON here cos it is much leaner and easier to read than XML. Greatly reduces the amount of data being transferred and parsed. Of course, there will be other scripts for querying other things like perhaps companies, individuals, genres etc. Having such a framework will also make anidb AJAX ready which will release the full power of the browser allowing the web pages to rival any standalone app.

If there isn't atm, wonder if anybody is interested and if anidb admin is supportive? I am interested in such a project, particularly in the design of the data exchange and services, but I an not a php/pl programmer (Works in C++, javascript mostly). However, the things that I am looking at the moment should be very straight forward as it mirrors what can be accessed from http manually. In fact, for a hack, we can actually parse the current htmls directly (very very messy and will break if the web format changes).

This is just the first step I'm looking at. The next step will be coming up with a generic anime information data structure that is server independent and coming up with proper abstraction that will allow anime data of different structure (e.g. from other anime sites) be transformed into a common format that can be used in a web client or standalone client. The advantage is that the clients will see only ONE generic format while the actual transforms etc will be provided by development team dedicated to each site/data source. It will certainly make future maintenance easier imo.

After that is done, what I hope to achieve is something like animelamp (an anime collection manager), where it can double as a anime collection organizer, which can query online databases to fill up info for a new anime, as well as provide links to the online files/cds/dvds while indicating those that are offline. specialized filters may be created to synchronize the applicable information between the client and the anidb lists.

Well, those are just some of the possibilities ;)

Looking forward to replies from the admin as well as interested parties.
exp
Site Admin
Posts: 2438
Joined: Tue Oct 01, 2002 9:42 pm
Location: Nowhere

Post by exp »

well,

this is not an easy topic.
Just as a sidenote, before we begin, we're already working on an xml-based ajax version of the anime page, prototypes of it are already working. In the long run some of the data intensive pages of anidb will be converted to ajax pages (first: anime page & mylist page).

The notion of an HTTP based API has been discussed and in a way we already have something like it. But it's not providing the data you want.
(And some of the first versions of the TCP API were even completely build on top of HTTP!)

The key issue here is server load, we've consciously decided to go with an UDP based non-xml approach to maximize the number of concurrent clients the system can handle.
Compared to the UDP API, an HTTP API as you suggest it would be quite some additional overhead, even without xml.

The question here is, what do we gain? Sure maybe implementation on the client side would be slightly easier with an HTTP & xml based approach, but we're a non-profit project and server performance is our key resource. As such anything which brings a penalty in terms of server load needs to have a good justification.

We're quite willing to extend the functionality of the UDP API to suit the need of our users, but an HTTP based API doesn't really seem to be worth the additional overhead it would entail.

I won't say never, but right now, I don't think we really need an HTTP API.

BYe!
EXP
sphere
Posts: 19
Joined: Thu Nov 16, 2006 9:33 am

Post by sphere »

I see :) I think I can understand some of the difficulties you guys face. I was actually thinking that it would actually be a simpler subset than the webpages right now. instead of returning fully formatted web pages, it is returning the data. I don't know about the exact implementation, but since anime data are quite standard, could they be cached remotely too? e.g. if person A on some ISP requests this anime data via http, will it be possible that person B requesting the same data gets it from the ISP cache instead of the server? (maybe it's just wishful thinking on my side :) )

But the ajax sounds like good news, though how you can implement it without http is beyond my understanding. If you DO use http, then isn't the cost essentially at least (if not more than) the cost of a http api? Nevertheless, I'm very interested in the ajax part. do keep us informed!

And I will also be glad to offer any help if you guys need them (used anidb quite a lot, so I tot its only proper I give something back). I mostly concentrate on client-side development, namely c++, and javascript web-app (basically using the javascript engine in browsers as a platform for developing full apps). Also a greasemonkey (take back the WEB!) junkie :) On the not-so-front end, I have working knowledge of DBMS and data design/optimization, also have done some system/framework architecture work. Currently, I work in a small development team providing 3D visualization solutions in architectural/managerial building-related domains.
sphere
Posts: 19
Joined: Thu Nov 16, 2006 9:33 am

Post by sphere »

btw, regarding the client registration part.

Suppose that I am interested in creating a secondary abstraction library (maybe like anidb access for C++), which will hide all the connection details (so that the client frontend won't actually know if it is on http/udp or whether it just randomly generate the data :lol: ) that allows data to be retrieved in a more friendly format without bogging the high-level app with things like parsing the udp replies or dealing with traffic control. How will this fit into the registration part? Or will it be that for all the clients that uses this secondary API will have to register itself?

thanks in advance.
epoximator
AniDB Staff
Posts: 379
Joined: Sun Nov 07, 2004 11:05 am

Post by epoximator »

i have nothing against adding json formatted replies to some of the udp api commands, so feel free to define it in the wiki: http://wiki.anidb.info/w/UDP_API_DEV

as long as the library is used unmodified, i would say only one registration is ok. (although several registrations wouldn't hurt either) but this isn't really something to worry about at this point.
Rar
AniDB Staff
Posts: 1471
Joined: Fri Mar 12, 2004 2:41 pm
Location: UK
Contact:

Post by Rar »

What exp said, but with added distain for your choice of topic, programming language, serialisation format, and aspirations.
Five reasons to ignore made-up-acronym of the week: #Xpath #Transforms #Well defined encoding #No eval temptations #Gzip laughs at the bloat complaint

Rar
exp
Site Admin
Posts: 2438
Joined: Tue Oct 01, 2002 9:42 pm
Location: Nowhere

Post by exp »

sphere wrote:But the ajax sounds like good news, though how you can implement it without http is beyond my understanding. If you DO use http, then isn't the cost essentially at least (if not more than) the cost of a http api? Nevertheless, I'm very interested in the ajax part. do keep us informed!
Well, it would obviously be based on an HTTP XML API, however that API will deliver only the types of data which are required for the ajax scripts. The API is not meant to be used by any 3rd party programs/scripts.

The key issue here is access frequency and request count. A human in front of a pc will typically request _MUCH_ less information that automated scripts hacked together for some purpose or another.
I.e. if someone were to write something like AoM based on an HTTP API it would kill the server if more than a handful of people would use it.

We just want to get the most out of the available resources and atm that means that it's a better choice to put some more burden on the client programmers in form of a not-so-easy to access UDP API if that means we can spare ourself all the HTTP/TCP overhead.

BYe!
EXP
PetriW
AniDB Staff
Posts: 1522
Joined: Sat May 24, 2003 2:34 pm

Post by PetriW »

sphere wrote:But the ajax sounds like good news, though how you can implement it without http is beyond my understanding. If you DO use http, then isn't the cost essentially at least (if not more than) the cost of a http api? Nevertheless, I'm very interested in the ajax part. do keep us informed!
Actually, ajax has the potential to allow for much less serverside work. Caching + less html to generate = better performance.
sphere
Posts: 19
Joined: Thu Nov 16, 2006 9:33 am

Post by sphere »

Rar wrote:What exp said, but with added distain for your choice of topic, programming language, serialisation format, and aspirations.
Five reasons to ignore made-up-acronym of the week: #Xpath #Transforms #Well defined encoding #No eval temptations #Gzip laughs at the bloat complaint

Rar
Just to clarify things: I AM a XML supporter (over JSON). I mentioned JSON as an alternative to the raw delimited data that seems to be the case now with the UDP API which is sure to wreck havoc even if there is just a small change (e.g. change of order of fields or adding/removing fields). I was just afraid that mentioning xml (with the significant bloat since UDP API works in TEXT form) will sounds too intimidating :)

One thing you missed out: XML scales better than JSON(e.g. when parsing), which can be significant if you are talking about large data size.
sphere
Posts: 19
Joined: Thu Nov 16, 2006 9:33 am

Another take at it

Post by sphere »

This post mostly meant for epox or any other administrator.

think this has been long dead, but since I picked up python recently, and amazed that how such easy it is to develop small apps using such scripting/JIT-compiled languages, tot I revisited the topic... but there is a slight change in focus... after trying out the UDP API (but frankly speaking, find it less informative than the actual html, prob due to limitation of data size being exchanged)

1) instead of requesting a special http api, which could costs to the server side as it is something totally different from the existing htmls in terms of both serving as well as dev/maintenance, I am now looking at a less robust but potentially more server friendly form of parsing the present htmls instead. Now, I know that normally it is not good to try to tax normal html servers this way, but I have a few assumptions to state and hopefully be clarified by epox or other people who knows the situation.

a) html can get cached. unlike UDP where every single UDP client bugs the server directly, normal (largely static) html gets cached at various levels: local cache, isp cache, isp xchange cache, secondary proxy cache etc. Meaning that it prob has a higher chance of NOT requesting any data from the server (web experts, correct me if I'm wrong on this). Also, the way I study the (my own, that is) usage pattern, the frequency of use isn't really more than looking up anidb manually.

b) html can get compressed easily (natively supported by many browsers), again unlike UDP, which is already fragile (error prone) without compression, and would require a compression scheme with very high redundancy to make it robust, but also makes it too hard to decode, since most UDP processors deals with simple text

c) html is close to xml (or xml itself if it is xhtml). meaning that as long as certain identifying rules are fixed e.g. certain data always identified by certain parent or certain class attribute or certain keywords etc, the rest of the html can be rearranged without affecting the extraction (it could be costly, but it's on the client side). Xpath will be a great tool in this case. On the other hand UDP replies are (and must be) fixed as they are referenced by position using some scehma which is NOT sent with the data (the client has to know beforehand). Any change to the reply formats will most certainly kill all the older clients that assumed a fixed way of parsing the data.

ah... I think I got ahead of myself by not explaining what I am trying to do...

(inserted out of order)

0)

Before I revisited this, I was actually writing a simple python script that checks several urls for any new links (ed2k links to anime). (nope. not from anidb. otherwise I don't have to crack my head over things). It does this by loading a simple list of urls to check, when executed, go through them one by one, download the source for that url and parses it locally for ed2k links. It then checks another oldlinks.txt stored locally to remove any links that is inside. it then saves the new links to a simple text file. I run this every other day to check for any new episodes to some series. I just have to look at the newlinks.txt to see if there is anything. Just for info, after I reject or started downloading certain links, I cut and pasted them manually into the oldlinks.txt so that they won't appear next time in the newlinks.txt. This is still primitive and lots of rooms to improve, but the simplified way of storing stuff could potentially allow a nicer and more feature-full GUI to be built over it in the future. The main reason why I use simple text files is that copy and pasting the raw links to and from clipboard allows them to be pasted directly into the eMule client (whereas if I had used binary database or xml, I can't do it without processing it).

Now comes the problem. given just a url, there is no easy way to tell me what anime it is refering to (forum thread like url naming, so it's mostly numbers). So when I add a new url, I want to tag it with extra info such as what anime it is, some "vital statstics" and a link to some "authority" for more info. The thing is that anidb is my number one (in fact... only one) source of getting anime information, so what I had to do was to go to anidb manually every time I add a new url to my program, then manually and labourously copy over the info one by one.

normally, this doesn't occur too frequently so when I have time, it is ok (but still boring job). But as fans know, animes comes out quarterly.... meaning that during the start of a quarter, there will potentially be tons and tons (I scream with an ironic mix of joy and frustration, that only a fan will know!!!) of new animes. But it also mean that setting up the thing takes up quite a bit of time and effort.

What I was hoping to do, is to have an easier way to get some anime info from the "authority" (in this case anidb) and compose some sort of (prob html) description which be used to identify the urls/links I deal with, have basic info, and have hyper links back to the detail info of the "authority", so that when I click on it, I get back to the anime's page on the browser.

The long term and too-ambitious goal is to have ONE abstract anime schema that rules them all (abstract coz not "owned" by any single anime database source). It's like what COLLADA is among 3d formats. basically, it is a universal anime schema that can be partially filled by data from various sources (since a single data source may not be able to fill up all data). But frankly speaking, since I am happy with anidb's schema, it prob turns out that this will be close, if not exactly what anidb is using... so it may turn out to be a "universal adaptor" for anidb's database instead :) after anidb, I could maybe work on variants of my scripts that works with other anime sites that returns data in a compatible form (but really, I was hoping that after I release the scripts that defines the run-time representations of the schema, somebody else can do it, since I really need just anidb :p but come to think of it, it could be interesting if I can extract image links from sources and compose a image gallery to be displayed with the anime info.....) , and have a higher level script that can take in all this data and plant them into a single data template, with links back to original sources + potentially doing data-tidying (e.g. different sites may use different codes/strings for similar things, which gets mapped to a single abstract data field.... sort of like what unicode is to all the other locale-specific code pages). Could also compile interesting data e.g. show what rating each site gives to a single anime. Can even do conflict detection e.g. some sites disagree on maybe number of episodes etc etc... maybe it can motivate interested users to find out (from official websites. we can't update databases from each other since they are all secondary info) and inform the respective sites of neccessary updates.

another not so high priority and perhaps controversial feature is actually to integrate with the ed2klink harvester that I was using, such that it is actually possible to say load the links from some forum pages or some release sites, compare it with anidb's links and maybe thru some api adds the new links to anidb??? unless some maniac is doing this on a massive uncontrolled fashion, the way I see it is that since I am using the links myself, I can sort of validate them (false links/wrong links). This new feature just allow me an easy way to add to anidb's data in a by-the-way fashion.

(end of background story)

2) now coming back to the point. what I wanted to find out now is how acceptable this is. frankly speaking, I am doing this only partly because it is helpful to me as a anidb member. by itself, I love to script and "improve" the functionalities of the things that I like to use (e.g. anidb). I'm doing this entirely in Python, using some common extensions, all of which can be downloaded for free from the various sources. All my stuff are scripts also mean that they are available as sources and can be freely modified by anybody who can get their hands on it. What I was hoping to do is to create some standard scripts that acts as mini libraries. e.g. currently I am working on one that extracts the information such as main/official/synonum/short titles, ratings, producers, year (standard stuff we find on the main anime info page) and packs them into a easy to consume python map-like object (dictonary), which can then be further processed by other tools or saved to disk in some format. I'm willing to release all my work to fellow anidb members once they are ready (I could just paste the whole script file in a forum post so that anybody can just cut and paste it into their own text file and run it locally) but will only continue development if admin says ok. I use anidb a lot. I like anidb alot. and I really appreciate what epox and team is doing. if you guys say no (maybe due to techincal/resource reasons), I'll stop. period. of course, on the other hand, if you guys is supportive and can help answering my queries and give suggestions from your expertise, I'll greatly appreciate it, and should the scripts finally become public, I think so will the rest of anidb community :)

regards
sphere
epoximator
AniDB Staff
Posts: 379
Joined: Sun Nov 07, 2004 11:05 am

Post by epoximator »

you write way to much, no one is going to read all that. highlight your points at least.

some general points.
1. you can parse the site all you want, but you might get automatically banned. furthermore, your client will never be sanctioned by us.
2. the technical aspects of UDP is not interesting...

it think you get this all wrong. if you really are interested in our content, you should concentrate on our TCP API

see
http://wiki.anidb.info/w/Future_of_AniDB#TCP_API
exp
Site Admin
Posts: 2438
Joined: Tue Oct 01, 2002 9:42 pm
Location: Nowhere

Post by exp »

Let me pick up some of your points.

1) HTTP Caching
As all AniDB pages are customised (if you're logged in) none of them will be cached by any HTTP proxy. All requests end up at the main webserver. The webserver itself does some internal caching of parts of the data, in order to offload the dbserver, but there is little caching going on which would reduce the load for the webserver itself. And no caching which would reduce bandwidth usage.
Sure, some caching would be possible if you're accessing AniDB as a guest. But then, you don't have access to a lot of the data. Especially the ed2k data which seems to be very important for you.
(Furthermore, our concrete setup probably prevents caching by HTTP caches even if you're accessing AniDB as a guest)

2) Screen scraping in general.
This is always a bad idea. If you really have to do it. It only shows that one of the other APIs has shortcomings which should be addressed.
We do not support any kind of automatic processing of AniDB's webpages. There are a couple of "anti-leeching" measures in effect. And you could easily end up getting yourself banned if you're not careful.
Some of the critical issues:
a) by parsing html pages you generally request much more data than you actually need, which increases server load compared to using some mechanism which requests exactly the parts you want.
b) there is no good way to detect automated parsing and to disable old broken scripts centrally (which is possible for UDP/TCP client). This means that once any script is spread it is next to impossible to ensure that everyone updates to the latest version once a critical bug is fixed.
c) the html pages are undergoing constant change. By relying on parsing them your client will break regularly and unpredictably whenever we make some layout changes, which happens quite often and unannounced.

3) Your concrete problem
If I've understood you corretly, what you basically want to do is take an ed2k link and identify the file it represents. Then get enough data about that file to be able to give it a sensible filename.
This is _exactly_ what the UDP API is for. I really don't see why you'd need anything else for this.

4) Python library
Take a look at the UDP Client page on the wiki. IIRC some people have already written some python scripts. You might be able to extend or reuse some of the existing code.

BYe!
EXP
sphere
Posts: 19
Joined: Thu Nov 16, 2006 9:33 am

Post by sphere »

thanks for the reply and clarification. I haven't seen the TCP API before and it could prove to be what I needed. will take a look. The whole reason for returning to http was the missing info on the html that is not available in the udp (yet). but if http caching cannot be exploited, then I agree that this is not the way.

Will probably give up from all the negative feedback, but just to clarify something to convince myself more totally.

1) on banning of clients (for aggressive access)
I tot banning is based on frequency of access?? the reasons that bots are generally banned is not because they are bots, but usually because of the heavy traffic that they introduce yes no? The way I am using this python script is first to download the page (ONCE) which is saved onto the local filesystem (I don't want tp download everytime I need it, esp when I was making lots of mistakes in trying to parse it). If say my access is same, if not lower (I always try to look at the local file cache for the specified files first) than a typical surfer, does that still constitute banning? I won;t continue to dev it as public kind of api, but I was hoping that at least it can be used as a tool to help me automate some tasks .

2) concerning the ed2k link, I think exp misunderstood me. The links i get are usually not on anidb, so there is no easy way to check for md4 hash matches to identify which ed2klink belongs to which anime. I usually do the association manually. I only seek to do a one-time info lookup for each new series (happens once per new anime series... which means the maximum number of times I'll use it is capped by the total number of new animes coming out, assuming that I can even watch it all :) )

3) all in all, think I'll try looking at the TCP API that epox mentioned to see if it can cover everything that is current available on the http. Or if the UDP API is ever developed to the point that it can present everything that the http can (but selectively chosen by user), I may want to go back to it too :)

thanks and regards
epoximator
AniDB Staff
Posts: 379
Joined: Sun Nov 07, 2004 11:05 am

Post by epoximator »

1) if it's not agressive, then it won't be banned, no. it's still nothing we want to support, though.

2a) md4? i think you mean ed2k hash. so, what's the problem? you check with FILE whether the file exists or not, if not then manually associate in your local db or whatever. the anime page doesn't help you more than FILE... (and it doesn't offer all files at once anyway)

2b)once per anime? you're not aware that the content changes quite often? new files are registered all the time

3) no, sorry, you can't look at it, it's not public.

i don't think i understand exactly what you want, though. please specify better.

and if you want improvement of the current web interface, why not just request it? (http://tracker.anidb.net)
sphere
Posts: 19
Joined: Thu Nov 16, 2006 9:33 am

Post by sphere »

epoximator wrote:1) if it's not agressive, then it won't be banned, no. it's still nothing we want to support, though.

2a) md4? i think you mean ed2k hash. so, what's the problem? you check with FILE whether the file exists or not, if not then manually associate in your local db or whatever. the anime page doesn't help you more than FILE... (and it doesn't offer all files at once anyway)

2b)once per anime? you're not aware that the content changes quite often? new files are registered all the time

3) no, sorry, you can't look at it, it's not public.

i don't think i understand exactly what you want, though. please specify better.

and if you want improvement of the current web interface, why not just request it? (http://tracker.anidb.net)
Hi. thanks for the time and allow me to explain further just to clarify things. (don't worry. giving up already)

1) The whole idea behind http was that I tot that it could help by caching of mostly static files+compression and that there is no need for any developer to do any other http api or the like (having http query that return xml replies will be nice though... lol.. just dreaming here). this is mainly due to the lack of some of the info (according to what I've read up, but I could be wrong. e.g. image url or sypnosis? ) from UDP interface and the fact that everything that http current has on one page (thus 1 exchange), UDP needs a whole lot of requests. But once it is clear that http is not help by off-server cache, I've mostly given up on this fragile (like mentioned, website formatting may change too drastically) api idea.

2a) The ed2k links I get are not from anidb, and quite possibly there are a lot that are not in the anidb at all. Thus, I don't think I can nor had I tot of using the ed2k like to automatically find out which anime it belongs to and request the info. It's more like I do a word in title search in anidb to find out the page (or at least the aid) then I manually copy over the important anime fields, which are largely the static ones like title, episodes, producers and stuff. the content update regarding the new files (ed2k links) being added are not really in my scope atm (simply because I get sources from elsewhere). Thus, there is really no need to refresh just to get new files (for that I would prefer something small like UDP API, which will work much better).

previously, I was thinking of expanding a script to automatically upload those links that I got which are not in anidb (but from later reading, it seems that only advanced users can do that?? so nothing lost here).

@exp from previous post concerning python
I've searched somewhat and looked at a few python implementation of the UDP API. unfortunately, most are inactive. I tried to take one and further develop it (pyanidb, but it seems incomplete):

it has the basic UDP send and receive, but has a too general execute which is trouble some to use.. I;ve made some slight improvement e.g. automatically adding session tag when it is needed (instead of manually having to specify it for every single request) + wrap higher level functions around the basic one. was also working on an automatic flood control such that the UDP connection will automatically wait (if blocking) until it is ok to send the new request ( or return failure if non-blocking)

I was just trying out things at that point so I hadn't tried to get in touch with the original people (if they are still working on it?) It was then when I compared with http that I realised some data doesn't seems available to UDP (from the API wiki), which is why I revisited http ideas. but now that the idea is toasted, may revisit the UDP until perhaps the TCP API is available.

btw, I'm not really working on an end user client, but rather more like a intermediate client-side API to facilitate development of other clients :) So the local client just have to deal with the local API and not the actual network stuff so that when anidb API gets upgraded, the client developer just have to upgrade the client side api without having to change rest of the app. or so I wished :)


@to all anidb staff
anyway, regardless of being a little disappointed at how things turn out, must still say here again that you guys have done a great job (I don't think you will ever get thanked enough for the great work you guys put in). And thanks for taking time to reply to people like me who seems to exists only to take valuable time off your hands which could have been spent on making anidb better... LOL
Locked