Future of Anime Hint - Anime Referral

Forum for discussing AniDB rules & standards. No small talk!

Moderator: AniDB

egg
Posts: 769
Joined: Tue Nov 11, 2003 7:17 am

Post by egg »

nstgc wrote:I do have to admit that the thought of a map of the anime database sounds pretty cool. How is it currently being done? I know you don't give out server information (not that I could read it if you did), but what about just the math. Thats what I want anyway. I really don't want to put out effort just to find that none of my seggestions are taken.

I glanced at one of the publications and it seemed more like marcketing and how this type of thing can help companies. This doesn't help at all. Which ones did you find useful?
Well actual structure is somewhat irrelevant, I generally have to manipulate things into special tables anyway to facilitate the comparisons/computations. Basically any of the data in the system can be used for comparison.

For the A-to-A system I am working on, the first step is to create a table that lists all of the animes with similar votes and the types of votes. Each record has aid1, aid2, vote1, vote2, type1, and type2. There are 18,000,000 records (based on votes from August 2005) and it takes 30 minutes to run on my machine. Then it goes through this list one by one adjusts the votes with weights (permanent votes have a higher weight than temporaries) and calculates sums on things like vote1^2, vote2^2, vote1*vote2. When it gets to the end of a set of anime it calculates the score (cos of the angle between the vectors / ratio of the vector lengths). This takes about 4.5 hours to run this portion on my machine.

For the U-to-U system I do similar things with creating temporary tables to assemble the data (for complex operations this is actually faster in many cases), but the amount of data to go through is significantly less.

As far as the publications, the earlier ones (in general) talk more about the recommendation system. Like I said I hadn't gone through all of them either and many of them don't really have much to do with what we are talking about. The Amazon article was more useful (the link is in the first post). I found this set of articles because I was looking up a reference from the Amazon article, and then I saw the whole list of articles...

As far as implementing your idea, I cannot say one way or another at this point. In order to do it, I would have to be convinced that the outcome is worth the effort. The U-to-U systems that I have worked on appear to have hit a limitation and something different is needed. You have a suggestion of something that is different, but so far I am not sure if it will lead to good results or not, so I am hesitant to put much work into it. Also, I started on this A-to-A initiative, so assuming exp is willing to do something like that, then I will probably try this first (I know that is what I told you last time too, sorry). I think your idea is worth discussion, but atm I don't have a plan to implement it, on the other hand I have not rejected it either. I don't want to discourage you, but I don't want you to have false expectations either.
exp
Site Admin
Posts: 2438
Joined: Tue Oct 01, 2002 9:42 pm
Location: Nowhere

Post by exp »

about the A2A part, as long as it returns usefull results, I think it's a good idea to add that as an additional feature.
we'll of course have to see how to handle all the computation needed, but as long as we can calculate most of it offline, that should be solvable.

BYe!
EXP

PS: btw. currently we have 3463 animes, 116097 users and 411703 anime votes (+73781 temp votes) in the db
nstgc
Posts: 48
Joined: Sun May 22, 2005 6:45 am
Location: Island Closest to Hell

Post by nstgc »

To egg,

The reason I'm so insistant on my way is that currently I see it as the best way and I want the hint system to work properly. From a mathematical stand point, I know my way is better then the corrilation coeffeciences. I can't figure out which is better, your A to A or my CRS+weights, with the explination you just gave. What I have gathered from your last post is as walls
  • User votes are used to compare the similarity between animes
  • A covariance is involved "sums on things like vote1^2, vote2^2, vote1*vote2"
  • Things are being explained as vectors
  • These vecotrs are being dot producted or your taking their ratio "cos of the angle between the vectors / ratio of the vector lengths"
If it weren't for that last one, I would think you know what you're doing and only you knows. However, the cos of the angle is not the ratio between the lenghts. Trust me, as a double major math/physics I deal with vectors every day. The cosine of the angle is the ratio between the dot product of the vectors and the product of the lengths.

Since you seem as capable of making an equation for me to look at as I am of making code for you to look at I'll just ask specific quetions on parts. (actualy, you're a little more capable of writing a formula since I can't code at all)
  • what are the components of the vectors
  • how exactly are you taking taking these ratios or finding the cosine
  • you are taking a covariance, what part does it play, mathematicly
  • what are types "aid1, aid2, vote1, vote2, type1, and type2"
Those are the only questions I can make given the limited information you gave me. I gave you all the information I know about my system so that you could see that it was better then the current systems used here. I think asking for parts of your new system so that I can figure out if your new system is better or not is minor. I understand you aren't as good at math as I am, so I'm not expecting anything too eleborate. Also, basic information, like that which you prodived in the first post post, is worthless. This is why I ended up posting the link to my pdf file. My laymen text explination was insifecinet.

I don't understand the problem with mine. It certianly is a step up from the current systems so why didn't you use that as a starting point instead of trying to start something new again yourself? It is mathematicly sound and implimentable. Jundging from what Exp said, there should be more then enough information for mine to work. I still haven't looked into how much is needed, but I also just woke up (I slept for 12 hours strait).

Also, my comment about heterogeneous versus homogeneous was not mentioned in your reply.
egg
Posts: 769
Joined: Tue Nov 11, 2003 7:17 am

Post by egg »

I don't have time atm for a complete reply, but I didn't want to leave things hanging too long... I only have time to code and/or participate in these conversations depending on work and family... I am trying to be as open as possible. As you well know it is a lot of work to explain something in detail to the nth degree, what I did yesterday was just to give you an idea of what is happening in the time I had. Most people do not care about the technical details, and from them all I was expecting were things like, "yeah, that's good information," or, "No way, that's junk." So far it has been the second and based on that it has allowed me to tweak how things work. From people like you, who have a technical interest, I am interested in what you have to say about the stuff, and I will try to provide information to accomodate that. In your particular case you had a proposal, and I am interested in hearing more about it, but that does not necessarily mean that in the end that I will agree with you that it should be implemented. I would like to understand more about what you have, but understand that it will take time.

I have been very careful in choosing my words for clarity, and you have pointed out a number of cases where you did not understand what I said. Unfortunately I have the same issue with your writing, sometimes I do not understand everything you say (especially since I am usually half asleep before I have time to sit down and go through things).

As long as you are willing to participate I appreciate your input. If you feel that I am misleading you or are intentionally trying to keep you in the dark, that is not the case. Also I have my own constraints and frequently there will be a while with nothing happening and sometimes brief periods of a lot of activity.

As far as puting in more details on the A2A system, most likely I will not be able to post anything coherent until Saturday.
nstgc
Posts: 48
Joined: Sun May 22, 2005 6:45 am
Location: Island Closest to Hell

Post by nstgc »

Ok, I didn't really realize that you only expected imput in this thread.

Clarification on that post that I made this time 23 hours ago:
My CRS is not an equal U2U and A2A. It is a U2U, that happens to sum over the anime aspect. I still believe that the CRS combined with another weighting system is best.

Since the CRS is extremely simple it took less then five minutes to create the reverse. However, just like Relitivity (which I have a test on today at 11 est), its simple in its form, but can be somewhat complex in its function. I will need several hours interpreting the reverse. As it is, I doubt it will be useful without modification, which would take some more time.

[edit] Don't worry about providing too much explination. As I said in another post, I don't understand words very easily. If you could just put down the mathematical statements whith what they do that will be enough. By "what they do" I mean make sure I can fit one statement together with another statement. By statement I mean something like a+b=c. I may have crappy language skill, but I am very good understanding math.

Also, I do, as you said, understand how much trouble it is to write a detailed explination of something like this. I also understand that time is an issue to those who have lives (like you). I was mostly upset because I thought that you expected everyone to give imput that requires deep understanding without providing any way of getting that "deep understanding". I am relieved that this is not the case. I also get this warm fuzzy feeling thinking I'm special...or the gas heater could be spewing carbon monoxide again.
egg
Posts: 769
Joined: Tue Nov 11, 2003 7:17 am

Post by egg »

This is meant to be a somewhat technical document, which is really a narrative of how I ended up where I am today with the technical details filled in. I tried to just document the equations, but those needed so much explanation of why they were that way, that I decided a narrative was needed.

Well when I first started thinking about implementing an Anime-To-Anime (A2A) system, what inspired me to do this was a document about how the recommendation system was implemented at Amazon. In the document it said:
It can measure the similarity of two customers, A and B, in various ways; a common method is to measure the cosine of the angle between the two vectors:

similarity(A,B) = cos(A,B) = A(dot)B/(|A|*|B|)
Other documents also mentioned that frequently the cosine or Pearson’s was used in these cases. I had already implemented Pearson’s on the User-To-User (U2U) anime hint, and knew that it had limitations that would be a disadvantage, and the cosine was an elegant solution (frequently fairly simple, elegant solutions turn out to be the correct one).

The first thing was to determine the vectors, which was relatively straightforward. If you can think of the votes being in a spreadsheet:

Code: Select all

Title  User1 User2 User3 User4 User5
------ ----- ----- ----- ----- -----
Amine1    10     5     8     3     9
Anime2    10     7     6     7     9
Anime3     9     2     8     5     7
Each row can be treated as a vector in n-Dimensional space (in this case 5 dimensions).

It turns out that the cosine can be calculated rather efficiently.

A(dot)B (the dot product of the vectors A and B) is the sum of the products, so for Anime1 and Anime3 it is: 10*10 + 5*7 + 8*6 + 3*7 + 9*9 = 285

|A| (the norm of the vector A) is the square root of the sum of the squares, so for Anime1 it is: sqrt(10*10 + 5*5 + 8*8 + 3*3 + 9*9) = 16.7032931

|B| (the norm of the vector B) is the square root of the sum of the squares, so for Anime2 it is: sqrt(10*10 + 7*7 + 6*6 + 7*7 + 9*9) = 17.7482393

So, the cos(A,B) = 285/(16.7032931*17.7482393) = 0.961363175

This maps rather nicely into code, what I did was to make a table with all of the combinations on where a user had voted on two different anime, so I had aid1, aid2, vote1, and vote2 (for a particular user, which user it was does not matter, so that information was not kept). There were 12 million records in this table (it only included permanent votes). This table is sorted by aid1 and aid2, so when I go through read through it line by line I was able to go anime by anime. So what I did was I kept a running total of the products (for the dot product) and the squares (for the norms) and when I got to the end of a set of anime I calculated the square roots and the cosine value. I also dropped any combinations of anime that did not have at least 10 votes in common (if there are not that many votes, the data is more unreliable).

From this information I could create a list of anime where people had voted similarly (based on the score). Although this worked, I found that everything had a score of about .9xxxxxx. I realized that is because in any given dimension, there was always a positive movement (1 to 10), so the resulting vectors were actually fairly similar. So I decided to subtract 5 from the votes BEFORE I calculated the product and squares, so the votes are essentially from -4 to 5 now instead of 1 to 10.

Another problem I realized at this time is that the cosine only determined if the vectors were pointing in the same direction, not if the votes they represented similar scores. If you think in 2 dimensions, a vector starting at 0,0 and going through 1,1 is going the same direction as a vector starting at 0,0 and going through 10,10. In our case though the votes 1 and 10 are the extremes, so what I decided to do was to divide the resulting cosine score by the ratio of the lengths (norm) of the vectors.

At first I was going to divide by the |A| and multiply by |B|. The thinking was that if people voted higher for B, then it would have a higher norm and the value would get bumped up. But after running a few scenarios it appeared to generally list the top anime in the system (since those are the ones with the highest votes). So what I decided was the goal was to find the anime that were the CLOSEST to having the same votes as this one. So what I decided to do was to multiply by the lower value of |A| and |B| and divide by the higher value. The end result is if |A| and |B| are the same, then the value stays the same, otherwise, if they are different, then the score is penalized, the larger the difference (regardless of which one is greater), the larger the penalty.

This also turned out to be handy for my equations because that meant:
if(|A| > |B|) then:
score = (A(dot)B/|A|*|B|)*(|B|/|A|) = (A(dot)B*|B|)/(|A|*|B|*|A|) = A(dot)B/(|A|*|A|)

Similarly if(|A| < |B|) then:
Score = A(dot)B/(|B|*|B|)

This meant that I do not have to calculate the expensive (in CPU time) square roots (since I am essentially squaring the square roots).

So that left me with the following:
  • Subtract 5 from the votes.
  • Keep a running tally of the product and squares
  • If there are less than 10 values for these two anime, throw out the score.
  • When I get to the end of the sequence calculate a score by:
    if |A| > |B| then score = ab_product_sum/a_squared_sum else score = ab_product_sum/b_squared_sum
  • Multiply the score by 1000 and take the integer value
These were the first results that I made available on the forum (look at the post for FMA above).

Things have progressed since then, but I don’t have any more time right now. Also I want to see if this information is useful before I take the time to do more.
nstgc
Posts: 48
Joined: Sun May 22, 2005 6:45 am
Location: Island Closest to Hell

Post by nstgc »

Thank you very much, I wasn't expecting anything at all untill Saterday. I have questions and comments now though.
egg wrote:At first I was going to divide by the |A| and multiply by |B|. The thinking was that if people voted higher for B, then it would have a higher norm and the value would get bumped up. But after running a few scenarios it appeared to generally list the top anime in the system (since those are the ones with the highest votes). So what I decided was the goal was to find the anime that were the CLOSEST to having the same votes as this one. So what I decided to do was to multiply by the lower value of |A| and |B| and divide by the higher value. The end result is if |A| and |B| are the same, then the value stays the same, otherwise, if they are different, then the score is penalized, the larger the difference (regardless of which one is greater), the larger the penalty.
wakarimasen...(I don't get it). I don't fallow. I'll need it in terms of variables and operators. Of cource I don't expect it now.
egg wrote:This also turned out to be handy for my equations because that meant:
if(|A| > |B|) then:
score = (A(dot)B/|A|*|B|)*(|B|/|A|) = (A(dot)B*|B|)/(|A|*|B|*|A|) = A(dot)B/(|A|*|A|)

Similarly if(|A| < |B|) then:
Score = A(dot)B/(|B|*|B|)
While a bit different, that reminds me the projection of vecotor A onto B. The only difference is that you aren't multiplying by the vector B.
egg wrote:if |A| > |B| then score = ab_product_sum/a_squared_sum else score = ab_product_sum/b_squared_sum
I need some time to translate this one. I assume its just everything mentioned before it but put together?
egg wrote:Things have progressed since then, but I don’t have any more time right now. Also I want to see if this information is useful before I take the time to do more.
One thing you'll never need to worry about is not knowing if I find something useful. I do understand that hanging feeling, especialy when it involes something you did. I often request replies when PMing or emailing people. Then compound that with wondering if your wasting your valueable and fleeting time; its enough to drive someone mad.

Earlier, after hear that you used vectors, I decided to extrapolate on that. I came up with an N-field of either users or animes (I had aid in my head but it could be anything) as well. Regretable, to my embaressment, I didn't come to any major comclusions. After about 45 minutes, I just an episode of Andrameda (I downded it for a freind and so I figure I may as well watch it). I have a couple of problems, but many points of praise.

Problems:
  • You should subtract by 5.5
  • I don't like how your using an if/then logic for "whose bigger;" it leads to discontinuity and "favortism"
  • You're trying to get rid of the sign, we all know my thoughts on that
The Good:
  • At its heart it is simple and elequant
  • It should work
  • It produces a sign
  • I like the use of vectors
While I like my CRS for becuase its a rating system instead of a "similarity" system, this seems to have more potential. Using vectors is a convient way to display this data. It makes me wonder if I could use matrisies as well. I'll try this first, then the matrix idea later.

Will the end product have a core which handles things and then have a bunch of weights and stuff help? (ignoring my typing and speak, I'm on ambien)

[edit] I don't know at what point in that message I took the ambien (a sleep aid), but everything seems pretty coherant.
sothis
Posts: 28
Joined: Tue Jul 22, 2003 4:15 pm

Post by sothis »

was just having a chat with exp, ive been gone for quite awhile but now am back, woo!

anyways, i do appreciate the desire to not "compete" with my system i have set up... at the moment i'm redesigning and recoding the entire site, including the anirec.

but in any event, exp and i did briefly discuss the algorithm for anidb... i'd be open to the idea of somehow incorporating the data i have with anidb's anime hint, for anidb, at least. in general i'd like to strengthen the link between our sites since they have always been sister sites of sorts. i think we each compliment each other well. anidb for its complciated and accurate algorithm, and anime planet for its 100% user driven recommendations. i think the two give users a very good selection of data to choose from for finding new series :) and i hope with the redesign of anime planet, we can somehow make that bond stronger.
nstgc
Posts: 48
Joined: Sun May 22, 2005 6:45 am
Location: Island Closest to Hell

Post by nstgc »

While I can't speak for egg, and thus I won't, I will comment that the use of user driven data would add another layer to the "human factor". Currently we are already dealing with this enough as it is. Once a proper system is created, there should be no reason to such data anyway.

I will be the first person to say that egg's cesine system is flawed. You could litteraly say "I've done the math" and there are two parts that concern me. I've looked at only one so far, but I will look at teh other later. I'm also waiting for him to get me a better guide to the system and answer the questions that preceeded your post.

now to egg,

I noticed some problems Thursday and was waiting for you to post a reply. I hate double posting.

I looked at the if/then statement. There are no points of discontinuity in the function itself, but in the dervitive there is. This is becuase at secant theta |B| (I can't remember, it was several days ago) the eqaution changes. Thats where it flips. So instead of being cos(q)r its cos(q)r^-1. If you have a graphing calcultor or program punch those two in. They are nothing alike. I was thinking about a hyperbolic function that I use in cases like these (I've run into this situation before actualy) but it won't work in this case. It has a singularity at zero which is no good. I'll work on a replacement in calc tomarrow. While I advise the replacement, I can't definitively show that it caues error, but I'm sure its not good.

The second is that the lengths are actualy the quadradic mean times sqrt(N). This would cuase popular series to score better then those which aren't so popular. I'm also confident that it exaserbates the skew, but I'm not entirely sure. This aspect will be easy to prove and if I'm right, a major revision will be required unless you want to just turn a blind eye to glearing error.
nstgc
Posts: 48
Joined: Sun May 22, 2005 6:45 am
Location: Island Closest to Hell

Post by nstgc »

I hate double posting, but I will in this case. I want this topic to be flagged as new message.

Please read the fallowing pdf file.

[url]http://nst_gc.tripod.com/cosine_hints.pdf[/url]
Rar
AniDB Staff
Posts: 1471
Joined: Fri Mar 12, 2004 2:41 pm
Location: UK
Contact:

Post by Rar »

Putting anything in a pdf is a sure fire way of making me not read it... though I admit, I'd probably give out pdfs rather than write mathml...

As I mentioned this to egg on irc, I'll log it here as well. I don't really have much to suggest, anime hint-wise, as default settings have given me nothing for about a year, and even tweaking them gives me at best 5 or so people, half of whom are mates...

Rar
nstgc
Posts: 48
Joined: Sun May 22, 2005 6:45 am
Location: Island Closest to Hell

Post by nstgc »

Well, I have an html version, but it looks horrible. I have the original openoffice writer document if you would perfer that, but I doubt it. The technical points are more for egg so unless you really care and request the html version I'll just leave it as is.
W1N9Zr0
Posts: 6
Joined: Sat Apr 30, 2005 4:26 am

Post by W1N9Zr0 »

There was a great anime suggestions site a couple years back, don't remember the name or the url, it was closed down.

Every anime was rated by users along 6 axis (action, romance, comedy, sci-fi, a couple more which I can't quite remember) + an overall "I liked it this much" rating. Then the script would calculate the weighted center of your votes, and find anime closest to it. The matches were quite good.

Maybe the categories could be thrown in to improve animehint matches?
nstgc
Posts: 48
Joined: Sun May 22, 2005 6:45 am
Location: Island Closest to Hell

Post by nstgc »

While I very much agree with that suggestion, I think it has already been thrown out. From my stand point it would work buetifuly, but in practice even I, who is a theoretical thinger, can see that implementing something like that would take a long ass time.
Rar
AniDB Staff
Posts: 1471
Joined: Fri Mar 12, 2004 2:41 pm
Location: UK
Contact:

Post by Rar »

Well, more to the point, when categories are shiny, integrating them to the hint (rather than the current genre hack) should actually be quite painless... If you want sci-fi-high results, you could get 'em. I don't think they should be the *basis* for rec suitability though.

Rar
Locked