Knee and a Half: Ultimate - Hat Tournaments - Why do we suck at rating ourselves?

Background:
For a while now, I've spent some (very small amount of) my time helping to make teams for hat tournaments in Israel. Lately it was for the latest Ligat Cova (Israel's monthly hat tournament, put on by an awesome bunch of volunteers), and the annual Succot hat tourmanent, but I've done it for several ligot cova in the past.

The first couple of times I did it (I think the first was 2009's Holylanders Hat), I (or we - in that case, Tomer and I) did it manually. Take all the players we have, derive from their self-rating and what we know about them a rating we're more likely to trust, and started throwing people into teams. Several hours later (without any exaggeration, and anyone else who's done it before can attest) we were finished, with what was largely a 'do-it-by-feel' process. Being somewhat of a nerd, I'm not a huge fan of making shit up as I go - I'd like to work in a somewhat organized fashion, with a process that's at least somewhat mathematically justified.

This is going to be a two-part post - in this first part, I'll tackle the issue of rating players - and why the ratings we receive from players are, in my mind, essentially useless. In the next part, I'll write a bit about the work I (with help from several guys in the community) have been doing to make this process smarter, easier, and less time-consuming.

Claim:
The data we receive from the players about themselves (we ask them to rate their experience, disc skills, and speed) correlate very weakly to our ratings of them, making their ratings effectively useless at helping us build balanced teams. Also, I have graphs on my side, so I must by right.

Data:

I worked with the data from the latest ligat cova - 51 players who signed up. A few more signed up, but obviously disregarded the ranking area in the signup sheet (veteran players ranking themselves as 1's (out of 5) across the board). More data is available - the succot tournament, and many past ligot cova - but I'm not sure it's necessary.

Method:

I plotted our (tournament organizers, the closest we believe have to a true rating- since it's all subjective in the end) rating scores against the players' self-rated numbers. The first shot I took is plotting against the sum of all three self-rated parameters (again, we're talking about experience, disc skills, and speed), and this came out:

Even without the numbers, it's easy to see the plot is all over the place. The numbers only help support the claim. If you've never taken statistics before, try the Wikipedia article on the coefficient of determination (R^2). Math majors, feel free to correct me: in simple terms, R^2 tells us how much variation in the Y (dependent) variable can be explained by the X (independent) variable. In our case, R^2 = 0.2366 means that only 23.66% of the variation in the Y variable (our ratings) can be explained by variation in the X variable (the players' ratings). It also means our correlation coefficient, R, is the square root of that - R = 0.4864. If you want to show correlation, you're usually looking for 0.9 or more. Essentially, there is no strong correlation between the players' rating of themselves and our ratings of them. Also in Hebrew:

אם לא למדתם סטטיסטיקה, וזה נשמע כמו סינית (ואם כן, ואתם רוצים להסביר את זה יותר טוב, אני אשמח): R^2 עוזר להבין כמה מהשינויים במשתנה שאנחנו היינו רוצים לחזות (במקרה הזה, את הדירוג שלנו) אפשר להסביר דרך המשתנה שהיינו רוצים שיחזה (במקרה הזה, הדירוג העצמי של השחקנים) הערך שקיבלנו אומר שרק 23.66% מהשינוי בדירוג שלנו אפשר להסביר על ידיד הדירוג של המשתמשים. זה גם נותן לנו שהקורלציה היא השורש של המספר הזה - או R = 0.4864. אם מנסים להוכיח משהו סטטיסטית, בדרך כלל מחפשים קורלציה של 0.9 ומעלה. יש קורלציה מאוד חלשה בין הדירוג העצמי של השחקנים לדירוג שלנו שלהם.

I then tried another method. We classify the players (in our ratings) to Cutters (whose speed is more important), Handlers (whose disc skills are more critical), and Girls (whom there are few, so we don't split internally). With that in mind, I tried to create a different predicting statistic:

For cutters, the sum of their experience and speed
For handlers, the sum of their experience and disc skills
For girls, 2/3rds of the sum of all three (to keep the number on the same 2-10 scale)

The results, unfortunately, were not better:

In fact, the statistics are even weaker = R^2 = 0.2141, which gives R = 0.4627.

Conclusion:

If you got here, congratulations, and thank you :) I have three ideas as to why this happens, but I haven't thought much about trying to prove or disprove them:

People don't really care. For most players, this isn't their first or second time filling up the form. They probably think we don't even look at what they fill up (which I just claimed we shouldn't, so maybe they're right :)). As a result, they don't really bother to fill it up seriously, and make up numbers that sound about right. That's even after removing the people who left them all at 1's, who obviously weren't trying. Frankly, my dear, they don't give a damn.
Most players' standards are different than ours. Two quick examples, one in each direction. Someone who only plays with a few of his buddies every couple of weeks decides to join us. Compared to them, he's the fastest, and thinks his throws are hot shit. He can even throw a flick, once our of every two or three times. He would probably overrate himself. Conversely, some guy who was around the middle of his D1 college team doesn't think of himself as a star, but would probably be one of the best players at the tournament. He would probably underrate himself.
Our rankings aren't as good as we think they are. Quite possible - but they're the best we have.

How do we fix it? I have some ideas, but this is already long enough. I'll throw them in with the second part of this post.

If you've read this far (and even if you just skipped to the conclusion) - I'd love your thoughts. Anything wrong with my math? What do you think we can and should (if at all) do?

-Guy

6 comments:

MatanDecember 5, 2011 at 9:08 PM
Hey guy how are you?

I have read the whole post and I love it. Although I don't fully agree with the details of some of your claims and conclusions, I still think this brings up a great conversation which is hard to have on an oral basis (sitting on the fence after LC) but could be really productive on this platform.
First and foremost - I'm not sure that our rating is so much better than self rating.
I think so for two reasons.
Firstly, I haven't seen an empiric observation that indeed an our rating is so much more even than a self rated based teaming.
Secondly, I'm not sure that the error bars we can set for either are not as wide as it gets (for both self and our ratings).
Second and this is a more general (yet technical) point about data analysis - you should be really careful when comparing them. I'm not 100% sure how have moved from the complicated form we have on the registration (with all the different qualities) into a single 1-9 scale? If you don't find the correct tranformation (and this may require some elaborate thinking), getting into quantitative conclusions from the resulting fits might be pointless.

P.S
I have already told you that Guy, but now with that leg brace, you don't only look like but you simply are EXACTLY Forreset Gump.

Matan
Guy DavidsonDecember 5, 2011 at 10:41 PM
For why I don't exactly trust the self-ranking, let's play a quick game of guess over g-chat. I'll give you their rankings, and you tell me who you prefer. We'll see if you build the best team...

I'm not sure either. I definitely haven't proven it empirically. I do believe that after watching them play (or in some of your cases, playing with them), you can tell a lot more about how they compare to other players you know than they do.

I'm not sure what you're trying to say about the error bars.

If I wasn't clear about it: The Y axis is always our ranking - that's what I'm trying to predict. Which is why it always comes out from around 1-9. The X axis changes:
* In the first graph, the X axis (explanatory) variable is the SUM of their ranking. For each player, experience + disc + speed.
* In the second graph, I tried something slightly more elaborate. For people we defined as handlers, I took experience + disc. For cutters, experience + speed. And for girls, since we don't separate (but I want to stay on the same scale) I took 2/3rds of the sum of all three.

You might be able to find a transformation that better explains it. I'm still not sure it will beat our rankings.

One thing I didn't mention - probably the biggest factor to help gauge the level is where the player played before, which we get, but barely. Maybe we should try harder to get more of that info?

I hope this makes some more sense. What do you think now?

-Forrest (err, Guy)
Dan TDecember 6, 2011 at 1:10 AM
I will just say that I enjoyed reading this, and that I definitely agree with Matan - You really do look like Forrest Gump.
MatanDecember 6, 2011 at 1:38 AM
We can further discuss the error bars and the "correct" transformation (and what tha heck does it mean). But I don't want this to get into a pointless academic discussion. Instead I'll refer to your last point of yours - namely, if indeed we want to create a reliable self rating form, what should we ask the players?
Either way I do suggest that we'll try several tournaments with our rating instead and see how does that work out.

P.S
This very last hat had a lot of artificial design into the groups, using criteria well beyond those applicable to a web form. Yet these criteria were proven (during the games in also in my retrospect) to be absolutely essential in the success of LC29 (and me winning it:) This makes this whole objective teamming algorithm approach to make the ideal teams seem somewhat premature. (yet more comfortable than the manual process)
RonDecember 7, 2011 at 12:00 PM
Interesting - good use of your time under ground...

I suggest you improve the reliability of the self-assessment in the form by emphasizing its importance and by providing verbal descriptions of the grades, e.g. experience 1: under 3 months; experience 5: played for the national team; speed 3: can sprint the length of the field twice; disk 5: can throw inside-out & outside-in with both hands (or whatever).

From now on, you need to generate your grades of the players without looking at their self grades, or you'll be biased toward improving the correlation...

Ron
Guy DavidsonDecember 8, 2011 at 1:51 AM
Matan - I touched on both subjects in the second part of this post, which I just finished writing. I'm also not sure that what you're talking about was a lot of artificial design - two or three swaps made it happen.

Ron - interesting idea. Touched on it as well.

Check it out: http://kneeandahalf.blogspot.com/2011/12/ultimate-hat-tournaments-making-better.html

Monday, December 5, 2011

Ultimate - Hat Tournaments - Why do we suck at rating ourselves?

6 comments: