Knee and a Half: December 2011

Wednesday, December 14, 2011

On physiotherapy

It's probably about time I write something about the physiotherapy I'm doing. After all, it's part of the namesake of this blog, and probably the one thing that will remain constant in my life over the next six months or so.

I'll start by debunking some myths (or thoughts, or opinions) about physiotherapy. It's not fun. The way I'm working, I do the same set of exercises twice or three times a day, anywhere between three and four hours. It's not easy - in fact, it's swift kick in the ass, showing me how much weaker I got over the past few months. Physiotherapy also doesn't make me feel better immediately afterwards - it's not a massage or a warm up. Just like you're tired after going on a run, or how your arms might be weak after a workout, my legs are usually sore and in pain after a physio session.

During the first two weeks after surgery, I barely did shit. My physio involved around two hours a day in a CPM (Continuous Passive Motion device), a device resembling a medieval torture structure designed to move my leg back and forth, to make sure nothing sticks and work passively on range of motion. I also did a few minutes a day of straightening exercises, trying to get closer to being able to straighten my leg completely. Then came the two week review with the orthpedic surgeon, and the green light to start real physiotherapy.

An aside about my therapist - he's absolutely awesome. Chen deserves the credit for first suggesting I work with him, and my surgeon also recommended him. He's very serious, and not afraid to push and challenge me, which is what I feel I need. I think the most important part of picking a therapist is finding someone you're comfortable working with, and trust to help you make the quickest and safest recovery. For my money, no one is better than Yuval (I won't pimp him here, but if by any chance you're reading this and need physiotherapy, get in touch).

And so, after the review I started meeting with Yuval, and he started opening a can of whoopass on me. We started doing range of motion exercises (both extension and flexion), strength exercises (small squats, lunges, single legged squats, etc.), core work, balance work, and some things that can only be described as pure torture (hip raises). He also started me on bike work very early, which shows me just how out of shape I really am, but you gotta start somewhere...

I'm not really sure why I wrote this - maybe so you know what I've been doing at home all day, or what I'll be doing when I disappear in the middle of the day twice a day. Or maybe it's just for me - physio is going to remain a constant in my life for the next six months or so, so it deserves at least one blog post. And look at the time - time to go do my evening session...

Thursday, December 8, 2011

Ultimate - Hat Tournaments - Making better teams, faster

Previously:
In the first part of this post, I looked at how we rate ourselves as Ultimate players. I showed that the way players rate themselves correlates weakly with how an observer rates them, too weakly to be used to achieve accurate ratings.

The post prompted some debate on my Facebook page - most interesting was Eric's take. You should go read the whole thing, but in a nutshell, he suggests ditching a numeric system, and switching to an archetype based one. I'll wait while you read the whole thing...

It will almost certainly be easier for most players to rate themselves this way. Another idea, posted in a comment by my father, is to give examples for the numeric ratings. For example, for the experience rating:

I've been playing for a few months
I've been playing for over half a year, and play occasionally with friends.
I play on a club here in Israel, and practice at least once most weeks.
I played on the national team / top level club in Israel, or played for my college in the states.
I played club or top level college Ultimate in the states.

That would force people to rate themselves on our scale, not theirs. I picked the easiest example for it - but you can definitely write a similar scale for disc skills, athleticism, and speed as well. UI wise, it should be a radio button, rather than a text box, but that's a different story...

I love Eric's idea - it's very cool and intuitive. However, there is a big advantage to forced quantification of the data - it allows me to automate a big part of the team generation process.

What I did:

I started by going through the players' ratings, and distilling them into a position and a rating. The position is a handler, a cutter, or a girl (with teams having only two or three girls each, it's not worth splitting them further). The rating is a number between 1 and 9 - where, honestly, a 1 is subtraction by addition, and a 9 is the highest level found in Israel. [I'm not actually sure this is the best method - see note in the end]. We recently also started saving the ratings we end up giving people - so they can be tuned after tournaments, and to help us 'remember' players we've already met.

Using these ratings, I can now start making teams. I have two obvious goals when making teams - split the handlers, cutters, and girls roughly equally between the teams, and keep the total ratings around equal as well. A few other goals include keeping the average ages of teams close (so I don't throw a team of kids against team of oldies), ditto with average heights, and making sure each team as a "leader", or more generally, worrying about chemistry.

My team generation algorithm is pretty simple:

Go through each pool of players (handlers, cutters, girls):
Within each pool, select a random player from the highest remaining score (the randomness is to cause some variation in the generated sets of teams)
Assign this player to the team with the lowest total rating so far. If there is more than one option, again, choose randomly.

I generate a large number of such suggestions, and then give them each a score. My (admittedly very simplistic) scoring algorithm is based on variances:

Suggestion score = 2x [team total score variance] +

[variance of teams' average ages] +

[ variance of teams' average heights].

The lower the score, the better. I wanted to weigh the scores more, but also consider the heights and ages. Obviously, this is also inexact (just like people fill in their ratings inaccurately, they may do the same with their heights, and to a lesser extent, ages).

Okay, what now:

I take the suggestion of teams with the lowest score, and start working from there. This the awesome part - instead of spending an hour or two (and sometimes double that) assigning players to teams, I can (almost) immediately jump to fine tuning. This includes several steps:

Making the teams have equal (or approximately equal) numbers of players. (If you feel like it, see if you can figure out why it wouldn't always happen with the algorithm I use)
Verifying that the teams with fewer or weaker handlers also get cutters who feel more comfortable behind the disc, or girls who handle. [This might also be solved by the other rating algorithm I have in mind - again, see note in the end]. This is crucial - with our generally low level of disc skill (at least, compared to the US), a team with noticeably weaker handling is in a significant disadvantage.
Chemistry considerations. This is by far the hardest, most voodoo-like part of the process. I think I got it mostly right on the last ligat cova, but there are so many unknowns that you never really know. I try to make sure each team as at least one experienced player, who doesn't mind leading and being vocal about it, and that each team as a mixture of younger and older talented players.

Tweaking with those considerations usually takes around half an hour, which means the whole team generation process took around an hour. Last, but certainly not least, I make sure another set of eyes (or two) review the teams, and make sure I didn't do anything overly silly (such as forget a player, think a girl is a guy, or absolutely mis-rate a player. Purely hypothetical examples, of course...).

I took a process which used to take three-four hours of total pain in the ass, and turned it to a relatively smooth ordeal which takes an hour or less to complete. I can live with that :)

I wrote the team generation script in Python, and tend to tweak it before each tournament. Just for kicks, I think I might write quick GUI to make the fine tuning easier (and because I feel like learning to work with PyQt).

I'd love to hear your thoughts on this - what might I be doing wrong? How can we do it better? What would you change?

-Guy

Notes:

Another rating system: I messed around with keeping our ratings on another scale. Instead of keeping a single 1-10 rating, keep two. Rate each player as both a handler and a cutter. That would better demonstrate the cutters who have the disc skills and decision making to handle if need be, as well as differentiate better between the handlers who are old and slow (*cough* Mickey *cough*), and those who can also cut if it serves the team better.

If we do that, then we no longer need to split the guys into handlers and cutters. It would require to modify the team generation algorithm some - I haven't done it yet, but it certainly doesn't seem like a very tough problem. Of course, manual tweaking would still be necessary - but maybe less so than before. What do you think? Might it work better?

Monday, December 5, 2011

Ultimate - Hat Tournaments - Why do we suck at rating ourselves?

Background:
For a while now, I've spent some (very small amount of) my time helping to make teams for hat tournaments in Israel. Lately it was for the latest Ligat Cova (Israel's monthly hat tournament, put on by an awesome bunch of volunteers), and the annual Succot hat tourmanent, but I've done it for several ligot cova in the past.

The first couple of times I did it (I think the first was 2009's Holylanders Hat), I (or we - in that case, Tomer and I) did it manually. Take all the players we have, derive from their self-rating and what we know about them a rating we're more likely to trust, and started throwing people into teams. Several hours later (without any exaggeration, and anyone else who's done it before can attest) we were finished, with what was largely a 'do-it-by-feel' process. Being somewhat of a nerd, I'm not a huge fan of making shit up as I go - I'd like to work in a somewhat organized fashion, with a process that's at least somewhat mathematically justified.

This is going to be a two-part post - in this first part, I'll tackle the issue of rating players - and why the ratings we receive from players are, in my mind, essentially useless. In the next part, I'll write a bit about the work I (with help from several guys in the community) have been doing to make this process smarter, easier, and less time-consuming.

Claim:
The data we receive from the players about themselves (we ask them to rate their experience, disc skills, and speed) correlate very weakly to our ratings of them, making their ratings effectively useless at helping us build balanced teams. Also, I have graphs on my side, so I must by right.

Data:

I worked with the data from the latest ligat cova - 51 players who signed up. A few more signed up, but obviously disregarded the ranking area in the signup sheet (veteran players ranking themselves as 1's (out of 5) across the board). More data is available - the succot tournament, and many past ligot cova - but I'm not sure it's necessary.

Method:

I plotted our (tournament organizers, the closest we believe have to a true rating- since it's all subjective in the end) rating scores against the players' self-rated numbers. The first shot I took is plotting against the sum of all three self-rated parameters (again, we're talking about experience, disc skills, and speed), and this came out:

Even without the numbers, it's easy to see the plot is all over the place. The numbers only help support the claim. If you've never taken statistics before, try the Wikipedia article on the coefficient of determination (R^2). Math majors, feel free to correct me: in simple terms, R^2 tells us how much variation in the Y (dependent) variable can be explained by the X (independent) variable. In our case, R^2 = 0.2366 means that only 23.66% of the variation in the Y variable (our ratings) can be explained by variation in the X variable (the players' ratings). It also means our correlation coefficient, R, is the square root of that - R = 0.4864. If you want to show correlation, you're usually looking for 0.9 or more. Essentially, there is no strong correlation between the players' rating of themselves and our ratings of them. Also in Hebrew:

אם לא למדתם סטטיסטיקה, וזה נשמע כמו סינית (ואם כן, ואתם רוצים להסביר את זה יותר טוב, אני אשמח): R^2 עוזר להבין כמה מהשינויים במשתנה שאנחנו היינו רוצים לחזות (במקרה הזה, את הדירוג שלנו) אפשר להסביר דרך המשתנה שהיינו רוצים שיחזה (במקרה הזה, הדירוג העצמי של השחקנים) הערך שקיבלנו אומר שרק 23.66% מהשינוי בדירוג שלנו אפשר להסביר על ידיד הדירוג של המשתמשים. זה גם נותן לנו שהקורלציה היא השורש של המספר הזה - או R = 0.4864. אם מנסים להוכיח משהו סטטיסטית, בדרך כלל מחפשים קורלציה של 0.9 ומעלה. יש קורלציה מאוד חלשה בין הדירוג העצמי של השחקנים לדירוג שלנו שלהם.

I then tried another method. We classify the players (in our ratings) to Cutters (whose speed is more important), Handlers (whose disc skills are more critical), and Girls (whom there are few, so we don't split internally). With that in mind, I tried to create a different predicting statistic:

For cutters, the sum of their experience and speed
For handlers, the sum of their experience and disc skills
For girls, 2/3rds of the sum of all three (to keep the number on the same 2-10 scale)

The results, unfortunately, were not better:

In fact, the statistics are even weaker = R^2 = 0.2141, which gives R = 0.4627.

Conclusion:

If you got here, congratulations, and thank you :) I have three ideas as to why this happens, but I haven't thought much about trying to prove or disprove them:

People don't really care. For most players, this isn't their first or second time filling up the form. They probably think we don't even look at what they fill up (which I just claimed we shouldn't, so maybe they're right :)). As a result, they don't really bother to fill it up seriously, and make up numbers that sound about right. That's even after removing the people who left them all at 1's, who obviously weren't trying. Frankly, my dear, they don't give a damn.
Most players' standards are different than ours. Two quick examples, one in each direction. Someone who only plays with a few of his buddies every couple of weeks decides to join us. Compared to them, he's the fastest, and thinks his throws are hot shit. He can even throw a flick, once our of every two or three times. He would probably overrate himself. Conversely, some guy who was around the middle of his D1 college team doesn't think of himself as a star, but would probably be one of the best players at the tournament. He would probably underrate himself.
Our rankings aren't as good as we think they are. Quite possible - but they're the best we have.

How do we fix it? I have some ideas, but this is already long enough. I'll throw them in with the second part of this post.

If you've read this far (and even if you just skipped to the conclusion) - I'd love your thoughts. Anything wrong with my math? What do you think we can and should (if at all) do?

-Guy