Monday 26 August 2013

Around the world in 80 puzzles - Scoring

Following up the introduction of "Around the world in 80 puzzles", and discussing details about its format, this is an update on the scoring system that will apply for the puzzle sets of this part of the Championship. Please read carefully.


Scoring

The reason we need a separate scoring system for “80 puzzles” is that it needs to compensate for the fact that different competitors are not solving the same set of puzzles. Although we had taken significant measures that the sets are compliant to certain standards and that they are at a similar level of difficulty, it is impossible to guarantee that there will be no imbalance between the sets, therefore a scoring system needs to be made aware of this. The process is as follows.

Puzzle sets will contain raw puzzle scores (“raw scores”) as if they were ordinary WPC rounds, these will be established based on the results of the test solving that the core team performed. All sets will have an identical maximum score. Since every competitor solves exactly three sets, everyone will have three raw scores. The raw scores will be converted to standard Championship scores as follows:

  • The 5th placed official competitor’s raw score will be converted to 500 points
  • The median of all the official raw scores will be converted to 250 points
  • A zero raw score will be converted to 0 points
  • Raw scores, other than those in a)-c), both official and unofficial, will be converted to points proportionally, i.e.
    • Raw scores below median will be converted using linear scaling to the zero and median marks
    • Raw scores above median will be converted using linear scaling to the median and 5th marks


This is actually much easier than it sounds ☺


Let’s see an example, assuming a 60 minutes round with a total of 100 points available, with 25 competitors, assuming all of them are official, showing ranking, name, raw scores and converted points:

Rk Name    Raw   Conv
=====================
 1 Alice   98     663
 2 Bob     96     638
 3 Cecil   91     575
 4 Daniel  87     525
 5 Edward  85     500
 6 Frank   81     475
12 Luke    47     263
13 Median  45     250
14 Norah   43     239
24 Yves     2      11
25 Zero     0       0

(Note that it is not necessary to have someone with zero points - it is of course not desirable ☺)

As seen in the table, the 5th placed solver received 500 points, the one with the median score (called Median, incidentally the right place in the alphabet) received 250 points, while unlucky Zero (incidentally, again) got, well, 0 points. All the other solvers’ scores were converted into points proportionally scaled to how well they did compared to those three highlighted solvers (more precisely, proportionally scaled to the two of them that are closest to them).

Rationale

Although we have taken as much caution as possible to ensure that the difficulty of the sets is at least similar, only three test solvers’ data is no guarantee that they actually are similar. Since different competitors will attempt different sets, any difference in actual difficulty needs to be compensated for, and that’s exactly what the outlined calculation does.

No matter how hard each of the puzzle sets will turn out to be, the above compensation scheme will ensure that any performance of solvers will be rewarded in light of how well they did compared to the rest of the solvers who solved the same round (in theory, the fact that not all rounds are solved by the same set of solvers can make some difference here, but with a high number of solvers, the variance in the eventual 5th score and median is negligibly small). In case a puzzle set turns out to be harder than the others, then the raw scores of the field will probably be lower overall, meaning that the final scores will be adjusted a little more upwards than for another puzzle set. Conversely, if a puzzle set turns out to be easier than the others, then the raw scores are likely to end up higher and they will be adjusted a little less upwards than for another puzzle set.

The choice of the markers (i.e. 5th place and median) may seem rather arbitrary. Indeed, the number of markers to use and where they are located in the distributions is up for choice. We have examined a few options here, including already existing score conversion algorithms, and found that while using a single marker (i.e. making all scores proportional to just one median or just one average) may introduce some unwanted distortions, but using two markers work reasonably well across the distribution.

Indeed, taking a look around popular online puzzling sites, the most notable examples are LMI and Croco-Puzzle, both of them uses ratings to rank puzzlers (although LMI is based on complete puzzle sets, while Croco goes by one-puzzle times), and both of them uses two markers to scale the scores of the field, in their case it is the top competitor and the one with median raw scores. Our method is indeed fairly similar to theirs, the only difference is that we believe that the top performance of any field may have a high variance in extraordinariness, while taking a score slightly below the top is more stable in general, hence our choice to use the 5th raw score instead of the 1st. The reason for this choice of ours is explained below.

(The fact that we had been looking at LMI and Croco ratings, among others, makes sense especially because the purpose that these puzzle sites have introduced ratings is precisely to be able to measure and rate solvers’ performance across multiple tests, even if different solvers choose to solve different sets. While the choice of puzzle sets is of course a little more controlled in case of “80 puzzles” than in the free world of online competitions, using the same principle to bring solvers’ results into a comparable shape is certainly justified.)

The choice of 250 and 500 may seem arbitrary, but there certainly is a thought process behind them, both in terms of the magnitude and the ratio of these numbers. All the individual rounds of the Championship that are outside the scope of “Around the world in 80 puzzles” will be scored based on the philosophy that round lengths and total scores will approximately come down to a “winner to achieve 10 points per minute” scheme, although some of the rounds may be designed to be more realistic for top puzzlers to finish and obtain bonus points than some of the other rounds. Therefore, if we want to keep the “80 puzzles” rounds’ scoring in line with that overall philosophy, we need to try to make round winners to score “around 600”. Easier said than done – granted, we could just award them with 600 points with a different score conversion, but later in this write-up we make clear why we chose to set 5th placed solvers scores instead of that of round winners. Given some of the test data to be detailled below, the winners are expected to score around 600 based on this score conversion scheme, maybe slightly above if they beat the field comprehensively, so the choice of awarding 500 points to 5th placed solver seems a sensible approach (other than that, it also sounds simple and 500 is a well-rounded guideline number). On the other hand, a large set of test data (including but not limited to the WPC rounds analysed below) has indicated that the median score of a solver field typically falls within 45-50% of the score of the 5th placed solver, and that while this ratio is more typically around 45% for some of the online test that we have looked at, for WPC rounds it tends to be more around 50%, as indeed is almost exactly the case in some of the test data below. A possible explanation for that is that solvers who attend a WPC represent a slightly higher skilled field than those who attend online tests (even though the latter ones often feature many of the top guns, too!). Therefore, the use of the 50% value seems justified for median raw scores to be converted onto, hence the introduction of 250. Additionally, while the exact mapping of the lowest scores could also be dealt with in various ways, we have gone for simplicity here and just say that a raw score of zero maps onto zero points.

Finally, the reason for only using official competitors’ data for the purpose of determining the markers for score conversion is obviously that none of the unofficial scores should have any impact on any of the official ones.


In trying to provide some further analysis and some visualised results to understand the impact of the score conversions, he following paragraphs will turn a little technical, feel free to skip if not interested in the dirty details!

I have looked into some past data to see if applying the method above would uncover anything odd. While I have certainly not collected and processed enough data to be able to say that we conducted a large amount of testing, this set of data was intentionally chosen to at least resemble to the “80 puzzles” framework to give us some confidence. In addition, there were other tests done by one of the reviewers using other sets of data that don't fit the boundaries of this page but available for further discussion.

Description of test data used:
  • Taken from the results of WPC 2011 
  • Includes the full results of four long individual rounds of practically assorted puzzles:
    • Round 2 – “Assorted”
    • Round 5 – “Evergreens”
    • Round 12 – “Hungaricum”
    • Round 13 – “Innovatives”
  • These rounds featured a range of design philosophies (classic-ish, innovative, and of course the infamous Evergreens with all those think-outside-the-box puzzles)
  • These rounds lasted between 50 and 70 minutes (averaging to 60)
  • Since WPC 2011 took place in Hungary, these rounds were scored and timed by the exact same core team in 2011 that coordinated the puzzle creation process with the authors of “80 puzzles” this year
For all these reasons, the choice of this particular set of data seems justified by the fact that “80 puzzles” will be very similar in all the aspects.

The figure below provides a nice visualisation of all this data in one glance. One curve corresponds to one of the rounds as shown. The horizontal axis captures the ranking of competitors for each of the rounds, while the vertical axis shows the scores they achieved.

Score/rank distribution of four selected long rounds of WPC 2011.
Note a fairly similar shape across all of them
It’s hard to say what is a “good” distribution for any of the rounds, but looking at this figure, it is probably fair to say that the four rounds’ distributions are reasonably similar. (Of course, this doesn’t mean that everyone actually achieved similar scores in those rounds, quite the contrary – but when you look at all the competitors as a group, their overall performance distribution seems to be reasonably consistent.)

The good thing that comes out of this figure is that there seems to be a decent level of consistency on “what is the percentage of the maximum score you need to achieve if you want to finish in position N”. Of course, winning scores are always difficult to predict but as you go down the list, it becomes much more controlled, the lines are fairly close to each other.

It is important to point out that this type of consistency is not just a property of the puzzler population, it’s also a feature of the scoring/timing system in terms of determining the amount of puzzles that are packed into a round, the composition of difficulty levels, the scoring and the timing of the round. Let me show you how it can look much worse – this time we are using all but a few individual rounds from WPC 2011, and for visualisation purposes the data has been mapped onto a scale of 1000 (since the round sizes were largely different – you’ll notice someone must have got a small bonus in one of the rounds, and Sprint was excluded precisely because there were many bonuses there which is not relevant for us), otherwise the chart works just like the previous one:

Score/rank distribution of (almost) all individual rounds of WPC 2011.
Some of them are notably different this time

Here, the fact that Screentest, whose percentage scores were significantly higher than those of all the other rounds is probably not much of a concern, given that it was indeed a very different round in nature than all the others. However, the other culprit with those low scores, Borderless, is more of an issue: apparently, half the field solved only one puzzle or none at all, there were groups of tens of people finishing on identical scores (meaning that this round contributed nothing to separating those people based on their performance or to establish their rankings), and the overall percentage is far lower than all the other rounds. It’s probably fair to say that with hindsight, looking at this data and the context it provides, the Borderless round was not appropriately composed, scored and timed for a Championship (other sources of feedback indicate content issues also).

(It would be interesting to look into similar data from other WPCs. I’m pretty sure we would get significantly different results for some of the years.)

Let us get back to the scores of the four rounds of WPC 2011 and let us pretend that they are actually raw scores from an imaginary “80 puzzles” contest from somewhere in the past. We notice that although the curves are, as discussed, fairly parallel and not far off, they are not actually very close to each other. If you take the line that in a “80 puzzles” framework not everybody would have solved every round, then someone doing Evergreens (the highest scoring of the rounds) but skipping Innovative (the lowest scoring one) will have probably ended up with higher raw scores than someone with similar skills but skipping the sets the other way round, which is why score conversion is required in the first place.

Therefore, let us apply the scoring methods of this year’s “80 puzzles” onto those four rounds to compensate for this apparent difference between the difficulty of the rounds. Figures are rounded to integers (for scores) and three decimals (for multipliers) for the purposes of showing here (but not for the purpose of actual calculations).

WPC 2011 rounds normalised using the method defined for "80 puzzles".
They line up so well except for the head of the crazy green curve...


Some backing data, all given in the order these rounds took place in the actual event, i.e. Assorted, Evergreens, Hungaricum, Innovative.

Definition                  Ast Evg Hun Inn
================================================
Round winner (raw) scores:  630 645 870 520
5th place (raw) scores:     545 550 590 475
Median scores:              270 240 295 220


(Note that in this data set, there was no distinction between official and unofficial scores. These numbers might slightly, but not radically differ when not including unofficial scores into calculating the markers.)

It looks pretty clear that the four rounds’ scores have now been normalised into a fairly similar distribution, regardless of their comparative difficulty, which has been our main objective.

An interesting thing to note is the very high score obtained in the Hungaricum round, indeed the converted top score is 975! This is the result of the fact that the difference between the top ranked solver and the 5th ranked solver was much higher in this round than in any of the other rounds. You can argue that this 1st ranked solver is then rewarded better for his performance than in any other round winners, but then again, they did achieve an outstanding result in actual solving, as evidenced by even the raw scores and therefore they need to be properly rewarded by whatever score conversion.

In fact, this set of data offers an excellent visual justification for us not choosing to mark the curves using the 1st ranked solver in any of the rounds: had we chosen to do so, the normalised figures would look like this (this is a different method from the one we will be using for 80P):

The same normalisation concept when marked to the top competitor instead of the 5th placed one.
Note how one rocket score in the green round impacts its 5-30th places


What this demonstrates is that now the top ranked solver of the Hungaricum round did not get rewarded better than the winners of the other rounds. Instead of that, however, all the non-top solvers within the top 30 have now been significantly under-rated compared to similarly ranked solvers in all the other rounds.

In the context of “80 puzzles”, it is clearly more desirable to keep the field together even if the occasional stellar performance of a round winner may seem very highly rewarded, rather than keep the winners together and create a situation where tens of people may feel not sufficiently rewarded.

(While this choice makes sense for our purposes to ensure the balance of “80 puzzles” sets, it is important to note that the rating systems of LMI and Croco are well established and robust systems that address slightly different situations, therefore our analysis should not in any way be seen as an attempt to try to assess or question their processes – in fact we are grateful to them for having pioneered to implement such rating systems, and of course for all the work they put into running their sites for all of us in the first place!)

Conclusion

It is probably apparent from the description above (even from the amount of scrolling required) that finding the right system for ranking puzzlers over a diverse set of puzzles is a complex problem with many possible considerations regarding assumptions, methodology, data and parameters, and many subtle choices that influence how well any ranking system will do under competition circumstances. It's important to point out that analysing such systems is very easy a posteriori but nearly impossible a priori, meaning that it's important to see how well other similar examples have done and try to make the most of those working examples. There are no claims that this scoring system is perfect (it is impossible to make such a claim), but hopefully it is by now clear to the reader that this issue was not taken lightly, we have put a huge amount of due diligence into investigating solutions and alternatives and into communicating why certain choices were made the way they were made.

Therefore we are confident that with this scoring system and the format rules communicated earlier, "80 puzzles" will be seen as a balanced, fair and integral part of the Championship and will be successful as such!

Notes

This document was sent for review to a couple of people a few days before its public appearance. Feedbacks, suggestions and remarks were received and, where applicable, incorporated.

Reviewers are, in no particular order:

  • Members of the core team (useful comments from Pal Madarassy)
  • The lead authors (tips from Thomas Snyder)
  • The WPF Board
  • Special thanks to Tom Collyer for an in-depth review on methodology and the insightful and inspiring bits of feedback
The next communication about "Around the world in 80 puzzles" will be the release of the instruction booklets of the puzzle sets therein, prior to releasing the booklet of the whole event. This is designed to allow for teams familiarising themselves with the sets and prepare their line-ups for "who skips what" decisions.

No comments:

Post a Comment