Why is Elo Inflating so much?

876584635678890 · November 4, 2018, 12:48pm

About when I started saving the leaderboard elo scores, there should have been at least one matchmaking change:

Now, look at the general elo development of the top algos since then:

As you can see, I very professionally marked the upper and lower boundaries of the top 10 and they are looking to steadily increase (somewhat linearly).

Looking at the development of the amount of matches played, we can see that there was no change in matchmaking speed (completely linear):

This should therefore most likely not be a cause because the highest elo was stable at around 2000 for a long time.

You need to know that the elo required to be in the top 10 was even lower before the database was in place.

Almost two weeks ago, I plotted the elo distribution at that time and below you can see that data compared to the current distribution (it is based on each users highest elo on the leaderboard):

(plot available here)

This shows that the elo did not only inflate in the top 10, but it did actually inflate overall.
There are almost no algos left below 500 elo -> what applies for the leaderboard also does for all other algos in this case.

Now, I am wondering about two situations:

How were the matchmaking changes able to increase all elo (are they the cause of this phenomenon)?
If so, what changes were applied for this to happen?

kkroep · November 4, 2018, 1:02pm

Playing a match under the 400 ELP gap makes it that top algos can only lose ELO. What happens now is that the very best algos almost never lose, which was also the case previously, but now every win actually boosts the ELO of the best algos. Because algos are becoming very adaptive, they are covered against (almost every) static strategy. This means that only dynamic algorithms are a threat, and they are difficult to get right. Not only that, but dynamic algos have a high ELO, so losing a match is not that bad.

Actually my KeKroepes_2.0 algo has a mistake in the anti ping rush defense (accidentally commented out), so about half of all loses it has are due to that bug. Other losses are against genuinely impressive algos (with high ELOs) mostly.

So a combination of the new match making, and the maturing of top algos is affecting this I think. I feel like the top dogs are stepping up their game now that the top algos are stress tested against other top algos.

n-sanders · November 4, 2018, 1:20pm

Another thing to keep in mind is that this only charts a user’s best algo. So it make sense that basically everyone’s best algo is getting better. There’s no real way to track “feeder” algos that come in at 1500 and lose enough to never become that user’s best algo.

I know I’ve also created algos to try wacky new ideas that sometimes don’t pan out

Edit: also old, strong algos can hang out for a while and provide elo growth. NMFF was a previous #1 that’s still got 2k elo that people are getting better at beating and it’s feeding higher elo.

n-sanders · November 4, 2018, 1:30pm

It may be misleading to say it as an algo is getting “better”, because algos don’t “improve”. What I mean is the users are creating better algos and the top 10 is being filled with “younger” algos that have higher potential elo ceilings (such as @kkroep fixing his ping defense).

Ryan_Draves · November 4, 2018, 3:01pm

Better algos really shouldn’t account for the massive inflation though, which has taken place in a matter of days (and directly coincides with a matchmaking fix). To me it seems much more likely that, since every match played has a positive effect on the winner and negative effect on the loser, a couple of things might be happening. Keep in mind the previous ceiling, 1900-2000, which is considerably close to the 400 elo gap of where new algos start from (1500).

Before, without rematches, algos at the top had very little to gain. Since they’ve already played most any other algo, and rematches were not held, almost every match was above the 400 elo threshold, so any losses to these new, lower elo bots simply negated what little gains in elo could be made. This placed the ceiling around 1950.

Now, these bots are inflating above the gap of having nothing to gain from playing new bots. By the time the new bots are in range to play the top dogs, the top dogs have something to gain from winning. It’s apparent that they have too much to gain from winning, however, and that it’s not scaling off as their elo increases.

876584635678890 · November 4, 2018, 3:04pm

@n-sanders How come that elo steadily increases just now?
Truth_of_Cthaeh was top 1 for weeks and mostly between 1900 and 2000, just towards the end surpassed 2000, but I would count that towards the inflation.

Think of the following scenario: Elo inflation would never stop for old algos because of what @Ryan_Draves explained. There should not be an easy way to get any new algo into the top spots then.

There might not even be a need to change the system then. We could just adjust the starting elo, which leads me to the question whether 1700 would be a better starting elo right now.

kkroep · November 4, 2018, 3:16pm

Because of the math behind how ELO systems work, that would just result in everybody’s score being 200 pnts higher . That is identical to punishing every single algo in the system by 200 pnts as a one time event and then continuing as normal. ELO performance is measured against a baseline that is set by the ELO rating of fresh algorithms. A better solution to the problem you raised correctly is that new entries don’t have their ELO change quick enough, and should be scaling faster than established algos. As to my understanding the team is currently pondering over this issue. An example would be that new algos have their ELO reward or punishment be 3X higher for the first 10 matches are something similar.

Ryan_Draves · November 4, 2018, 3:22pm

That still doesn’t solve top algos inflating rapidly. Having new algos gain more early on is nice, but to fix the inflation, the top algos need to have less gains. Either by scaling off elo gains and holding elo losses constant, or by scaling both off. Personally I think elo gains should scale off in higher elos but losses stay constant, because that way better algos can reach the top and kick the old ones out.

kkroep · November 4, 2018, 3:22pm

The question is actually whether the ELO will keep rising like this, or will eventually settle somewhere. I think the new system will allow for a lot higher ELO, but it would still plateau. The question is where. Mathematically speaking there is no way that the ELO ratings can keep rising to infinity or something, because all the points a player gains, are taken away from another player. There is only a limited pool of ELO points to grab from.

[edit] if you look at chess, the plateau is now somewhere around 2800, and AI players somewhere above 3000 because they don’t ever lose.

The changes to matchmaking simply show how superior the top algos are to their peers, something that wasn’t as visible previously when not matched against high ELO algos. Previously when every algo was matched against everybody, it was more important to not lose to any low ranked algo (becuase of the spectacular penalty), than to compete with the high level ones. This is now different.

[edit] it is very important that the developers don’t sit this one out till the competition ends however. Otherwise one cannot feasibly submit a new algorithm in the last three weeks because it cannot rise to the top fast enough!

Ryan_Draves · November 4, 2018, 3:27pm

Yes, there is a cap, but it’s not simply because these new algos are “x amount better.” The cap is when the inflation spreads out the top algos so much they have no matches left to play (400 elo higher than every other algo they haven’t played before). This would just slow the inflation, not stop it. Realistically at some point it would take so long for a new algo to gain enough elo to play a match against the top algo, thereby creating a “ceiling.” It’d be hard to calculate where this cap is, but you could assuming it’s proportional to the amount of new, good algos being uploaded and climbing to high enough ranks to boost the top dogs.

Again, the cap is not representative of how good new algos are. It’s just a consequence of the flawed matchmaking changes.

Edit: No theoretical cap. Just a realistic limit as to how high algos can climb in 2 months. The inflation should scale off after reaching some ridiculous number, like 3000-4000 (arbitrary guess).

876584635678890 · November 4, 2018, 3:30pm

Adding to this: The Elo system does not have a theoretical ceiling.

kkroep · November 4, 2018, 3:35pm

Ryan_Draves:

Yes, there is a cap, but it’s not simply because these new algos are “x amount better.” The cap is when the inflation spreads out the top algos so much they have no matches left to play (400 elo higher than every other algo they haven’t played before). This would just slow the inflation, not stop it. Realistically at some point it would take so long for a new algo to gain enough elo to play a match against the top algo, thereby creating a “ceiling.” It’d be hard to calculate where this cap is, but you could assuming it’s proportional to the amount of new, good algos being uploaded and climbing to high enough ranks to boost the top dogs.

Again, the cap is not representative of how good new algos are. It’s just a consequence of the flawed matchmaking changes.

This would be correct if there is one algo that beats everything. Like the chess example I gave you. The rating of AlphaZero is 3750, while the highest rated human is magnus carlsen with 2882. The superiority shows. This is clearly not the case here. There are many competitors for the top 10 spots, as you can see by the leaderboard being quite volatile recently. This is not a game of “who rises faster”, but people are genuinely finding new strategies to beat up the other strategies. Notice that the old top dogs are basically all gone now. All are replaced by superior iterations.

I completely disagree with your statement that this is due to flawed matchmaking changes. I would actually turn it around: the previous matchmaking system was very flawed, and we got used to that, but this one is probably a lot more representative.

Your estimation of 3000-4000 is only possible if there are a very few amount of algos that win against literally every other algo. I don’t expect that to happen. Actually the ELO system is very well researched, so these things can easily be checked in literature.

It has a practical ceiling, based on where it is applied and how

[edit] if you look at the application of 8
https://876584635678890.github.io/
you can see that not the algos are rising but better algos take over. Take EMP^ERROR_v1.0 for example. It currently has a 2026 rating. So it stabilized under 2100 once better algos showed up. EMP^ERROR_v1.0 is already quite effective, but better algos should have a higher rating than that thing. Another example is Not_my_final_form.

and that is exactly what it was before, so nothing changed there. That algo peaked at 2069. I guess it dropped a bit

876584635678890 · November 4, 2018, 4:23pm

The amount of data is fairly small and there are a lot of factors that play into success of different algos of the same users, but I will try to analyze it a bit anyway:

(PythonSmasher3DA.I_ algos, two are colored bright greenish in the same way because they are the same, but uploaded twice)

The better version (in blue) overtook the older versions at some point (from 4PM to 4AM), then got overtaken by the “weaker” version again (3PM next day to 4AM) and now the blue version is ranked higher again.
I am not sure if we can conclude much from this, but I think we can conclude that it does not show that the elo just rose because a “better” algo overtook it (both are still rising).

(EMP^ERROR_v1.0 vs KeKroepes_2.0 (cyan))

Just comparing these two indicates that the newer version just rose higher because it was better because it at least looks like the old algo hit a ceiling, but that was just for a single day, which is short to draw conclusions.

(not_my_final_form vs MazeRunner)

I do not think that we should try to make any conclusions about this because we have no data about the blue graph after the green one took over. It might just be that not_my_final_form showed a similar trend.

If that meant “It is stuck at exactly (about) 2000 elo.”, one could make @kkroep’s argument.

(sawtooth (brown) vs sawtoothV2)

These two graphs have two important features: The brown one was higher in elo all the way in the time of the blue straight line from 8AM to 3PM because of the way the data is captured, i.e. it does not show that the blue elo was actually lower than the brown one because there was no data recorded for it during that time.
Same for the other way round at 3PM to 9PM near the end, where there was no data recorded for the brown graph.
From this we can see that both algos rose equally in elo because both graphs are lower in elo on the left than on the right side, which points to elo inflation.

As I said in the beginning, this is not enough data to draw conclusions, but the original image:

This should show this kind of development the best (because it includes most available data) and this points more towards elo inflation than it does towards there being stronger algos, which we cannot make out in my opinion at the moment.

kkroep · November 4, 2018, 4:45pm

Yeah, this is what I am trying to point out too. Some people are saying: “look at this thing rising it is going to keep going from now on, the matchmaking is flawed”. My guess is that the new matchmaking system will result in a different ELO ceiling, and that the current algos are now rising to the new ELO ceiling. Basically the ladder hasn’t had enough time to find a new stable point due to the recent matchmaking changes. The reason it is rising at the pace we see here, is because the top algos are limited by how many other algos share the same rank so they can get ELO points.

The examples I gave where to illustrate different ways of looking at the data, like counter examples. Point is, the behavior of ELO systems is very well researched. If you want to say that the ELO system is flawed, you need some serious mathematical guns and/or empirical evidence to back it up (more than is currently presented). And if you can I recommend publishing a paper about it in a well respected venue. If that is the case PM me, I’ll be happy to help . Disclaimer: I recently had a discussion with the developers over ELO so I actually looked up the literature… saturday night well spend

MilanLR · November 4, 2018, 5:49pm

We don’t know wich algo is better so we let them both in. sometimes one loses and it goes down a bit but we haven’t got a conclusion too ourselfs

Ryan_Draves · November 4, 2018, 6:05pm

If it was as well researched as you suggest, it wouldn’t need changes nor would it need such a patronizing defender.

These are ideas as to what’s causing the apparent inflation. It doesn’t take “mathematical guns” to make logical guesses, and these guesses don’t have to be right.

We’ll let the people running the show figure out what’s right and what’s wrong and adjust from there; this forum is for a discussion of ideas, not an argument over why “my idea is better than yours.”

kkroep · November 4, 2018, 6:13pm

You clearly misunderstand my point. I am not here to attack you personally, but rather to point out that the system is based on maths and is well researched, globally accepted and therefore not simply flawed, at least not in the way you are suggesting.

You are suggesting here that the matchmaking system is flawed because the ELO ceiling is different than before the matchmaking changes. This a very bold statement to make. If you go online to look for info on ELO and matchmaking systems there are many people much better at explaining this than I can, so I suggest you to take a look if you are interested.

If you would be posting a comment about how the earth is flat you’d also get a reaction from me that you might not like…

I am a strong believer in math, scientific research and constructive arguments. I will defend those ideals yes. If you find that offensive then so be it.

Other than faster scaling at the start, I don’t believe it needs changes? The system works, and is applied in many, many applications

[edit] I read back the conversation, and I fail to see how you got so offended other than me disagreeing with you and offereing counter arguments and examples. @Ryan_Draves can you point out to me which parts of this conversation triggered you so much to start name calling?

RegularRyan · November 4, 2018, 6:35pm

Hey everyone,

Last week we released a change that would improve matchmaking by causing algos Elos to be closer together on average, and never more than 400 points away. We have not adjusted how we are calculating scores, only how we are making matches. I have not investigated this issue in detail, (just saw this thread now) but this behavior is reasonably in line with what I would expect.

Previously, algos would gain nearly no Elo when faced against another algo 400 points lower than it, and none at all at 500 points lower. This led to a dramatic slowdown in progress around 2000 Elo, as many matches were being played were these pointless matches with no points actually being exchanged. This was most noticable in that many top algos stayed on the board for weeks before falling, even when their creators introduced much better versions of those algos. Currently, I believe this is the reason algos were platued at 2000 points.

Elo is a measure of standard deviation of winrates. An algo with 2000 Elo should win 64% of the time against an algo of 1900 Elo, 74% of the time against an algo of 1800 Elo, 85% against 1700, and 91% of the time against a 1600 algo. We currently believe that the current state of the leaderboard does reflect this, but I will investigate on monday.

Note that the reason Elo is inflating at the bottom of the leaderboard is that there was an issue that was resolved pretty early on where more points were being lost than gained, leading to many algos dropping far lower than expected. These algos are now ‘self correcting’ back up to the appropriate level, and I expect the second lump in 8’s Terminal Leaderboard Elo Distribution chart to merge back into the first lump eventually.

There are some unique factors in our system, such as the frequency at which new players can enter, lose points, and then leave the system, and the fact that the ‘ceiling’ and ‘floor’ for algos is very far apart. What I mean by this is that for the Elo cap to reach, say, 3000, you need to have 2000 Elo players that nearly always beat average players, 2500 Elo players that almost always beat those, and 3000 Elo players that almost always beat those. I believe our leaderboard might reach the 3000 range, especially as top algos continue to improve.

We are working on systems that will help reach your correct Elo in a shorter amount of time. Next weeks changes should lead to more matches being played overall, and a higher % of total games being played by players over 1500.

There is a ton of discussion here, let me know if I missed any points of anything else needs to be adressed

876584635678890 · November 4, 2018, 8:27pm

@kkroep I have actually found some evidence that better algos have taken over in the data:

Since there was new data captured for Truth_of_Cthaeh, we can see that it has really remained at its spot and MazeRunner has risen nonetheless.

n-sanders · November 4, 2018, 8:28pm

Just to add a little more data:

NMFF is sitting at 2050 elo right now (after last two matches played being losses). So it’s seemed to stabilize in a 2k-2.1k band.

MazeRunner is a similar algo to NMFF but with 6 releases of development trying to address the weak spots of NMFF (with hopefully no regression ). The intermediate version were removed when MazeRunner went up, but NMFF got to stay since it was my top algo. So for a number of match-ups that NMFF loses, MazeRunner wins (and I’ve seen this behavior in the global competition). Therefore MazeRunner should have a higher elo ceiling than NMFF.

I don’t know what kind of ceiling MazeRunner will have, but it’s 17-2 over the last matches played (that’s all the history terminal makes available at the moment). It’s oldest match in the history was on 2018-11-03T15:17:53.636Z. If 8’s chart is in zulu, it had 2120 elo when it started it’s 17-2 run and is currently at 2235.

Edit for more data: the elo range of it’s opponents in the past 19 matches is 1717 - 2043.