Why is Elo Inflating so much?

Because of the math behind how ELO systems work, that would just result in everybody’s score being 200 pnts higher :stuck_out_tongue:. That is identical to punishing every single algo in the system by 200 pnts as a one time event and then continuing as normal. ELO performance is measured against a baseline that is set by the ELO rating of fresh algorithms. A better solution to the problem you raised correctly is that new entries don’t have their ELO change quick enough, and should be scaling faster than established algos. As to my understanding the team is currently pondering over this issue. An example would be that new algos have their ELO reward or punishment be 3X higher for the first 10 matches are something similar.

3 Likes

That still doesn’t solve top algos inflating rapidly. Having new algos gain more early on is nice, but to fix the inflation, the top algos need to have less gains. Either by scaling off elo gains and holding elo losses constant, or by scaling both off. Personally I think elo gains should scale off in higher elos but losses stay constant, because that way better algos can reach the top and kick the old ones out.

1 Like

The question is actually whether the ELO will keep rising like this, or will eventually settle somewhere. I think the new system will allow for a lot higher ELO, but it would still plateau. The question is where. Mathematically speaking there is no way that the ELO ratings can keep rising to infinity or something, because all the points a player gains, are taken away from another player. There is only a limited pool of ELO points to grab from.

[edit] if you look at chess, the plateau is now somewhere around 2800, and AI players somewhere above 3000 because they don’t ever lose.

The changes to matchmaking simply show how superior the top algos are to their peers, something that wasn’t as visible previously when not matched against high ELO algos. Previously when every algo was matched against everybody, it was more important to not lose to any low ranked algo (becuase of the spectacular penalty), than to compete with the high level ones. This is now different.

[edit] it is very important that the developers don’t sit this one out till the competition ends however. Otherwise one cannot feasibly submit a new algorithm in the last three weeks because it cannot rise to the top fast enough!

1 Like

Yes, there is a cap, but it’s not simply because these new algos are “x amount better.” The cap is when the inflation spreads out the top algos so much they have no matches left to play (400 elo higher than every other algo they haven’t played before). This would just slow the inflation, not stop it. Realistically at some point it would take so long for a new algo to gain enough elo to play a match against the top algo, thereby creating a “ceiling.” It’d be hard to calculate where this cap is, but you could assuming it’s proportional to the amount of new, good algos being uploaded and climbing to high enough ranks to boost the top dogs.

Again, the cap is not representative of how good new algos are. It’s just a consequence of the flawed matchmaking changes.

Edit: No theoretical cap. Just a realistic limit as to how high algos can climb in 2 months. The inflation should scale off after reaching some ridiculous number, like 3000-4000 (arbitrary guess).

2 Likes

Adding to this: The Elo system does not have a theoretical ceiling.

1 Like

This would be correct if there is one algo that beats everything. Like the chess example I gave you. The rating of AlphaZero is 3750, while the highest rated human is magnus carlsen with 2882. The superiority shows. This is clearly not the case here. There are many competitors for the top 10 spots, as you can see by the leaderboard being quite volatile recently. This is not a game of “who rises faster”, but people are genuinely finding new strategies to beat up the other strategies. Notice that the old top dogs are basically all gone now. All are replaced by superior iterations.

I completely disagree with your statement that this is due to flawed matchmaking changes. I would actually turn it around: the previous matchmaking system was very flawed, and we got used to that, but this one is probably a lot more representative.

Your estimation of 3000-4000 is only possible if there are a very few amount of algos that win against literally every other algo. I don’t expect that to happen. Actually the ELO system is very well researched, so these things can easily be checked in literature.

It has a practical ceiling, based on where it is applied and how

[edit] if you look at the application of 8
https://876584635678890.github.io/
you can see that not the algos are rising but better algos take over. Take EMP^ERROR_v1.0 for example. It currently has a 2026 rating. So it stabilized under 2100 once better algos showed up. EMP^ERROR_v1.0 is already quite effective, but better algos should have a higher rating than that thing. Another example is Not_my_final_form.

and that is exactly what it was before, so nothing changed there. That algo peaked at 2069. I guess it dropped a bit

2 Likes

The amount of data is fairly small and there are a lot of factors that play into success of different algos of the same users, but I will try to analyze it a bit anyway:


(PythonSmasher3DA.I_ algos, two are colored bright greenish in the same way because they are the same, but uploaded twice)

The better version (in blue) overtook the older versions at some point (from 4PM to 4AM), then got overtaken by the “weaker” version again (3PM next day to 4AM) and now the blue version is ranked higher again.
I am not sure if we can conclude much from this, but I think we can conclude that it does not show that the elo just rose because a “better” algo overtook it (both are still rising).


(EMP^ERROR_v1.0 vs KeKroepes_2.0 (cyan))

Just comparing these two indicates that the newer version just rose higher because it was better because it at least looks like the old algo hit a ceiling, but that was just for a single day, which is short to draw conclusions.


(not_my_final_form vs MazeRunner)

I do not think that we should try to make any conclusions about this because we have no data about the blue graph after the green one took over. It might just be that not_my_final_form showed a similar trend.

If that meant “It is stuck at exactly (about) 2000 elo.”, one could make @kkroep’s argument.


(sawtooth (brown) vs sawtoothV2)

These two graphs have two important features: The brown one was higher in elo all the way in the time of the blue straight line from 8AM to 3PM because of the way the data is captured, i.e. it does not show that the blue elo was actually lower than the brown one because there was no data recorded for it during that time.
Same for the other way round at 3PM to 9PM near the end, where there was no data recorded for the brown graph.
From this we can see that both algos rose equally in elo because both graphs are lower in elo on the left than on the right side, which points to elo inflation.

As I said in the beginning, this is not enough data to draw conclusions, but the original image:

This should show this kind of development the best (because it includes most available data) and this points more towards elo inflation than it does towards there being stronger algos, which we cannot make out in my opinion at the moment.

1 Like

Yeah, this is what I am trying to point out too. Some people are saying: “look at this thing rising it is going to keep going from now on, the matchmaking is flawed”. My guess is that the new matchmaking system will result in a different ELO ceiling, and that the current algos are now rising to the new ELO ceiling. Basically the ladder hasn’t had enough time to find a new stable point due to the recent matchmaking changes. The reason it is rising at the pace we see here, is because the top algos are limited by how many other algos share the same rank so they can get ELO points.

The examples I gave where to illustrate different ways of looking at the data, like counter examples. Point is, the behavior of ELO systems is very well researched. If you want to say that the ELO system is flawed, you need some serious mathematical guns and/or empirical evidence to back it up (more than is currently presented). And if you can I recommend publishing a paper about it in a well respected venue. If that is the case PM me, I’ll be happy to help :slight_smile:. Disclaimer: I recently had a discussion with the developers over ELO so I actually looked up the literature… saturday night well spend :nerd_face: :sob:

1 Like

We don’t know wich algo is better so we let them both in. sometimes one loses and it goes down a bit but we haven’t got a conclusion too ourselfs

If it was as well researched as you suggest, it wouldn’t need changes nor would it need such a patronizing defender.

These are ideas as to what’s causing the apparent inflation. It doesn’t take “mathematical guns” to make logical guesses, and these guesses don’t have to be right.

We’ll let the people running the show figure out what’s right and what’s wrong and adjust from there; this forum is for a discussion of ideas, not an argument over why “my idea is better than yours.”

You clearly misunderstand my point. I am not here to attack you personally, but rather to point out that the system is based on maths and is well researched, globally accepted and therefore not simply flawed, at least not in the way you are suggesting.

You are suggesting here that the matchmaking system is flawed because the ELO ceiling is different than before the matchmaking changes. This a very bold statement to make. If you go online to look for info on ELO and matchmaking systems there are many people much better at explaining this than I can, so I suggest you to take a look if you are interested.

If you would be posting a comment about how the earth is flat you’d also get a reaction from me that you might not like…

I am a strong believer in math, scientific research and constructive arguments. I will defend those ideals yes. If you find that offensive then so be it.

Other than faster scaling at the start, I don’t believe it needs changes? The system works, and is applied in many, many applications

[edit] I read back the conversation, and I fail to see how you got so offended other than me disagreeing with you and offereing counter arguments and examples. @Ryan_Draves can you point out to me which parts of this conversation triggered you so much to start name calling?

Hey everyone,

Last week we released a change that would improve matchmaking by causing algos Elos to be closer together on average, and never more than 400 points away. We have not adjusted how we are calculating scores, only how we are making matches. I have not investigated this issue in detail, (just saw this thread now) but this behavior is reasonably in line with what I would expect.

Previously, algos would gain nearly no Elo when faced against another algo 400 points lower than it, and none at all at 500 points lower. This led to a dramatic slowdown in progress around 2000 Elo, as many matches were being played were these pointless matches with no points actually being exchanged. This was most noticable in that many top algos stayed on the board for weeks before falling, even when their creators introduced much better versions of those algos. Currently, I believe this is the reason algos were platued at 2000 points.

Elo is a measure of standard deviation of winrates. An algo with 2000 Elo should win 64% of the time against an algo of 1900 Elo, 74% of the time against an algo of 1800 Elo, 85% against 1700, and 91% of the time against a 1600 algo. We currently believe that the current state of the leaderboard does reflect this, but I will investigate on monday.

Note that the reason Elo is inflating at the bottom of the leaderboard is that there was an issue that was resolved pretty early on where more points were being lost than gained, leading to many algos dropping far lower than expected. These algos are now ‘self correcting’ back up to the appropriate level, and I expect the second lump in 8’s Terminal Leaderboard Elo Distribution chart to merge back into the first lump eventually.

There are some unique factors in our system, such as the frequency at which new players can enter, lose points, and then leave the system, and the fact that the ‘ceiling’ and ‘floor’ for algos is very far apart. What I mean by this is that for the Elo cap to reach, say, 3000, you need to have 2000 Elo players that nearly always beat average players, 2500 Elo players that almost always beat those, and 3000 Elo players that almost always beat those. I believe our leaderboard might reach the 3000 range, especially as top algos continue to improve.

We are working on systems that will help reach your correct Elo in a shorter amount of time. Next weeks changes should lead to more matches being played overall, and a higher % of total games being played by players over 1500.

There is a ton of discussion here, let me know if I missed any points of anything else needs to be adressed

3 Likes

@kkroep I have actually found some evidence that better algos have taken over in the data:

Since there was new data captured for Truth_of_Cthaeh, we can see that it has really remained at its spot and MazeRunner has risen nonetheless.

1 Like

Just to add a little more data:

NMFF is sitting at 2050 elo right now (after last two matches played being losses). So it’s seemed to stabilize in a 2k-2.1k band.

MazeRunner is a similar algo to NMFF but with 6 releases of development trying to address the weak spots of NMFF (with hopefully no regression :wink: ). The intermediate version were removed when MazeRunner went up, but NMFF got to stay since it was my top algo. So for a number of match-ups that NMFF loses, MazeRunner wins (and I’ve seen this behavior in the global competition). Therefore MazeRunner should have a higher elo ceiling than NMFF.

I don’t know what kind of ceiling MazeRunner will have, but it’s 17-2 over the last matches played (that’s all the history terminal makes available at the moment). It’s oldest match in the history was on 2018-11-03T15:17:53.636Z. If 8’s chart is in zulu, it had 2120 elo when it started it’s 17-2 run and is currently at 2235.

Edit for more data: the elo range of it’s opponents in the past 19 matches is 1717 - 2043.

So much for NMFF’s elo stabilizing … It grew to the point of overtaking MazeRunner after MR had a few losses in a row today (NMFF has an elo of 2222 at the time of posting).

I’m not sure what this tells us anymore …

I’m guessing you win virtually every match against mazes, and have pretty consistent response to other algos, so that gives you top spot. The question is, would the NMFF have won against the algos the mazerunner lost to? If not it would have been simply unlucky/lucky matchmaking, and shouldn’t be stable?

NMFF (most likely) would not have won against the algos that beat MR. There’s one match that’s in question, but it wouldn’t account for much margin of error here. So at some point NMFF will face (and lose to) those same algos.

Perhaps it’s incorrect to think there will be elo stability because of the continuous influx of new algos. There’s around 400-500 uploaded each day (based on 8’s charts), so depending on matchups against those new algos, things will always be changing. NMFF and MR can beat most people’s first maze implementations, but they lose to a certain approach in maze behavior (which account for all three of the recent MR-vs-maze losses) and I’m seeing that approach more often in the global competition. So it definitely seems people are getting better at optimizing the maze-maze matchup.

This is similar to what I pointed out about sawtooth vs sawoothV2.

Basically, both algos experienced a similar trend in gaining elo, which for now indicates elo inflation.

1 Like

I bet top elo will be >2706 at the end of the month (without matchmaking changes).

:face_with_hand_over_mouth:

1 Like

Was just trolling old posts and found this guy. Unfortunately I didn’t see this while you guys were discussing it before, but I just wanted to add my 2 cents to the conversation because I probably have the most theoretical and empirical exposure to our elo system. Want to dispel any unrest you guys are experiencing.

One thing I want to throw out there immediately is that neither the mean nor median elo is inflating; in fact it’s deflating.

I’ve been monitoring it for the past few months, and the system is behaving as mathematically intended. Since the last elo bug was patched some time in october, the total elo among deleted and active algos has stayed constant, which is expected from a zero sum system. An observed elo pool can appear to shift when people delete algos. An elo pool can shift up when an influx of bad algos are uploaded then deleted: They give elo to other algos then go away. The opposite can also happen (as is the case in our system). I hope this makes sense to you.

When you consider only the 6 active algos for every user, mean elo has trended down from 1488 to 1129 since october. Do not be alarmed. As @kkroep said, this is ok for elo systems and attempts to correct it by shifting the default elo would be fruitless. Some chess leagues like the USCF also have mean elo drops into the 1100s.

So as was mentioned above, the reason the diagrams show an upward trend is that on average people’s top algos are getting better. With access to all the data I can assure you that overall elo is not inflating.

Mathematically speaking, elo can go arbitrarily high, but it will be balanced (zero sum) on the other end. As Ryan said, elo is an attempt to fit winrates to a normal distribution. Let’s say right now an influx of average people join the game. This doesn’t affect the skill of the top 10, but what it effectively did is lower the standard deviation on the distribution. If the standard deviation of the distribution is lower, the top 10 must be more standard deviations above the mean to represent the same absolute skill (this is not precisely right, but it’s close enough for this post). The easy answer is that as long as the top players keep getting better w.r.t the average player, the elo of the top players will continue to go up. This is happening really fast right now because Terminal is pretty young, while a game like chess is thousands of years old.

Hope this helps.

7 Likes