How rankings work

Your rating, explained.

Every player has a skill distribution, not a single number. Here's what that means, why we picked TrueSkill over Elo, and the math we use under the hood.

TLDR

Two numbers per player: μ and σ. Mu is our best guess at your skill. Sigma is how unsure we are.
The rating you see is conservative. It's (μ − σ) × 48. New players start at (25 − 8.33) × 48 ≈ 800.
Each match is a Bayesian update. Winning shifts μ up, losing shifts it down, and σ shrinks either way — the system gets more confident.
Beating a stronger or more-uncertain opponent moves your rating more. Beating a known weaker player barely moves it.
Teams just sum. Team μ = sum of player μs, team σ² = sum of player σ²s.

See it in action

Two players. You are new — wide curve, low confidence. Sam has played a few games. Hit the button and watch what happens to both ratings.

You Sam

You rating 800 ±400

Sam rating 1200 ±144

You're new (μ=25, σ=8.3 — wide curve, the system has no idea how good you are). Sam has played a few games.

What to notice: your first win bumps the rating a lot because σ was huge. Subsequent wins move you less, because the system is now more confident. Sam barely moves because his σ is small (he was already well-known).

Why TrueSkill, not Elo?

Elo is great for chess (1v1, big sample sizes, slow rating drift). It's awkward for everything else.

Elo doesn't model uncertainty. A new player and a 200-game veteran with the same rating are treated identically. They shouldn't be.
Elo doesn't natively support teams. 2v2 and 5v5 require ad-hoc extensions, none of which agree with each other.
Elo's K-factor is a hack. You manually tune how much each game moves a rating. TrueSkill computes this for you, per player, per match.

TrueSkill was designed by Microsoft Research for Xbox Live matchmaking — a system that has to onboard new players quickly, handle teams, and converge fast for millions of users. We use it because the same constraints apply to a 12-person office league.

The two numbers

Each player has a (μ, σ) tuple. Together they describe a normal distribution over the player's "true" skill — what they would average over an infinite number of games against perfectly-matched opponents.

μ (mu) — the peak of the distribution. Our point estimate of the player's skill. Default: 25.0.
σ (sigma) — the spread. How much we'd bet against our own estimate. Default: 25/3 ≈ 8.333.

The number you see on a leaderboard is displayRating = (μ − σ) × 48. The − σ bias means we err on the side of "not yet proven." The × 48 scale just makes ratings look like familiar Elo numbers (new players ≈ 800, strong veterans ≈ 1500+).

We also expose a stability percentage — how much variance we've eliminated since you started. Against an opponent of similar skill it lands roughly at:

~26% after 1 game
~59% after 5 games
~71% after 10 games
~80% after 20 games

The exact numbers depend on who you play — beating well-known opponents shrinks σ faster than beating other newcomers. It's computed from σ², not σ, because Bayesian updates operate on variance: each match removes a fraction of the remaining uncertainty.

How a match updates your rating

For a winner W and loser L, the update is:

// dynamics inflation (applied first, to both players)
σ²  ←  σ² + τ²

// total uncertainty in this match
c² = σ_W² + σ_L² + 2β²

// "exceeds-margin" V/W functions over the standard normal
t = (μ_W − μ_L) / c
v = φ(t) / Φ(t)        // pdf / cdf
w = v · (v + t)

// new ratings
μ_W' = μ_W + (σ_W² / c) · v
σ_W' = √( σ_W² · (1 − w · σ_W² / c²) )

μ_L' = μ_L − (σ_L² / c) · v
σ_L' = √( σ_L² · (1 − w · σ_L² / c²) )

Key intuitions:

β (beta) = 4.167 is per-match performance noise. Even good players sometimes lose. β controls how surprised the system gets when an underdog wins.
τ (tau) = 0.0833 is a tiny inflation we add to σ before every match. It prevents σ from collapsing to zero, so ratings stay responsive over time even after many games.
The (σ² / c) factor means players with high σ get bigger updates. New players move fast; veterans move slowly.
We don't model draws — DRAW_PROBABILITY = 0. Most of our games don't have meaningful ties.

Teams

For team matches we treat the team as a single distribution:

μ_team  = Σ μᵢ        (sum of individual mus)
σ²_team = Σ σᵢ²       (sum of individual variances)

This is the standard TrueSkill assumption that a team's performance is the sum of its members'. A team with one ringer and three rookies has high μ and high σ — strong on average, but unpredictable. A balanced team of mid-level players has moderate μ and lower σ.

After the match, each player's update is proportional to their share of the team's total variance. The ringer absorbs more of the rating change than the rookies, because they were responsible for more of the team's expected output.

What we don't capture

Honest about the limitations:

Score margin. Winning 11-1 and winning 11-9 are equivalent. The model only sees who won.
Draws. Set to zero. Tied matches need to be resubmitted as a clear winner or thrown out.
Time decay. If you stop playing for six months, your σ stays put. Some rating systems inflate σ over time; we don't.
Per-match streakiness. β is fixed, so the model assumes the same noise level for every player. In reality, some players are more consistent than others.

None of these are fatal for a friend-group league. They're worth knowing if you ever wonder why a particular result felt off.

Read the paper

TrueSkill™: A Bayesian Skill Rating System (PDF) — Herbrich, Minka, Graepel (NeurIPS 2006). The original Microsoft Research paper. Math-heavy but the introduction is approachable.