WN9 expected values method

This page describes the method used to generate per-tank expected values for WN9, and other parameters such as winrate and damage. See the WN9 description page for more information on the WN9 method, or the WN9 implementation page for details of how to implement WN9 metrics on a website or service.

Principle
Known and potential flaws
Performance metric interpretation
Tank balance interpretation
Data collection
Functional description
Example code & test data

Principle

The goal of expected value methods is to generate tank performance statistics that are independent of population skill. The results can assist in understanding the relative performance of tanks, or help to determine a player's skill independent of the tanks they play.

When attempting to determine the relative capability of tanks, people usually start with averages. The biggest problem with averages is that different tanks have players with different skill, and the general solution is to consider each player's tanks relative to their overall performance.

For example, a primitive expected value method for winrate might subtract each player's overall winrate from the winrate of their tanks. Those offsets could then be averaged over the population to give a final +/- for each tank. The main flaw of this particular method is that the original player winrates are not tank-adjusted themselves. See "Metric Bias" below.

The method documented here attempts to improve on the following problems:

Recency bias

Players usually improve over time, and are much more likely to play some tanks later in their career. Some examples:

Higher tier tanks are much more likely to be played by better players.
Rare lines are more likely to be played later in a career, for example most players who played both the IS and IS-2 are much more likely to have played the IS first. This can vary by server.
Low tier premiums tend to have more experienced players than low tier non-premiums.

These effects make methods based on overall data near-useless. The WN8 method used a heuristic to guess whether tanks were played recently, but this had a limited effect on recency bias, and caused additional problems with heavily nerfed tanks. This method instead uses pure interval data.

Metric bias

Expected value methods work by comparing each player's general performance with their performance in individual tanks. This doesn't work well when the general performance metric is inadequate.

Some methods use winrate as a proxy for skill, which doesn't work because winrate is heavily dependent on average tier. For example, the IS-2 is much more likely than the IS to be played by players with a higher average tier, and so its relative performance will be overrated.

WN8 used the previous set of expected values to determine a player's performance. The WN9 method is similar, except that the starting point assumes that all tanks are equal, and the method is iterated (with improving expected values) until the results converge.

Stock bias

Some tanks are much less likely to be played stock than others. Premiums and tier 10s are obvious examples, but popular tanks (eg IS-3) are also much more likely to be played once elited than unpopular tanks (eg KV-4).

It's arguable whether this is a problem. For skill metrics, including stock bias can be "fairer" for most players, because they do play stock grinds. However, it also makes metrics much easier to pad by tank selection, and clouds attempts at tank comparison. This method removes stock bias by discarding battles played before a tank-specific quantity of XP is earned.

Crew skill bias

Some tanks are typically played with better crews than others. Low-tier tanks are frequently ground through without reaching 100%. Tier 10s are commonly played with very good crews, while popular elite tanks tend to be played with better crews than grind tanks.

This method doesn't correct for starting crew skill, mostly because there's no data to derive it from. However, it does adjust performances to a level of ~75k earned XP. This makes a huge difference to low-tier tanks, and a small difference to later tier grind vs elite tanks.

Variable skill scaling

In practice, tanks do not scale equally with skill: For example, good players tend to perform relatively well in tanks with high mobility, while bad players perform relatively well in slow tanks with strong hull armour. This means that methods that use a single expected value such as WN8 will never be accurate for players far from the target skill level. This was a known problem when WN8 was created, but the data available at the time wasn't sufficient to determine the skill scaling.

This method uses the available data to generate two values for each tank: A central expected value and a scale. Tanks with higher scale perform relatively well for good players. In practice, a straight line fit the results extremely well, so there was no reason to use anything more complex.

Metric drift

The WN8 method has a problem where the meaning of the expected values drifts over time. This method normalizes them so that the expected values always represent the performance of an average tier 10 player.

Known and potential flaws

Baseline crew skill assumption

As mentioned previously, the method doesn't correct for starting crew skill because there's no data to derive it. This is partly mitigated by adjusting to the point of 75k earned XP, but expected values should still be somewhat easier to achieve in low tier tanks given equal crew skills.

This may also have effects within tiers. Premium tanks and reward tanks won't necessarily be started with a similar crew to a standard line tank. Mid-tier "keeper" tanks may be played with restarted crews, or following tanks may be played with weaker than expected crews.

Universal skill assumption

This method assumes that "skill" is single-valued, and not heavily dependent on class or tank played. This assumption seems to hold together, although there's some evidence that SPG skill doesn't correlate as strongly with skill in other classes. A small adjustment has been applied accordingly.

A related problem is that most players of a tank may be "doing it wrong" at a range of skill levels. For example, players may make incorrect assumptions about the optimal playstyle or equipment for a tank, and that will drag down its expected values. This is particularly likely if a tank is best played without the elite equipment, or follows a tank with a very different playstyle. Note that the expected values are not strictly "wrong" here, but the skill->performance variance will be higher than usual.

Potential platoon bias

The method uses source data that's unfiltered for platooning. This isn't a problem unless high-skill players are more likely to platoon in some tanks, but solo others.

If there is a bias, it would drag up the winrate scale value for tanks which are relatively popular in platoons (as against solo), without any matching increase in the damage scale. Most of the winrate scales make sense, but possible outliers like the T-55A may be explained by players solo-grinding personal missions.

Performance metric interpretation

The method generates two values for parameters such as winrate, damage and frags, termed "exp" and "scale". Exp is the expected value for an average tier 10 player, while scale represents how easy it is to increase that parameter relative to the expected value. For example, tanks with a high damage scale increase their damage output faster in the hands of good players.

The exp and scale values can be used directly for tank-adjusted skill metrics:

skill = 1 + (param / exp - 1) / scale;

Where skill 1.0 = tier 10 average. Of course, this only adjusts for tank capabilities. Players can still increase their performance by playing at specific times of day or on easier servers. Players can also change their playstyle to increase one parameter without improving their results, which is what composite metrics such as WN9 are for.

The tank adjustment also assumes the following:

Tanks are always played elite.
Tanks are played with a typical starting crew and ~75k earned XP.
Tanks are played with a typical quantity of premium ammo & consumables.

There are no per-account adjustments for these parameters: In other words, free XP use, crew management and premium consumables are included in the definition of "skill".

Tank balance interpretation

Expected values can be used as an indicator of tank balance at different skill levels. However, expected winrate represents the practical ability of a tank to win games, and so it's strongly influenced by matchmaker rules:

It's generally much harder to win at higher tiers, because the players are better. Hence expected winrates can only be directly compared with other tanks from the same tier.
Scouts and arty are currently hard-matched, and so their winrates can only be compared against each other. Other stats such as damage can give you an indication of class-relative balance, however.
Matchmaking weight has a strong influence on winrate, because the matchmaker attempts to balance weights across teams. Newer mid-tier mediums and TDs are typically weighted as heavies, and this given them winrates 2-3% lower than near-identical tanks with medium weight.
The matchmaker also attempts to match identical tanks across teams, so for example if there are three solo T-62As in a match, it's very unlikely that all three will be placed on the same team. This can have a biasing effect on popular tanks, as they're particularly likely to be matched against each other. A popular tank that's strong or has above-average players will be underrated by expected winrates.
Because tanks have higher influence when top tier, tanks that perform relatively well when top tier will generally get higher winrates than tanks that perform relatively well when bottom tier. This may not match expectations and user impressions. This is a general problem with all winrate-based metrics.
Scale values will also depend on matchmaker influence. For example, tier 10 tanks or tanks with preferential MM will have higher scale values. Scouts will have lower influence.

Expected winrates alone will only tell you about tank's capability for an average tier 10 player. For balance at other skill levels, you need to adjust using the scale, for example dmgexp*(1+0.3*dmgscale) for players around the unicum skill level.

Data collection

Data source

This method uses per-tank interval data, commonly the difference between two tanks/stats snapshots. Stock-grind filtering and crew skill adjustment work much better if multiple shorter diffs are used, so that each week's battles can be discarded or adjusted separately. I typically use 11 weekly snapshots to create 10 diffs. Note that your weekly diffs will need to store the starting XP as well as the XP diff.

If using the tanks/stats API, data from the "random" section is preferred (requires &extra=random), although the "all" section also works for intervals because company battles are dead and clan battles are no longer added to it. Make sure that you filter down to necessary results with &fields.

Account choice

Pulling all accounts on a server is both unnecessary and wasteful, as the method is largely unaffected by sample bias and works best with high-activity accounts. I recommend only using accounts that average more than 5-10 battles per day. Around 300k high-activity accounts is a sufficient sample, and obtainable on EU or RU. Accounts that begin the interval with less than 2500 battles should be discarded, because they'll get different matchmaking in tiers 1-3, and their skill is likely to change rapidly.

I use data from EU, primarily. RU has more data, but also more botting. If some tanks are more frequently botted than others, it will distort the results. EU is also a good halfway point between the NA and RU tier & class distributions.

Sanity checks

When creating diffs, you'll need to consider various API glitches and "features":

When accounts are reset, tanks are not necessarily reset in the API until they're played.
The tanks/stats API occasionally returns bad or inconsistent data, such as cumulative parameters decreasing or win+loss+draw != battles. Current rate is typically 1 per 30,000 tanks/stats calls. If this happens to an account in a weekly diff, I discard it for the whole interval.
Occasionally, WG hide or reveal old tanks in the API. Hiding tanks breaks the inconsistent data check, and revealing tanks produces bad data (because they weren't actually played in the interval), so you need to detect both cases and ignore those tanks for the interval.

Functional description

This description assumes that you're using it to generate WN9 exp and scale values. For other parameters, see the notes under Adjustments for alternative parameters.

Step 1: building the MM params

For this method, you'll need a list of tank IDs with the following parameters for each:

tier: Just the tank's tier.
mmrange: How many tiers above its own the tank sees in MM. Only used for WN9.
xpmin: Approx amount of XP required to make the tank usable. I only did this very roughly for WoT. You could do a better version automatically with the current API, although ideally you have some manual input to exclude unnecessary and expensive modules (eg. radio on M2 Medium).
xpscale: The approx performance difference given 3x the earned XP, ignoring stock grind. The derivation was too complex to detail here, but you can generally copy values from tanks with the same class/tier/premium status.
common: Marks whether a tank is used to set the definition of expected values. This value is currently only used for tier 10 tanks.

An example file containing these parameters is included below.

Step 2: Build interval data with grind filtering and crew skill adjustment

The midpoint of the earned XP is used to adjust each weekly diff. If it's below xpmin, that week's data for that tank is discarded because it's likely to be a stock grind. Otherwise, it's adjusted with a base-3 log and xpscale to be equivalent to the performance with 75k earned XP.

Note that this adjustment does not strictly put all tanks on a fair playing field, because it does not take the starting crew skill into account. Given equal crew skill, you can achieve slightly higher WN9 results in lower tier tanks. What it does do is adjust effectively between tanks that are often played elite, and tanks that are not.

Once the weekly diffs have been adjusted, they're added together by tank. WN9base values are then calculated for each tank with more than 5 battles, with the rest discarded. Finally, each account should have a list of played tanks each with a battle count and WN9.

The 5 battle limit is an arbitrary compromise. The WN9base calculation doesn't work well for very low battle counts, and tanks only played for a few battles are less likely to be played seriously.

Step 3: Regression mapping & normalization

The principle is to assume that players play all their tanks with the same "skill" during an interval. You can estimate that skill, and then for each tank, graph player skill vs tank performance. A straight line is fitted to each graph, representing the ability of players of different skills to perform in that tank. This data is then used to make a better estimate of skill, and the process is repeated until the results converge.

In testing, weighted least-squares regression was the most accurate method for the vast majority of tanks. The IS-3 had slightly lower error with least absolute deviation, but this was probably caused by old clan-inclusive data. Data points were weighted by the number of battles in that tank, which had far less error than other weighting strategies tested. A straight-line fit works extremely well in practice, and there's insufficient data for a higher-order fit. Early passes use a fit through 0,0 to make convergence more reliable.

After each regression pass, the results are normalized such that the average slope and intercept are 0 and 1 respectively for common tier 10 tanks. This sets the definition of the expected values. It's also valid to normalize the results only after the last pass, although this gives very slightly different results.

Step 4: Conversion from slope/intercept to exp/scale

To make it easier to compare the results directly, they're converted from the slope/intercept to exp/scale format with the following formulas:

exp = intercept + t10avg * slope
scale = slope * t10avg / exp

t10avg for WN9base is 1.0, by definition. If generating expected values for other parameters, you use the tier 10 average of that parameter instead.

"exp" is the value that an average tier 10 player would achieve if they played that tank, while "scale" indicates how well the tank scales with skill. A tank with 1.1 scale would have a 10% faster performance improvement with increasing skill than a tank with 1.0 scale. Calculating "skill" from these values looks like this:

skill = (tankstat - intercept) / (slope * t10avg)
hence skill = 1 + (tankstat / exp - 1) / scale

Step 5: Newbie MM, SPG and scout adjustments

At least on EU and RU, players with more than 2500 battles are preferentially matched against each other in tiers 1-3. This makes their performance far worse, as they no longer have as many seals to club. As this doesn't operate well at off-peak times (or on NA, probably), expected values are scaled to fit a merged population, using data from before newbie MM was introduced. This prevents low-tier padding in cases where the newbie MM is not operating.

Artillery skill scaling is slightly underrated by the method because artillery skill doesn't correlate that strongly with skill in other classes: Players who are good at other tanks are not necessarily good at artillery. Skill-scaling is increased by 5% to compensate. This effect probably won't change over time.

Since the "personal missions" began, artillery are also running well below their historical performance. This is probably because a large proportion of arty players are only playing them for the missions, and many of the missions require poor play. Expected values are adjusted up by 10% to compensate. This will need to be re-checked over time.

Scout tanks (light tanks with extended MM) are played significantly better by players who play >10% of their battles in them, but "occasional" scout players account for a large chunk of the data. The expected values are adjusted by +3% and the scaling by +5% to match the performance of more experienced scout players.

Step 6: Replace tanks with low battle counts

To improve the quality of the data, you can merge results from earlier expected value runs for tanks with low battle counts. I currently use the following weighting formula:

oldweight = MIN(oldbattles*2, 40000)
mergedexp = (oldexp*oldweight + newexp*newbattles) / (oldweight + newbattles)

Some tanks will still not have enough data for a reliable straight-line fit. ~10k battles is a reasonable cut-off, but you can also identify them by unusual scale values for their class & tier. Tanks with a similar playstyle should generally have a similar scale.

In some cases you can take these tanks from servers where they're more popular, while in others you can copy the result from a near-identical tank. Otherwise you may need to select a similar tank, and/or adjust the values manually. Some current examples:

A-32	copy from A-20
KV-220 test	copy from KV-220
Pz V/IV	adjust from RU
Pz V/IV alfa	copy from Pz V/IV
Awfulpanther	old values
T-44-122	guessed from RU median
ISU-130	Sturer Emil values

An example Excel sheet for handling these operations is provided below.

Adjustments for alternative parameters

The same basic method can be used to generate other expected values, such as winrate, damage and spots. The following adjustments may be required:

A real tier 10 average should be substituted for the t10avg param. You can get most of these from the WN9 tier average table.
Winrate needs adjustments because it's not zero-based. Wherever winrate or expected winrate is directly scaled, it should be scaled much less than WN9 (or damage). Suitable values are noted in the code for the crew skill adjustment, arty adjustment and newbie MM adjustment.
The full regression method works poorly for parameters that are not strongly skill-dependent, such as spots and defence. For these parameters, I just use a fit through zero and the weighted mean, as with the first two passes of WN9.
Winrate sometimes needs additional zero-fit passes to converge correctly.

Example code and data

Example code & data for generating WN9 expected values. Notes are included for modifications to generate other expected values.