sudo make

Evaluating scoring models from business perspective

Scoring model

Many businesses use binary classification in their operations, typically employing it for differentiating "good events" from "bad events". Generally these algorithms first assign some number between 0 and 1 (score) to an event, and then make a decision how to intepret this score according to the business logic.

All models falling into "good" vs. "bad" category, collectively, we can call scoring models. Depending on the number of hyperparameters, when it comes to deployments we can face the situation where we have to choose from hundreds or even thousands of models. And in most cases there's no single metric that would reliably tell us which one is better.

But wait, what about AUC?

AUC, while being all-around good metric for evaluating models in theory, can be deceptive when it comes to production. Let's say you have two models, A and B, with AUC(A) > AUC(B):


ROC curves are intersecting at ~0.1. Now, the AUC for Model B is actually higher than for Model A, but if the events that get scores >= 0.1 are already classified as positives, this improvement doesn't convert into a more useful model. In fact, it makes things worse where it matters, with scores < 0.1!


One way to approach this is to keep track of the theshold for previous models and compare AUC below that point. Most of the time it will get the job done - and you will have your automated model selection. Although, many data scientists find it really painful to even mention things like AUC in business meetings because of how non-intuituve they are. Here's a simpler way.

First, make an estimation. How many "positives" are you expecting to get? For most settings this is very reasonable, because while this ratio fluctuates slightly, you won't be too far off if you take an average value. Let's say, you have N% of positives on average.

Score the entire validation set and take top N% point, completely ignoring the AUC. This way you get a new threshold and you can immediately calculate true positive rate, false positive rate, F1 and any theshold-dependent metric you want, but what's important here is your metrics achieve the following:

  1. They evaluate how well your model corresponds to reality
  2. They give you an idea of how would moving the threshold up or down affect the quality in terms of business logic errors - any misclassification based on our "rating" directly corresponds to added/lost value.
  3. If you are building this model as a pilot project, by limiting the rating to a few top-scored items you are getting a good estimation of added value.
  4. All metrics instantly become interpretable.

Unintended consequences

One of the coolect features of this evaluation method is that it doesn't crash and burn when you add new data over time. Depending on your pipeline, it can introduce small changes and slightly drive a model to new trends, or it can change everything at once, doubling the size of your training set and introducing new patterns into your model.

A good example of such behavior is adding a history of orders from a big client into a churn prediction model. Their own patterns will be very strong in your dataset and while you can benefit greatly from bringing together data from multiple sources, they will change the distribution of data significantly.

Any of this would break theshold-dependent metrics and make you spend a lot of time trying to figure out what happened, calculating:

  1. New thresholds
  2. New baseline AUC
  3. New baseline TPR, F1, etc.

But if the score of sample X was lower than the score of sample Y before adding new data, they are most likely to remain that way.

Model selection

All this brings us to a better model selection algorithm. In the simplest form:

  1. Estimate the expected ratio of positive samples - N%
  2. Train multiple models with different hyperparameters
  3. Compute the scores on a validation set for each model
  4. Choose the negative/positive threshold using top N% of the produced scores
  5. Order the models by any metric of your choice using obtained thresholds

This method, being more robust and interpretable, can be very beneficial for tuning models in a startup, when all you (and your first customers) care about is expressed not in probabilities, but in revenue/conversion rates.

Author image
About Roman Trusov