Part 2 in a series: Model Performance Measurement

This article is my second in a series about targets in modelling. This time, we will discuss model performance targets for different machine learning problem setups and why they are important to businesses that rely on data-driven techniques to make decisions.

In the previous entry, we used the example of a spam filter to illustrate a conceptual model design. In that example, one of the steps involved a spam prediction model using a combination of rules and machine learning techniques built on historic data that was collected over a period of time. The output from this process – a signal determining whether or not the e-mail was indeed spam – is used to drive the filtering decision. This is a simple example, but it demonstrates an important point: when we are reliant on the output of models to make decisions, it is crucial to achieve a good level of confidence in that output. We can measure this confidence, or reliability, by setting targets to the model performance.

The most common approach to validating the performance of a machine learning model is to measure the deviation between the output of the model from real-world feedback (commonly referred to as the “ground truth”).

There are three steps to consider when setting such performance targets:

  1. Defining the model outcome  
  2. Choosing appropriate metrics that will fairly measure deviation 
  3. Defining model success criteria as the maximum deviation to tolerate 

The model outcome and the ground truth (when available) are used to determine what the performance of a machine learning model is measured on.

We are not restricted to using a single metric at a time; in fact, a combination of metrics can sometimes ensure a more balanced model. What matters is that the performance metric chosen is highly representative of the problem being solved, fits the nature of the dataset and is interpretable by the model’s stakeholders.

Speaking of fitting the metric to the problem, the most common types of machine learning problems faced are:

  • Regression problems: predicting a numeric value, such as house prices
  • Classification problems: predicting a category or outcome, such as whether an e-mail is spam
  • Clustering problems: grouping similar entities together, such as grouping television shows by genre

The most straightforward performance measures for regression problems are mean-squared error and a variation on it, R-squared error. These measures compute error as the distances between a set of predicted and actual outcomes, indicating how well a given model fits the data or the portion of explained variance in the data.

Classification problems concern themselves with the number of times the model guessed the outcome correctly – more formally, the percentage of data that the model has correctly classified out of the total population. Variations on this approach can tailor error measurement to specific class outcomes, which are helpful in cases where datasets are imbalanced (e.g. a highly unbalanced ratio of spam to non-spam e-mail). Supporting metrics such as F1 scores and ROC curves aggregated v.

Let’s use the spam filter example to demonstrate the different choices we can make with the metrics beyond accuracy score. If our objective is to detect spam, we would want the model to be broad enough to avoid missing out on potential spam but precise enough to ensure that the e-mails it flags are truly spam. Balancing these two characteristics, known as precision and recall, can be complex without the right metrics. In this case, a more specialised metric is needed: the F1 score, which balances both precision and recall. For more nuanced use cases, it might be preferable to bias the precision over recall, or vice-versa (e.g. reducing the risk of filtering out actual e-mails at the cost of letting some spam through). In these cases, a weighted F1 score, called F-beta, can be used to allocate more weight to model precision.

Measuring error in clustering problems is complicated by the fact that it is an unsupervised type of machine learning problem, i.e. one that does not need real-world feedback to operate. Model performance can be evaluated by looking at how cohesive the clusters are (tight grouping inside clusters) and how well-separated the clusters are. Common metrics are Gap Statistic and Silhouette Coefficient, which will also help us to determine how well-formed the clusters are and what the optimal number of clusters for a dataset should be.

Finally, any machine learning work done in practice should be subject to reasonable business acceptance rules, such as benchmarks or success criteria for the metric. This target should be set after having done a proof of concept that provides a reasonable level of confidence in the modelling approach. Good success criteria should ensure that business objectives are achieved but should also ensure that they feasible for the model to reach.

Model performance targets and success criteria do not always need to be rooted in accuracy. For instance, in business, time and money saved from using a machine learning model are often regarded as equally important. In fact, a business might be comfortable sacrificing accuracy in favour of significant cost savings. Unbiasedness or model fairness is also becoming increasingly important with the scrutiny on using machine learning models increases in use cases that might lead to discrimination or other adverse outcomes. 

This concludes this blog post. In our next and final post of the series, we will discuss choosing the right dataset.