Part 3 in a series: Choosing the Right Dataset
This is the last blogpost of this series. We’ve already talked about conceptual model targets and model performance targets, now it is time to discuss the importance of data in building and evaluating models. More specifically, we will talk about three things: data quality, splitting data for evaluation, and sampling. Before we jump in, let me remind you that in the context of today’s post, a model refers to a decision-generating process that applies logical or statistical techniques to transform the data it is provided into a meaningful output.
I’ll start with the obvious: good data quality is the foundation for producing accurate (and useful) findings from modelling. The characteristics of good quality data are accuracy, sufficiency, relevance and timeliness. To bring this to life, let’s consider the problem of predicting credit card fraud, for example.
To build a basic model, you’d need card transaction and fraud records, which should be objectively accurate given the importance of the operational systems from which they’re sourced. That said, when sourcing this data, we might find that there is a low number of fraud cases, which might make it difficult for the model to generalise – a challenge with sufficiency. In situations like these, the common recourse is to increase the time frame of the data collected or to “oversample” fraud cases to better balance the dataset. Sufficiency challenges are not limited to the number of records sampled. Data sparsity is another problem that plagues sufficiency. In our fraud scenario, this could take the form of not having ample information about cardholders, which would limit the quality of the predictive features in the model. Relevance cannot just be judged through common sense and domain knowledge. We cannot really understand the effect of a specific sample of data on our target outcome without running exploratory data analysis and modelling techniques such as regularisation. Last but not least, there’s the issue of whether the frequency at which the data is sourced will enable us to run an effective model. If we’re looking to detect credit card fraud using transactions, we need to be sure that our model will be fed a steady stream of transactions rather than an end-of-month batch.
Now that we know what good looks like for data, let’s next dive into how to use it.
Data Split and Cross-Validation are two techniques used in supervised machine learning problems to help with model evaluation. Let’s pick up where we left off and see how they can help us in building a credit card fraud model.
We’d normally train our model on the data we’ve collected. Bingo! However, we wouldn’t then be able to work test our model – if we test it with the data we used to build it, how can we be sure whether the model learned to predict fraud or whether the model blindly learned the cases we trained it on? This is where Data Splitting comes in – we split our data sample into two parts. We use 70%-80% of that data to build and train our model and reserve the remaining 20% to test our model with. It is important to ensure that both our training and testing samples are representative of the hypothesis we’re testing for. In other words, we cannot train our model on all the non-fraud cases and save the fraud cases for testing. Now, in the real world, we will usually hit some form of sufficiency problem with the data. That’s where Cross-Validation comes into play. Cross-Validation is brilliant in cases where you’re short on data. It is a technique that splits up a dataset into a small number of equal-sized cuts, called “folds”. It then holds one fold for testing and trains the model on the remainder. Not very different from the Data Splitting process, right? Well, that’s where things get interesting. Once Cross-Validation builds the model, it switches the folds and runs the process again. This repeats until every fold has been used as a testing set, then the average model performance across the different training runs is taken. While it does mean that the model training process will be run for a number of times, this is a great way to make a little data go very far.
There are non-technical aspects to also bear in mind. Not all data sits in easy-to-access systems and not all data is labelled. This introduces new challenges to collecting data for model development – there is a business cost to acquiring some feeds of data, and there is a much larger business cost to have someone label any unlabelled observations. Being smart about the size of the sample needed is crucial, as you must be able to show that a model can offer promising results while minimising the cost required to build it. It is good practice to set intermediate performance targets to achieve with cost-efficient data sources before really scaling up. More often than not, machine learning projects fail because they require too much effort or investment for the results they promise.
If there’s a lesson to take home from this, it is that it is crucial to clearly define business objectives and limits alongside the modelling targets before rushing into the actual modelling exercise. Know what data you can get, what data you think you need, work out whether there are gaps that will affect the approach you take and keep an eye on costs while you’re at it.