Financial services institutions and, more generally, firms operating in regulated industries are expected to comply with laws and directives aimed at ensuring that their customers are properly informed and are treated fairly when sales or other types of communications take place. This could mean, for example, that a mortgage buyer needs to be aware of the difference between a fixed and a variable rate before they commit to undertake the loan, or that a conversation about credit card collections should proceed differently if a customer is identified as vulnerable.

To monitor and remediate risks arising from agents' handling of customer communications, quality assurance (QA) teams are employed to listen to phone calls and review their compliance with relevant QA frameworks, which are typically associated with the specific product or reason for the customer’s call. Calls review is a laborious task which requires significant time investment: mortgage sales calls, for instance, are more than an hour long and can contain more than 50 compliance test-points to check. With limited resources, QA teams typically review less than 1% of calls leaving potential risks un-checked.

TrueVoice can help in automating, enhancing and guiding QA reviews by processing 100% of calls and directing the attention of a reviewer towards the most important segments of each call, improving both the efficiency and the effectiveness of the review process. In this blog, I will explain the way we build, evaluate and use our compliance models to achieve this goal.

Enhancing the QA review process

Performance measures

Compliance checklists typically consist of a set of topics, questions and mandatory statements that need to be covered in conversations with customers to ensure that they are properly informed and treated fairly. From a data science point of view, addressing the QA review means building a set of classifiers that detect the occurrence and position of segments of interest in a phone call. To evaluate the performance of each classifier, we employ the following standard machine learning concepts:

Recall: the proportion of all positive examples that have been correctly classified. For instance, if we are trying to identify an operator asking about a customer’s financial dependents, a positive example is a segment of the call when that conversation takes place, and the recall will be the share of positive examples our models are able to identify.

Precision: the proportion of identified segments which are actually positive examples. Continuing with the example above, if the model identifies 10 segments as containing conversation about financial dependents, the proportion of those when that conversation actually took place.

There is typically a trade-off between precision and recall. When casting a wide net by making a model very generic, we can achieve high recall at the expense of a low precision. On the other hand, if we build a very specific model, we may be able to achieve a high precision at the expense of a low recall*.

For each model in the scope of a QA framework, we can plot precision vs recall, and obtain a graph similar to the one shown below.

From machine learning to business objectives

Not all test-points are created equal. Some of them are defined in such a way that their occurrence alone is enough to determine compliance: for instance, playing back a mandatory statement or asking for a form of identification usually fall into this category. On the other hand, assessing compliance for other test points involves some kind of professional judgement. For example, are income and expenditure disclosures consistent with a loan applicant’s employment and family status?

Moreover, different test-points may have different importance in determining the overall compliance of a conversation and, therefore, may command a different risk appetite and set levels of precision or recall. For instance, if determining income and expenditure is crucial to a call’s compliance, we may want to achieve a high level of recall for this specific test point, and let reviewers validate segments of a call using their own professional judgement, while ensuring that little evidence is missed.

Our model outputs can be used in different ways depending on a specific QA framework to achieve several business objectives. In the simplest case, models can output compliant vs non-compliant predictions for a specific test-point, enabling the automation of aspects of a QA review. This can be carried out for test-points that are identified sufficiently well, and whose contribution to a call’s compliance does not involve professional judgement. When content needs to be reviewed by a QA professional, or for test-points that are harder to identify, areas of a call likely to contain segments of interest can be highlighted to improve a reviewer’s efficiency. Finally, thematic reviews can be enabled by triaging calls according to different outcomes, such as evidence of confusion or customer vulnerability.

Modelling strategies

Identifying content of interest in a phone call is central to our approach for enhancing the QA review process, and different modelling strategies can be employed for this purpose. After a call recording is transcribed, the resulting text can be analysed in 3 main ways:

Keyword search: this is the simplest methodology aimed at identifying one or a set of words, along with their position in the transcription of a call. Typically, this method does not achieve good performance if not in very limited cases when the words are very specific to the content of interest and are correctly transcribed from good quality audio.

Search query: a more sophisticated method for searching content of interest in free-form text, it uses boolean and proximity operators to combine keywords in complex rule-based models. It generally offers an improvement over simple keyword search, but it only achieves high levels of precision and recall in easy to identify segments such as mandatory statements which are always expressed in the same way.

Machine learning: based on a supervised learning methodology, it uses keywords, search queries, natural language processing techniques like latent semantic analysis and non-verbal indicators to achieve the best possible performance. Unlike rule-based methods such as keywords and search queries, it outputs a probability score rather than a prediction. By setting a threshold (e.g. predicting all segments with score > 0.7 as containing content of interest), we can tune the trade-off between precision and recall independently for each model and implement business constraints. For instance, if we want to achieve a high level of recall, we can choose a relatively low threshold which will yield more positive predictions.

The graph below shows how we typically improve the performance of a model by starting with a set of keywords, building search queries and eventually moving to a machine learning approach. The curve intersecting the “machine learning” marker represents different levels of precision and recall that can be achieved by tuning the model’s threshold.

The impact of transcription quality on modelling outcomes

Customer calls happen in a wide variety of environments, from quiet offices over crystal clear audio connections to busy streets where the sound of traffic and an unreliable connection make the conversation challenging. TrueVoice employs a first in class speech to text engine but, despite continuous improvements in this technology, transcription will always be less than perfect. However, by harnessing a machine learning methodology, we can often overcome poor transcription quality by way of two main mechanisms:

Miss-transcription correction: terms that are often miss-transcribed in the same way will be identified by a machine learning algorithm if associated with a specific outcome.

Statistical scores: every feature of a model (that is, every word, search query and more complex indicator) contributes to a score according to its association with the content of interest. A machine learning model typically takes into account thousands of words and tens of word combinations to build evidence for the identification of a test-point. Therefore, if one or even a relatively large set of words are miss-transcribed, the model will generally still be able to gather enough evidence from the remaining words that have been either correctly transcribed or miss-transcribed in a consistent way.

Quantifying benefits

The automation, efficiency and thematic review objectives described above lead to quantifiable benefits arising from employing TrueVoice in assisting the operations of a QA team. In previous engagements, we were able to automate the review of about half of a mortgage sales compliance checklist and improve the review efficiency between 54% and 72%, meaning that a QA team was able to review the same number of calls in less than half the time.

Moreover, our thematic review approach was used to efficiently identify calls displaying customer confusion or vulnerability with an 8x and 2x uplift respectively compared to a random sample. Overall, this means that the same QA team working for the same number of hours could review twice as many calls and up to 16 times as many calls belonging to a specific thematic review objective, significantly reducing risk for the organisation.

If you are interested in knowing more about how TrueVoice can help your QA team, please head over to our product page to find out more.