What are the main reasons customers contact our call centres? What makes them happy about our products and what is driving complaints? 

These are some of the questions our clients are tackling with the help of TrueVoice, Deloitte’s speech analytics solution, to improve customer experience and operational efficiency in their contact centres.

Machine learning plays a pivotal role in making this possible, by modelling business outcomes such as call resolution, reason for contact and compliance risk for businesses that operate in heavily regulated areas such as financial services. Working on speech data in a variety of industries with clients at different levels of technological maturity presents specific challenges that I will outline in this post, while lifting the lid on what constitutes TrueVoice’s machine learning engine, and relating the data science to the business objectives our clients are pursuing.

Modelling principles: multi-modal indicators and explainable predictions 

It’s not just what you say, but how you say it that matters. Evolution has given us a subtle, kaleidoscopic variety of ways to express emotions, opinions, concerns and requests. In the age of social media and recommendation engines, companies are striving to build personal relationships with their customers but, while people are naturally attuned to the complexity of human communications, understanding patterns of behaviour at scale is impossible without the use of the right technology. Whether we are modelling emotions, complaints or agents pressuring their clients into buying a product, we generally choose to employ multi-modal indicators derived from the audio signal (how you say it) and its speech to text transcription (what you say). These two streams are processed in different ways to derive complementary features for our machine learning models, including:

Audio-based indicators

  • Pitch, intensity: perceptually-weighted features that are especially useful to capture peaks in the emotional nature of the conversation.
  • Pauses, speech rate: useful to indicate hesitation or interruptions in the conversation due to, for instance, a customer being put on hold.
  • Audio-based emotion detection: outputs of a neural network model trained on acted emotional utterances, it is useful to capture non-verbal emotional indicators.

Text-based indicators

  • Search queries: outputs from a rule-based model which includes boolean operators (e.g. search for occurrences of “customer” and “service”) and proximity operators (e.g. search for occurrences of “amazon” and “services” would hit all instances of Amazon within two words of services). It can then be useful to zoom in on specific instances of known ways of expressing an outcome.
  • Latent semantic analysis: a technique to summarise unstructured text into broad topics using text vectorisation and dimensionality reduction (truncated singular value decomposition). Useful to express the context in which an outcome tends to occur: for instance, complaints for a retail business may occur in the context of discussions around returns and refunds.
  • Deep recurrent networks: used in TrueVoice primarily for emotion detection, deep neural network based on recurrent nets such as LSTMs are able to model a word in its context and therefore may be used in place of latent semantic analysis. However, deep neural networks are typically trained on clean, written text and they don’t always generalise well when applied to datasets resulting from transcribing noisy recordings or speakers with heavy accents.

Call meta-data

  • Call position and duration: some outcomes tend to occur at the beginning or at the end of a call, or tend to be more prevalent in longer or shorter calls.
  • Operational meta-data: some outcomes tend to happen on inbound calls vs outbound calls, or they may tend to happen more if handled by specific teams within the call centre.

Having extracted fairly complex features, some of which are themselves the outputs of machine learning models, our estimators tend to be relatively simple and interpretable models such as logistic regression or support vector machines. This gives us the advantage of quickly diagnosing why a model is making certain decisions and why a model may generalise or not across different datasets.



Variable contribution







Latent semantic analysis

Good | want | need | thank | afternoon | patience








Latent semantic analysis

Thank | ok | yes | bye | right





Partial example extracted from the complaint model explanations. Complaint and anger search queries feature as the top contributors to a complaint prediction, while audio-based happiness is the top contributor against a complaint prediction. The two latent semantic analysis features exemplify the impact of different contexts to the prediction: the positive one refers to a situation where an agent greets a customer after they have been put on hold (e.g. “good afternoon, thank you for your patience”), while the negative one refers to a customer thanking an operator and closing the conversation. Pitch is also positively associated with a complaint prediction, which is consistent with a raised level of emotional arousal.

Outcomes, technological maturity and modelling choices 

With TrueVoice, we endeavour to serve clients that are at different stages of technological maturity, with some of them just starting to leverage voice data and others who are well on their way towards building a complex omni-channel view of their customers. Moreover, some outcomes are inherently easier to model than others. For this reason, each of the machine learning models we build can be seen in light of the three following criteria:

  • Outcome complexity: some contact centre conversations need to include mandatory disclosures aimed at satisfying regulatory or quality assurance requirements. These are an example of outcomes that tend to be relatively easy to model provided an adequate audio quality, and do not typically rely on multi-modal indicators. Other modelling outcomes such as emotions or complaints are much more subtle and require all the tools at our disposal to be properly addressed.
  • Model footprint: some models such as specific disclosures may be only applicable to individual TrueVoice clients, others such as reasons for call are generic across a whole industry, while others yet such as emotions or call resolution are generic enough to be applicable across different industries.
  • Model maturity: as a result of outcome complexity and model footprint our models typically follow a maturity path consisting of:
    • Initial rule-based models: generated using search queries in collaboration with our industry experts, they constitute baselines for models that require less complexity or that are less generalisable across clients and industries.
    • Initial machine learning-based models: usually provided as an output following a proof of value phase, they are created by tagging and modelling specific outcomes and may use a variety of indicators including audio and text-based features.
    • Machine learning feedback loop: once our models are in production, they enter their most mature phase where their performance is periodically monitored based on new tags, and where models are continuously improved according to a framework comprising five levels of model refinement from simple re-training to complete re-engineering. 

Machine learning vs business performance 

Our training protocols follow typical machine learning best practices such as cross-validation, hyper-parameter tuning, model diagnostics and performance evaluation reports. This is crucial to ensure that we build the best possible models from a technical point of view. However, at TrueVoice we strive to go one step beyond and work with our clients to understand how using our models will impact their business operations and help them moving forward in their journey towards building data centric organisations.

In the context of binary classification models, which feature in the majority of our use-cases, this often means carefully handling the balance between precision and recall to address the following scenarios.

Risk-based sampling

Many of our clients are interested in reviewing phone conversations and identify outcomes of interest to better understand their customers or to ensure they comply with regulations. They typically review less than 1% of their calls and select them at random, meaning that rare events will almost never appear in front of a reviewer. For instance, a bank may be interested in reviewing calls to spot irregularities in the way they sell mortgages: if those irregularities only appear in a small proportion of situations, the bank may still receive hefty fines, but its reviewers will be unable to identify the issue and remediate it. TrueVoice will instead select a risk-based sample of calls to review that are most likely to contain irregularities based on machine learning models, ranking models decisions from the most confident to the least confident and hence prioritising precision over recall.

An example of risk-based sampling on a small subset of 70 calls selected at random. The baseline incidence of the outcome of interest is around 50% as highlighted by the median risk-based sampling when rank=70 (that is, referring to the whole random sample). On the other hand, selecting the top 10 calls from this sample according to our model’s scores results in 90% median incidence of the outcome of interest. Violin plots display distributions across different cross-validation folds.


Another common scenario consists of providing an unbiased estimate of the occurrence of a certain event. For example, a retailer offering products from different brands may be interested in knowing what is the percentage of complaints each brand receives from its customers, or what is the percentage of calls that result in the customer’s issue being resolved. In this situation, our models which identify different outcomes such as call resolution or complaints cannot favour precision over recall because this would under-estimate the overall occurrence of that outcome by selecting only the cases where the model provides a very confident prediction. On the other hand, they cannot favour recall over precision or the opposite effect would take place and the overall occurrence of that outcome would be over-estimated. Instead, we need to set models’ thresholds at a value that minimises the difference between precision and recall to ensure an unbiased estimate of that outcome. Moreover, we find that when models with a large footprint are applied across different customers or industries, thresholds need to be tuned to maintain this balance of precision vs recall.

An example of threshold tuning. The current threshold results in poor recall and an underestimation of the overall occurrence of this outcome. Moving the threshold to circa -0.65 balances the trade-off between precision and recall.


Voice data represents a rich and relatively untapped source of information that can help organisations gaining precious insights into their customers and operations. From a data science standpoint, processing audio and building multi-modal models on noisy datasets presents great challenges and an opportunity to push the boundaries of what technology can do when married with deep business expertise, which is what we strive towards in our daily work at TrueVoice. 

If you are interested in knowing how we can help your business, please head over to our AWS quick start page to begin your journey. Alternatively, you can read more about TrueVoice here