Key takeaways

  • Internal teams, BPO service providers, callbots, and chatbots: most organizations manage these 3 types of entities with quality methods that are incompatible with each other
  • BPOs are measured on operational SLAs (answer rate, AHT, abandonment rate) — but no qualitative KPI is included in the contract. And the BPO self-evaluates: a structural conflict of interest
  • AI tools (callbot, chatbot) receive no conversational monitoring — the containment rate is measured, not the quality of the interaction
  • Automated Quality Monitoring enables unified benchmarking with the same evaluation grid across all 3 entity types: 8 dimensions, comparable scoring
  • Typical findings: internal teams 68/100, BPO 52/100, callbots 41/100 in conversational quality — on the same customer scenario
  • ROI: +15 to 30% quality improvement for BPO, -40 to 60% chatbot escalations, EUR 800K to EUR 4M in annual savings depending on volumes

Why has quality benchmarking become essential?

Customer relations is no longer the responsibility of a single team. In large organizations, 40 to 60% of interactions are handled by external service providers (BPO) or by artificial intelligence tools (callbots, chatbots). The internal contact center now represents only a fraction of total volume.

This reality creates a strategic problem: three entities handle your customers, but none is evaluated against the same criteria. The internal team undergoes occasional call monitoring. The BPO sends a monthly report based on its own indicators. The chatbot displays a containment rate. Comparison is impossible.

Three entities, three quality realities

CriterionInternal teamBPO / Service providerAI tools (callbot + chatbot)
Typical profileIntegrated contact center, 300 agents, 25 supervisorsMulti-site BPO, 500 agents (Paris, Casablanca, Bucharest)Telephone callbot + Web/app chatbot
Volume / month120,000 calls200,000 calls (for one client company)30,000 callbot calls + 50,000 chatbot conversations
Current QMCall monitoring 3-5%, Excel scorecards1-2% audited by the BPO itselfNo conversational QM
KPIs trackedCSAT, AHT, FCR, quality scoreAnswer rate >90%, AHT <6 min, abandonment <5%Containment rate, transfer rate, post-bot CSAT
ChallengesSubjectivity, low coverage, 25%/year turnoverConflict of interest (self-evaluation), 40%/year turnover, multi-clientHallucinations, poorly managed escalations, no emotion detection

The conclusion is clear: you cannot improve what you cannot compare. Without a common framework, each entity optimizes its own indicators — and your customers suffer an inconsistent experience from one channel to another. To understand the limitations of purely operational KPIs, read our article on customer relationship KPIs.

The BPO self-evaluation trap. When your service provider is the only one auditing its own conversations, it has a structural incentive to present favorable results. Monthly reports show green indicators — but your customers perceive a quality gap. Independent benchmarking through AI eliminates this bias.

How are internal teams, BPOs, and AI tools evaluated today?

Internal teams — evaluation in progress but incomplete

Most internal contact centers have implemented a quality monitoring framework: evaluation grids, call monitoring by supervisors, coaching sessions. But coverage remains low: 3 to 5% of calls are actually evaluated. Grids are often Excel-based, evaluations are subjective (one supervisor scores differently from another), and feedback arrives with a delay — sometimes several weeks after the call.

BPO service providers — operational SLAs, not qualitative ones

The contract with a BPO defines SLAs (Service Level Agreements) centered on operations: answer rate, average handling time, abandonment rate. These indicators measure efficiency, not quality. A BPO agent can meet the AHT SLA of <6 minutes while being rushed, impolite, or imprecise.

Quality monitoring? It is performed by the BPO itself, on 1 to 2% of calls, using its own grids. The client company receives a monthly report — but has no direct visibility into what its customers actually experience.

AI tools — the quality monitoring blind spot

Callbots and chatbots handle tens of thousands of interactions per month. Monitoring boils down to a few metrics:

  • Containment rate: 62% for the callbot, 70% for the chatbot
  • Transfer rate: percentage of escalations to a human agent
  • Post-bot CSAT: 3.1/5 for the callbot, 3.4/5 for the chatbot

But nobody analyzes the conversational quality of these interactions. Did the callbot understand the request? Did the chatbot provide accurate information or hallucinate? Was the escalation to an agent handled without the customer having to repeat everything? These questions remain unanswered.

The blind spots table

Evaluation methodWhat it capturesWhat it misses
Call monitoring (internal)Point-in-time quality, targeted coaching95% of calls escape review, subjectivity
Contractual SLAs (BPO)Operational efficiencyConversational quality, empathy, actual resolution
Containment rate (AI)Volume handled without escalationResolution quality, hallucinations, customer frustration
Automated QM (100%)All dimensions, across 100% of interactions

Key terms

  • BPO (Business Process Outsourcing): outsourcing of customer operations to a specialized service provider, often multi-site and multi-client
  • SLA (Service Level Agreement): contractual service level commitments — typically operational KPIs (answer rate, AHT)
  • Containment rate: percentage of interactions fully handled by AI without transfer to a human agent
  • Deflection rate: percentage of interactions redirected from human channels to automated channels
  • Callbot: AI-powered voice agent, capable of handling telephone calls autonomously
  • Chatbot: AI-powered text agent, handling written conversations (chat, messaging)
  • Benchmark radar: multidimensional comparison framework evaluating entities against the same criteria
  • QOS (Quality of Service): overall quality level as perceived by customers across all channels

For a comprehensive overview of AI Quality Monitoring benefits, read our dedicated article.

What does conversational analysis reveal about the actual quality of each entity?

The most revealing test involves submitting the same scenario to all three entity types and comparing the results. Here is what the analysis of thousands of interactions shows for a common case: a customer calling to dispute an amount on their invoice.

Internal agent — Quality score: 72/100

What the agent should doWhat the internal agent actually does
Identify themselves, greet by name, acknowledge the issue, check the file, explain the discrepancy, propose a concrete solution, confirm the resolution, ask if there are other questions"Hello Mr. Dupont, I'm Marie from customer service. I've reviewed your file. I understand your concern about this amount. Let me check your invoice... Indeed, I see a discrepancy of 47 euros compared to your usual plan. This corresponds to an option activated on January 15th. Would you like me to deactivate it and process a refund?"

AI analysis: correct identification, empathy present, proactive research, clear explanation of the discrepancy, concrete resolution proposal. Areas for improvement: no summary at the end of the call, no satisfaction inquiry.

BPO agent — Quality score: 48/100

What the agent should doWhat the BPO agent actually does
Identify themselves, greet by name, acknowledge the issue, check the file, explain the discrepancy, propose a concrete solution, confirm the resolution, ask if there are other questions"Yeah hello, what's this about? ... OK your invoice. Give me your customer number. ... Right so it's because you have an extra option. You keeping it or not? ... OK I'll deactivate it. It'll be reflected on the next invoice. Anything else?"

AI analysis: no agent identification, no personalized greeting, no empathy, minimal explanation without detailing the amount, no resolution confirmation. The issue is technically resolved, but the customer experience is poor.

Callbot — Quality score: 35/100

What the callbot should doWhat the callbot actually does
Identify the reason for the call, access the file, explain the discrepancy, propose a corrective action or transfer smoothly to an agent with context"Welcome to customer service. What is the reason for your call? ... I understand, you have a question about your invoice. Your last invoice amounts to 67 euros 50. For any claim regarding an amount, please log in to your customer portal or call back to be connected to an advisor. Can I help you with anything else?"

AI analysis: the callbot identified the reason for the call but did not access the file, did not understand it was a dispute (not just a question), proposed no resolution, and redirected the customer to another channel — generating a frustrating escalation where the customer will have to re-explain everything.

Same scenario, three radically different experiences. The internal agent resolves with empathy (72/100). The BPO resolves without relational quality (48/100). The callbot fails to resolve and generates frustration (35/100). These patterns are systematic and detectable by AI at scale: conversational analysis automatically identifies gaps in discourse, empathy, resolution, and escalation between your entities. To discover the full range of analysis features, read the 12 features that make the difference. And for criteria that require human validation, read our article on hybrid analysis.

How to build a unified benchmarking framework?

The key to multi-entity benchmarking lies in a common evaluation grid applicable to both human agents and AI tools. Raisetalk offers an 8-dimension radar.

The 8 quality benchmark dimensions

DimensionDefinitionHow it is measured
Discourse compliancePresence of mandatory elements (script, legal notices)Automatic detection of expected elements in the transcript
Empathy and active listeningQuality of emotional engagement with the customerSentiment analysis, detection of rephrasing and acknowledgment
Effective resolutionDid the customer actually get what they needed?Analysis of the reason vs. the conversation outcome
Clarity and explanationWas the information communicated in an understandable way?Lexical complexity, presence of explanations, absence of unexplained jargon
Escalation managementHow are complex cases transferred?Analysis of contextual continuity during transfer
Resolution timeOperational efficiencyTotal duration, talk/silence ratio, responsiveness
Emotional satisfactionCustomer sentiment at the end of the interactionSentiment analysis on the last quarter of the conversation
Regulatory complianceAdherence to sector-specific legal obligationsCompliance scoring (same methodology as article 17)

The benchmark radar: visualizing the gaps

DimensionInternal teamBPOCallbotChatbot
Discourse compliance74/10068/10082/10085/100
Empathy and active listening71/10055/10022/10018/100
Effective resolution78/10061/10045/10052/100
Clarity and explanation69/10058/10065/10072/100
Escalation management72/10048/10035/10040/100
Resolution time62/10070/10092/10095/100
Emotional satisfaction68/10050/10030/10028/100
Regulatory compliance65/10060/10088/10090/100
Weighted overall score70/10059/10057/10060/100

This radar reveals a counter-intuitive finding: AI tools outperform human agents on discourse compliance and resolution time (they follow the script to the letter and respond instantly), but collapse on empathy, escalation management, and emotional satisfaction. The BPO falls in the middle on most dimensions — but lags significantly behind the internal team on empathy and escalation.

From operational SLA to qualitative SLA for BPOs

Automated benchmarking makes a paradigm shift possible in the relationship with your service providers: moving from operational SLA to qualitative SLA.

Traditional SLA (operational)Qualitative SLA (proposed)
Answer rate > 90%Average quality score > 65/100
AHT < 6 minEffective resolution > 75%
Abandonment rate < 5%BPO CSAT ≥ 85% of internal CSAT
Compliance rate > 90%
Empathy score > 50/100

QM maturity matrix for AI tools

LevelDescriptionKPIs tracked
Level 0 — InvisibleNo qualitative monitoringContainment rate only
Level 1 — OperationalLogs and volume metricsTransfer rate, session duration, post-bot CSAT
Level 2 — QualitativeConversational analysis of logs/transcriptsEffective resolution, clarity, escalation management
Level 3 — BenchmarkSame criteria as human agents8 radar dimensions, benchmark vs. internal agents

Each entity has its strengths and weaknesses — and that is normal. The goal of the radar is not to rank entities, but to identify for each one the priority improvement levers. Train your BPO agents on empathy. Improve your callbot's escalation handling. And adapt the radar weightings to your strategy: if regulatory compliance is critical (banking, insurance), it will carry more weight. To align your scorecard with a recognized quality framework, read our article on ISO 18295 certification.

Which specific KPIs should be tracked for each entity type?

Internal team KPIs: beyond AHT

KPIMeasureTarget
Overall quality scoreAverage of the 8-dimension radar> 70/100
Per-agent progressionQuality score evolution over 3 months+5 pts / quarter
Coaching impactScore before/after coaching session+8 pts minimum
Non-compliance rate% of calls below threshold< 10%
Conversational CSATSatisfaction inferred from the conversation (not a survey)> 75/100

BPO service provider KPIs: from operational SLA to qualitative SLA

KPIMeasureTarget
Quality gap vs. internalBPO score − Internal score (on same dimensions)< 10 points
Contractual quality scoreAverage score on the radar> 65/100
Avoidable escalations% of escalations due to lack of competence (not complexity)< 12%
Contractual complianceAdherence to defined qualitative SLAs> 90%
Inter-site consistencyStandard deviation of quality score across BPO sites< 8 points

AI tool KPIs: measuring what a chatbot cannot do

KPIMeasureTarget
Effective resolution% of interactions where the customer received a complete answer> 65%
Escalation qualityIs context transmitted? Does the customer have to repeat?> 80% contextualized transfers
Hallucination rate% of responses containing incorrect information< 3%
Post-bot vs. post-human CSATSatisfaction gap between AI interaction and human interaction< 15% gap
Empathy scoreAI's ability to rephrase, acknowledge, adapt tone> 35/100

The containment rate trap. A callbot with a 70% containment rate may appear performant. But if 30% of those "contained" interactions result in a frustrated customer who hangs up without being helped, the reality is very different. The containment rate measures what the AI retains — not what it resolves. Only conversational analysis can measure effective resolution.

To learn more about the historical evolution of quality monitoring toward AI, read our article on the QM revolution through AI.

What ROI can you expect from automated quality benchmarking?

The impact depends on the size of your operations and the maturity of your quality framework. Here are three simulations based on the entity profiles presented at the beginning of this article.

Simulation 1 — Internal team (300 agents, 120,000 calls/month)

MetricBeforeAfter 12 monthsImpact
Interactions audited3% (3,600/month)100% (120,000/month)x33 coverage
Average quality score65/10078/100+13 points
Supervisor time spent on listening70% of time20% (focus on coaching)-50 pts → more coaching
CSAT72%81%+9 points
Complaints / year4,2002,500-40%
Complaint savings / yearEUR 510K / year

Simulation 2 — BPO (500 agents, 3 sites, 200,000 calls/month)

MetricBeforeAfter 12 monthsImpact
Interactions audited1% (by the BPO)100% (by the client company)Quality sovereignty
Average quality score52/10067/100+15 points
Quality gap vs. internal-18 points-11 points-39% gap
Quality SLA penalties0 (no qualitative SLA)ActivatedContractual leverage
Avoidable escalations22% of escalations12%-45%
Savings / yearEUR 1.8M / year

Simulation 3 — AI tools (callbot + chatbot, 80,000 interactions/month)

MetricBeforeAfter 12 monthsImpact
Interactions analyzed0% (logs only)100%Full visibility
Callbot escalation rate38%22%-16 points
Post-callbot CSAT3.1/53.8/5+22%
Detected hallucination rateUnknown4.2% → corrected to 1.8%Measurable reliability
Chatbot effective resolution48%68%+20 points
Savings vs. human agents / yearEUR 1.6M / year

Summary view

EntityQuality before → afterMain gainDirect savings / year
Internal (300 agents)65 → 78/100-40% complaintsEUR 510K
BPO (500 agents, 3 sites)52 → 67/100-39% gap vs. internalEUR 1.8M
AI (80K interactions/month)N/A → measurable-16 pts callbot escalationEUR 1.6M
TotalEUR 3.9M / year

The finding is striking: the largest savings potential lies with the BPO — where quality is the least monitored and volumes are the highest.

These figures are simulations based on average assumptions. Actual ROI depends on your volumes, complaint costs, and quality maturity. Raisetalk offers a free trial workspace to evaluate results on your own data: try for free.

What best practices ensure sustainable benchmarking?

1. Unify the evaluation grid before comparing

Benchmarking starts with a common framework. Define your 8 dimensions, their weightings, and your thresholds — then apply them to all entities. Without a unified grid, comparison is an illusion.

2. Demand transparency from your BPOs

Include qualitative SLAs in your contracts. Require direct access to recordings — or better yet, connect your BPO's audio streams directly to your analytics platform. The quality audit must be independent from the audited service provider.

3. Evaluate your AI tools with the same rigor as your human agents

A callbot handles 30,000 interactions per month. It deserves the same level of monitoring as a human agent — not just a containment rate dashboard. Apply the same 8 radar dimensions and compare scores.

4. Use the benchmark as a lever for improvement, not punishment

The benchmark radar is not a punitive ranking. It is a management tool that identifies priority improvement levers for each entity. The BPO lacks empathy? Train its agents using the highest-rated verbatims from your internal team. The chatbot fails at escalation? Rework the prompt and the context transfer.

5. Review weightings quarterly

Your strategy evolves, and so should your quality criteria. If you strengthen your "premium customer relationship" positioning, increase the weight of empathy and emotional satisfaction. If regulatory compliance becomes critical, adjust accordingly.

Benchmarking creates a virtuous cycle. When the BPO knows that every call is evaluated against the same criteria as the internal team, quality improves organically. When AI teams see that their callbot is compared to human agents, they invest in conversational quality — not just containment. And to automate real-time alerts on critical gaps, read our article on smart notifications.

How to get started?

1. Map your entities and their volumes

Identify all actors handling your customer interactions: internal teams, BPO (how many sites, how many agents), callbots, chatbots, IVR. For each entity, note monthly volumes and current QM methods.

2. Define your unified benchmark grid

Choose your 8 dimensions, their weightings, and your thresholds. Involve quality, customer relations, and digital leadership. The grid must be acceptable to all parties for the benchmark to have value.

3. Connect your conversations to Raisetalk

Integration is done via API or SFTP deposit for each source: internal center recordings, BPO audio streams, chatbot conversation logs, callbot transcripts. To choose the right transcription model, read our STT model comparison.

4. Launch an initial 3-month benchmark

Analyze 3 months of history across all entities. This initial benchmark establishes the baseline: where does each entity stand on each dimension? What are the most significant gaps? What are the quick wins?

5. Activate continuous monitoring and alerts

Move from one-time benchmarking to continuous monitoring: real-time scoring, alerts on critical gaps, comparative dashboards. This is the improvement loop that transforms diagnosis into results.

Ready to benchmark the quality of all your entities?


Quality benchmarking between internal teams, service providers, and AI tools is not a luxury — it is a necessity for any organization that outsources or automates part of its customer interactions. Without a common framework, you are flying blind: your internal KPIs look good, your BPO reports green, your chatbot has an acceptable containment rate — but your customers experience inconsistencies from one channel to another. Automated Quality Monitoring creates this unified vision: same grid, same scoring, same standards for all. The potential EUR 3.9M in savings is just the tip of the iceberg — the real gain is a quality of service that is controlled, measurable, and comparable across your entire customer ecosystem.