Skip to main content
Enterprise AI Analysis: Person Name Detection on Cantonese-English Code-Mixed Call Center Transcripts

Enterprise AI Analysis

Person Name Detection on Cantonese-English Code-Mixed Call Center Transcripts

This analysis examines a pioneering study on robust person name detection in the unique linguistic environment of Hong Kong's call centers, highlighting advanced AI techniques for data privacy and compliance in FinTech.

Executive Impact & Strategic Value

Implementing advanced Named Entity Recognition (NER) for Cantonese-English code-mixed transcripts is crucial for financial institutions to enhance data privacy, ensure regulatory compliance, and streamline operations. This technology allows for accurate PII masking, reducing legal risks and improving customer trust in a multilingual setting.

0.000 Overall F1 Score (Cantonese-BERT)
0.00 Inter-Annotator Agreement F1
0 Total Utterances Analyzed
0 Utterances with Names

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Enterprise Process Flow: PII Redaction

Record & Store Conversations
ASR Transcription
Identify PII (NER)
Mask PII

Challenges in Cantonese Code-Mixed NER

Challenge High-Resource Languages (e.g., English) Cantonese Code-Mixed Context
Data Availability
  • ✓ Abundant public datasets
  • ✓ Significantly under-represented datasets
Linguistic Diversity
  • ✓ Relatively uniform naming conventions
  • ✓ Mix of Hanzi, Romanized Cantonese, English names
Code-Mixing
  • ✓ Less frequent or domain-specific
  • ✓ Common in informal conversations
ASR Noise
  • ✓ Capitalization & punctuation aid NER
  • ✓ Missing capitalization, grammatical errors, false starts
0.95 F1 score for Inter-Annotator Agreement (IAA)

This high score demonstrates robust consistency in manual annotation, a critical foundation for training high-performing NER models in complex code-mixed environments.

104,360 Cleaned Utterances in the Dataset

After deduplication, the dataset comprised 104,360 unique Cantonese-English code-mixed utterances, providing a rich foundation for model training. Of these, 6,241 contained at least one person name.

Case Study: Tackling Data Imbalance with Negative Sampling

Challenge: The vast majority of call center data does not contain person names, leading to a significant data imbalance. A model trained on the full dataset tended to be overly cautious, resulting in low recall and missed entities. Conversely, training only on name-containing examples led to excessive false positives, especially with common Chinese characters like “張” (Cheung), which can also mean "a measure word."

Solution: Targeted down-sampling was applied, selectively including ambiguous negative samples that could confuse the model. This ensured a balanced representation, controlling the number of negative examples to match positive ones and maintaining an equal distribution of Chinese and English negative examples to prevent bias.

Impact: This strategic approach enabled the model to learn the patterns of rare positive entities more effectively, significantly improving recall without sacrificing precision, as validated by experimental results.

Annotation Guidelines for Person Names

Titles without Names Excluded
Stutter Instances Included
Interrupted Fragments Included
Location Masking (Pre-processing)

These guidelines ensure consistency and accuracy, capturing the nuances of spoken Cantonese while minimizing false positives from non-name entities.

Fine-tuned Transformers vs. Prompted LLMs

Feature Fine-tuned Transformers (e.g., Cantonese BERT) Prompted LLMs (e.g., Gemma-2-9b)
Training Data
  • ✓ Task-specific, pre-trained on Cantonese code-mixed data
  • ✓ General knowledge, limited Cantonese exposure
Performance on HAN (Chinese Characters)
  • ✓ Robust, F1 up to 0.988
  • ✓ Poor, F1 around 0.876, prone to mislabeling common characters
Performance on ENG + ROM (Latin Alphabet)
  • ✓ Excellent, F1 up to 0.990
  • ✓ Good, F1 above 0.960
Computational Cost
  • ✓ More efficient for inference
  • ✓ Higher resource requirements (even smaller LLMs)

Case Study: The Advantage of Cantonese-BERT-Base

Innovation: The study significantly improved upon generic Chinese BERT models by expanding its vocabulary with Cantonese-specific characters and continuing pre-training on over 10 million Cantonese sentences from online forums, where code-mixing is prevalent. This custom pre-training focused on topics like current affairs, gossip, sports, finance, and entertainment.

Impact: This specialized Cantonese-BERT-Base demonstrated superior performance across all name types, especially in handling code-mixed text. Its pre-training on authentic Cantonese data, including frequent integration of English words, equipped it with the linguistic nuances necessary for accurate inference, outperforming its Chinese-BERT-Base counterpart.

27.6% Recall Improvement for ENG + ROM Names with Down-sampling

Targeted negative down-sampling for Cantonese-BERT-Base dramatically improved recall for English and Romanized Cantonese names from 0.712 to 0.988, underscoring its necessity in imbalanced datasets.

Key Performance Comparison: Best Models (F1-Scores)

Metric Cantonese-BERT-Base (Downsampling) Gemma-2-9b (Prompted LLM)
Overall F1
  • ✓ 0.989
  • ✓ 0.951
HAN (Chinese Characters) F1
  • ✓ 0.988
  • ✓ 0.876
ENG + ROM (Latin Alphabet) F1
  • ✓ 0.990
  • ✓ 0.983

Case Study: Cantonese Linguistic Misinterpretations

Problem: The model frequently misidentifies Cantonese discourse particles and honorifics as parts of personal names. Common misclassifications include “誒” (hesitation particle), “喇” (pragmatic assertion), and “阿” (colloquial prefix) being conflated with surnames or full names. For example, "誒小姐" (Ah Miss) or "阿先生" (Ah Sir) are incorrectly tagged as names, despite "誒" and "阿" being prefixes and "先生" standing alone as a polite address.

Example: In "咁唔該呀黃生先" (First of all, thank you, Mr. Wong), "黃生先" was incorrectly tagged as a name. Only "黃生" (Mr. Wong) is the name, while "先" is a particle. This error likely arises from the surface-form overlap with the polite form "黃先生" where "先生" is an honorific.

Lesson: This highlights a need for deeper linguistic understanding beyond surface patterns, particularly in spontaneous speech where context is crucial for disambiguation.

Case Study: Confusion with Location-Derived Proper Nouns

Problem: The model occasionally mis-classifies geographical names as personal names, especially those not covered by the rule-based location masker. Culturally significant sites like ancestral halls or private roads, if absent from government datasets, can bypass pre-processing.

Example 1: "黃維則" (Wong Wai Tsak) was incorrectly tagged as a name in "黃維則堂" (Wong Wai Tsak Tong). In reality, Wong Wai Tsak Tong is an ancestral hall, and Wong Wai Tsak is not a real person.

Example 2: "張保仔" (Cheung Po Tsai) was incorrectly tagged as a name in "張保仔路" (Cheung Po Tsai Road). Cheung Po Tsai Road is a street named after a historical pirate. Even though Cheung Po Tsai is a real person, treating it as a false positive prevents degradation of transcript readability.

Lesson: Augmenting the location gazetteer with data from mapping platforms like OpenStreetMap is crucial to minimize these false positives, maintaining privacy without compromising transcript utility.

Estimate Your AI Impact

Quantify the potential savings and efficiency gains for your enterprise by adopting advanced AI solutions like the one analyzed.

Annual Cost Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A phased approach to integrate advanced name detection and PII masking into your enterprise workflows.

Phase 1: Discovery & Strategy (2-4 Weeks)

Comprehensive assessment of existing data privacy protocols and current transcription workflows. Define scope, identify key data sources, and establish success metrics. Strategy workshop with key stakeholders.

Phase 2: Data Preparation & Model Customization (4-8 Weeks)

Collection and annotation of enterprise-specific Cantonese-English code-mixed call center data. Fine-tuning of Cantonese-BERT-Base model with targeted negative sampling and location masking. Integration with proprietary ASR systems.

Phase 3: Pilot Deployment & Validation (3-6 Weeks)

Deploy the custom NER model in a controlled pilot environment. Conduct rigorous A/B testing against baseline methods, evaluate performance on accuracy, recall, and processing speed. Gather user feedback for iterative improvements.

Phase 4: Full-Scale Integration & Monitoring (Ongoing)

Seamless integration into production systems, including real-time PII masking. Continuous monitoring of model performance, data drift, and regulatory changes. Ongoing model retraining and updates to maintain peak accuracy and compliance.

Ready to Enhance Your Data Privacy?

Secure your enterprise communications and ensure compliance with cutting-edge AI. Book a free consultation with our experts to design a tailored solution.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking