Skip to main content
Enterprise AI Analysis: Large Language Model Judged Self-Training for Named Entity Recognition

Research & Analysis

Large Language Model Judged Self-Training for Named Entity Recognition

Authors: SHISONG CHEN, JIAAN WANG, CHENGYI YANG, YANGHUA XIAO, ZHIXU LI, XIN LIN

Published: 21 February 2026 at WSDM '26, Boise, USA

This paper introduces a novel LLM-judged self-training method for Named Entity Recognition (NER). It addresses the challenge of confirmation bias in self-training by leveraging LLMs to select high-quality pseudo-labels, significantly outperforming state-of-the-art approaches in few-shot settings.

Executive Impact

The research presents a significant advancement in Natural Language Processing, offering a robust and efficient method for Named Entity Recognition (NER) even with limited labeled data. This breakthrough has direct implications for enterprise AI applications.

0 Improvement over SOTA (CoNLL03, 5-shot)
0 Research Paper Downloads
0 Academic Citations (Initial)
0 Max LLM Training Cost Time

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction & Background
LLM-Judged Self-Training Methodology
Experimental Results & Analysis
Efficiency & Cost

Introduction & Background

Named Entity Recognition (NER) is a core NLP task, but supervised methods are limited by scarce labeled data. Self-training leverages unlabeled data but faces confirmation bias from incorrect pseudo-labels. This paper introduces a novel approach using Large Language Models (LLMs) to judge and filter pseudo-labels, addressing these limitations. It leverages LLMs' rich knowledge and few-shot learning capabilities to improve the quality of self-training.

LLM-Judged Self-Training Methodology

The proposed method integrates LLMs as a judge model to evaluate pseudo-label quality. It features a comprehensive prompt design that incorporates task-specific rules mined directly by the LLM from labeled data, mitigating the impact of inappropriate demonstrations. To counter LLM hallucinations, a collaborative pseudo-label selection method is employed, combining classifier confidence with LLM judgments, guided by calibration-guided probability smoothing. This ensures high-quality pseudo-label selection.

Experimental Results & Analysis

Extensive experiments on CoNLL03, MIT Restaurant, MIT Movie, and SNIPS datasets demonstrate that the LLM-judged self-training method significantly outperforms state-of-the-art approaches across all few-shot settings. The method shows greater advantage with less labeled data. Ablation studies confirm the effectiveness of task rules, few-shot demonstrations, and collaborative pseudo-label selection. An analysis of LLM judgment capability highlights the importance of strong self-knowledge for reliable confidence outputs.

Efficiency & Cost

The efficiency and cost of LLM judging are within acceptable limits. Using GPT-3.5 as the judge model on a single Nvidia 3090 GPU, the total training cost remains below $5 and the entire process takes less than 10 hours. Critically, the LLM is exclusively used during the training phase to select high-quality pseudo-labels. For inference, only the smaller, fine-tuned classifier model is deployed, ensuring no additional inference costs and maintaining real-time performance. This makes LLM-judged self-training a practical solution for enterprise AI.

5.33 F1 Score Improvement over SOTA (CoNLL03, 5-shot)

LLM-Judged Self-Training Process

Train Classifier on Labeled Data
Assign Pseudo-Labels to Unlabeled Data
LLM Judges & Selects High-Quality Pseudo-Labels
Retrain Classifier on Combined Data
Feature Traditional Self-Training LLM-Judged Self-Training (Our Method)
Judge Model Training Relies on limited labeled data Leverages LLM's vast pre-trained knowledge
Judgment Reliability Prone to unreliability due to data scarcity and confirmation bias Enhanced by comprehensive prompts, mined rules, and collaborative selection
Data Efficiency Performance often limited by few labeled examples Effective in low-resource and few-shot settings
Flexibility Requires specific judge model architecture for each task Adaptable across various NLP tasks (text classification, slot tagging)

Practical Benefits: Cost & Efficiency of LLM Judging

Our experimental results demonstrate that the LLM judging process is highly efficient and cost-effective. Utilizing GPT-3.5 as the judge model on a single Nvidia 3090 GPU, the total training cost remains below $5 and the entire process takes less than 10 hours. Critically, the LLM is exclusively used during the training phase to select high-quality pseudo-labels. For inference, only the smaller, fine-tuned classifier model is deployed, ensuring no additional inference costs and maintaining real-time performance. This makes LLM-judged self-training a practical solution for enterprise AI.

Calculate Your Potential AI ROI

Estimate the potential efficiency gains and cost savings by implementing advanced AI solutions powered by insights like those in this research.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A typical phased approach to integrating LLM-judged self-training into your enterprise NER workflows, building capabilities step-by-step.

Phase 1: Initial Model Training

Train a BERT-base classifier on your existing limited labeled Named Entity Recognition (NER) datasets. Establish baseline performance and prepare for pseudo-label generation.

Phase 2: Pseudo-Label Generation

Utilize the initially trained classifier to generate pseudo-labels for a large volume of unlabeled data. This step aims to expand the training pool before LLM-guided refinement.

Phase 3: LLM-Guided Data Selection

Employ a Large Language Model (e.g., GPT-3.5) as a judge to evaluate and filter the pseudo-labels. This involves a custom prompt incorporating task rules and a collaborative selection strategy (confidence fusion and calibration smoothing) to ensure high-quality data.

Phase 4: Iterative Refinement

Combine the original labeled data with the LLM-selected high-quality pseudo-labeled data. Retrain the classifier using this expanded dataset to iteratively improve NER performance and reduce confirmation bias.

Phase 5: Deployment & Monitoring

Deploy the enhanced, small NER classifier for real-time inference, leveraging the benefits of LLM-guided self-training without incurring ongoing LLM inference costs. Continuously monitor performance and retrain as new data becomes available.

Ready to Transform Your Enterprise with AI?

Leverage cutting-edge research to build intelligent systems that drive efficiency and innovation. Our experts are ready to guide you.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking