Enterprise AI Analysis

Liar, Liar, LLM on Fire: Investigating Deception in Turkish Text Generation

This study explores lie detection in LLM-generated Turkish text, creating a dataset (TQuADFake) of correct and incorrect sentences for question-answering. It investigates linguistic features to distinguish between generated truths and lies, evaluating models like SVM, PLMs, and LLMs. Findings show linguistic features are useful but insufficient alone, with supervised PLMs outperforming LLMs. The research highlights challenges in effective lie detection and provides a foundation for misinformation detection in Turkish.

Schedule Your Strategy Session

Executive Impact: Key Metrics

This research offers critical insights into the capabilities and limitations of LLMs in generating truthful information, with significant implications for content authenticity and misinformation detection in AI-driven applications.

9,954 LLM-Generated Sentences

0.46 Human Evaluator Agreement (Kappa)

78.11% Best Model Accuracy (BERT-base)

~10,000 TQuADFake Dataset Size

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Dataset Creation

The study introduces TQuADFake, a new dataset for lie detection in Turkish, built upon the TQuAD dataset for Islamic Science History. It comprises 9,954 sentences generated by GPT-4 and Claude 3 Sonnet, categorized as correct or intentionally incorrect ("lies") based on question-answer pairs. Human and LLM evaluators assessed dataset quality, with GPT-4 demonstrating superior performance in generating fluent and coherent sentences, while Claude struggled with consistency. The dataset aims to provide a robust resource for misinformation detection research in Turkish.

Linguistic Features

The research analyzes several linguistic features to distinguish between correct and incorrect sentences generated by LLMs. Sentence length shows correct sentences are generally longer and more detailed. PoS tagging reveals GPT-4 maintains consistent grammatical structures, while Claude's variations reflect greater sensitivity to linguistic accuracy in correct versus incorrect outputs. Punctuation usage is higher in correct sentences, indicating better structure. Max depth suggests more complex structures in correct sentences, while semantic similarity is slightly higher for correct sentences, aligning better with original intent. These features, though individually insufficient, provide valuable cues for lie detection.

Lie Detection Models

Various models were evaluated for lie detection on the TQuADFake dataset: SVM with linguistic features, BERT-base, and LLMs (Llama 3.1, Claude 3 Sonnet, GPT-4) in zero-shot, few-shot, and supervised fine-tuning (SFT) settings. The BERT-base model achieved the highest accuracy (78.11%), outperforming all LLMs, even after SFT (Llama 3.1 SFT: 72.11%). SVM also surpassed LLMs. This indicates that simpler models focusing on fine-grained linguistic and stylistic cues are currently more effective for this task than larger, general-purpose LLMs, which struggle to consistently capture these nuances. The task remains an unsolved problem, highlighting the need for specialized approaches.

~0.90 Strongest Correlation (Pearson R) for Correctness between Human & LLM-Eval

Case Study: GPT-4's Deceptive Fluency

Description: Human evaluators found GPT-4 to be a "better liar" than Claude, as its incorrect sentences were more difficult to detect. This module explores why GPT-4’s generated misinformation is particularly convincing.

Challenge: GPT-4 produces more fluent, coherent, and contextually relevant incorrect sentences. It often only changes the answer span, making the lie subtle and integrated into otherwise accurate statements. Claude, in contrast, frequently misapplies information or relies on hedging phrases (e.g., "according to what is said"), making its inaccuracies easier to spot.

Solution: The model highlights that GPT-4's ability to maintain high grammatical consistency and natural language flow even when generating false information complicates lie detection for both humans and automated systems. This challenges traditional linguistic feature-based detection which often relies on structural anomalies in misleading text.

Outcome: This insight underscores the need for advanced, context-aware lie detection methods that can go beyond superficial linguistic cues to identify semantic inconsistencies and factual deviations, particularly for highly sophisticated LLMs like GPT-4.

Enterprise Process Flow

Curate TQuADFake Dataset with QA Pairs

→

Generate Correct & Incorrect Sentences via LLMs

→

Evaluate Dataset Quality (Human & LLM)

→

Extract Linguistic Features (Stanza)

→

Train Lie Detection Models (SVM, BERT, LLMs)

→

Analyze Model Performance & Linguistic Impact

Model Performance Comparison on Lie Detection

Comparison Point	BERT-base	Llama 3.1 8B SFT	GPT-4 Zero-shot
Accuracy	78.11%	72.11%	58.94%
Precision	78.21%	72.11%	68.59%
Recall	78.11%	72.11%	58.94%
Key Finding	Best overall performance Effective with fine-grained signals	Improved with SFT Still lags behind BERT-base	Lowest performance among LLMs Struggles with zero-shot context

Calculate Your Potential AI ROI

Estimate the impact of implementing advanced AI solutions for misinformation detection and content verification within your enterprise.

Your Industry

Number of Employees (Impacted by Content Verification)

Avg. Hours/Week Spent on Manual Verification

Avg. Hourly Rate (USD)

Estimated Annual Cost Savings $0

Hours Reclaimed Annually 0

Calculate Your AI ROI

Your AI Implementation Roadmap

Based on our analysis, here’s a potential phased approach to integrating advanced misinformation detection into your enterprise workflows.

Phase 01: Initial Data Strategy & Pilot

Establish clear objectives for misinformation detection, identify key content sources, and initiate a pilot project with the TQuADFake dataset or similar domain-specific data to train and validate initial models.

Phase 02: Linguistic Feature Integration

Integrate robust linguistic feature extraction pipelines (e.g., PoS tagging, dependency parsing) for Turkish text. Develop and fine-tune specialized models like BERT-base for optimal performance on detected lies.

Phase 03: LLM Evaluation & Hybrid System Development

Implement a framework for continuous evaluation of LLM-generated content against truthfulness criteria. Explore hybrid systems combining specialized lie detection models with LLM capabilities for complex cases.

Phase 04: Deployment & Continuous Monitoring

Deploy the validated misinformation detection system within your content workflows. Establish continuous monitoring and feedback loops to adapt to evolving deception strategies and LLM updates, ensuring sustained accuracy and reliability.

Start Your AI Journey

Ready to Enhance Your Content Integrity?

Leverage cutting-edge AI to combat misinformation and ensure the authenticity of your digital content. Our experts are ready to guide you.

Book Your Free Consultation

Enterprise AI Analysis

Liar, Liar, LLM on Fire: Investigating Deception in Turkish Text Generation

Executive Impact: Key Metrics

Deep Analysis & Enterprise Applications

Dataset Creation

Linguistic Features

Lie Detection Models

Case Study: GPT-4's Deceptive Fluency

Enterprise Process Flow

Model Performance Comparison on Lie Detection

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 01: Initial Data Strategy & Pilot

Phase 02: Linguistic Feature Integration

Phase 03: LLM Evaluation & Hybrid System Development

Phase 04: Deployment & Continuous Monitoring

Ready to Enhance Your Content Integrity?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai