Education / Assessment AI
ChatGPT and Gemini participated in the Korean College Scholastic Ability Test - Earth Science I
This study analyzes the performance of state-of-the-art LLMs (GPT-4o, Gemini 2.5 Flash, Gemini 2.5 Pro) on the 2025 Korean College Scholastic Ability Test (CSAT) Earth Science I section. It identifies key cognitive limitations in multimodal scientific reasoning, including 'Perception Errors,' 'Calculation-Conceptualization Discrepancy,' and 'Process Hallucination.' The findings suggest how to design 'AI-resistant questions' by exploiting these vulnerabilities to distinguish human competency from AI-generated responses.
Executive Impact: Diagnosing AI's Cognitive Gaps
Understanding AI's fundamental reasoning flaws is crucial for robust assessment design and educational integration. Our analysis reveals key performance metrics and areas of vulnerability.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The Perception-Cognition Gap in LLMs
The study reveals a significant 'Perception-Cognition Gap' where LLMs struggle to interpret symbolic meanings in schematic diagrams, even when visual data is recognized. This is not merely a visual error but a deeper failure to connect visual information with underlying scientific concepts. Key sub-categories include Visual Data Misreading (9 cases, 25%) and Schematic Misinterpretation (6.5 cases, 18.06%).
Conceptual Application Challenges
LLMs demonstrate 'Calculation-Conceptualization Discrepancy', successfully performing calculations but failing to apply the underlying scientific concepts to the results. This indicates a superficial understanding rather than deep conceptual integration. Sub-categories are Concept Misapplication (4.5 cases, 12.50%) and Calculation-Concept Discrepancy (1 case, 2.78%).
Flawed Reasoning and Process Hallucination
A critical vulnerability identified is the LLMs' tendency to skip complex reasoning steps and generate plausible but unfounded conclusions, termed 'Process Hallucination.' They also exhibit 'Flawed Reasoning' by making logical leaps or setting false premises. Sub-categories: Flawed Reasoning (7 cases, 19.44%), Spatio-temporal Failure (2 cases, 5.56%), Factual Hallucination (2 cases, 5.56%), and Process Hallucination (4 cases, 11.11%).
Enterprise Process Flow
| Model | Full-Page Input (Accuracy) | Optimized Input (Accuracy) | Key Limitations |
|---|---|---|---|
| Gemini 2.5 Flash | 8% | 20% |
|
| GPT-4o | 14% | 22% |
|
| Gemini 2.5 Pro | 28% | 68% |
|
| Human Examinee (Top) | N/A | 95%+ |
|
AI-Resistant Question Design: Leveraging LLM Weaknesses
By exploiting the identified vulnerabilities, educators can design questions that effectively distinguish genuine human understanding from AI-generated responses. For instance, creating items that require interpreting atypical schematic diagrams (targeting Perception-Cognition Gap) or multi-step problems where procedural calculations must be connected to deep scientific meaning (targeting Calculation-Conceptualization Discrepancy) can serve as powerful AI-resistant assessments. Also, questions demanding strict visual data verification to counter 'Process Hallucination' are critical.
Calculate Your Potential AI Optimization ROI
See how understanding and addressing AI's cognitive limitations can translate into tangible benefits for your organization.
Your Strategic Implementation Roadmap
Based on the research, we've outlined a phased approach to leverage AI's strengths while mitigating its weaknesses within your organization.
Phase 1: Vulnerability Assessment & Gap Analysis
Identify specific AI cognitive gaps within your enterprise data and processes, leveraging insights from CSAT-like reasoning failures.
Phase 2: AI-Resistant Design Prototyping
Develop and prototype AI-resistant assessment strategies or data validation mechanisms tailored to your unique operational challenges.
Phase 3: Human-AI Collaboration Frameworks
Establish protocols for human oversight and verification, building on the observed limitations of AI in deep reasoning and hallucination.
Phase 4: Continuous Monitoring & Refinement
Implement systems for ongoing evaluation of AI performance and adaptation of strategies to maintain assessment fairness and data integrity.
Ready to Transform Your AI Strategy?
Book a personalized consultation to discuss how these insights can be applied to your enterprise, ensuring robust and fair AI integration.