arxiv:2512.18880

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Published on Dec 21

· Submitted by

Authors:

Abstract

Large Language Models struggle to accurately estimate human cognitive difficulty due to a misalignment with human perceptions and a lack of introspection regarding their own limitations.

AI-generated summary

Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

View arXiv page View PDF GitHub 0 Add to collection

Community

zhoutianyi

Paper submitter 2 days ago

Key Findings of our Human-LLM difficulty alignment study:

Systematic Misalignment: Contrary to standard capability metrics, scaling does not reliably translate into alignment. Increasing model scale does not improve difficulty predictions; instead, models form a cohesive Machine Consensus, aligning significantly stronger with each other than with human reality.
Limits of Simulation: Neither extrinsic ensembling nor proficiency simulation serves as a reliable fix for the misalignment. Ensemble performance is strictly bounded by weaker models, while proficiency simulation proves highly inconsistent as models struggle to authentically mimic different proficiency levels.
The Curse of Knowledge: Our IRT-based analysis reveals a fundamental mechanistic divergence: the difficulty derived from models' actual correctness correlates even worse with humans than their explicit perceptions. Items that are difficult for humans are frequently trivial for models, and this capability exhibits significant inertia even under weak student prompts.
Metacognitive Blindness: We identify a critical lack of introspection. With AUROC scores hovering near random guessing, models fail to predict their own limitations, indicating that explicit difficulty estimates are effectively decoupled from the model's actual correctness, lacking the internal signal to ground their predictions.