Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
Abstract
Top-tier MLLMs demonstrate limited capability in processing discrete symbols despite strong performance in complex reasoning, revealing a cognitive mismatch between visual perception and symbolic understanding.
While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.
Community
Interesting benchmark on symbolic understanding in multimodal LLMs. One striking result is that models can do better on reasoning than on basic symbol recognition, suggesting they often rely on language priors rather than true visual-symbol grounding. A timely and valuable evaluation resource.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models (2026)
- TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning (2026)
- Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies (2026)
- Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation (2026)
- VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? (2026)
- MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning (2026)
- MM-THEBench: Do Reasoning MLLMs Think Reasonably? (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper