MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

Yuxuan Fan1, Gyusik Seo1, Jing Hao2, Jaemin Cho3,4, Mohit Bansal5, Jaehong Yoon1†
1NTU Singapore  ·  2The University of Hong Kong  ·  3Johns Hopkins University  ·  4AI2  ·  5UNC-Chapel Hill

Abstract

Audiovisual arts encompass diverse creative disciplines, including cinema, visual arts, stage performance, and game design, where artistic meaning arises from deliberate combinations of visual, auditory, and narrative elements. True artistic understanding extends beyond recognizing what is depicted to reasoning about why it is expressed through particular creative choices. Despite the strong progress of multimodal large language models (MLLMs), this critical aspect of artistic understanding remains underexplored, as existing benchmarks largely measure perceptual recognition while overlooking reasoning about creative intent. To address this gap, we introduce MuseBench, a comprehensive benchmark designed to evaluate MLLMs on nuanced artistic understanding. It comprises 4,016 questions spanning cinematic arts, static visual arts, stage performing arts, and game arts, distilled from over 10K candidate video essays that pair professional commentary with visual demonstration. Comprehensive zero-shot evaluation of 28 state-of-the-art MLLMs reveals that even the best-performing model achieves only 48.29% accuracy, substantially below human expert performance of 87.18%.

Benchmark Overview

MuseBench Overview
4,016 Questions
4 Categories
11 Sub-domains
28 MLLMs Evaluated

Construction Pipeline

MuseBench is constructed through three phases: (I) video collection and preprocessing with audio transcription, (II) question and answer annotation with adversarial distractor generation, and (III) human-in-the-loop iterative quality review.

MuseBench Construction Pipeline

Example Questions

Sample questions from each of the four art categories, illustrating single-select and multi-select formats with varying option counts.

MuseBench Example Questions

Results

Zero-shot evaluation across 28 MLLMs. ACC = Accuracy (%), CAA = Chance-Adjusted Accuracy (%), EM = Exact Match (%). Per-category CAA shown for single-select questions.

Model ACC CAA EM Cin. SVA SPA Game
Human Expert87.1890.9878.0098.7490.1389.4286.15
Proprietary MLLMs
Claude-4.6-Opus48.2955.1328.9163.2658.5162.6534.07
Qwen-3.5-Plus47.2758.5223.2168.8864.3660.6938.80
Doubao-Seed46.1155.0024.2262.1065.6356.8432.86
GPT-5.444.5850.2825.5056.5054.2456.4332.00
Gemini-3.1-Pro36.8943.7714.8843.1642.7049.5038.72
Grok-4.120.5413.718.0014.1919.7015.903.20
Kimi-K2.519.9118.332.0723.0624.0523.620.35
GLM-4.5v17.135.438.6116.17-2.9713.60-4.34
Open Source General-Purpose MLLMs
Qwen3.5-397B44.7653.4222.7162.6557.4556.6035.79
InternVL3-78B37.8147.0313.5347.9848.7957.2531.59
InternVL3-8B33.0729.4920.3043.4136.2330.486.66
Qwen2.5-Omni32.7030.7118.1836.4740.7634.438.31
MiniCPM-o31.3427.0518.9035.7232.9231.496.14
Gemma-4-E4B27.6128.679.0639.4032.5531.4410.30
LLaVA-OV-7B20.4121.240.5022.0125.7725.0910.24
Open Source Video-Specific MLLMs
VideoLLaMA327.1826.829.9034.3724.5533.7614.02
Video-R126.7328.417.2130.5028.7038.4913.87
VideoRFT26.1326.178.1726.5030.1436.049.01
VideoChat-R126.0826.497.7735.8729.0533.046.46
Video-XL-224.1729.910.1128.6329.2940.4219.17
LongVT20.5117.144.6417.9921.6121.925.02
VideoLLaMA220.3420.071.1731.3518.0725.345.40
AKS19.3118.990.0017.6121.4628.016.34
Q-Frame18.769.658.0513.8113.833.308.19
VideoChat217.7815.270.3417.2014.3518.6710.45
Video-CCAM17.5315.100.0023.3818.3116.921.05
LongVU14.878.211.0114.406.505.467.75
TimeChat14.427.790.3412.279.664.615.04
Random13.550.026.040.04-0.040.020.06

Key Findings

F1: Audiovisual-arts reasoning remains far from saturated

The best proprietary model (Claude-4.6-Opus) reaches 48.29% accuracy, trailing human experts (87.18%) by nearly 39 percentage points. Open-source models lag even further behind, underscoring the difficulty of intent-level artistic reasoning for current MLLMs.

F2: Game arts are a shared weakness

Across all model categories, game arts consistently yield the lowest scores, suggesting that game design reasoning (level design intent, mechanics-narrative interplay, aesthetic systems) is uniquely challenging for MLLMs.

Performance Summary

F3: Key frames provide limited gains

Video-specific models that rely on key-frame extraction achieve accuracies between 14.42% and 20.51%, showing that sparse frame sampling is insufficient to capture the temporal and contextual cues essential for artistic understanding.

F4: Models select the most salient correct option but miss the rest

High CAA scores paired with low EM scores reveal that models tend to identify the single most obvious correct answer while failing to recognize additional valid options, indicating shallow pattern matching over genuine comprehension.

F5: Modality gain

Models with audio input (V+A+T) do not consistently outperform vision-text-only (V+T) counterparts, suggesting that current audio encoding and cross-modal fusion strategies fail to effectively leverage auditory information for artistic reasoning.

F6: Open-source MLLMs exhibit pronounced first-position bias

Open-source models disproportionately favor answer options in the first position, regardless of correctness. This positional bias inflates accuracy on standard orderings and highlights a systematic vulnerability in multiple-choice reasoning.

Position Bias Analysis

BibTeX

If you find MuseBench useful for your research, please consider citing:

@article{fan2026musebench,
  title   = {MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs},
  author  = {Fan, Yuxuan and Seo, Gyusik and Hao, Jing and Cho, Jaemin and Bansal, Mohit and Yoon, Jaehong},
  year    = {2026},
  journal = {arXiv preprint arXiv:2606.30026},
  url     = {https://arxiv.org/abs/2606.30026},
}