We present PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, \dataset provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.
Step-level Accuracy, Final-Answer Accuracy and response time for evaluated models.
A data example with the proposed DAG structure.
• DAG Representation of Solutions
Encode each solution as a Directed Acyclic Graph (DAG), where nodes are formulas and edges capture prerequisite relations. This provides a minimal, complete, and interpretable structure of the reasoning process.
• Ancestor Closure Scoring
Score = matched formulas + all their prerequisites. This ensures fair credit propagation along causal chains, avoiding both over- and under-crediting.
• Optimality Guarantee
We prove the optimality of the DAG representation and the corresponding scoring policy, ensuring no information loss and no redundant complexity.
[Stage 1] Constant Substitution. We substitute certain variables with their expressions. Variables, constants, and units are normalized into predefined form for consistency.
[Stage 2] Solution Set Equivalence Check. For two equations with $N$ variables, one variable is randomly chosen as the target, the remaining $N-1$ are assigned random values, then and the target is solved to compare whether the solution sets are equivalent. This process is repeated for multiple iterations. Solution set equivalence serves as a proxy for equation equivalence.
An example of our scoring pipeline. A) Formula matching aligns student and reference formulas. B) Back-propagation grading highlights correctly credited formulas along the dependency DAG. C) The final score is computed as the sum of credited points, yielding 90/100 in this case.
Formula Extraction and Normalization.
• Given a student's solution, all mathematical expressions are first extracted and rewritten into our dataset's standardized canonical form, discarding invalid expressions such as syntactically malformed formulas or irrelevant numerical fragments.
Formula Matching.
• Each standardized student formula is compared against the reference DAG of the solution according to Section~\ref{sec:matching}, which outputs a set of matched formulas in the DAG.
Scoring.
• Finally, we score the student solution according to the Ancestor Closure Scoring Policy in Section~\ref{sec:ancestor} with the DAG and the set of matched formulas.
Overview of Three-Step Rewriting Pipeline..
Three-Step Rewriting Pipeline. To guarantee both internal consistency and external evaluability, every sample in the dataset is processed through a structured three-stage rewriting pipeline. Each stage focuses on eliminating ambiguity and enforcing standardization, while preserving the fidelity of the original content.
Verification and Quality Control. At each stage, an LLM-based module verifies formatting, clarity, and dependency rules; failures trigger corrective feedback and regeneration.
Fine-Grained Enhancements. Beyond the main pipeline, we applied several refinements: enforcing significant-figure rules, explicitly defining all constants and variables, and unifying answer formatting.
You can download the dataset on Hugging Face Dataset.
Difficulty Annotation.
Each problem is assigned a composite difficulty label that integrates LLM-based ratings of conceptual depth and computational burden with an entropy-based DAG complexity measure. The three components are combined into a unified score, which is mapped to Easy, Medium, Hard, capturing both the content difficulty and the reasoning complexity of the solution.
Physics Domain Categorization.
Each problem is categorized into one of seven key physics domains:
(1) Mechanics,
(2) Electromagnetism,
(3) Optics,
(4) Atomic, Nuclear, and Particle Physics,
(5) Thermodynamics and Statistical Physics,
(6) Quantum Mechanics,
(7) Solid State Physics and Miscellaneous Topics.
Table 1. Step-level Accuracy and Final-Answer Accuracy across difficulty levels (Easy, Medium, Hard, and Avg.) for evaluated models.
Table 1 reports step-level and final-answer accuracy across difficulty levels. As problem difficulty rises, performance declines and response time increases, reflecting LLMs' sensitivity to longer reasoning chains, more demanding modeling, and higher computational effort. Final-answer and step-level evaluations diverge sharply with problem difficulty: final-answer accuracy drops by over 40% from easy to medium and below 10% on hard problems, while step-level scoring reveals that models still earn partial credit by applying key principles or deriving valid intermediate equations before failing at later stages.
These results demonstrate that final-answer scoring alone severely underestimates reasoning ability, whereas step-level evaluation provides a more faithful measure of process competence under complex tasks. Moreover, step-level signals open promising avenues for training and data curation: If evaluation relies solely on final answers, rewards on difficult problems become extremely sparse. Instead, step-level scoring provides rich intermediate reward signals, offering valuable guidance for reinforcement learning and a principled basis for constructing higher-quality training data.
Step-level and final-answer accuracy across Physics Domain Categories and Difficulty Levels.
= We analyze LLM performance across physics domains and difficulty levels, as shown in Figure. Models exhibit varying accuracy across different types, with the highest performance observed in Thermodynamics and Statistical Physics and the lowest in Quantum Mechanics. Step-level evaluation further exposes weaknesses in reasoning coherence, and accuracy consistently drops from Easy to Hard problems across all domains.
The effect of multimodal input varies across model families. In general, adding images provides stronger gains at the step level than at the final-answer level, highlighting its role in supporting intermediate reasoning. However, for smaller or weaker models, multimodal input can even be detrimental, as diagrams in physics problems often serve a presentational rather than informational role, with the critical content already conveyed in text.
As shown in Figure 4., we observe that reasoning-oriented models exhibit consistently higher accuracy than chat-oriented models, but this improvement consistently comes with substantially longer response times.
We further evaluate GPT-5 and GPT-5-mini for three reasoning modes (low, medium, high). Results indicate a consistent improvement in accuracy with increasing reasoning effort. However, for GPT-5, the average latency of the medium mode is 83.38% higher than the low mode, while the high mode is 268.18% higher. GPT-5-mini shows the same pattern. These results confirm that deeper reasoning consistently improves accuracy while incurring proportional increases in computational cost. Notably, while o4-mini was previously claimed to be a good reasoning model, its performance here is relatively poor; one possible explanation is that, as a distilled model, it suffers from limited generalization and thus struggles with complex reasoning tasks beyond its training distribution.
Figure 5: Distribution of primary error types across models.
We perform error analysis on the first incorrect step detected in each solution as shown in Figure 5, using a unified taxonomy that integrates process-level physics reasoning errors with formula-level derivation errors.
The classification covers seven categories:
(1) Diagram Analysis Error (DAE),
(2) Physics Theorem Application Error (PTAE),
(3) Modeling and Process Understanding Error (MPUE),
(4) Condition or Assumption Error (CAE),
(5) Variable Relationship Error (VRE),
(6) Derivation and Computation Error (DCE), and
(7) Unit Dimension Error (UDE).
The dominant error types across models are Condition/Assumption Errors (CAE), which arise when models set up inconsistent or incorrect physical assumptions; Derivation & Computation Errors (DCE), which occur when models make mistakes in algebraic manipulation or calculation; and Modeling & Process Understanding Errors (MPUE), which reflect failures in mapping the problem into the correct physical model or reasoning process. This indicates that LLMs often fail both in establishing consistent physical conditions and in executing algebraic reasoning.
We randomly sampled 70 problems (10 from each domain) along with their corresponding DeepSeek-V3 (text-only) solutions. Each problem–solution pair was independently evaluated by two human experts to reduce variance, with annotators including an IPhO Gold Medalist and Top-Tier Physics PhD. In cases where the two experts' scores differed substantially, a third annotator was invited to adjudicate and determine the final score.
We quantified the agreement between framework-generated scores and human annotations using Kendall's τb correlation coefficient, along with statistical significance testing via both asymptotic and permutation-based p-values. Higher $\tau_b$ values indicate stronger concordance, with significance levels verifying the robustness of the observed correlations.
Table 2: Comparison of annotation alignment.
Table 2 demonstrates the clear superiority of PRISM-DAG. which achieves the highest τb and lowest $p$-values. LLM-as-Judge is purely outcome-based, assigning only binary 0/1 scores, while PSAS-S, though process-based, evaluates steps independently without modeling causal dependencies. Both baselines are LLM-based, whereas our non-LLM PRISM-DAG explicitly accounts for causality across steps, leading to stronger alignment with human judgments. We analyzed failure cases from our evaluator and two baselines to understand strengths and limitations.
@article{huang2024physics,
title={PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning},
author={Wanjia Zhao, Qinwei Ma, Jingzhe Shi, Shirley Wu, Jiaqi Han, Yijia Xiao, Si-Yuan Chen, Xiao Luo, Ludwig Schmidt, James Zou},
journal={arXiv preprint arXiv:2510.03185},
year={2025}
}