Logo PRISM-Physics

PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

1Stanford University, 2Tsinghua University,
3UCLA, 4Harvard University 5University of Wisconsin-Madison

Introduction

We present PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, \dataset provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.

Prism-Physics

Step-level Accuracy, Final-Answer Accuracy and response time for evaluated models.

Logo PRISM-DAG

Overview

Prism-Physics

A data example with the proposed DAG structure.

DAG Representation of Solutions Encode each solution as a Directed Acyclic Graph (DAG), where nodes are formulas and edges capture prerequisite relations. This provides a minimal, complete, and interpretable structure of the reasoning process.
Ancestor Closure Scoring Score = matched formulas + all their prerequisites. This ensures fair credit propagation along causal chains, avoiding both over- and under-crediting.
Optimality Guarantee We prove the optimality of the DAG representation and the corresponding scoring policy, ensuring no information loss and no redundant complexity.

Evaluation Framework

Rule-based Physics Formula Equivalence Matching

[Stage 1] Constant Substitution. We substitute certain variables with their expressions. Variables, constants, and units are normalized into predefined form for consistency.
[Stage 2] Solution Set Equivalence Check. For two equations with $N$ variables, one variable is randomly chosen as the target, the remaining $N-1$ are assigned random values, then and the target is solved to compare whether the solution sets are equivalent. This process is repeated for multiple iterations. Solution set equivalence serves as a proxy for equation equivalence.

Prism-Physics

An example of our scoring pipeline. A) Formula matching aligns student and reference formulas. B) Back-propagation grading highlights correctly credited formulas along the dependency DAG. C) The final score is computed as the sum of credited points, yielding 90/100 in this case.

Scoring Pipeline

Formula Extraction and Normalization.
• Given a student's solution, all mathematical expressions are first extracted and rewritten into our dataset's standardized canonical form, discarding invalid expressions such as syntactically malformed formulas or irrelevant numerical fragments.
Formula Matching.
• Each standardized student formula is compared against the reference DAG of the solution according to Section~\ref{sec:matching}, which outputs a set of matched formulas in the DAG.
Scoring.
• Finally, we score the student solution according to the Ancestor Closure Scoring Policy in Section~\ref{sec:ancestor} with the DAG and the set of matched formulas.

Logo PRISM-Physics Dataset

Data Collection and Preprocessing

Prism-Physics

Overview of Three-Step Rewriting Pipeline..

Three-Step Rewriting Pipeline. To guarantee both internal consistency and external evaluability, every sample in the dataset is processed through a structured three-stage rewriting pipeline. Each stage focuses on eliminating ambiguity and enforcing standardization, while preserving the fidelity of the original content.
Verification and Quality Control. At each stage, an LLM-based module verifies formatting, clarity, and dependency rules; failures trigger corrective feedback and regeneration.
Fine-Grained Enhancements. Beyond the main pipeline, we applied several refinements: enforcing significant-figure rules, explicitly defining all constants and variables, and unifying answer formatting.

You can download the dataset on Hugging Face Dataset.

Dataset statistics

Logo Difficulty Annotation.
Each problem is assigned a composite difficulty label that integrates LLM-based ratings of conceptual depth and computational burden with an entropy-based DAG complexity measure. The three components are combined into a unified score, which is mapped to Easy, Medium, Hard, capturing both the content difficulty and the reasoning complexity of the solution.
Physics Domain Categorization.
Each problem is categorized into one of seven key physics domains:
(1) Mechanics, (2) Electromagnetism, (3) Optics, (4) Atomic, Nuclear, and Particle Physics,
(5) Thermodynamics and Statistical Physics, (6) Quantum Mechanics, (7) Solid State Physics and Miscellaneous Topics.

Logo Experiment Results

Main Results

Step-level vs. Answer-level Evaluation.

Table 1 reports step-level and final-answer accuracy across difficulty levels. As problem difficulty rises, performance declines and response time increases, reflecting LLMs' sensitivity to longer reasoning chains, more demanding modeling, and higher computational effort. Final-answer and step-level evaluations diverge sharply with problem difficulty: final-answer accuracy drops by over 40% from easy to medium and below 10% on hard problems, while step-level scoring reveals that models still earn partial credit by applying key principles or deriving valid intermediate equations before failing at later stages.
These results demonstrate that final-answer scoring alone severely underestimates reasoning ability, whereas step-level evaluation provides a more faithful measure of process competence under complex tasks. Moreover, step-level signals open promising avenues for training and data curation: If evaluation relies solely on final answers, rewards on difficult problems become extremely sparse. Instead, step-level scoring provides rich intermediate reward signals, offering valuable guidance for reinforcement learning and a principled basis for constructing higher-quality training data.

= We analyze LLM performance across physics domains and difficulty levels, as shown in Figure. Models exhibit varying accuracy across different types, with the highest performance observed in Thermodynamics and Statistical Physics and the lowest in Quantum Mechanics. Step-level evaluation further exposes weaknesses in reasoning coherence, and accuracy consistently drops from Easy to Hard problems across all domains.

Modality and Reasoning-Level Comparisons

Text Models vs. Multimodal Models

The effect of multimodal input varies across model families. In general, adding images provides stronger gains at the step level than at the final-answer level, highlighting its role in supporting intermediate reasoning. However, for smaller or weaker models, multimodal input can even be detrimental, as diagrams in physics problems often serve a presentational rather than informational role, with the critical content already conveyed in text.

Error Analysis

We perform error analysis on the first incorrect step detected in each solution as shown in Figure 5, using a unified taxonomy that integrates process-level physics reasoning errors with formula-level derivation errors. The classification covers seven categories: (1) Diagram Analysis Error (DAE), (2) Physics Theorem Application Error (PTAE), (3) Modeling and Process Understanding Error (MPUE), (4) Condition or Assumption Error (CAE), (5) Variable Relationship Error (VRE), (6) Derivation and Computation Error (DCE), and (7) Unit Dimension Error (UDE).
The dominant error types across models are Condition/Assumption Errors (CAE), which arise when models set up inconsistent or incorrect physical assumptions; Derivation & Computation Errors (DCE), which occur when models make mistakes in algebraic manipulation or calculation; and Modeling & Process Understanding Errors (MPUE), which reflect failures in mapping the problem into the correct physical model or reasoning process. This indicates that LLMs often fail both in establishing consistent physical conditions and in executing algebraic reasoning.

Evaluation Framework Analysis

Annotation Setup.

We randomly sampled 70 problems (10 from each domain) along with their corresponding DeepSeek-V3 (text-only) solutions. Each problem–solution pair was independently evaluated by two human experts to reduce variance, with annotators including an IPhO Gold Medalist and Top-Tier Physics PhD. In cases where the two experts' scores differed substantially, a third annotator was invited to adjudicate and determine the final score.

BibTeX


    @article{huang2024physics,
      title={PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning},
      author={Wanjia Zhao, Qinwei Ma, Jingzhe Shi, Shirley Wu, Jiaqi Han, Yijia Xiao, Si-Yuan Chen, Xiao Luo, Ludwig Schmidt, James Zou},
      journal={arXiv preprint arXiv:2510.03185},
      year={2025}
    }