Hi-Phy: A Benchmark for Hierarchical Physical Reasoning

by   Cheng Xue, et al.

Reasoning about the behaviour of physical objects is a key capability of agents operating in physical worlds. Humans are very experienced in physical reasoning while it remains a major challenge for AI. To facilitate research addressing this problem, several benchmarks have been proposed recently. However, these benchmarks do not enable us to measure an agent's granular physical reasoning capabilities when solving a complex reasoning task. In this paper, we propose a new benchmark for physical reasoning that allows us to test individual physical reasoning capabilities. Inspired by how humans acquire these capabilities, we propose a general hierarchy of physical reasoning capabilities with increasing complexity. Our benchmark tests capabilities according to this hierarchy through generated physical reasoning tasks in the video game Angry Birds. This benchmark enables us to conduct a comprehensive agent evaluation by measuring the agent's granular physical reasoning capabilities. We conduct an evaluation with human players, learning agents, and heuristic agents and determine their capabilities. Our evaluation shows that learning agents, with good local generalization ability, still struggle to learn the underlying physical reasoning capabilities and perform worse than current state-of-the-art heuristic agents and humans. We believe that this benchmark will encourage researchers to develop intelligent agents with advanced, human-like physical reasoning capabilities. URL: https://github.com/Cheng-Xue/Hi-Phy


page 5

page 14

page 15

page 16


Phy-Q: A Benchmark for Physical Reasoning

Humans are well-versed in reasoning about the behaviors of physical obje...

NovPhy: A Testbed for Physical Reasoning in Open-world Environments

Due to the emergence of AI systems that interact with the physical envir...

Forward Prediction for Physical Reasoning

Physical reasoning requires forward prediction: the ability to forecast ...

ChemAlgebra: Algebraic Reasoning on Chemical Reactions

While showing impressive performance on various kinds of learning tasks,...

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

With the continuous evolution and refinement of LLMs, they are endowed w...

PHYRE: A New Benchmark for Physical Reasoning

Understanding and reasoning about physics is an important ability of int...

Physics-Based Task Generation Through Causal Sequence of Physical Interactions

Performing tasks in a physical environment is a crucial yet challenging ...

Please sign up or login with your details

Forgot password? Click here to reset