PIEFACE: Human Feedback & Eval Platform for Reasoning Agents

What It Does

PIEFACE is a controllable, verifiable sandbox for training and evaluating reasoning agents. It pairs a ground-truth verifier with a human-in-the-loop annotation interface, so every agent action can be checked, scored, and fed back into a reward model — live.

While the current demo uses a symbolic reasoning domain from theoretical CS, the platform generalises to any task with discrete, maskable actions and a programmatic verifier.

Core Capabilities

Annotation Pipeline

Step through agent traces interactively. Accept, deny, or override each action with a single click. Feedback is logged per-step for fine-grained reward signal.

Reward Model Training

Human preferences are stored and used to train a personalized reward model. Track RLHF vs baseline success rates in real time via the built-in metrics panel.

Trace Replay

Replay any saved trace step-by-step, inspect intermediate states, and compare how different policies handle the same scenario.

Multi-Agent Switching

Swap between trained agents or take over manually at any point mid-trace. Run head-to-head comparisons without restarting the environment.

How It Works

Configure an environment — select a task instance (e.g., a source and target gadget type).
Run or replay — let an agent solve the task, replay an existing trace, or drive manually.
Annotate — accept or deny each proposed action. Your preferences are recorded per step.
Train — the reward model updates from your feedback. Compare RLHF-tuned policy performance against the baseline.

Try It Now →

Background: Gadget Reductions (current demo domain)

The demo environment is built around gadget reductions from computational complexity theory. Gadgets are modular components used in hardness reductions — like logic gates for encoding constraints in puzzles such as Sokoban or PushPush.

AP2T — Anti-Parallel 2-Toggle
C2T — Crossing 2-Toggle
L2T — Locking 2-Toggle
NWT — Noncrossing-Wire Toggle

The agent's task is to construct a simulation of one gadget type from instances of another, proving computational equivalence. See Demaine, Hendrickson & Lynch (2020) for the theory.

References

Built at MIT CSAIL. Questions or collaboration inquiries: LinkedIn · zacburton [at] alum [dot] mit [dot] edu