Robotics Evals — Push code → evaluate → understand failures → improve

quarq eval run — policy_v2.pt

$ quarq eval run --policy policy_v2.pt --n 1000

Running 1000 simulations across 8 workers...

✓ 848 passed

✗ 152 failed — categorized automatically

Failure breakdown:

42 arm slip on grasp → wrist_torque < threshold

67 motor lag timeout → latency spike @ t=1.2s

43 camera glare / sensor loss → depth_conf < 0.4

→ Dashboard: quarq.dev/runs/a3f7c2

The problem

The status quo is manual, slow, broken.

Robotics teams run thousands of simulations per day with almost no infrastructure to make sense of what’s happening.

PAIN 01

Death by video review

150 failures. 150 videos. Engineers spend hours replaying simulations just to figure out why a task failed.

PAIN 02

No observability stack

Software teams have Datadog. LLM teams have Langfuse. Robotics is yet to have its movement.

PAIN 03

Manual testing everywhere

Policy updates are validated by manual scripts. Regressions slip through unnoticed until something breaks.

The solution

Automated evals. Instant diagnostics.

Two tools that work together — local testing for fast iteration, cloud analytics for scale.

01 / SDKAvailable

Local Testing Toolkit

Drop into your existing code. Define success criteria, configure failure scenarios, and run evals instantly on your own machine before you push anything.

→Plug-and-play: works with your existing policy and simulator
→Scenarios: glare, motor lag, slippery surfaces, sensor noise
→Run locally in seconds — no cloud round-trip needed
→Structured output for downstream analytics

Star on GitHub

02 / CLOUDEarly access

Analytics Dashboard & CI/CD

Every GitHub push triggers automated evaluation. Instead of video files, you get categorized failure groups with one-click replay.

→GitHub Actions integration — zero new infra
→Auto-categorized failures by root cause
→One-click playlist of exact failure moments
→Track physical metrics and regressions over time

Early access

Compatible with

Isaac SimMuJoCoGazeboGitHub ActionsPython SDK

How it works

From code change to root cause in minutes.

Connect your simulator

Install the SDK, point it to your policy and simulator, and define success criteria.

Run evaluations automatically

Every commit triggers large-scale evaluation through GitHub Actions integrations.

Understand failures instantly

Get grouped failure categories, replayable examples, and trend analysis instead of raw logs and videos.

Stop watching robot failures. Start fixing them.