Adversarial Robotics

August 5, 2025 1 minute read

Systematic threat modeling for LLM-controlled robotics.

Adversaries can influence or control multiple components across the perception–planning–actuation stack. Below is a concise model that distinguishes the robotics setting from conventional chatbot jailbreaks and outlines capabilities and defenses.

Why robotics jailbreaks are different

Physical actions matter: What seems “safe” in chat can be unsafe when executed by an embodied robot. A prompt that is benign in text may cause hazardous motion.
Multimodal inputs: Beyond text, robots process vision (images/video) and audio (voice→text). Attacks can exploit these additional channels.
Context dependence: Safety depends on state and surroundings. Accelerating is fine on an empty path but unsafe with an obstacle ahead.

Attacker inputs

World perturbations: Modify the environment or sensors (camera/LiDAR) to mislead perception.
Adversarial instructions: Craft text or voice prompts (voice→text) to elicit harmful plans.

Attacker capabilities

Black-box: Issue voice/text commands and observe behavior; no access to internal model outputs.
Gray-box: Read LLM/VLM outputs (e.g., reasoning traces) but cannot modify the model.
White-box: Access to architecture and weights; can tailor gradient-based or prompt-internal attacks.

Defense surfaces

Input: Sanitize and rewrite user instructions; filter or verify sensor inputs before planning.
Model: Use robust system prompts; fine-tune for safety and constraint adherence (e.g., RLHF, DPO).
Post-filtering: Validate and constrain plans/actions with safety checks before execution; provide enforceable guarantees.

Doguhan Yeke