1 minute read

Systematic threat modeling for LLM-controlled robotics.

Adversaries can influence or control multiple components across the perception–planning–actuation stack. Below is a concise model that distinguishes the robotics setting from conventional chatbot jailbreaks and outlines capabilities and defenses.

Why robotics jailbreaks are different
  • Physical actions matter: What seems “safe” in chat can be unsafe when executed by an embodied robot. A prompt that is benign in text may cause hazardous motion.
  • Multimodal inputs: Beyond text, robots process vision (images/video) and audio (voice→text). Attacks can exploit these additional channels.
  • Context dependence: Safety depends on state and surroundings. Accelerating is fine on an empty path but unsafe with an obstacle ahead.
Attacker inputs
  • World perturbations: Modify the environment or sensors (camera/LiDAR) to mislead perception.
  • Adversarial instructions: Craft text or voice prompts (voice→text) to elicit harmful plans.
Attacker capabilities
  • Black-box: Issue voice/text commands and observe behavior; no access to internal model outputs.
  • Gray-box: Read LLM/VLM outputs (e.g., reasoning traces) but cannot modify the model.
  • White-box: Access to architecture and weights; can tailor gradient-based or prompt-internal attacks.
Defense surfaces
  • Input: Sanitize and rewrite user instructions; filter or verify sensor inputs before planning.
  • Model: Use robust system prompts; fine-tune for safety and constraint adherence (e.g., RLHF, DPO).
  • Post-filtering: Validate and constrain plans/actions with safety checks before execution; provide enforceable guarantees.