Reflector: Internalizing Step-wise Reflection against Indirect Jailbreaks

Published in ICML 2026, 2026

Reflector is a framework that equips large language models with internalized step-wise reflection capabilities to defend against indirect jailbreak attacks. By embedding reflective reasoning into the model’s generation process, Reflector enables proactive detection and mitigation of adversarial prompts without relying on external safety filters.

Recommended citation: Ma, J., Zhang, J., Li, X., Zou, B., Lu, C., & Yang, C. (2026). "REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak." ICML 2026.
Download Paper

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Li Xiangtian