Reflector: Internalizing Step-wise Reflection against Indirect Jailbreaks
Published in ICML 2026, 2026
Reflector is a framework that equips large language models with internalized step-wise reflection capabilities to defend against indirect jailbreak attacks. By embedding reflective reasoning into the model’s generation process, Reflector enables proactive detection and mitigation of adversarial prompts without relying on external safety filters.
Recommended citation: Ma, J., Zhang, J., Li, X., Zou, B., Lu, C., & Yang, C. (2026). "REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak." ICML 2026.
Download Paper
