Bypassing Gemma and Qwen safety with raw strings

(teendifferent.substack.com)

2 points | by teendifferent 2 hours ago

1 comments

  • teendifferent 2 hours ago
    OP here. I spent the weekend red-teaming small-scale open weights models (Qwen2.5-1.5B, Qwen3-1.7B, Gemma-3-1b-it, and SmolLM2-1.7B).

    I found a consistent vulnerability across all of them: Safety alignment relies almost entirely on the presence of the chat template.

    When I stripped the <|im_start|> / instruction tokens and passed raw strings:

    Gemma-3 refusal rates dropped from 100% → 60%.

    Qwen3 refusal rates dropped from 80% → 40%.

    SmolLM2 showed 0% refusal (pure obedience).

    Qualitative failures were stark: models that previously refused to generate explosives tutorials or explicit fiction immediately complied when the "Assistant" persona wasn't triggered by the template.

    It seems we are treating client-side string formatting as a load-bearing safety wall. Full logs, the apply_chat_template ablation code, and heatmaps are in the post.

    Read the full analysis: https://teendifferent.substack.com/p/apply_chat_template-is-...