SGLang CVE-2026-5760: A Malicious AI Model File Is Enough to Get RCE on Your Inference Server

Introduction

A critical unpatched vulnerability in SGLang — a popular high-performance AI serving framework — lets an attacker execute arbitrary code on the server simply by convincing it to load a malicious GGUF model file. Tracked as CVE-2026-5760 with a CVSS score of 9.8, the flaw is a server-side template injection (SSTI) bug in how SGLang renders chat templates embedded in model metadata. Any MLOps team downloading community models from Hugging Face or similar hubs is potentially exposed.

What Happened

SGLang is widely used to serve large language models at scale, competing with vLLM and Text Generation Inference for production inference workloads. The vulnerability sits in the getjinjaenv() function, which initializes a Jinja2 template environment using the default jinja2.Environment() — the unsandboxed variant. That distinction matters, because Jinja2's default environment will happily execute arbitrary Python expressions embedded in templates.

The attack chain is embarrassingly simple:

  1. An attacker crafts a GGUF model file whose tokenizer.chat_template metadata field contains a Jinja2 SSTI payload. The payload also includes a trigger phrase designed to reach the vulnerable code path — researchers demonstrated the attack using Qwen3 reranking hooks.
  2. A victim downloads the model from Hugging Face or another distribution platform and loads it into SGLang.
  3. The attacker sends a crafted request to SGLang's /v1/rerank endpoint.
  4. SGLang calls entrypoints/openai/serving_rerank.py, which renders the attacker-controlled chat_template through the unsandboxed Jinja2 environment. The embedded Python payload executes with the privileges of the inference process.

From there the attacker has full RCE on the inference host — which typically means access to GPU resources, model weights, API keys for orchestration layers, and often network reachability into private data stores used for RAG pipelines.

As of April 20, no official patch from the SGLang maintainers has shipped. The CERT/CC advisory (VU#915947) recommends replacing jinja2.Environment() with jinja2.ImmutableSandboxedEnvironment as the architectural fix.

Why It Matters

Model files from public hubs are the npm packages of MLOps — downloaded by the million, rarely reviewed, and often run by whichever framework is convenient. CVE-2026-5760 turns every GGUF file into a potential RCE payload for SGLang deployments. It mirrors a broader pattern we are seeing across the AI stack: frameworks built for speed treating model artifacts as trusted inputs, when in reality they are attacker-controlled data.

If your team pulls models from Hugging Face, quantizes them locally into GGUF, or runs any community-contributed LLM through SGLang's rerank API, you have been rolling the dice.

Who Is Affected

  • Any organization running SGLang to serve LLMs in production or staging environments
  • MLOps pipelines that pull GGUF models from Hugging Face, TheBloke mirrors, or user-contributed model repos
  • Inference hosts with the /v1/rerank endpoint exposed to internal services or the internet
  • RAG and reranking workloads that route untrusted queries through SGLang

How to Protect Yourself

Until an official patch lands, apply the sandbox fix yourself. In your SGLang install, locate the getjinjaenv() function and replace the environment constructor:

# Before (vulnerable)
from jinja2 import Environment
env = Environment()

# After (mitigated)
from jinja2.sandbox import ImmutableSandboxedEnvironment
env = ImmutableSandboxedEnvironment()

Restrict the rerank endpoint. If you do not actively need /v1/rerank, block it at the reverse proxy:

location /v1/rerank {
    deny all;
    return 403;
}

Or in Kubernetes using a NetworkPolicy that limits access to trusted pods only.

Inspect model metadata before loading. The gguf Python library can extract metadata without executing templates:

from gguf import GGUFReader
reader = GGUFReader("model.gguf")
for field in reader.fields.values():
    if "chat_template" in field.name.lower():
        print(field.name, field.parts)

Reject any model whose tokenizer.chat_template field contains Jinja2 expression delimiters like {{, {%, __class__, mro, subprocess, or os.system.

Sandbox your inference workload. Run SGLang as a non-root user inside a container with no outbound internet access, a read-only filesystem, and no access to cloud metadata endpoints:

securityContext:
  runAsNonRoot: true
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
networkPolicy:
  egress:
    - to:
      - ipBlock:
          cidr: 10.0.0.0/8

Monitor for suspicious behavior from your inference host — outbound DNS to unexpected domains, spawned shell processes, or unusual file reads are all signals that a malicious template rendered successfully.

Source