Skip to content

Conversation

@dougyster
Copy link

This is a fix of @denyszhak's PR of adding the SGLang backend (see #477). Previously, the SGLang version is 0.4.0 and running a chat completion would result in dependency errors. Below are the fixes implemented:

  1. Upgraded SGLang to 0.5.6
  2. Added CUDA 13 toolkit (for B200 compatibility)
  3. Fixed colon sanitization for model names (required by SGLang ver 0.5.6)

The docker image builds successfully on both linux with B200s and H200s, and I'm able to load a model and run chat completion with the following:

In one terminal, run:

make docker-run-sglang-cuda

In another:

load the model with:

curl -X POST http://localhost:12434/models/create \
  -H "Content-Type: application/json" \
  -d '{"from": "ai/smollm2-vllm"}'

then make a completion api request

curl -X POST http://localhost:8080/engines/sglang/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2-vllm",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "max_tokens": 50
  }'

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dougyster, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the SGLang backend into the model-runner, addressing critical dependency and compatibility issues encountered in a prior attempt. The changes involve updating the SGLang version, incorporating CUDA 13 for enhanced GPU support, and refining model name handling. This significantly expands the model-runner's capabilities, allowing it to leverage SGLang for efficient chat completions, particularly on modern NVIDIA GPUs like B200s and H200s, and provides new Docker targets for building and running SGLang-specific images.

Highlights

  • SGLang Backend Integration: Introduced the SGLang backend to the model-runner, enabling support for SGLang-based models.
  • Dependency Fixes and Upgrades: Upgraded SGLang to version 0.5.6 to resolve previous dependency errors and added CUDA 13 toolkit for B200 compatibility.
  • Model Name Sanitization: Implemented a fix for colon sanitization in model names, which is a requirement for SGLang version 0.5.6 and later.
  • Docker Build and Runtime Support: Ensured the Docker image builds successfully on Linux with B200s and H200s, and verified chat completion functionality with the new SGLang backend.
  • Backend Communication Protocol: Modified the backend interface and runner to support TCP communication for backends like SGLang, in addition to existing Unix socket support.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/promote-to-latest.yml
    • .github/workflows/release.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the SGLang backend, including Dockerfile changes for building the SGLang environment, Makefile targets for new Docker images, and the necessary Go backend implementation. The changes are well-structured, using build tags for conditional compilation of vLLM to keep the SGLang image lean. A new UsesTCP method has been added to the Backend interface to support SGLang's TCP-only communication, with corresponding updates to the scheduler and other backends.

My review focuses on a few key areas:

  1. A critical issue in the SGLang memory estimation logic which could cause runtime failures.
  2. A potential incompatibility in the Dockerfile's CUDA setup.
  3. A minor optimization for the Docker image size.

Overall, this is a solid contribution that significantly extends the capabilities of the model runner. The code is clean and the approach for adding a new backend is robust.

Comment on lines +230 to +233
return inference.RequiredMemory{
RAM: 1,
VRAM: 1,
}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The GetRequiredMemoryForModel function currently returns hardcoded placeholder values of 1 for both RAM and VRAM. This is dangerous as it will cause the scheduler to severely underestimate the memory required to run a model, likely leading to out-of-memory errors and scheduler failures.

If memory estimation is not yet implemented for SGLang, it would be safer to return an error, similar to how the MLX backend handles it. This would prevent the model from being scheduled until proper estimation is in place.

	// TODO: Implement accurate memory estimation for SGLang.
	return inference.RequiredMemory{}, errors.New("memory estimation for sglang is not yet implemented")

Comment on lines +130 to +132
&& wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb \
&& dpkg -i cuda-keyring_1.1-1_all.deb \
&& apt update && apt install -y cuda-toolkit-13-0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Dockerfile is configured to use an ubuntu24.04 base image (as defined in the Makefile), but it's fetching the CUDA keyring from the ubuntu2204 repository. This mismatch could lead to package installation issues or incompatibilities. Please use the repository corresponding to the base image's OS version (ubuntu2404).

    && wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb \
    && dpkg -i cuda-keyring_1.1-1_all.deb \
    && apt update && apt install -y cuda-toolkit-13-0 \

USER root

# Install CUDA toolkit 13 for nvcc (needed for flashinfer JIT compilation)
RUN apt update && apt install -y \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To keep the Docker image size minimal, it's a good practice to use the --no-install-recommends flag with apt install. This prevents the installation of optional packages that are not strictly necessary.

RUN apt update && apt install -y --no-install-recommends \

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code. (link)
  • Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code. (link)

General comments:

  • The SGLang Docker integration appears broken: binaryPath() points to /opt/sglang-env/sglang, but the Dockerfile only creates a Python venv with sglang installed (so the executable will be under /opt/sglang-env/bin), and initFromDocker never populates pythonPath, meaning Run will likely error out in the container; consider aligning binaryPath with the venv layout (e.g., bin/sglang or bin/python3 -m sglang.launch_server) and ensuring the Docker install path is actually used.
  • The SGLang CUDA image build in release.yml targets both linux/amd64 and linux/arm64 while the Dockerfile installs cuda-toolkit-13-0 from the ubuntu2204/x86_64 repo, which will fail or be incorrect for arm64; consider gating the CUDA toolkit setup to amd64 only or using an appropriate arm64-compatible base/repo.
  • The new TCP backend port selection in runner.go uses a fixed tcpBackendBasePort (30000) with no configuration hooks, which may conflict with existing services or other instances; it might be safer to make this base port configurable (e.g., via env var or config) or at least document the assumption about available port range.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The SGLang Docker integration appears broken: `binaryPath()` points to `/opt/sglang-env/sglang`, but the Dockerfile only creates a Python venv with `sglang` installed (so the executable will be under `/opt/sglang-env/bin`), and `initFromDocker` never populates `pythonPath`, meaning `Run` will likely error out in the container; consider aligning `binaryPath` with the venv layout (e.g., `bin/sglang` or `bin/python3 -m sglang.launch_server`) and ensuring the Docker install path is actually used.
- The SGLang CUDA image build in `release.yml` targets both `linux/amd64` and `linux/arm64` while the Dockerfile installs `cuda-toolkit-13-0` from the `ubuntu2204/x86_64` repo, which will fail or be incorrect for arm64; consider gating the CUDA toolkit setup to amd64 only or using an appropriate arm64-compatible base/repo.
- The new TCP backend port selection in `runner.go` uses a fixed `tcpBackendBasePort` (30000) with no configuration hooks, which may conflict with existing services or other instances; it might be safer to make this base port configurable (e.g., via env var or config) or at least document the assumption about available port range.

## Individual Comments

### Comment 1
<location> `.github/workflows/release.yml:174` </location>
<code_context>
+        with:
+          file: Dockerfile
+          target: final-sglang
+          platforms: linux/amd64, linux/arm64
+          build-args: |
+            "LLAMA_SERVER_VERSION=${{ inputs.llamaServerVersion }}"
</code_context>

<issue_to_address>
**issue (bug_risk):** Building the SGLang CUDA image for arm64 is likely incompatible with the x86_64-only CUDA repo used in the Dockerfile.

This stage installs CUDA 13 from the `ubuntu2204/x86_64` repo, but the workflow also builds `linux/arm64` images. On arm64, this will likely fail during `apt` setup or pull the wrong architecture. Either limit this image to `linux/amd64` or update the Dockerfile to handle arm64 explicitly (separate repo/logic or skipping the toolchain on non‑x86).
</issue_to_address>

### Comment 2
<location> `Dockerfile:124-132` </location>
<code_context>
+
+USER root
+
+# Install CUDA toolkit 13 for nvcc (needed for flashinfer JIT compilation)
+RUN apt update && apt install -y \
+    python3 python3-venv python3-dev \
+    curl ca-certificates build-essential \
+    libnuma1 libnuma-dev numactl ninja-build \
+    wget gnupg \
+    && wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb \
+    && dpkg -i cuda-keyring_1.1-1_all.deb \
+    && apt update && apt install -y cuda-toolkit-13-0 \
+    && rm cuda-keyring_1.1-1_all.deb \
+    && rm -rf /var/lib/apt/lists/*
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Installing CUDA toolkit 13.0 on top of a CUDA 12.9 runtime image may introduce version skew and significantly increase image size.

This stage installs `cuda-toolkit-13-0` and points `PATH`/`LD_LIBRARY_PATH` to `/usr/local/cuda-13.0`, while the base image is `nvidia/cuda:12.9.0-runtime-ubuntu24.04`. That can create conflicting runtime libraries between the base image and the toolkit. If CUDA 13 is required by SGLang/flashinfer, consider switching to a CUDA 13.x runtime base image. If you only need `nvcc`, a better pattern is to use a matching `devel` image in a build stage and keep the final runtime on a single, consistent CUDA version to avoid mismatch and image bloat.

Suggested implementation:

```
# --- SGLang variant ---
FROM llamacpp AS sglang

ARG SGLANG_VERSION=0.5.6

USER root

# Install CUDA toolkit 12.9 for nvcc (matching the CUDA 12.9 runtime)
RUN apt update && apt install -y \
    python3 python3-venv python3-dev \
    curl ca-certificates build-essential \
    libnuma1 libnuma-dev numactl ninja-build \
    wget gnupg \
    && wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb \
    && dpkg -i cuda-keyring_1.1-1_all.deb \
    && apt update && apt install -y cuda-toolkit-12-9 \
    && rm cuda-keyring_1.1-1_all.deb \
    && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /opt/sglang-env && chown -R modelrunner:modelrunner /opt/sglang-env

USER modelrunner

# Set CUDA paths for nvcc (needed during flashinfer compilation)
ENV PATH=/usr/local/cuda-12.9/bin:$PATH
ENV LD_LIBRARY_PATH=/usr/local/cuda-12.9/lib64:$LD_LIBRARY_PATH

# Install uv and SGLang as modelrunner user

```

If SGLang/flashinfer explicitly requires CUDA 13.x features, a more robust solution would be:
1. Introduce a dedicated build stage based on a `nvidia/cuda:13.x-devel-ubuntu24.04` image to compile flashinfer with `nvcc`.
2. Copy only the built artifacts into the final runtime image (which should stay on a single CUDA version, e.g., 12.9 or 13.x, but not both).
Adapting to that pattern will require changes elsewhere in the Dockerfile to add the build stage and adjust `COPY` instructions.
</issue_to_address>

### Comment 3
<location> `pkg/inference/backends/sglang/sglang.go:184-192` </location>
<code_context>
+	}
+
+	// Determine binary path - use Docker installation if available, otherwise use Python
+	binaryPath := s.binaryPath()
+	sandboxPath := sglangDir
+	if _, err := os.Stat(binaryPath); errors.Is(err, fs.ErrNotExist) {
+		// Use Python installation
+		if s.pythonPath == "" {
+			return fmt.Errorf("sglang: no docker binary at %s and no python runtime configured; did you forget to call Install?", binaryPath)
+		}
+		binaryPath = s.pythonPath
+		sandboxPath = ""
+	}
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** The SGLang runner mixes assumptions about a standalone binary with Python `-m` invocation, which may not be compatible if a real binary is ever present.

When `/opt/sglang-env/sglang` exists you treat it as the binary but still pass `-m sglang.launch_server` and other Python-specific flags. A real compiled `sglang` binary at that path likely wouldn’t accept those arguments. If the contract is “always run as a Python module”, consider removing the `binaryPath()` override and always using `pythonPath`. Otherwise, branch the argument construction so Python invocations (`python3 -m ...`) and a native `sglang` binary receive appropriate, different args.

Suggested implementation:

```golang
	// Determine binary path.
	// If the Docker/native sglang binary exists, treat it as a standalone executable.
	// Otherwise, fall back to invoking Python as a module runner.
	usePython := false

	binaryPath := s.binaryPath()
	sandboxPath := sglangDir
	if _, err := os.Stat(binaryPath); errors.Is(err, fs.ErrNotExist) {
		// Use Python installation
		if s.pythonPath == "" {
			return fmt.Errorf("sglang: no docker binary at %s and no python runtime configured; did you forget to call Install?", binaryPath)
		}
		binaryPath = s.pythonPath
		sandboxPath = ""
		usePython = true
	}

```

To fully implement the behavior your review comment describes, you will also need to:

1. **Branch command construction based on `usePython`:**
   Somewhere later in this method (or in a helper) you will have something like:
   ```go
   cmd := exec.Command(binaryPath, args...)
   ```
   or similar. Update it to:
   ```go
   var cmd *exec.Cmd
   if usePython {
       // Python module invocation: python -m sglang.launch_server ...
       pyArgs := append([]string{"-m", "sglang.launch_server"}, args...)
       cmd = exec.Command(binaryPath, pyArgs...)
   } else {
       // Native sglang binary invocation: sglang ...
       cmd = exec.Command(binaryPath, args...)
   }
   ```
   Ensure any existing `"-m", "sglang.launch_server"` arguments are removed from the generic `args` slice and only added in the `usePython` branch.

2. **Update any other places that assume `binaryPath` is always Python:**
   - If you have code that always sets `PYTHONPATH`, or otherwise assumes `binaryPath` is a Python interpreter, guard that logic with `if usePython { ... }`.
   - If `usePython` needs to be referenced outside this function, consider adding it as a field on the relevant struct instead of a local variable.

These changes will make the contract clear: a real `sglang` binary is treated as a standalone executable with native-style arguments, while Python-based installs are invoked via `python -m sglang.launch_server`.
</issue_to_address>

### Comment 4
<location> `pkg/inference/backends/sglang/sglang_config_test.go:39-48` </location>
<code_context>
+func TestGetArgs(t *testing.T) {
</code_context>

<issue_to_address>
**suggestion (testing):** Extend TestGetArgs to cover socket parsing errors and non-default host/port

It’d be valuable to extend the table in `TestGetArgs` with two extra cases:

1. An invalid `socket` (e.g. `"127.0.0.1"` or `"bad-socket"`) that triggers the `net.SplitHostPort(socket)` error path, with `expectError: true`.
2. A non-default host/port (e.g. `"0.0.0.0:34567"`) that asserts the resulting `--host`/`--port` flags match those values.

These would close the gaps around socket parsing and help prevent regressions in the new TCP-based scheduling behavior.

Suggested implementation:

```golang
func TestGetArgs(t *testing.T) {
	tests := []struct {
		name        string
		config      *inference.BackendConfiguration
		bundle      *mockModelBundle
		mode        inference.BackendMode
		expected    []string
		expectError bool
	}{
		{
			name: "empty safetensors path should error",
			// rest of this case as previously defined (ensuring it expects an error)
		},
		{
			name: "invalid socket returns error",
			config: &inference.BackendConfiguration{
				Socket: "bad-socket",
			},
			bundle:      &mockModelBundle{},
			mode:        inference.BackendModeSingle,
			expectError: true,
		},
		{
			name: "non-default host and port from socket",
			config: &inference.BackendConfiguration{
				Socket: "0.0.0.0:34567",
			},
			bundle: &mockModelBundle{},
			mode:   inference.BackendModeSingle,
			expected: []string{
				"--host", "0.0.0.0",
				"--port", "34567",
			},
			expectError: false,
		},

```

Because only a fragment of `TestGetArgs` is visible, you’ll need to align a few details with the existing code:

1. **BackendConfiguration field name**:  
   - Replace `Socket` with the actual field name used by `getArgs` for the scheduler’s TCP socket (e.g. `SchedulerSocket`, `ListenAddr`, etc.).
2. **BackendMode value**:  
   - Replace `inference.BackendModeSingle` with whichever `BackendMode` constant is appropriate in the existing tests (e.g. `inference.BackendModeServer`, `inference.BackendModeSingle`, etc.).
3. **First test case body**:  
   - Remove the placeholder comment in the `"empty safetensors path should error"` case and restore its original struct fields (or ensure it is constructed so that `getArgs` errors due to the empty safetensors path and sets `expectError: true`).
4. **Argument expectations order/shape**:  
   - Confirm that `getArgs` indeed produces `--host <host> --port <port>` in that order, and adjust the `expected` slice accordingly if the flag names or ordering differ.
5. **Imports**:  
   - If these tests introduce any new types or constants not already imported (unlikely given the snippet), ensure the corresponding imports are present at the top of the file.
</issue_to_address>

### Comment 5
<location> `pkg/inference/backends/sglang/sglang.go:138` </location>
<code_context>
	if err := exec.Command(pythonPath, "-c", "import sglang").Run(); err != nil {
</code_context>

<issue_to_address>
**security (go.lang.security.audit.dangerous-exec-command):** Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

*Source: opengrep*
</issue_to_address>

### Comment 6
<location> `pkg/inference/backends/sglang/sglang.go:144` </location>
<code_context>
	output, err := exec.Command(pythonPath, "-c", "import sglang; print(sglang.__version__)").Output()
</code_context>

<issue_to_address>
**security (go.lang.security.audit.dangerous-exec-command):** Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +124 to +132
# Install CUDA toolkit 13 for nvcc (needed for flashinfer JIT compilation)
RUN apt update && apt install -y \
python3 python3-venv python3-dev \
curl ca-certificates build-essential \
libnuma1 libnuma-dev numactl ninja-build \
wget gnupg \
&& wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb \
&& dpkg -i cuda-keyring_1.1-1_all.deb \
&& apt update && apt install -y cuda-toolkit-13-0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Installing CUDA toolkit 13.0 on top of a CUDA 12.9 runtime image may introduce version skew and significantly increase image size.

This stage installs cuda-toolkit-13-0 and points PATH/LD_LIBRARY_PATH to /usr/local/cuda-13.0, while the base image is nvidia/cuda:12.9.0-runtime-ubuntu24.04. That can create conflicting runtime libraries between the base image and the toolkit. If CUDA 13 is required by SGLang/flashinfer, consider switching to a CUDA 13.x runtime base image. If you only need nvcc, a better pattern is to use a matching devel image in a build stage and keep the final runtime on a single, consistent CUDA version to avoid mismatch and image bloat.

Suggested implementation:

# --- SGLang variant ---
FROM llamacpp AS sglang

ARG SGLANG_VERSION=0.5.6

USER root

# Install CUDA toolkit 12.9 for nvcc (matching the CUDA 12.9 runtime)
RUN apt update && apt install -y \
    python3 python3-venv python3-dev \
    curl ca-certificates build-essential \
    libnuma1 libnuma-dev numactl ninja-build \
    wget gnupg \
    && wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb \
    && dpkg -i cuda-keyring_1.1-1_all.deb \
    && apt update && apt install -y cuda-toolkit-12-9 \
    && rm cuda-keyring_1.1-1_all.deb \
    && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /opt/sglang-env && chown -R modelrunner:modelrunner /opt/sglang-env

USER modelrunner

# Set CUDA paths for nvcc (needed during flashinfer compilation)
ENV PATH=/usr/local/cuda-12.9/bin:$PATH
ENV LD_LIBRARY_PATH=/usr/local/cuda-12.9/lib64:$LD_LIBRARY_PATH

# Install uv and SGLang as modelrunner user

If SGLang/flashinfer explicitly requires CUDA 13.x features, a more robust solution would be:

  1. Introduce a dedicated build stage based on a nvidia/cuda:13.x-devel-ubuntu24.04 image to compile flashinfer with nvcc.
  2. Copy only the built artifacts into the final runtime image (which should stay on a single CUDA version, e.g., 12.9 or 13.x, but not both).
    Adapting to that pattern will require changes elsewhere in the Dockerfile to add the build stage and adjust COPY instructions.

Comment on lines +184 to +192
binaryPath := s.binaryPath()
sandboxPath := sglangDir
if _, err := os.Stat(binaryPath); errors.Is(err, fs.ErrNotExist) {
// Use Python installation
if s.pythonPath == "" {
return fmt.Errorf("sglang: no docker binary at %s and no python runtime configured; did you forget to call Install?", binaryPath)
}
binaryPath = s.pythonPath
sandboxPath = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): The SGLang runner mixes assumptions about a standalone binary with Python -m invocation, which may not be compatible if a real binary is ever present.

When /opt/sglang-env/sglang exists you treat it as the binary but still pass -m sglang.launch_server and other Python-specific flags. A real compiled sglang binary at that path likely wouldn’t accept those arguments. If the contract is “always run as a Python module”, consider removing the binaryPath() override and always using pythonPath. Otherwise, branch the argument construction so Python invocations (python3 -m ...) and a native sglang binary receive appropriate, different args.

Suggested implementation:

	// Determine binary path.
	// If the Docker/native sglang binary exists, treat it as a standalone executable.
	// Otherwise, fall back to invoking Python as a module runner.
	usePython := false

	binaryPath := s.binaryPath()
	sandboxPath := sglangDir
	if _, err := os.Stat(binaryPath); errors.Is(err, fs.ErrNotExist) {
		// Use Python installation
		if s.pythonPath == "" {
			return fmt.Errorf("sglang: no docker binary at %s and no python runtime configured; did you forget to call Install?", binaryPath)
		}
		binaryPath = s.pythonPath
		sandboxPath = ""
		usePython = true
	}

To fully implement the behavior your review comment describes, you will also need to:

  1. Branch command construction based on usePython:
    Somewhere later in this method (or in a helper) you will have something like:

    cmd := exec.Command(binaryPath, args...)

    or similar. Update it to:

    var cmd *exec.Cmd
    if usePython {
        // Python module invocation: python -m sglang.launch_server ...
        pyArgs := append([]string{"-m", "sglang.launch_server"}, args...)
        cmd = exec.Command(binaryPath, pyArgs...)
    } else {
        // Native sglang binary invocation: sglang ...
        cmd = exec.Command(binaryPath, args...)
    }

    Ensure any existing "-m", "sglang.launch_server" arguments are removed from the generic args slice and only added in the usePython branch.

  2. Update any other places that assume binaryPath is always Python:

    • If you have code that always sets PYTHONPATH, or otherwise assumes binaryPath is a Python interpreter, guard that logic with if usePython { ... }.
    • If usePython needs to be referenced outside this function, consider adding it as a field on the relevant struct instead of a local variable.

These changes will make the contract clear: a real sglang binary is treated as a standalone executable with native-style arguments, while Python-based installs are invoked via python -m sglang.launch_server.


s.pythonPath = pythonPath

if err := exec.Command(pythonPath, "-c", "import sglang").Run(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (go.lang.security.audit.dangerous-exec-command): Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

Source: opengrep

return ErrSGLangNotFound
}

output, err := exec.Command(pythonPath, "-c", "import sglang; print(sglang.__version__)").Output()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (go.lang.security.audit.dangerous-exec-command): Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

Source: opengrep

B200/H200 GPUs are x86_64 architecture only, and we install CUDA toolkit
from the x86_64 repository. Building for arm64 would fail.

ARM64 CUDA support can be added by the community if needed.
@dougyster dougyster force-pushed the dougyster/fix-sglang-aiohttp branch from 6bb4142 to dbfc078 Compare December 8, 2025 22:15
@denyszhak
Copy link

denyszhak commented Dec 8, 2025

@dougyster I appreciate you were on this, I'm still on the PR and I can include your commit as it with you as an author to the original PR because I'm not going to merge broken change anyway and fixed sourcery issues that I'm going to push soon

@dougyster
Copy link
Author

I appreciate you were on this, I'm still on the PR and I can include your commit as it with you as an author to the original PR because I'm not going to merge broken change anyway and fixed sourcery issues that I'm going to push soon

Sounds good~ Please see the key fixes I made in the PR if it's helpful to you - mainly just upgrading SGLang to ver 0.5.6 with CUDA 13.0 for x86_64 architecture - thanks!

@denyszhak
Copy link

@dougyster It really is, thank you, appreciate your help!

@dougyster
Copy link
Author

@dougyster It really is, thank you, appreciate your help!

Happy to help! I'll close this PR and you can cherrypick this commit as needed - let me know if you need help testing on linux nvidia hardware

@dougyster dougyster closed this Dec 8, 2025
@dougyster dougyster reopened this Dec 9, 2025
@dougyster dougyster closed this Dec 9, 2025
@dougyster dougyster reopened this Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants