Skip to content

[Bug] CUDA: no kernel image is available for execution on the device #1681

@StudenteChamp2

Description

@StudenteChamp2

Git commit

7f0e728 i think
Get a Cuda error when executing img2img generation from CLI. CPP side my code crashes at generate_image call.

Operating System & Version

Windiows 11

GGML backends

CUDA

Command-line arguments used

sd-cli.exe --backend cuda0 --diffusion-model flux-2-klein-4b-Q8_0.gguf --llm Qwen3-4B-UD-Q4_K_XL.gguf --vae flux2.full_encoder_small_decoder.safetensors --vae-conv-direct -p "turn the image into a high quality photograph" -r Interior_Input_1024.png -o output_photo.png --cfg-scale 1 --steps 4 --fa --offload-to-cpu

Steps to reproduce

Launched image generation from the Windows bash.

What you expected to happen

An image output

What actually happened

It crashes!

Logs / error messages / stack trace

The verbose log:

C:\Users\Owner\Downloads\CUDA_CRash>sd-cli.exe --backend cuda0 --diffusion-model flux-2-klein-4b-Q8_0.gguf --llm Qwen3-4B-UD-Q4_K_XL.gguf --vae flux2.full_encoder_small_decoder.safetensors --vae-conv-direct -p "turn the image into a high quality photograph" -r Interior_Input_1024.png -o output_photo.png --cfg-scale 1 --steps 4 --fa --offload-to-cpu --verbose
[DEBUG] main.cpp:598 - version: stable-diffusion.cpp version master-709-92a3b73-1-g7f0e728, commit 7f0e728
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16302 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 16302 MiB
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = Intel(R) Graphics (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
[DEBUG] main.cpp:599 - System Info:
SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | REPACK = 1 |
[DEBUG] main.cpp:600 - SDCliParams {
mode: img_gen,
output_path: "output_photo.png",
image_path: "",
metadata_format: "text",
verbose: true,
color: false,
canny_preprocess: false,
convert_name: false,
preview_method: none,
preview_interval: 1,
preview_path: "preview.png",
preview_fps: 16,
taesd_preview: false,
preview_noisy: false,
metadata_raw: false,
metadata_brief: false,
metadata_all: false
}
[DEBUG] main.cpp:601 - SDContextParams {
n_threads: 12,
model_path: "",
clip_l_path: "",
clip_g_path: "",
clip_vision_path: "",
t5xxl_path: "",
llm_path: "Qwen3-4B-UD-Q4_K_XL.gguf",
llm_vision_path: "",
diffusion_model_path: "flux-2-klein-4b-Q8_0.gguf",
high_noise_diffusion_model_path: "",
uncond_diffusion_model_path: "",
embeddings_connectors_path: "",
vae_path: "flux2.full_encoder_small_decoder.safetensors",
vae_format: "auto",
audio_vae_path: "",
taesd_path: "",
esrgan_path: "",
control_net_path: "",
embedding_dir: "",
embeddings: {
}
wtype: NONE,
tensor_type_rules: "",
lora_model_dir: ".",
hires_upscalers_dir: "",
photo_maker_path: "",
rng_type: cuda,
sampler_rng_type: NONE,
offload_params_to_cpu: true,
max_vram: "0",
stream_layers: false,
backend: "cuda0",
params_backend: "",
enable_mmap: false,
control_net_cpu: false,
clip_on_cpu: false,
vae_on_cpu: false,
flash_attn: true,
diffusion_flash_attn: false,
diffusion_conv_direct: false,
vae_conv_direct: true,
circular: false,
circular_x: false,
circular_y: false,
chroma_use_dit_mask: true,
qwen_image_zero_cond_t: false,
chroma_use_t5_mask: false,
chroma_t5_mask_pad: 1,
prediction: NONE,
lora_apply_mode: auto,
force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:602 - SDGenerationParams {
loras: "{
}",
high_noise_loras: "{
}",
prompt: "turn the image into a high quality photograph",
negative_prompt: "",
clip_skip: -1,
width: -1,
height: -1,
batch_count: 1,
init_image_path: "",
end_image_path: "",
mask_image_path: "",
control_image_path: "",
ref_image_paths: ["Interior_Input_1024.png"],
control_video_path: "",
auto_resize_ref_image: true,
increase_ref_index: false,
pm_id_images_dir: "",
pm_id_embed_path: "",
pm_style_strength: 20,
skip_layers: [7, 8, 9],
sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 4, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
high_noise_skip_layers: [7, 8, 9],
high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
custom_sigmas: [],
cache_mode: "",
cache_option: "",
cache: disabled (threshold=inf, start=0.15, end=0.95),
moe_boundary: 0.875,
video_frames: 1,
fps: 16,
vace_strength: 1,
strength: 0.75,
control_strength: 0.9,
seed: 42,
upscale_repeats: 1,
upscale_tile_size: 128,
hires: { enabled: false, upscaler: "Latent", model_path: "", scale: 2, target_width: 0, target_height: 0, steps: 0, denoising_strength: 0.7, custom_sigmas: [], upscale_tile_size: 128 },
vae_tiling_params: { 0, 0, 0, 0, 0.5, 0, 0, "" },
}
[INFO ] common.cpp:1973 - set width x height to 2048 x 2048
[DEBUG] ggml_extend_backend.cpp:326 - Initializing backend: CUDA0
[DEBUG] ggml_extend_backend.cpp:326 - Initializing backend: CPU
[DEBUG] model_loader.cpp:227 - using 12 threads for model loading
[INFO ] stable-diffusion.cpp:397 - loading diffusion model from 'flux-2-klein-4b-Q8_0.gguf'
[INFO ] model_loader.cpp:235 - load flux-2-klein-4b-Q8_0.gguf using gguf format
[DEBUG] model_loader.cpp:284 - init from 'flux-2-klein-4b-Q8_0.gguf'
[INFO ] stable-diffusion.cpp:459 - loading llm from 'Qwen3-4B-UD-Q4_K_XL.gguf'
[INFO ] model_loader.cpp:235 - load Qwen3-4B-UD-Q4_K_XL.gguf using gguf format
[DEBUG] model_loader.cpp:284 - init from 'Qwen3-4B-UD-Q4_K_XL.gguf'
[INFO ] stable-diffusion.cpp:473 - loading vae from 'flux2.full_encoder_small_decoder.safetensors'
[INFO ] model_loader.cpp:238 - load flux2.full_encoder_small_decoder.safetensors using safetensors format
[DEBUG] model_loader.cpp:312 - init from 'flux2.full_encoder_small_decoder.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:524 - Version: Flux.2 klein
[INFO ] stable-diffusion.cpp:551 - Weight type stat: f32: 453 | q8_0: 80 | q4_K: 154 | q5_K: 30 | q6_K: 49 | iq4_xs: 20 | bf16: 9
[INFO ] stable-diffusion.cpp:552 - Conditioner weight type stat: f32: 145 | q4_K: 154 | q5_K: 30 | q6_K: 49 | iq4_xs: 20
[INFO ] stable-diffusion.cpp:553 - Diffusion model weight type stat: f32: 60 | q8_0: 80 | bf16: 9
[INFO ] stable-diffusion.cpp:554 - VAE weight type stat: f32: 248
[DEBUG] stable-diffusion.cpp:556 - ggml tensor size = 400 bytes
[DEBUG] qwen2_tokenizer.cpp:14 - merges size 151387
[DEBUG] qwen2_tokenizer.cpp:39 - vocab size: 151674
[DEBUG] llm.hpp:226 - llm: num_layers = 36, vocab_size = 151936, hidden_size = 2560, intermediate_size = 9728
[DEBUG] flux.hpp:168 - flux: depth = 5, depth_single_blocks = 20, guidance_embed = false, context_in_dim = 7680, hidden_size = 3072, num_heads = 24
[INFO ] stable-diffusion.cpp:955 - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:527 - vae decoder: ch = 96
[INFO ] stable-diffusion.cpp:991 - Using Conv2d direct in the vae model
[INFO ] stable-diffusion.cpp:1059 - Using flash attention
[INFO ] stable-diffusion.cpp:1073 - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:1091 - validating model metadata
[DEBUG] stable-diffusion.cpp:1141 - model metadata validated; weights will be prepared lazily
[INFO ] stable-diffusion.cpp:1196 - total params memory size = 7822.64MB (VRAM 0.00MB, RAM 7822.64MB): text_encoders 3602.16MB(RAM), diffusion_model 4101.40MB(RAM), vae 119.08MB(RAM), controlnet 0.00MB(N/A), extensions 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1292 - running in Flux2 FLOW mode
[INFO ] stable-diffusion.cpp:4432 - generate_image 2048x2048
[DEBUG] denoiser.hpp:833 - Flux2FlowDenoiser: set shift to 3.230
[INFO ] denoiser.hpp:579 - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3456 - sampling using Euler method
[INFO ] stable-diffusion.cpp:3963 - EDIT mode
[DEBUG] stable-diffusion.cpp:3973 - auto resize ref images
[DEBUG] stable-diffusion.cpp:3990 - resize vae ref image 0 from 2048x2048 to 1024x1024
[DEBUG] model_loader.cpp:990 - loading 108/248 tensors from flux2.full_encoder_small_decoder.safetensors
|##################################################| 108/108 - 656.62MB/s
[INFO ] model_loader.cpp:1229 - loading tensors completed, taking 0.20s (read: 0.02s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] model_manager.cpp:221 - model manager prepared params backend buffer ( 65.72 MB, 108 tensors, RAM)
[DEBUG] model_manager.cpp:313 - model manager staged compute params ( 65.72 MB, 108 tensors) to CUDA0, taking 0.01s
[DEBUG] ggml_extend.hpp:1998 - vae compute buffer size: 1549.00 MB(VRAM)
[ERROR] ggml_extend.hpp:70 - ggml_cuda_compute_forward: CONV_2D failed
[ERROR] ggml_extend.hpp:70 - CUDA error: no kernel image is available for execution on the device
[ERROR] ggml_extend.hpp:70 - current device: 0, in function ggml_cuda_compute_forward at D:\Stable-diffusion\ggml\src\ggml-cuda\ggml-cuda.cu:3163
[ERROR] ggml_extend.hpp:70 - err
D:\Stable-diffusion\ggml\src\ggml-cuda\ggml-cuda.cu:103: CUDA error

Additional context / environment details

My CPU:
Intel(R) Core(TM) Ultra 9 285K (3.70 GHz)

My GPU
NVIDIA GeForce RTX 5070 Ti (16 GB)

Also Vulkan execution works well!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions