Git commit
7f0e728 i think
Get a Cuda error when executing img2img generation from CLI. CPP side my code crashes at generate_image call.
Operating System & Version
Windiows 11
GGML backends
CUDA
Command-line arguments used
sd-cli.exe --backend cuda0 --diffusion-model flux-2-klein-4b-Q8_0.gguf --llm Qwen3-4B-UD-Q4_K_XL.gguf --vae flux2.full_encoder_small_decoder.safetensors --vae-conv-direct -p "turn the image into a high quality photograph" -r Interior_Input_1024.png -o output_photo.png --cfg-scale 1 --steps 4 --fa --offload-to-cpu
Steps to reproduce
Launched image generation from the Windows bash.
What you expected to happen
An image output
What actually happened
It crashes!
Logs / error messages / stack trace
The verbose log:
C:\Users\Owner\Downloads\CUDA_CRash>sd-cli.exe --backend cuda0 --diffusion-model flux-2-klein-4b-Q8_0.gguf --llm Qwen3-4B-UD-Q4_K_XL.gguf --vae flux2.full_encoder_small_decoder.safetensors --vae-conv-direct -p "turn the image into a high quality photograph" -r Interior_Input_1024.png -o output_photo.png --cfg-scale 1 --steps 4 --fa --offload-to-cpu --verbose
[DEBUG] main.cpp:598 - version: stable-diffusion.cpp version master-709-92a3b73-1-g7f0e728, commit 7f0e728
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16302 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 16302 MiB
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = Intel(R) Graphics (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
[DEBUG] main.cpp:599 - System Info:
SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | REPACK = 1 |
[DEBUG] main.cpp:600 - SDCliParams {
mode: img_gen,
output_path: "output_photo.png",
image_path: "",
metadata_format: "text",
verbose: true,
color: false,
canny_preprocess: false,
convert_name: false,
preview_method: none,
preview_interval: 1,
preview_path: "preview.png",
preview_fps: 16,
taesd_preview: false,
preview_noisy: false,
metadata_raw: false,
metadata_brief: false,
metadata_all: false
}
[DEBUG] main.cpp:601 - SDContextParams {
n_threads: 12,
model_path: "",
clip_l_path: "",
clip_g_path: "",
clip_vision_path: "",
t5xxl_path: "",
llm_path: "Qwen3-4B-UD-Q4_K_XL.gguf",
llm_vision_path: "",
diffusion_model_path: "flux-2-klein-4b-Q8_0.gguf",
high_noise_diffusion_model_path: "",
uncond_diffusion_model_path: "",
embeddings_connectors_path: "",
vae_path: "flux2.full_encoder_small_decoder.safetensors",
vae_format: "auto",
audio_vae_path: "",
taesd_path: "",
esrgan_path: "",
control_net_path: "",
embedding_dir: "",
embeddings: {
}
wtype: NONE,
tensor_type_rules: "",
lora_model_dir: ".",
hires_upscalers_dir: "",
photo_maker_path: "",
rng_type: cuda,
sampler_rng_type: NONE,
offload_params_to_cpu: true,
max_vram: "0",
stream_layers: false,
backend: "cuda0",
params_backend: "",
enable_mmap: false,
control_net_cpu: false,
clip_on_cpu: false,
vae_on_cpu: false,
flash_attn: true,
diffusion_flash_attn: false,
diffusion_conv_direct: false,
vae_conv_direct: true,
circular: false,
circular_x: false,
circular_y: false,
chroma_use_dit_mask: true,
qwen_image_zero_cond_t: false,
chroma_use_t5_mask: false,
chroma_t5_mask_pad: 1,
prediction: NONE,
lora_apply_mode: auto,
force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:602 - SDGenerationParams {
loras: "{
}",
high_noise_loras: "{
}",
prompt: "turn the image into a high quality photograph",
negative_prompt: "",
clip_skip: -1,
width: -1,
height: -1,
batch_count: 1,
init_image_path: "",
end_image_path: "",
mask_image_path: "",
control_image_path: "",
ref_image_paths: ["Interior_Input_1024.png"],
control_video_path: "",
auto_resize_ref_image: true,
increase_ref_index: false,
pm_id_images_dir: "",
pm_id_embed_path: "",
pm_style_strength: 20,
skip_layers: [7, 8, 9],
sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 4, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
high_noise_skip_layers: [7, 8, 9],
high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
custom_sigmas: [],
cache_mode: "",
cache_option: "",
cache: disabled (threshold=inf, start=0.15, end=0.95),
moe_boundary: 0.875,
video_frames: 1,
fps: 16,
vace_strength: 1,
strength: 0.75,
control_strength: 0.9,
seed: 42,
upscale_repeats: 1,
upscale_tile_size: 128,
hires: { enabled: false, upscaler: "Latent", model_path: "", scale: 2, target_width: 0, target_height: 0, steps: 0, denoising_strength: 0.7, custom_sigmas: [], upscale_tile_size: 128 },
vae_tiling_params: { 0, 0, 0, 0, 0.5, 0, 0, "" },
}
[INFO ] common.cpp:1973 - set width x height to 2048 x 2048
[DEBUG] ggml_extend_backend.cpp:326 - Initializing backend: CUDA0
[DEBUG] ggml_extend_backend.cpp:326 - Initializing backend: CPU
[DEBUG] model_loader.cpp:227 - using 12 threads for model loading
[INFO ] stable-diffusion.cpp:397 - loading diffusion model from 'flux-2-klein-4b-Q8_0.gguf'
[INFO ] model_loader.cpp:235 - load flux-2-klein-4b-Q8_0.gguf using gguf format
[DEBUG] model_loader.cpp:284 - init from 'flux-2-klein-4b-Q8_0.gguf'
[INFO ] stable-diffusion.cpp:459 - loading llm from 'Qwen3-4B-UD-Q4_K_XL.gguf'
[INFO ] model_loader.cpp:235 - load Qwen3-4B-UD-Q4_K_XL.gguf using gguf format
[DEBUG] model_loader.cpp:284 - init from 'Qwen3-4B-UD-Q4_K_XL.gguf'
[INFO ] stable-diffusion.cpp:473 - loading vae from 'flux2.full_encoder_small_decoder.safetensors'
[INFO ] model_loader.cpp:238 - load flux2.full_encoder_small_decoder.safetensors using safetensors format
[DEBUG] model_loader.cpp:312 - init from 'flux2.full_encoder_small_decoder.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:524 - Version: Flux.2 klein
[INFO ] stable-diffusion.cpp:551 - Weight type stat: f32: 453 | q8_0: 80 | q4_K: 154 | q5_K: 30 | q6_K: 49 | iq4_xs: 20 | bf16: 9
[INFO ] stable-diffusion.cpp:552 - Conditioner weight type stat: f32: 145 | q4_K: 154 | q5_K: 30 | q6_K: 49 | iq4_xs: 20
[INFO ] stable-diffusion.cpp:553 - Diffusion model weight type stat: f32: 60 | q8_0: 80 | bf16: 9
[INFO ] stable-diffusion.cpp:554 - VAE weight type stat: f32: 248
[DEBUG] stable-diffusion.cpp:556 - ggml tensor size = 400 bytes
[DEBUG] qwen2_tokenizer.cpp:14 - merges size 151387
[DEBUG] qwen2_tokenizer.cpp:39 - vocab size: 151674
[DEBUG] llm.hpp:226 - llm: num_layers = 36, vocab_size = 151936, hidden_size = 2560, intermediate_size = 9728
[DEBUG] flux.hpp:168 - flux: depth = 5, depth_single_blocks = 20, guidance_embed = false, context_in_dim = 7680, hidden_size = 3072, num_heads = 24
[INFO ] stable-diffusion.cpp:955 - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:527 - vae decoder: ch = 96
[INFO ] stable-diffusion.cpp:991 - Using Conv2d direct in the vae model
[INFO ] stable-diffusion.cpp:1059 - Using flash attention
[INFO ] stable-diffusion.cpp:1073 - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:1091 - validating model metadata
[DEBUG] stable-diffusion.cpp:1141 - model metadata validated; weights will be prepared lazily
[INFO ] stable-diffusion.cpp:1196 - total params memory size = 7822.64MB (VRAM 0.00MB, RAM 7822.64MB): text_encoders 3602.16MB(RAM), diffusion_model 4101.40MB(RAM), vae 119.08MB(RAM), controlnet 0.00MB(N/A), extensions 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1292 - running in Flux2 FLOW mode
[INFO ] stable-diffusion.cpp:4432 - generate_image 2048x2048
[DEBUG] denoiser.hpp:833 - Flux2FlowDenoiser: set shift to 3.230
[INFO ] denoiser.hpp:579 - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3456 - sampling using Euler method
[INFO ] stable-diffusion.cpp:3963 - EDIT mode
[DEBUG] stable-diffusion.cpp:3973 - auto resize ref images
[DEBUG] stable-diffusion.cpp:3990 - resize vae ref image 0 from 2048x2048 to 1024x1024
[DEBUG] model_loader.cpp:990 - loading 108/248 tensors from flux2.full_encoder_small_decoder.safetensors
|##################################################| 108/108 - 656.62MB/s
[INFO ] model_loader.cpp:1229 - loading tensors completed, taking 0.20s (read: 0.02s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] model_manager.cpp:221 - model manager prepared params backend buffer ( 65.72 MB, 108 tensors, RAM)
[DEBUG] model_manager.cpp:313 - model manager staged compute params ( 65.72 MB, 108 tensors) to CUDA0, taking 0.01s
[DEBUG] ggml_extend.hpp:1998 - vae compute buffer size: 1549.00 MB(VRAM)
[ERROR] ggml_extend.hpp:70 - ggml_cuda_compute_forward: CONV_2D failed
[ERROR] ggml_extend.hpp:70 - CUDA error: no kernel image is available for execution on the device
[ERROR] ggml_extend.hpp:70 - current device: 0, in function ggml_cuda_compute_forward at D:\Stable-diffusion\ggml\src\ggml-cuda\ggml-cuda.cu:3163
[ERROR] ggml_extend.hpp:70 - err
D:\Stable-diffusion\ggml\src\ggml-cuda\ggml-cuda.cu:103: CUDA error
Additional context / environment details
My CPU:
Intel(R) Core(TM) Ultra 9 285K (3.70 GHz)
My GPU
NVIDIA GeForce RTX 5070 Ti (16 GB)
Also Vulkan execution works well!
Git commit
7f0e728 i think
Get a Cuda error when executing img2img generation from CLI. CPP side my code crashes at generate_image call.
Operating System & Version
Windiows 11
GGML backends
CUDA
Command-line arguments used
sd-cli.exe --backend cuda0 --diffusion-model flux-2-klein-4b-Q8_0.gguf --llm Qwen3-4B-UD-Q4_K_XL.gguf --vae flux2.full_encoder_small_decoder.safetensors --vae-conv-direct -p "turn the image into a high quality photograph" -r Interior_Input_1024.png -o output_photo.png --cfg-scale 1 --steps 4 --fa --offload-to-cpu
Steps to reproduce
Launched image generation from the Windows bash.
What you expected to happen
An image output
What actually happened
It crashes!
Logs / error messages / stack trace
The verbose log:
C:\Users\Owner\Downloads\CUDA_CRash>sd-cli.exe --backend cuda0 --diffusion-model flux-2-klein-4b-Q8_0.gguf --llm Qwen3-4B-UD-Q4_K_XL.gguf --vae flux2.full_encoder_small_decoder.safetensors --vae-conv-direct -p "turn the image into a high quality photograph" -r Interior_Input_1024.png -o output_photo.png --cfg-scale 1 --steps 4 --fa --offload-to-cpu --verbose
[DEBUG] main.cpp:598 - version: stable-diffusion.cpp version master-709-92a3b73-1-g7f0e728, commit 7f0e728
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16302 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 16302 MiB
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = Intel(R) Graphics (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
[DEBUG] main.cpp:599 - System Info:
SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | REPACK = 1 |
[DEBUG] main.cpp:600 - SDCliParams {
mode: img_gen,
output_path: "output_photo.png",
image_path: "",
metadata_format: "text",
verbose: true,
color: false,
canny_preprocess: false,
convert_name: false,
preview_method: none,
preview_interval: 1,
preview_path: "preview.png",
preview_fps: 16,
taesd_preview: false,
preview_noisy: false,
metadata_raw: false,
metadata_brief: false,
metadata_all: false
}
[DEBUG] main.cpp:601 - SDContextParams {
n_threads: 12,
model_path: "",
clip_l_path: "",
clip_g_path: "",
clip_vision_path: "",
t5xxl_path: "",
llm_path: "Qwen3-4B-UD-Q4_K_XL.gguf",
llm_vision_path: "",
diffusion_model_path: "flux-2-klein-4b-Q8_0.gguf",
high_noise_diffusion_model_path: "",
uncond_diffusion_model_path: "",
embeddings_connectors_path: "",
vae_path: "flux2.full_encoder_small_decoder.safetensors",
vae_format: "auto",
audio_vae_path: "",
taesd_path: "",
esrgan_path: "",
control_net_path: "",
embedding_dir: "",
embeddings: {
}
wtype: NONE,
tensor_type_rules: "",
lora_model_dir: ".",
hires_upscalers_dir: "",
photo_maker_path: "",
rng_type: cuda,
sampler_rng_type: NONE,
offload_params_to_cpu: true,
max_vram: "0",
stream_layers: false,
backend: "cuda0",
params_backend: "",
enable_mmap: false,
control_net_cpu: false,
clip_on_cpu: false,
vae_on_cpu: false,
flash_attn: true,
diffusion_flash_attn: false,
diffusion_conv_direct: false,
vae_conv_direct: true,
circular: false,
circular_x: false,
circular_y: false,
chroma_use_dit_mask: true,
qwen_image_zero_cond_t: false,
chroma_use_t5_mask: false,
chroma_t5_mask_pad: 1,
prediction: NONE,
lora_apply_mode: auto,
force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:602 - SDGenerationParams {
loras: "{
}",
high_noise_loras: "{
}",
prompt: "turn the image into a high quality photograph",
negative_prompt: "",
clip_skip: -1,
width: -1,
height: -1,
batch_count: 1,
init_image_path: "",
end_image_path: "",
mask_image_path: "",
control_image_path: "",
ref_image_paths: ["Interior_Input_1024.png"],
control_video_path: "",
auto_resize_ref_image: true,
increase_ref_index: false,
pm_id_images_dir: "",
pm_id_embed_path: "",
pm_style_strength: 20,
skip_layers: [7, 8, 9],
sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 4, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
high_noise_skip_layers: [7, 8, 9],
high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
custom_sigmas: [],
cache_mode: "",
cache_option: "",
cache: disabled (threshold=inf, start=0.15, end=0.95),
moe_boundary: 0.875,
video_frames: 1,
fps: 16,
vace_strength: 1,
strength: 0.75,
control_strength: 0.9,
seed: 42,
upscale_repeats: 1,
upscale_tile_size: 128,
hires: { enabled: false, upscaler: "Latent", model_path: "", scale: 2, target_width: 0, target_height: 0, steps: 0, denoising_strength: 0.7, custom_sigmas: [], upscale_tile_size: 128 },
vae_tiling_params: { 0, 0, 0, 0, 0.5, 0, 0, "" },
}
[INFO ] common.cpp:1973 - set width x height to 2048 x 2048
[DEBUG] ggml_extend_backend.cpp:326 - Initializing backend: CUDA0
[DEBUG] ggml_extend_backend.cpp:326 - Initializing backend: CPU
[DEBUG] model_loader.cpp:227 - using 12 threads for model loading
[INFO ] stable-diffusion.cpp:397 - loading diffusion model from 'flux-2-klein-4b-Q8_0.gguf'
[INFO ] model_loader.cpp:235 - load flux-2-klein-4b-Q8_0.gguf using gguf format
[DEBUG] model_loader.cpp:284 - init from 'flux-2-klein-4b-Q8_0.gguf'
[INFO ] stable-diffusion.cpp:459 - loading llm from 'Qwen3-4B-UD-Q4_K_XL.gguf'
[INFO ] model_loader.cpp:235 - load Qwen3-4B-UD-Q4_K_XL.gguf using gguf format
[DEBUG] model_loader.cpp:284 - init from 'Qwen3-4B-UD-Q4_K_XL.gguf'
[INFO ] stable-diffusion.cpp:473 - loading vae from 'flux2.full_encoder_small_decoder.safetensors'
[INFO ] model_loader.cpp:238 - load flux2.full_encoder_small_decoder.safetensors using safetensors format
[DEBUG] model_loader.cpp:312 - init from 'flux2.full_encoder_small_decoder.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:524 - Version: Flux.2 klein
[INFO ] stable-diffusion.cpp:551 - Weight type stat: f32: 453 | q8_0: 80 | q4_K: 154 | q5_K: 30 | q6_K: 49 | iq4_xs: 20 | bf16: 9
[INFO ] stable-diffusion.cpp:552 - Conditioner weight type stat: f32: 145 | q4_K: 154 | q5_K: 30 | q6_K: 49 | iq4_xs: 20
[INFO ] stable-diffusion.cpp:553 - Diffusion model weight type stat: f32: 60 | q8_0: 80 | bf16: 9
[INFO ] stable-diffusion.cpp:554 - VAE weight type stat: f32: 248
[DEBUG] stable-diffusion.cpp:556 - ggml tensor size = 400 bytes
[DEBUG] qwen2_tokenizer.cpp:14 - merges size 151387
[DEBUG] qwen2_tokenizer.cpp:39 - vocab size: 151674
[DEBUG] llm.hpp:226 - llm: num_layers = 36, vocab_size = 151936, hidden_size = 2560, intermediate_size = 9728
[DEBUG] flux.hpp:168 - flux: depth = 5, depth_single_blocks = 20, guidance_embed = false, context_in_dim = 7680, hidden_size = 3072, num_heads = 24
[INFO ] stable-diffusion.cpp:955 - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:527 - vae decoder: ch = 96
[INFO ] stable-diffusion.cpp:991 - Using Conv2d direct in the vae model
[INFO ] stable-diffusion.cpp:1059 - Using flash attention
[INFO ] stable-diffusion.cpp:1073 - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:1091 - validating model metadata
[DEBUG] stable-diffusion.cpp:1141 - model metadata validated; weights will be prepared lazily
[INFO ] stable-diffusion.cpp:1196 - total params memory size = 7822.64MB (VRAM 0.00MB, RAM 7822.64MB): text_encoders 3602.16MB(RAM), diffusion_model 4101.40MB(RAM), vae 119.08MB(RAM), controlnet 0.00MB(N/A), extensions 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1292 - running in Flux2 FLOW mode
[INFO ] stable-diffusion.cpp:4432 - generate_image 2048x2048
[DEBUG] denoiser.hpp:833 - Flux2FlowDenoiser: set shift to 3.230
[INFO ] denoiser.hpp:579 - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3456 - sampling using Euler method
[INFO ] stable-diffusion.cpp:3963 - EDIT mode
[DEBUG] stable-diffusion.cpp:3973 - auto resize ref images
[DEBUG] stable-diffusion.cpp:3990 - resize vae ref image 0 from 2048x2048 to 1024x1024
[DEBUG] model_loader.cpp:990 - loading 108/248 tensors from flux2.full_encoder_small_decoder.safetensors
|##################################################| 108/108 - 656.62MB/s
[INFO ] model_loader.cpp:1229 - loading tensors completed, taking 0.20s (read: 0.02s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[DEBUG] model_manager.cpp:221 - model manager prepared params backend buffer ( 65.72 MB, 108 tensors, RAM)
[DEBUG] model_manager.cpp:313 - model manager staged compute params ( 65.72 MB, 108 tensors) to CUDA0, taking 0.01s
[DEBUG] ggml_extend.hpp:1998 - vae compute buffer size: 1549.00 MB(VRAM)
[ERROR] ggml_extend.hpp:70 - ggml_cuda_compute_forward: CONV_2D failed
[ERROR] ggml_extend.hpp:70 - CUDA error: no kernel image is available for execution on the device
[ERROR] ggml_extend.hpp:70 - current device: 0, in function ggml_cuda_compute_forward at D:\Stable-diffusion\ggml\src\ggml-cuda\ggml-cuda.cu:3163
[ERROR] ggml_extend.hpp:70 - err
D:\Stable-diffusion\ggml\src\ggml-cuda\ggml-cuda.cu:103: CUDA error
Additional context / environment details
My CPU:
Intel(R) Core(TM) Ultra 9 285K (3.70 GHz)
My GPU
NVIDIA GeForce RTX 5070 Ti (16 GB)
Also Vulkan execution works well!