Sglang configuration - chat returns NULLs

### Basic Information - Models Used

Minimax-M2

### Information about environment and deployment

Docker Lmsysorg/sglang from 11/22/2025
Air gapped
Host Ubuntu 24.04
2 or 3 Nvidia H200
Minimax-M2

### Description

Hello,
 
I hope you can provide guidance.  Last week I downloaded a new sglang and minimax-m2 model from HuggingFace using “git clone”.  I have tried a couple of things that might be interesting to you and some questions. 
Putting KV on a third card crashes sglang.  Not a huge deal. I just moved to 2 H200 cards.
Important - When I run the model with 2 cards I am not able to complete even the sample test from the model website.  The results are below.  It just gave a bunch of Unicode NULLs.
I have provided my Portainer Compose.yaml file which I tried making just like the sample
I have provided the startup logging
I have project the curl chat interface test
Do you happen to have a sample Yaml for using Continue.dev in vs code? This is the next configuration item after it runs.
I was getting error when trying to do CPU offloading configuration.  Maybe you can shed some light on that.
 
Please help as I am trying to get this running for offline use.  This is basically to be used for coding with fairly large repos. So, I am hoping it is as good or better than Qwen3-coder 480B as it is much smaller in footprint and runs, in theory, using sglang.
 
==========================
Compose.yaml
==========================
 
 
version: "3.9"
 
services:
  sglang-minimax-m2:
    image: lmsysorg/sglang:latest
    container_name: sglang-minimax-m2
    restart: unless-stopped
 
    shm_size: "32gb"
 
    ulimits:
      memlock: -1
      stack: 67108864
 
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
 
    environment:
      NVIDIA_VISIBLE_DEVICES: "0,1"
      NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
 
      # Performance settings
      TOKENIZERS_PARALLELISM: "false"
      OMP_NUM_THREADS: "16"
 
    ports:
      - "30001:30000"
 
    volumes:
      - /opt/models/MiniMax-M2/1:/models/minimax-m2:ro
      - /opt/sglang_cache:/root/.cache
      - /opt/sglang_logs:/logs
 
    entrypoint:
      - bash
      - -c
      - |
          echo "==================================================="
          echo " Starting MiniMax-M2 (TP=2 on GPUs 0+1, 96K context)"
          echo "==================================================="
 
          python3 -m sglang.launch_server \
            --model-path /models/minimax-m2 \
            --host 0.0.0.0 \
            --port 30000 \
            --tp-size 2 \
            --trust-remote-code \
            --mem-fraction-static 0.85 \
            --reasoning-parser minimax-append-think \
            --tool-call-parser minimax-m2
#            --attention-backend triton \
#            --triton-attention-num-kv-splits 16 \
#            --dtype auto \
#            --context-length 196608 \
#            --max-running-requests 32 \
#            --max-total-tokens 180000 \
#            --stream-interval 1 \
 
 
===================================================
 
Starting MiniMax-M2 (TP=2 on GPUs 0+1)
 
===================================================
 
[2025-11-23 06:47:06] INFO model_config.py:884: Downcasting torch.float32 to torch.float16.
 
[2025-11-23 06:47:06] WARNING server_args.py:1197: Attention backend not explicitly specified. Use flashinfer backend by default.
 
[2025-11-23 06:47:06] INFO trace.py:69: opentelemetry package is not installed, tracing disabled
 
[2025-11-23 06:47:06] server_args=ServerArgs(model_path='/models/minimax-m2', tokenizer_path='/models/minimax-m2', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.85, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=188384565, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', api_key=None, served_model_name='/models/minimax-m2', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='minimax-append-think', tool_call_parser='minimax-m2', tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, decrypted_config_file=None, decrypted_draft_config_file=None)
 
[2025-11-23 06:47:06] Downcasting torch.float32 to torch.float16.
 
[2025-11-23 06:47:07] Using default HuggingFace chat template with detected content format: openai
 
[2025-11-23 06:47:14] INFO trace.py:69: opentelemetry package is not installed, tracing disabled
 
[2025-11-23 06:47:14] INFO trace.py:69: opentelemetry package is not installed, tracing disabled
 
[2025-11-23 06:47:15 TP1] Downcasting torch.float32 to torch.float16.
 
[2025-11-23 06:47:15] INFO trace.py:69: opentelemetry package is not installed, tracing disabled
 
[2025-11-23 06:47:15 TP0] Downcasting torch.float32 to torch.float16.
 
[2025-11-23 06:47:15 TP1] Downcasting torch.float32 to torch.float16.
[2025-11-23 06:47:15 TP1] Init torch distributed begin.
[2025-11-23 06:47:15 TP0] Downcasting torch.float32 to torch.float16.
[2025-11-23 06:47:15 TP0] Init torch distributed begin.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-23 06:47:16 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-23 06:47:17 TP1] Init torch distributed ends. mem usage=1.01 GB
[2025-11-23 06:47:17 TP0] Init torch distributed ends. mem usage=1.01 GB
[2025-11-23 06:47:21 TP0] Load weight begin. avail mem=133.01 GB
[2025-11-23 06:47:21 TP0] Detected fp8 checkpoint.
[2025-11-23 06:47:21 TP1] Load weight begin. avail mem=138.24 GB
Loading safetensors checkpoint shards:   0% Completed | 0/17 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   6% Completed | 1/17 [00:00<00:08,  1.85it/s]
Loading safetensors checkpoint shards:  12% Completed | 2/17 [00:01<00:08,  1.86it/s]
Loading safetensors checkpoint shards:  18% Completed | 3/17 [00:01<00:07,  1.93it/s]
Loading safetensors checkpoint shards:  24% Completed | 4/17 [00:02<00:07,  1.80it/s]
Loading safetensors checkpoint shards:  29% Completed | 5/17 [00:02<00:06,  1.85it/s]
Loading safetensors checkpoint shards:  35% Completed | 6/17 [00:03<00:05,  1.84it/s]
Loading safetensors checkpoint shards:  41% Completed | 7/17 [00:03<00:05,  1.97it/s]
Loading safetensors checkpoint shards:  47% Completed | 8/17 [00:04<00:04,  2.04it/s]
Loading safetensors checkpoint shards:  53% Completed | 9/17 [00:04<00:04,  1.84it/s]
Loading safetensors checkpoint shards:  59% Completed | 10/17 [00:05<00:03,  1.95it/s]
Loading safetensors checkpoint shards:  65% Completed | 11/17 [00:05<00:02,  2.18it/s]
Loading safetensors checkpoint shards:  71% Completed | 12/17 [00:05<00:02,  2.36it/s]
Loading safetensors checkpoint shards:  76% Completed | 13/17 [00:06<00:01,  2.38it/s]
Loading safetensors checkpoint shards:  82% Completed | 14/17 [00:06<00:01,  2.50it/s]
Loading safetensors checkpoint shards:  88% Completed | 15/17 [00:07<00:00,  2.61it/s]
Loading safetensors checkpoint shards:  94% Completed | 16/17 [00:07<00:00,  2.70it/s]
[2025-11-23 06:47:30 TP1] Load weight end. type=MiniMaxM2ForCausalLM, dtype=torch.float16, avail mem=30.87 GB, mem usage=107.38 GB.
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:08<00:00,  1.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:08<00:00,  1.98it/s]
[2025-11-23 06:47:30 TP0] Load weight end. type=MiniMaxM2ForCausalLM, dtype=torch.float16, avail mem=25.63 GB, mem usage=107.38 GB.
[2025-11-23 06:47:30 TP0] Using KV cache dtype: torch.float16
[2025-11-23 06:47:30 TP0] KV Cache is allocated. #tokens: 47940, K size: 2.83 GB, V size: 2.83 GB
[2025-11-23 06:47:30 TP0] Memory pool end. avail mem=18.31 GB
[2025-11-23 06:47:30 TP1] KV Cache is allocated. #tokens: 47940, K size: 2.83 GB, V size: 2.83 GB
[2025-11-23 06:47:30 TP1] Memory pool end. avail mem=23.55 GB
[2025-11-23 06:47:30 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=17.74 GB
[2025-11-23 06:47:30 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
[2025-11-23 06:47:30 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=22.98 GB
  0%|          | 0/36 [00:00<?, ?it/s]
Capturing batches (bs=256 avail_mem=17.31 GB):   0%|          | 0/36 [00:00<?, ?it/s][2025-11-23 06:47:34 TP1] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:34 TP0] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:39 TP1] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=3072,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:39 TP0] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=3072,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:42 TP1] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-11-23 06:47:42 TP1] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128]_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-11-23 06:47:43 TP0] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-11-23 06:47:43 TP0] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128]_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
Capturing batches (bs=256 avail_mem=17.31 GB):   3%|▎         | 1/36 [00:15<08:59, 15.42s/it]
Capturing batches (bs=248 avail_mem=16.94 GB):   3%|▎         | 1/36 [00:15<08:59, 15.42s/it]
Capturing batches (bs=248 avail_mem=16.94 GB):   6%|▌         | 2/36 [00:21<05:34,  9.83s/it]
Capturing batches (bs=240 avail_mem=16.94 GB):   6%|▌         | 2/36 [00:21<05:34,  9.83s/it]
Capturing batches (bs=240 avail_mem=16.94 GB):   8%|▊         | 3/36 [00:21<03:05,  5.62s/it]
Capturing batches (bs=232 avail_mem=16.93 GB):   8%|▊         | 3/36 [00:21<03:05,  5.62s/it]
Capturing batches (bs=232 avail_mem=16.93 GB):  11%|█         | 4/36 [00:22<01:59,  3.74s/it]
Capturing batches (bs=224 avail_mem=16.92 GB):  11%|█         | 4/36 [00:22<01:59,  3.74s/it]
Capturing batches (bs=224 avail_mem=16.92 GB):  14%|█▍        | 5/36 [00:23<01:22,  2.66s/it]
Capturing batches (bs=216 avail_mem=16.91 GB):  14%|█▍        | 5/36 [00:23<01:22,  2.66s/it]
Capturing batches (bs=216 avail_mem=16.91 GB):  17%|█▋        | 6/36 [00:24<00:58,  1.97s/it]
Capturing batches (bs=208 avail_mem=16.89 GB):  17%|█▋        | 6/36 [00:24<00:58,  1.97s/it]
Capturing batches (bs=208 avail_mem=16.89 GB):  19%|█▉        | 7/36 [00:25<00:47,  1.64s/it]
Capturing batches (bs=200 avail_mem=16.88 GB):  19%|█▉        | 7/36 [00:25<00:47,  1.64s/it]
Capturing batches (bs=200 avail_mem=16.88 GB):  22%|██▏       | 8/36 [00:26<00:40,  1.43s/it]
Capturing batches (bs=192 avail_mem=16.85 GB):  22%|██▏       | 8/36 [00:26<00:40,  1.43s/it]
Capturing batches (bs=192 avail_mem=16.85 GB):  25%|██▌       | 9/36 [00:26<00:32,  1.21s/it]
Capturing batches (bs=184 avail_mem=16.84 GB):  25%|██▌       | 9/36 [00:26<00:32,  1.21s/it]
Capturing batches (bs=184 avail_mem=16.84 GB):  28%|██▊       | 10/36 [00:27<00:28,  1.09s/it]
Capturing batches (bs=176 avail_mem=16.82 GB):  28%|██▊       | 10/36 [00:27<00:28,  1.09s/it]
Capturing batches (bs=176 avail_mem=16.82 GB):  31%|███       | 11/36 [00:28<00:23,  1.06it/s]
Capturing batches (bs=168 avail_mem=16.81 GB):  31%|███       | 11/36 [00:28<00:23,  1.06it/s]
Capturing batches (bs=168 avail_mem=16.81 GB):  33%|███▎      | 12/36 [00:29<00:21,  1.12it/s]
Capturing batches (bs=160 avail_mem=16.78 GB):  33%|███▎      | 12/36 [00:29<00:21,  1.12it/s]
Capturing batches (bs=160 avail_mem=16.78 GB):  36%|███▌      | 13/36 [00:30<00:21,  1.08it/s]
Capturing batches (bs=152 avail_mem=16.77 GB):  36%|███▌      | 13/36 [00:30<00:21,  1.08it/s]
Capturing batches (bs=152 avail_mem=16.77 GB):  39%|███▉      | 14/36 [00:31<00:22,  1.00s/it]
Capturing batches (bs=144 avail_mem=16.75 GB):  39%|███▉      | 14/36 [00:31<00:22,  1.00s/it]
Capturing batches (bs=144 avail_mem=16.75 GB):  42%|████▏     | 15/36 [00:32<00:20,  1.05it/s]
Capturing batches (bs=136 avail_mem=16.74 GB):  42%|████▏     | 15/36 [00:32<00:20,  1.05it/s]
Capturing batches (bs=136 avail_mem=16.74 GB):  44%|████▍     | 16/36 [00:33<00:19,  1.00it/s]
Capturing batches (bs=128 avail_mem=16.71 GB):  44%|████▍     | 16/36 [00:33<00:19,  1.00it/s]
Capturing batches (bs=128 avail_mem=16.71 GB):  47%|████▋     | 17/36 [00:34<00:18,  1.04it/s]
Capturing batches (bs=120 avail_mem=16.70 GB):  47%|████▋     | 17/36 [00:34<00:18,  1.04it/s]
Capturing batches (bs=120 avail_mem=16.70 GB):  50%|█████     | 18/36 [00:34<00:16,  1.11it/s]
Capturing batches (bs=112 avail_mem=16.68 GB):  50%|█████     | 18/36 [00:34<00:16,  1.11it/s]
Capturing batches (bs=112 avail_mem=16.68 GB):  53%|█████▎    | 19/36 [00:35<00:13,  1.23it/s]
Capturing batches (bs=104 avail_mem=16.67 GB):  53%|█████▎    | 19/36 [00:35<00:13,  1.23it/s]
Capturing batches (bs=104 avail_mem=16.67 GB):  56%|█████▌    | 20/36 [00:36<00:12,  1.30it/s]
Capturing batches (bs=96 avail_mem=16.64 GB):  56%|█████▌    | 20/36 [00:36<00:12,  1.30it/s]
Capturing batches (bs=96 avail_mem=16.64 GB):  58%|█████▊    | 21/36 [00:36<00:11,  1.29it/s]
Capturing batches (bs=88 avail_mem=16.63 GB):  58%|█████▊    | 21/36 [00:36<00:11,  1.29it/s]
Capturing batches (bs=88 avail_mem=16.63 GB):  61%|██████    | 22/36 [00:37<00:10,  1.37it/s]
Capturing batches (bs=80 avail_mem=16.61 GB):  61%|██████    | 22/36 [00:37<00:10,  1.37it/s]
Capturing batches (bs=80 avail_mem=16.61 GB):  64%|██████▍   | 23/36 [00:38<00:09,  1.37it/s]
Capturing batches (bs=72 avail_mem=16.60 GB):  64%|██████▍   | 23/36 [00:38<00:09,  1.37it/s]
Capturing batches (bs=72 avail_mem=16.60 GB):  67%|██████▋   | 24/36 [00:39<00:09,  1.24it/s]
Capturing batches (bs=64 avail_mem=16.57 GB):  67%|██████▋   | 24/36 [00:39<00:09,  1.24it/s]
Capturing batches (bs=64 avail_mem=16.57 GB):  69%|██████▉   | 25/36 [00:40<00:09,  1.12it/s]
Capturing batches (bs=56 avail_mem=16.56 GB):  69%|██████▉   | 25/36 [00:40<00:09,  1.12it/s]
Capturing batches (bs=56 avail_mem=16.56 GB):  72%|███████▏  | 26/36 [00:40<00:07,  1.26it/s]
Capturing batches (bs=48 avail_mem=16.54 GB):  72%|███████▏  | 26/36 [00:40<00:07,  1.26it/s]
Capturing batches (bs=48 avail_mem=16.54 GB):  75%|███████▌  | 27/36 [00:41<00:06,  1.42it/s]
Capturing batches (bs=40 avail_mem=16.53 GB):  75%|███████▌  | 27/36 [00:41<00:06,  1.42it/s]
Capturing batches (bs=40 avail_mem=16.53 GB):  78%|███████▊  | 28/36 [00:41<00:05,  1.59it/s]
Capturing batches (bs=32 avail_mem=16.50 GB):  78%|███████▊  | 28/36 [00:41<00:05,  1.59it/s]
Capturing batches (bs=32 avail_mem=16.50 GB):  81%|████████  | 29/36 [00:44<00:08,  1.22s/it]
Capturing batches (bs=24 avail_mem=16.50 GB):  81%|████████  | 29/36 [00:44<00:08,  1.22s/it]
Capturing batches (bs=24 avail_mem=16.50 GB):  83%|████████▎ | 30/36 [00:46<00:09,  1.60s/it]
Capturing batches (bs=16 avail_mem=16.47 GB):  83%|████████▎ | 30/36 [00:46<00:09,  1.60s/it]
Capturing batches (bs=16 avail_mem=16.47 GB):  86%|████████▌ | 31/36 [00:47<00:06,  1.38s/it]
Capturing batches (bs=12 avail_mem=16.46 GB):  86%|████████▌ | 31/36 [00:47<00:06,  1.38s/it]
Capturing batches (bs=12 avail_mem=16.46 GB):  89%|████████▉ | 32/36 [00:48<00:04,  1.24s/it]
Capturing batches (bs=8 avail_mem=16.43 GB):  89%|████████▉ | 32/36 [00:48<00:04,  1.24s/it]
Capturing batches (bs=8 avail_mem=16.43 GB):  92%|█████████▏| 33/36 [00:49<00:03,  1.06s/it]
Capturing batches (bs=4 avail_mem=16.43 GB):  92%|█████████▏| 33/36 [00:49<00:03,  1.06s/it]
Capturing batches (bs=4 avail_mem=16.43 GB):  94%|█████████▍| 34/36 [00:50<00:01,  1.04it/s]
Capturing batches (bs=2 avail_mem=16.40 GB):  94%|█████████▍| 34/36 [00:50<00:01,  1.04it/s]
Capturing batches (bs=2 avail_mem=16.40 GB):  97%|█████████▋| 35/36 [00:53<00:01,  1.85s/it]
Capturing batches (bs=1 avail_mem=16.39 GB):  97%|█████████▋| 35/36 [00:53<00:01,  1.85s/it]
Capturing batches (bs=1 avail_mem=16.39 GB): 100%|██████████| 36/36 [01:04<00:00,  4.58s/it]
Capturing batches (bs=1 avail_mem=16.39 GB): 100%|██████████| 36/36 [01:04<00:00,  1.80s/it]
[2025-11-23 06:48:35 TP0] Registering 8716 cuda graph addresses
[2025-11-23 06:48:36 TP0] Capture cuda graph end. Time elapsed: 66.06 s. mem usage=1.38 GB. avail mem=16.37 GB.
[2025-11-23 06:48:36 TP1] Capture cuda graph end. Time elapsed: 65.87 s. mem usage=1.38 GB. avail mem=21.61 GB.
[2025-11-23 06:48:37 TP0] max_total_num_tokens=47940, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=196608, available_gpu_mem=16.37 GB
[2025-11-23 06:48:38] INFO:     Started server process [1]
[2025-11-23 06:48:38] INFO:     Waiting for application startup.
[2025-11-23 06:48:38] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 40, 'top_p': 0.95}
[2025-11-23 06:48:38] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 40, 'top_p': 0.95}
[2025-11-23 06:48:38] INFO:     Application startup complete.
[2025-11-23 06:48:38] INFO:     Uvicorn running on http://0.0.0.0:30000/ (Press CTRL+C to quit)
[2025-11-23 06:48:39] INFO:     127.0.0.1:47326 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-23 06:48:39 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-23 06:48:44] INFO:     127.0.0.1:47330 - "POST /generate HTTP/1.1" 200 OK
[2025-11-23 06:48:44] The server is fired up and ready to roll!
 
 
==========================
Test with curl
==========================
llm-vm:/opt/models/MiniMax-M2/1$ curl http://192.168.130.175:30001/v1/chat/completions     -H "Content-Type: application/json"     -d 'webseclocalllm@webseclocalllm-vm:/opt/models/MiniMax-M2/1$                                                         -H "Content-Type: application/json"
    "model": "/models/minimax-m2",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "Explain Link-16 in simple terms."}
    ],
    "max_tokens": 256
  }'
{"id":"16af417af0c247d99b8c5caa91446ecc","object":"chat.completion","created":[1763879881](tel:1763879881),"model":"/models/minimax-m2","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":31,"total_tokens":287,"completion_tokens":256,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}webseclocalllm@webseclocalllm-vm:/opt/models/MiniMax-M2/1$ curl http://192.168.130.175:30001/get_model_info
{"model_path":"/models/minimax-m2","tokenizer_path":"/models/minimax-m2","is_generation":true,"preferred_sampling_params":null,"weight_version":"default","has_image_understanding":false,"has_audio_understanding":false}
 
 
 
{"model_path":"/models/minimax-m2","tokenizer_path":"/models/minimax-m2","is_generation":true,"preferred_sampling_params":null,"weight_version":"default","has_image_understanding":false,"has_audio_understanding":false}w


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sglang configuration - chat returns NULLs #50

Basic Information - Models Used

Information about environment and deployment

Description

==========================
Compose.yaml

--attention-backend triton \

--triton-attention-num-kv-splits 16 \

--dtype auto \

--context-length 196608 \

--max-running-requests 32 \

--max-total-tokens 180000 \

--stream-interval 1 \

==========================
Test with curl

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sglang configuration - chat returns NULLs #50

Description

Basic Information - Models Used

Information about environment and deployment

Description

========================== Compose.yaml

--attention-backend triton \

--triton-attention-num-kv-splits 16 \

--dtype auto \

--context-length 196608 \

--max-running-requests 32 \

--max-total-tokens 180000 \

--stream-interval 1 \

========================== Test with curl

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

==========================
Compose.yaml

==========================
Test with curl