Skip to content

Sglang configuration - chat returns NULLs #50

@websterdav

Description

@websterdav

Basic Information - Models Used

Minimax-M2

Information about environment and deployment

Docker Lmsysorg/sglang from 11/22/2025
Air gapped
Host Ubuntu 24.04
2 or 3 Nvidia H200
Minimax-M2

Description

Hello,

I hope you can provide guidance. Last week I downloaded a new sglang and minimax-m2 model from HuggingFace using “git clone”. I have tried a couple of things that might be interesting to you and some questions.
Putting KV on a third card crashes sglang. Not a huge deal. I just moved to 2 H200 cards.
Important - When I run the model with 2 cards I am not able to complete even the sample test from the model website. The results are below. It just gave a bunch of Unicode NULLs.
I have provided my Portainer Compose.yaml file which I tried making just like the sample
I have provided the startup logging
I have project the curl chat interface test
Do you happen to have a sample Yaml for using Continue.dev in vs code? This is the next configuration item after it runs.
I was getting error when trying to do CPU offloading configuration. Maybe you can shed some light on that.

Please help as I am trying to get this running for offline use. This is basically to be used for coding with fairly large repos. So, I am hoping it is as good or better than Qwen3-coder 480B as it is much smaller in footprint and runs, in theory, using sglang.

==========================
Compose.yaml

version: "3.9"

services:
sglang-minimax-m2:
image: lmsysorg/sglang:latest
container_name: sglang-minimax-m2
restart: unless-stopped

shm_size: "32gb"

ulimits:
  memlock: -1
  stack: 67108864

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 2
          capabilities: [gpu]

environment:
  NVIDIA_VISIBLE_DEVICES: "0,1"
  NVIDIA_DRIVER_CAPABILITIES: "compute,utility"

  # Performance settings
  TOKENIZERS_PARALLELISM: "false"
  OMP_NUM_THREADS: "16"

ports:
  - "30001:30000"

volumes:
  - /opt/models/MiniMax-M2/1:/models/minimax-m2:ro
  - /opt/sglang_cache:/root/.cache
  - /opt/sglang_logs:/logs

entrypoint:
  - bash
  - -c
  - |
      echo "==================================================="
      echo " Starting MiniMax-M2 (TP=2 on GPUs 0+1, 96K context)"
      echo "==================================================="

      python3 -m sglang.launch_server \
        --model-path /models/minimax-m2 \
        --host 0.0.0.0 \
        --port 30000 \
        --tp-size 2 \
        --trust-remote-code \
        --mem-fraction-static 0.85 \
        --reasoning-parser minimax-append-think \
        --tool-call-parser minimax-m2

--attention-backend triton \

--triton-attention-num-kv-splits 16 \

--dtype auto \

--context-length 196608 \

--max-running-requests 32 \

--max-total-tokens 180000 \

--stream-interval 1 \

===================================================

Starting MiniMax-M2 (TP=2 on GPUs 0+1)

===================================================

[2025-11-23 06:47:06] INFO model_config.py:884: Downcasting torch.float32 to torch.float16.

[2025-11-23 06:47:06] WARNING server_args.py:1197: Attention backend not explicitly specified. Use flashinfer backend by default.

[2025-11-23 06:47:06] INFO trace.py:69: opentelemetry package is not installed, tracing disabled

[2025-11-23 06:47:06] server_args=ServerArgs(model_path='/models/minimax-m2', tokenizer_path='/models/minimax-m2', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.85, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=188384565, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', api_key=None, served_model_name='/models/minimax-m2', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='minimax-append-think', tool_call_parser='minimax-m2', tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, decrypted_config_file=None, decrypted_draft_config_file=None)

[2025-11-23 06:47:06] Downcasting torch.float32 to torch.float16.

[2025-11-23 06:47:07] Using default HuggingFace chat template with detected content format: openai

[2025-11-23 06:47:14] INFO trace.py:69: opentelemetry package is not installed, tracing disabled

[2025-11-23 06:47:14] INFO trace.py:69: opentelemetry package is not installed, tracing disabled

[2025-11-23 06:47:15 TP1] Downcasting torch.float32 to torch.float16.

[2025-11-23 06:47:15] INFO trace.py:69: opentelemetry package is not installed, tracing disabled

[2025-11-23 06:47:15 TP0] Downcasting torch.float32 to torch.float16.

[2025-11-23 06:47:15 TP1] Downcasting torch.float32 to torch.float16.
[2025-11-23 06:47:15 TP1] Init torch distributed begin.
[2025-11-23 06:47:15 TP0] Downcasting torch.float32 to torch.float16.
[2025-11-23 06:47:15 TP0] Init torch distributed begin.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-23 06:47:16 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-23 06:47:17 TP1] Init torch distributed ends. mem usage=1.01 GB
[2025-11-23 06:47:17 TP0] Init torch distributed ends. mem usage=1.01 GB
[2025-11-23 06:47:21 TP0] Load weight begin. avail mem=133.01 GB
[2025-11-23 06:47:21 TP0] Detected fp8 checkpoint.
[2025-11-23 06:47:21 TP1] Load weight begin. avail mem=138.24 GB
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 6% Completed | 1/17 [00:00<00:08, 1.85it/s]
Loading safetensors checkpoint shards: 12% Completed | 2/17 [00:01<00:08, 1.86it/s]
Loading safetensors checkpoint shards: 18% Completed | 3/17 [00:01<00:07, 1.93it/s]
Loading safetensors checkpoint shards: 24% Completed | 4/17 [00:02<00:07, 1.80it/s]
Loading safetensors checkpoint shards: 29% Completed | 5/17 [00:02<00:06, 1.85it/s]
Loading safetensors checkpoint shards: 35% Completed | 6/17 [00:03<00:05, 1.84it/s]
Loading safetensors checkpoint shards: 41% Completed | 7/17 [00:03<00:05, 1.97it/s]
Loading safetensors checkpoint shards: 47% Completed | 8/17 [00:04<00:04, 2.04it/s]
Loading safetensors checkpoint shards: 53% Completed | 9/17 [00:04<00:04, 1.84it/s]
Loading safetensors checkpoint shards: 59% Completed | 10/17 [00:05<00:03, 1.95it/s]
Loading safetensors checkpoint shards: 65% Completed | 11/17 [00:05<00:02, 2.18it/s]
Loading safetensors checkpoint shards: 71% Completed | 12/17 [00:05<00:02, 2.36it/s]
Loading safetensors checkpoint shards: 76% Completed | 13/17 [00:06<00:01, 2.38it/s]
Loading safetensors checkpoint shards: 82% Completed | 14/17 [00:06<00:01, 2.50it/s]
Loading safetensors checkpoint shards: 88% Completed | 15/17 [00:07<00:00, 2.61it/s]
Loading safetensors checkpoint shards: 94% Completed | 16/17 [00:07<00:00, 2.70it/s]
[2025-11-23 06:47:30 TP1] Load weight end. type=MiniMaxM2ForCausalLM, dtype=torch.float16, avail mem=30.87 GB, mem usage=107.38 GB.
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:08<00:00, 1.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:08<00:00, 1.98it/s]
[2025-11-23 06:47:30 TP0] Load weight end. type=MiniMaxM2ForCausalLM, dtype=torch.float16, avail mem=25.63 GB, mem usage=107.38 GB.
[2025-11-23 06:47:30 TP0] Using KV cache dtype: torch.float16
[2025-11-23 06:47:30 TP0] KV Cache is allocated. #tokens: 47940, K size: 2.83 GB, V size: 2.83 GB
[2025-11-23 06:47:30 TP0] Memory pool end. avail mem=18.31 GB
[2025-11-23 06:47:30 TP1] KV Cache is allocated. #tokens: 47940, K size: 2.83 GB, V size: 2.83 GB
[2025-11-23 06:47:30 TP1] Memory pool end. avail mem=23.55 GB
[2025-11-23 06:47:30 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=17.74 GB
[2025-11-23 06:47:30 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
[2025-11-23 06:47:30 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=22.98 GB
0%| | 0/36 [00:00<?, ?it/s]
Capturing batches (bs=256 avail_mem=17.31 GB): 0%| | 0/36 [00:00<?, ?it/s][2025-11-23 06:47:34 TP1] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:34 TP0] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:39 TP1] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=3072,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:39 TP0] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=3072,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:42 TP1] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-11-23 06:47:42 TP1] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128]_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-11-23 06:47:43 TP0] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-11-23 06:47:43 TP0] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128]_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
Capturing batches (bs=256 avail_mem=17.31 GB): 3%|▎ | 1/36 [00:15<08:59, 15.42s/it]
Capturing batches (bs=248 avail_mem=16.94 GB): 3%|▎ | 1/36 [00:15<08:59, 15.42s/it]
Capturing batches (bs=248 avail_mem=16.94 GB): 6%|▌ | 2/36 [00:21<05:34, 9.83s/it]
Capturing batches (bs=240 avail_mem=16.94 GB): 6%|▌ | 2/36 [00:21<05:34, 9.83s/it]
Capturing batches (bs=240 avail_mem=16.94 GB): 8%|▊ | 3/36 [00:21<03:05, 5.62s/it]
Capturing batches (bs=232 avail_mem=16.93 GB): 8%|▊ | 3/36 [00:21<03:05, 5.62s/it]
Capturing batches (bs=232 avail_mem=16.93 GB): 11%|█ | 4/36 [00:22<01:59, 3.74s/it]
Capturing batches (bs=224 avail_mem=16.92 GB): 11%|█ | 4/36 [00:22<01:59, 3.74s/it]
Capturing batches (bs=224 avail_mem=16.92 GB): 14%|█▍ | 5/36 [00:23<01:22, 2.66s/it]
Capturing batches (bs=216 avail_mem=16.91 GB): 14%|█▍ | 5/36 [00:23<01:22, 2.66s/it]
Capturing batches (bs=216 avail_mem=16.91 GB): 17%|█▋ | 6/36 [00:24<00:58, 1.97s/it]
Capturing batches (bs=208 avail_mem=16.89 GB): 17%|█▋ | 6/36 [00:24<00:58, 1.97s/it]
Capturing batches (bs=208 avail_mem=16.89 GB): 19%|█▉ | 7/36 [00:25<00:47, 1.64s/it]
Capturing batches (bs=200 avail_mem=16.88 GB): 19%|█▉ | 7/36 [00:25<00:47, 1.64s/it]
Capturing batches (bs=200 avail_mem=16.88 GB): 22%|██▏ | 8/36 [00:26<00:40, 1.43s/it]
Capturing batches (bs=192 avail_mem=16.85 GB): 22%|██▏ | 8/36 [00:26<00:40, 1.43s/it]
Capturing batches (bs=192 avail_mem=16.85 GB): 25%|██▌ | 9/36 [00:26<00:32, 1.21s/it]
Capturing batches (bs=184 avail_mem=16.84 GB): 25%|██▌ | 9/36 [00:26<00:32, 1.21s/it]
Capturing batches (bs=184 avail_mem=16.84 GB): 28%|██▊ | 10/36 [00:27<00:28, 1.09s/it]
Capturing batches (bs=176 avail_mem=16.82 GB): 28%|██▊ | 10/36 [00:27<00:28, 1.09s/it]
Capturing batches (bs=176 avail_mem=16.82 GB): 31%|███ | 11/36 [00:28<00:23, 1.06it/s]
Capturing batches (bs=168 avail_mem=16.81 GB): 31%|███ | 11/36 [00:28<00:23, 1.06it/s]
Capturing batches (bs=168 avail_mem=16.81 GB): 33%|███▎ | 12/36 [00:29<00:21, 1.12it/s]
Capturing batches (bs=160 avail_mem=16.78 GB): 33%|███▎ | 12/36 [00:29<00:21, 1.12it/s]
Capturing batches (bs=160 avail_mem=16.78 GB): 36%|███▌ | 13/36 [00:30<00:21, 1.08it/s]
Capturing batches (bs=152 avail_mem=16.77 GB): 36%|███▌ | 13/36 [00:30<00:21, 1.08it/s]
Capturing batches (bs=152 avail_mem=16.77 GB): 39%|███▉ | 14/36 [00:31<00:22, 1.00s/it]
Capturing batches (bs=144 avail_mem=16.75 GB): 39%|███▉ | 14/36 [00:31<00:22, 1.00s/it]
Capturing batches (bs=144 avail_mem=16.75 GB): 42%|████▏ | 15/36 [00:32<00:20, 1.05it/s]
Capturing batches (bs=136 avail_mem=16.74 GB): 42%|████▏ | 15/36 [00:32<00:20, 1.05it/s]
Capturing batches (bs=136 avail_mem=16.74 GB): 44%|████▍ | 16/36 [00:33<00:19, 1.00it/s]
Capturing batches (bs=128 avail_mem=16.71 GB): 44%|████▍ | 16/36 [00:33<00:19, 1.00it/s]
Capturing batches (bs=128 avail_mem=16.71 GB): 47%|████▋ | 17/36 [00:34<00:18, 1.04it/s]
Capturing batches (bs=120 avail_mem=16.70 GB): 47%|████▋ | 17/36 [00:34<00:18, 1.04it/s]
Capturing batches (bs=120 avail_mem=16.70 GB): 50%|█████ | 18/36 [00:34<00:16, 1.11it/s]
Capturing batches (bs=112 avail_mem=16.68 GB): 50%|█████ | 18/36 [00:34<00:16, 1.11it/s]
Capturing batches (bs=112 avail_mem=16.68 GB): 53%|█████▎ | 19/36 [00:35<00:13, 1.23it/s]
Capturing batches (bs=104 avail_mem=16.67 GB): 53%|█████▎ | 19/36 [00:35<00:13, 1.23it/s]
Capturing batches (bs=104 avail_mem=16.67 GB): 56%|█████▌ | 20/36 [00:36<00:12, 1.30it/s]
Capturing batches (bs=96 avail_mem=16.64 GB): 56%|█████▌ | 20/36 [00:36<00:12, 1.30it/s]
Capturing batches (bs=96 avail_mem=16.64 GB): 58%|█████▊ | 21/36 [00:36<00:11, 1.29it/s]
Capturing batches (bs=88 avail_mem=16.63 GB): 58%|█████▊ | 21/36 [00:36<00:11, 1.29it/s]
Capturing batches (bs=88 avail_mem=16.63 GB): 61%|██████ | 22/36 [00:37<00:10, 1.37it/s]
Capturing batches (bs=80 avail_mem=16.61 GB): 61%|██████ | 22/36 [00:37<00:10, 1.37it/s]
Capturing batches (bs=80 avail_mem=16.61 GB): 64%|██████▍ | 23/36 [00:38<00:09, 1.37it/s]
Capturing batches (bs=72 avail_mem=16.60 GB): 64%|██████▍ | 23/36 [00:38<00:09, 1.37it/s]
Capturing batches (bs=72 avail_mem=16.60 GB): 67%|██████▋ | 24/36 [00:39<00:09, 1.24it/s]
Capturing batches (bs=64 avail_mem=16.57 GB): 67%|██████▋ | 24/36 [00:39<00:09, 1.24it/s]
Capturing batches (bs=64 avail_mem=16.57 GB): 69%|██████▉ | 25/36 [00:40<00:09, 1.12it/s]
Capturing batches (bs=56 avail_mem=16.56 GB): 69%|██████▉ | 25/36 [00:40<00:09, 1.12it/s]
Capturing batches (bs=56 avail_mem=16.56 GB): 72%|███████▏ | 26/36 [00:40<00:07, 1.26it/s]
Capturing batches (bs=48 avail_mem=16.54 GB): 72%|███████▏ | 26/36 [00:40<00:07, 1.26it/s]
Capturing batches (bs=48 avail_mem=16.54 GB): 75%|███████▌ | 27/36 [00:41<00:06, 1.42it/s]
Capturing batches (bs=40 avail_mem=16.53 GB): 75%|███████▌ | 27/36 [00:41<00:06, 1.42it/s]
Capturing batches (bs=40 avail_mem=16.53 GB): 78%|███████▊ | 28/36 [00:41<00:05, 1.59it/s]
Capturing batches (bs=32 avail_mem=16.50 GB): 78%|███████▊ | 28/36 [00:41<00:05, 1.59it/s]
Capturing batches (bs=32 avail_mem=16.50 GB): 81%|████████ | 29/36 [00:44<00:08, 1.22s/it]
Capturing batches (bs=24 avail_mem=16.50 GB): 81%|████████ | 29/36 [00:44<00:08, 1.22s/it]
Capturing batches (bs=24 avail_mem=16.50 GB): 83%|████████▎ | 30/36 [00:46<00:09, 1.60s/it]
Capturing batches (bs=16 avail_mem=16.47 GB): 83%|████████▎ | 30/36 [00:46<00:09, 1.60s/it]
Capturing batches (bs=16 avail_mem=16.47 GB): 86%|████████▌ | 31/36 [00:47<00:06, 1.38s/it]
Capturing batches (bs=12 avail_mem=16.46 GB): 86%|████████▌ | 31/36 [00:47<00:06, 1.38s/it]
Capturing batches (bs=12 avail_mem=16.46 GB): 89%|████████▉ | 32/36 [00:48<00:04, 1.24s/it]
Capturing batches (bs=8 avail_mem=16.43 GB): 89%|████████▉ | 32/36 [00:48<00:04, 1.24s/it]
Capturing batches (bs=8 avail_mem=16.43 GB): 92%|█████████▏| 33/36 [00:49<00:03, 1.06s/it]
Capturing batches (bs=4 avail_mem=16.43 GB): 92%|█████████▏| 33/36 [00:49<00:03, 1.06s/it]
Capturing batches (bs=4 avail_mem=16.43 GB): 94%|█████████▍| 34/36 [00:50<00:01, 1.04it/s]
Capturing batches (bs=2 avail_mem=16.40 GB): 94%|█████████▍| 34/36 [00:50<00:01, 1.04it/s]
Capturing batches (bs=2 avail_mem=16.40 GB): 97%|█████████▋| 35/36 [00:53<00:01, 1.85s/it]
Capturing batches (bs=1 avail_mem=16.39 GB): 97%|█████████▋| 35/36 [00:53<00:01, 1.85s/it]
Capturing batches (bs=1 avail_mem=16.39 GB): 100%|██████████| 36/36 [01:04<00:00, 4.58s/it]
Capturing batches (bs=1 avail_mem=16.39 GB): 100%|██████████| 36/36 [01:04<00:00, 1.80s/it]
[2025-11-23 06:48:35 TP0] Registering 8716 cuda graph addresses
[2025-11-23 06:48:36 TP0] Capture cuda graph end. Time elapsed: 66.06 s. mem usage=1.38 GB. avail mem=16.37 GB.
[2025-11-23 06:48:36 TP1] Capture cuda graph end. Time elapsed: 65.87 s. mem usage=1.38 GB. avail mem=21.61 GB.
[2025-11-23 06:48:37 TP0] max_total_num_tokens=47940, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=196608, available_gpu_mem=16.37 GB
[2025-11-23 06:48:38] INFO: Started server process [1]
[2025-11-23 06:48:38] INFO: Waiting for application startup.
[2025-11-23 06:48:38] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 40, 'top_p': 0.95}
[2025-11-23 06:48:38] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 40, 'top_p': 0.95}
[2025-11-23 06:48:38] INFO: Application startup complete.
[2025-11-23 06:48:38] INFO: Uvicorn running on http://0.0.0.0:30000/ (Press CTRL+C to quit)
[2025-11-23 06:48:39] INFO: 127.0.0.1:47326 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-23 06:48:39 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-23 06:48:44] INFO: 127.0.0.1:47330 - "POST /generate HTTP/1.1" 200 OK
[2025-11-23 06:48:44] The server is fired up and ready to roll!

==========================
Test with curl

llm-vm:/opt/models/MiniMax-M2/1$ curl http://192.168.130.175:30001/v1/chat/completions -H "Content-Type: application/json" -d 'webseclocalllm@webseclocalllm-vm:/opt/models/MiniMax-M2/1$ -H "Content-Type: application/json"
"model": "/models/minimax-m2",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain Link-16 in simple terms."}
],
"max_tokens": 256
}'
{"id":"16af417af0c247d99b8c5caa91446ecc","object":"chat.completion","created":1763879881,"model":"/models/minimax-m2","choices":[{"index":0,"message":{"role":"assistant","content":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":31,"total_tokens":287,"completion_tokens":256,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}webseclocalllm@webseclocalllm-vm:/opt/models/MiniMax-M2/1$ curl http://192.168.130.175:30001/get_model_info
{"model_path":"/models/minimax-m2","tokenizer_path":"/models/minimax-m2","is_generation":true,"preferred_sampling_params":null,"weight_version":"default","has_image_understanding":false,"has_audio_understanding":false}

{"model_path":"/models/minimax-m2","tokenizer_path":"/models/minimax-m2","is_generation":true,"preferred_sampling_params":null,"weight_version":"default","has_image_understanding":false,"has_audio_understanding":false}w

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions