-
Notifications
You must be signed in to change notification settings - Fork 170
Description
Basic Information - Models Used
Minimax-M2
Information about environment and deployment
Docker Lmsysorg/sglang from 11/22/2025
Air gapped
Host Ubuntu 24.04
2 or 3 Nvidia H200
Minimax-M2
Description
Hello,
I hope you can provide guidance. Last week I downloaded a new sglang and minimax-m2 model from HuggingFace using “git clone”. I have tried a couple of things that might be interesting to you and some questions.
Putting KV on a third card crashes sglang. Not a huge deal. I just moved to 2 H200 cards.
Important - When I run the model with 2 cards I am not able to complete even the sample test from the model website. The results are below. It just gave a bunch of Unicode NULLs.
I have provided my Portainer Compose.yaml file which I tried making just like the sample
I have provided the startup logging
I have project the curl chat interface test
Do you happen to have a sample Yaml for using Continue.dev in vs code? This is the next configuration item after it runs.
I was getting error when trying to do CPU offloading configuration. Maybe you can shed some light on that.
Please help as I am trying to get this running for offline use. This is basically to be used for coding with fairly large repos. So, I am hoping it is as good or better than Qwen3-coder 480B as it is much smaller in footprint and runs, in theory, using sglang.
==========================
Compose.yaml
version: "3.9"
services:
sglang-minimax-m2:
image: lmsysorg/sglang:latest
container_name: sglang-minimax-m2
restart: unless-stopped
shm_size: "32gb"
ulimits:
memlock: -1
stack: 67108864
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
environment:
NVIDIA_VISIBLE_DEVICES: "0,1"
NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
# Performance settings
TOKENIZERS_PARALLELISM: "false"
OMP_NUM_THREADS: "16"
ports:
- "30001:30000"
volumes:
- /opt/models/MiniMax-M2/1:/models/minimax-m2:ro
- /opt/sglang_cache:/root/.cache
- /opt/sglang_logs:/logs
entrypoint:
- bash
- -c
- |
echo "==================================================="
echo " Starting MiniMax-M2 (TP=2 on GPUs 0+1, 96K context)"
echo "==================================================="
python3 -m sglang.launch_server \
--model-path /models/minimax-m2 \
--host 0.0.0.0 \
--port 30000 \
--tp-size 2 \
--trust-remote-code \
--mem-fraction-static 0.85 \
--reasoning-parser minimax-append-think \
--tool-call-parser minimax-m2
--attention-backend triton \
--triton-attention-num-kv-splits 16 \
--dtype auto \
--context-length 196608 \
--max-running-requests 32 \
--max-total-tokens 180000 \
--stream-interval 1 \
===================================================
Starting MiniMax-M2 (TP=2 on GPUs 0+1)
===================================================
[2025-11-23 06:47:06] INFO model_config.py:884: Downcasting torch.float32 to torch.float16.
[2025-11-23 06:47:06] WARNING server_args.py:1197: Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-11-23 06:47:06] INFO trace.py:69: opentelemetry package is not installed, tracing disabled
[2025-11-23 06:47:06] server_args=ServerArgs(model_path='/models/minimax-m2', tokenizer_path='/models/minimax-m2', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.85, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=188384565, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', api_key=None, served_model_name='/models/minimax-m2', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='minimax-append-think', tool_call_parser='minimax-m2', tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, decrypted_config_file=None, decrypted_draft_config_file=None)
[2025-11-23 06:47:06] Downcasting torch.float32 to torch.float16.
[2025-11-23 06:47:07] Using default HuggingFace chat template with detected content format: openai
[2025-11-23 06:47:14] INFO trace.py:69: opentelemetry package is not installed, tracing disabled
[2025-11-23 06:47:14] INFO trace.py:69: opentelemetry package is not installed, tracing disabled
[2025-11-23 06:47:15 TP1] Downcasting torch.float32 to torch.float16.
[2025-11-23 06:47:15] INFO trace.py:69: opentelemetry package is not installed, tracing disabled
[2025-11-23 06:47:15 TP0] Downcasting torch.float32 to torch.float16.
[2025-11-23 06:47:15 TP1] Downcasting torch.float32 to torch.float16.
[2025-11-23 06:47:15 TP1] Init torch distributed begin.
[2025-11-23 06:47:15 TP0] Downcasting torch.float32 to torch.float16.
[2025-11-23 06:47:15 TP0] Init torch distributed begin.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-23 06:47:16 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-23 06:47:17 TP1] Init torch distributed ends. mem usage=1.01 GB
[2025-11-23 06:47:17 TP0] Init torch distributed ends. mem usage=1.01 GB
[2025-11-23 06:47:21 TP0] Load weight begin. avail mem=133.01 GB
[2025-11-23 06:47:21 TP0] Detected fp8 checkpoint.
[2025-11-23 06:47:21 TP1] Load weight begin. avail mem=138.24 GB
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 6% Completed | 1/17 [00:00<00:08, 1.85it/s]
Loading safetensors checkpoint shards: 12% Completed | 2/17 [00:01<00:08, 1.86it/s]
Loading safetensors checkpoint shards: 18% Completed | 3/17 [00:01<00:07, 1.93it/s]
Loading safetensors checkpoint shards: 24% Completed | 4/17 [00:02<00:07, 1.80it/s]
Loading safetensors checkpoint shards: 29% Completed | 5/17 [00:02<00:06, 1.85it/s]
Loading safetensors checkpoint shards: 35% Completed | 6/17 [00:03<00:05, 1.84it/s]
Loading safetensors checkpoint shards: 41% Completed | 7/17 [00:03<00:05, 1.97it/s]
Loading safetensors checkpoint shards: 47% Completed | 8/17 [00:04<00:04, 2.04it/s]
Loading safetensors checkpoint shards: 53% Completed | 9/17 [00:04<00:04, 1.84it/s]
Loading safetensors checkpoint shards: 59% Completed | 10/17 [00:05<00:03, 1.95it/s]
Loading safetensors checkpoint shards: 65% Completed | 11/17 [00:05<00:02, 2.18it/s]
Loading safetensors checkpoint shards: 71% Completed | 12/17 [00:05<00:02, 2.36it/s]
Loading safetensors checkpoint shards: 76% Completed | 13/17 [00:06<00:01, 2.38it/s]
Loading safetensors checkpoint shards: 82% Completed | 14/17 [00:06<00:01, 2.50it/s]
Loading safetensors checkpoint shards: 88% Completed | 15/17 [00:07<00:00, 2.61it/s]
Loading safetensors checkpoint shards: 94% Completed | 16/17 [00:07<00:00, 2.70it/s]
[2025-11-23 06:47:30 TP1] Load weight end. type=MiniMaxM2ForCausalLM, dtype=torch.float16, avail mem=30.87 GB, mem usage=107.38 GB.
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:08<00:00, 1.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:08<00:00, 1.98it/s]
[2025-11-23 06:47:30 TP0] Load weight end. type=MiniMaxM2ForCausalLM, dtype=torch.float16, avail mem=25.63 GB, mem usage=107.38 GB.
[2025-11-23 06:47:30 TP0] Using KV cache dtype: torch.float16
[2025-11-23 06:47:30 TP0] KV Cache is allocated. #tokens: 47940, K size: 2.83 GB, V size: 2.83 GB
[2025-11-23 06:47:30 TP0] Memory pool end. avail mem=18.31 GB
[2025-11-23 06:47:30 TP1] KV Cache is allocated. #tokens: 47940, K size: 2.83 GB, V size: 2.83 GB
[2025-11-23 06:47:30 TP1] Memory pool end. avail mem=23.55 GB
[2025-11-23 06:47:30 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=17.74 GB
[2025-11-23 06:47:30 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
[2025-11-23 06:47:30 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=22.98 GB
0%| | 0/36 [00:00<?, ?it/s]
Capturing batches (bs=256 avail_mem=17.31 GB): 0%| | 0/36 [00:00<?, ?it/s][2025-11-23 06:47:34 TP1] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:34 TP0] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:39 TP1] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=3072,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:39 TP0] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=3072,K=3072,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-11-23 06:47:42 TP1] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-11-23 06:47:42 TP1] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128]_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-11-23 06:47:43 TP0] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-11-23 06:47:43 TP0] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=256,N=768,device_name=NVIDIA_H200_NVL,dtype=fp8_w8a8,block_shape=[128, 128]_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
Capturing batches (bs=256 avail_mem=17.31 GB): 3%|▎ | 1/36 [00:15<08:59, 15.42s/it]
Capturing batches (bs=248 avail_mem=16.94 GB): 3%|▎ | 1/36 [00:15<08:59, 15.42s/it]
Capturing batches (bs=248 avail_mem=16.94 GB): 6%|▌ | 2/36 [00:21<05:34, 9.83s/it]
Capturing batches (bs=240 avail_mem=16.94 GB): 6%|▌ | 2/36 [00:21<05:34, 9.83s/it]
Capturing batches (bs=240 avail_mem=16.94 GB): 8%|▊ | 3/36 [00:21<03:05, 5.62s/it]
Capturing batches (bs=232 avail_mem=16.93 GB): 8%|▊ | 3/36 [00:21<03:05, 5.62s/it]
Capturing batches (bs=232 avail_mem=16.93 GB): 11%|█ | 4/36 [00:22<01:59, 3.74s/it]
Capturing batches (bs=224 avail_mem=16.92 GB): 11%|█ | 4/36 [00:22<01:59, 3.74s/it]
Capturing batches (bs=224 avail_mem=16.92 GB): 14%|█▍ | 5/36 [00:23<01:22, 2.66s/it]
Capturing batches (bs=216 avail_mem=16.91 GB): 14%|█▍ | 5/36 [00:23<01:22, 2.66s/it]
Capturing batches (bs=216 avail_mem=16.91 GB): 17%|█▋ | 6/36 [00:24<00:58, 1.97s/it]
Capturing batches (bs=208 avail_mem=16.89 GB): 17%|█▋ | 6/36 [00:24<00:58, 1.97s/it]
Capturing batches (bs=208 avail_mem=16.89 GB): 19%|█▉ | 7/36 [00:25<00:47, 1.64s/it]
Capturing batches (bs=200 avail_mem=16.88 GB): 19%|█▉ | 7/36 [00:25<00:47, 1.64s/it]
Capturing batches (bs=200 avail_mem=16.88 GB): 22%|██▏ | 8/36 [00:26<00:40, 1.43s/it]
Capturing batches (bs=192 avail_mem=16.85 GB): 22%|██▏ | 8/36 [00:26<00:40, 1.43s/it]
Capturing batches (bs=192 avail_mem=16.85 GB): 25%|██▌ | 9/36 [00:26<00:32, 1.21s/it]
Capturing batches (bs=184 avail_mem=16.84 GB): 25%|██▌ | 9/36 [00:26<00:32, 1.21s/it]
Capturing batches (bs=184 avail_mem=16.84 GB): 28%|██▊ | 10/36 [00:27<00:28, 1.09s/it]
Capturing batches (bs=176 avail_mem=16.82 GB): 28%|██▊ | 10/36 [00:27<00:28, 1.09s/it]
Capturing batches (bs=176 avail_mem=16.82 GB): 31%|███ | 11/36 [00:28<00:23, 1.06it/s]
Capturing batches (bs=168 avail_mem=16.81 GB): 31%|███ | 11/36 [00:28<00:23, 1.06it/s]
Capturing batches (bs=168 avail_mem=16.81 GB): 33%|███▎ | 12/36 [00:29<00:21, 1.12it/s]
Capturing batches (bs=160 avail_mem=16.78 GB): 33%|███▎ | 12/36 [00:29<00:21, 1.12it/s]
Capturing batches (bs=160 avail_mem=16.78 GB): 36%|███▌ | 13/36 [00:30<00:21, 1.08it/s]
Capturing batches (bs=152 avail_mem=16.77 GB): 36%|███▌ | 13/36 [00:30<00:21, 1.08it/s]
Capturing batches (bs=152 avail_mem=16.77 GB): 39%|███▉ | 14/36 [00:31<00:22, 1.00s/it]
Capturing batches (bs=144 avail_mem=16.75 GB): 39%|███▉ | 14/36 [00:31<00:22, 1.00s/it]
Capturing batches (bs=144 avail_mem=16.75 GB): 42%|████▏ | 15/36 [00:32<00:20, 1.05it/s]
Capturing batches (bs=136 avail_mem=16.74 GB): 42%|████▏ | 15/36 [00:32<00:20, 1.05it/s]
Capturing batches (bs=136 avail_mem=16.74 GB): 44%|████▍ | 16/36 [00:33<00:19, 1.00it/s]
Capturing batches (bs=128 avail_mem=16.71 GB): 44%|████▍ | 16/36 [00:33<00:19, 1.00it/s]
Capturing batches (bs=128 avail_mem=16.71 GB): 47%|████▋ | 17/36 [00:34<00:18, 1.04it/s]
Capturing batches (bs=120 avail_mem=16.70 GB): 47%|████▋ | 17/36 [00:34<00:18, 1.04it/s]
Capturing batches (bs=120 avail_mem=16.70 GB): 50%|█████ | 18/36 [00:34<00:16, 1.11it/s]
Capturing batches (bs=112 avail_mem=16.68 GB): 50%|█████ | 18/36 [00:34<00:16, 1.11it/s]
Capturing batches (bs=112 avail_mem=16.68 GB): 53%|█████▎ | 19/36 [00:35<00:13, 1.23it/s]
Capturing batches (bs=104 avail_mem=16.67 GB): 53%|█████▎ | 19/36 [00:35<00:13, 1.23it/s]
Capturing batches (bs=104 avail_mem=16.67 GB): 56%|█████▌ | 20/36 [00:36<00:12, 1.30it/s]
Capturing batches (bs=96 avail_mem=16.64 GB): 56%|█████▌ | 20/36 [00:36<00:12, 1.30it/s]
Capturing batches (bs=96 avail_mem=16.64 GB): 58%|█████▊ | 21/36 [00:36<00:11, 1.29it/s]
Capturing batches (bs=88 avail_mem=16.63 GB): 58%|█████▊ | 21/36 [00:36<00:11, 1.29it/s]
Capturing batches (bs=88 avail_mem=16.63 GB): 61%|██████ | 22/36 [00:37<00:10, 1.37it/s]
Capturing batches (bs=80 avail_mem=16.61 GB): 61%|██████ | 22/36 [00:37<00:10, 1.37it/s]
Capturing batches (bs=80 avail_mem=16.61 GB): 64%|██████▍ | 23/36 [00:38<00:09, 1.37it/s]
Capturing batches (bs=72 avail_mem=16.60 GB): 64%|██████▍ | 23/36 [00:38<00:09, 1.37it/s]
Capturing batches (bs=72 avail_mem=16.60 GB): 67%|██████▋ | 24/36 [00:39<00:09, 1.24it/s]
Capturing batches (bs=64 avail_mem=16.57 GB): 67%|██████▋ | 24/36 [00:39<00:09, 1.24it/s]
Capturing batches (bs=64 avail_mem=16.57 GB): 69%|██████▉ | 25/36 [00:40<00:09, 1.12it/s]
Capturing batches (bs=56 avail_mem=16.56 GB): 69%|██████▉ | 25/36 [00:40<00:09, 1.12it/s]
Capturing batches (bs=56 avail_mem=16.56 GB): 72%|███████▏ | 26/36 [00:40<00:07, 1.26it/s]
Capturing batches (bs=48 avail_mem=16.54 GB): 72%|███████▏ | 26/36 [00:40<00:07, 1.26it/s]
Capturing batches (bs=48 avail_mem=16.54 GB): 75%|███████▌ | 27/36 [00:41<00:06, 1.42it/s]
Capturing batches (bs=40 avail_mem=16.53 GB): 75%|███████▌ | 27/36 [00:41<00:06, 1.42it/s]
Capturing batches (bs=40 avail_mem=16.53 GB): 78%|███████▊ | 28/36 [00:41<00:05, 1.59it/s]
Capturing batches (bs=32 avail_mem=16.50 GB): 78%|███████▊ | 28/36 [00:41<00:05, 1.59it/s]
Capturing batches (bs=32 avail_mem=16.50 GB): 81%|████████ | 29/36 [00:44<00:08, 1.22s/it]
Capturing batches (bs=24 avail_mem=16.50 GB): 81%|████████ | 29/36 [00:44<00:08, 1.22s/it]
Capturing batches (bs=24 avail_mem=16.50 GB): 83%|████████▎ | 30/36 [00:46<00:09, 1.60s/it]
Capturing batches (bs=16 avail_mem=16.47 GB): 83%|████████▎ | 30/36 [00:46<00:09, 1.60s/it]
Capturing batches (bs=16 avail_mem=16.47 GB): 86%|████████▌ | 31/36 [00:47<00:06, 1.38s/it]
Capturing batches (bs=12 avail_mem=16.46 GB): 86%|████████▌ | 31/36 [00:47<00:06, 1.38s/it]
Capturing batches (bs=12 avail_mem=16.46 GB): 89%|████████▉ | 32/36 [00:48<00:04, 1.24s/it]
Capturing batches (bs=8 avail_mem=16.43 GB): 89%|████████▉ | 32/36 [00:48<00:04, 1.24s/it]
Capturing batches (bs=8 avail_mem=16.43 GB): 92%|█████████▏| 33/36 [00:49<00:03, 1.06s/it]
Capturing batches (bs=4 avail_mem=16.43 GB): 92%|█████████▏| 33/36 [00:49<00:03, 1.06s/it]
Capturing batches (bs=4 avail_mem=16.43 GB): 94%|█████████▍| 34/36 [00:50<00:01, 1.04it/s]
Capturing batches (bs=2 avail_mem=16.40 GB): 94%|█████████▍| 34/36 [00:50<00:01, 1.04it/s]
Capturing batches (bs=2 avail_mem=16.40 GB): 97%|█████████▋| 35/36 [00:53<00:01, 1.85s/it]
Capturing batches (bs=1 avail_mem=16.39 GB): 97%|█████████▋| 35/36 [00:53<00:01, 1.85s/it]
Capturing batches (bs=1 avail_mem=16.39 GB): 100%|██████████| 36/36 [01:04<00:00, 4.58s/it]
Capturing batches (bs=1 avail_mem=16.39 GB): 100%|██████████| 36/36 [01:04<00:00, 1.80s/it]
[2025-11-23 06:48:35 TP0] Registering 8716 cuda graph addresses
[2025-11-23 06:48:36 TP0] Capture cuda graph end. Time elapsed: 66.06 s. mem usage=1.38 GB. avail mem=16.37 GB.
[2025-11-23 06:48:36 TP1] Capture cuda graph end. Time elapsed: 65.87 s. mem usage=1.38 GB. avail mem=21.61 GB.
[2025-11-23 06:48:37 TP0] max_total_num_tokens=47940, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=196608, available_gpu_mem=16.37 GB
[2025-11-23 06:48:38] INFO: Started server process [1]
[2025-11-23 06:48:38] INFO: Waiting for application startup.
[2025-11-23 06:48:38] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 40, 'top_p': 0.95}
[2025-11-23 06:48:38] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 40, 'top_p': 0.95}
[2025-11-23 06:48:38] INFO: Application startup complete.
[2025-11-23 06:48:38] INFO: Uvicorn running on http://0.0.0.0:30000/ (Press CTRL+C to quit)
[2025-11-23 06:48:39] INFO: 127.0.0.1:47326 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-23 06:48:39 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-23 06:48:44] INFO: 127.0.0.1:47330 - "POST /generate HTTP/1.1" 200 OK
[2025-11-23 06:48:44] The server is fired up and ready to roll!
==========================
Test with curl
llm-vm:/opt/models/MiniMax-M2/1$ curl http://192.168.130.175:30001/v1/chat/completions -H "Content-Type: application/json" -d 'webseclocalllm@webseclocalllm-vm:/opt/models/MiniMax-M2/1$ -H "Content-Type: application/json"
"model": "/models/minimax-m2",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain Link-16 in simple terms."}
],
"max_tokens": 256
}'
{"id":"16af417af0c247d99b8c5caa91446ecc","object":"chat.completion","created":1763879881,"model":"/models/minimax-m2","choices":[{"index":0,"message":{"role":"assistant","content":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":31,"total_tokens":287,"completion_tokens":256,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}webseclocalllm@webseclocalllm-vm:/opt/models/MiniMax-M2/1$ curl http://192.168.130.175:30001/get_model_info
{"model_path":"/models/minimax-m2","tokenizer_path":"/models/minimax-m2","is_generation":true,"preferred_sampling_params":null,"weight_version":"default","has_image_understanding":false,"has_audio_understanding":false}
{"model_path":"/models/minimax-m2","tokenizer_path":"/models/minimax-m2","is_generation":true,"preferred_sampling_params":null,"weight_version":"default","has_image_understanding":false,"has_audio_understanding":false}w