Skip to content
Merged
6 changes: 3 additions & 3 deletions docs/model_server_rest_api_chat.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,11 +235,12 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
|-------|----------|----------|----------|---------|-----|
| temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
| top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| min_p | ✅ | ❌ | ✅ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
| top_k | ✅ | ❌ | ✅ | int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
| repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
| frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
| presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |

#### Speculative decoding specific

Expand Down Expand Up @@ -275,7 +276,6 @@ If any of those parameters is not specified and request is made to Prompt Lookup
- functions

#### Unsupported params from vLLM:
- min_p
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
- early_stopping
- stop_token_ids
Expand Down
6 changes: 3 additions & 3 deletions docs/model_server_rest_api_completions.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,12 @@ curl http://localhost/v3/completions \
|-------|----------|----------|----------|---------|-----|
| temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
| top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| min_p | ✅ | ❌ | ✅ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
| top_k | ✅ | ❌ | ✅ | int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
| repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
| frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
| presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |

#### Speculative decoding specific

Expand All @@ -106,7 +107,6 @@ Note that below parameters are valid only for prompt lookup pipeline. Add `"prom


#### Unsupported params from vLLM:
- min_p
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
- early_stopping
- stop_token_ids
Expand Down
5 changes: 3 additions & 2 deletions docs/model_server_rest_api_responses.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,11 +120,12 @@ curl http://localhost/v3/responses \
|-------|----------|----------|---------|-----|
| temperature | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
| top_p | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| top_k | ✅ | ❌ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
| min_p | ✅ | ❌ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
| top_k | ✅ | ❌ | int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
| repetition_penalty | ✅ | ❌ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
| frequency_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
| presence_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
| seed | ✅ | ❌ | integer (default: `0`) | Random seed to use for the generation. |
| seed | ✅ | ❌ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |

#### Speculative decoding specific

Expand Down
41 changes: 34 additions & 7 deletions src/llm/apis/openai_api_handler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -740,21 +740,48 @@ absl::Status OpenAIApiHandler::parseCommonPart(std::optional<uint32_t> maxTokens
return absl::InvalidArgumentError("top_p out of range(0.0, 1.0)");
}

// top_k: int; optional - defaults to 0
// Extension, unsupported by OpenAI API, however supported by vLLM and CB lib
// min_p: float; optional - defaults to 0 (disabled)
// Extension, unsupported by OpenAI API, however supported by vLLM and GenAI
it = doc.FindMember("min_p");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsDouble() && !it->value.IsInt())
return absl::InvalidArgumentError("min_p is not a valid number");
const float minPValue = static_cast<float>(it->value.GetDouble());
if (minPValue < 0.0f || minPValue >= 1.0f)
return absl::InvalidArgumentError("min_p out of range [0.0, 1.0)");
request.minP = minPValue;
}

// top_k: int; optional - when multinomial sampling is active, defaults to 40 if not set. Pass -1 to consider all tokens.
// Extension, unsupported by OpenAI API, however supported by vLLM and GenAI
it = doc.FindMember("top_k");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsInt())
return absl::InvalidArgumentError("top_k is not an integer");
request.topK = it->value.GetInt();
const int topKValue = it->value.GetInt();
if (topKValue < -1 || topKValue == 0)
return absl::InvalidArgumentError("top_k must be -1 (all tokens) or a positive integer");
request.topK = topKValue;
}

// seed: int; optional - defaults to 0 (not set)
// seed: uint32; optional - omit to use a random seed
it = doc.FindMember("seed");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsUint())
return absl::InvalidArgumentError("seed is not an unsigned integer");
request.seed = it->value.GetUint();
if (!it->value.IsInt() && !it->value.IsUint() && !it->value.IsInt64() && !it->value.IsUint64())
return absl::InvalidArgumentError("seed is not an integer");
if (it->value.IsUint64()) {
const uint64_t raw = it->value.GetUint64();
if (raw > std::numeric_limits<uint32_t>::max())
return absl::InvalidArgumentError("seed out of range [0, 4294967295]");
request.seed = static_cast<uint32_t>(raw);
} else if (it->value.IsUint()) {
request.seed = it->value.GetUint();
} else {
const int64_t raw = it->value.GetInt64();
if (raw < 0 || raw > static_cast<int64_t>(std::numeric_limits<uint32_t>::max()))
return absl::InvalidArgumentError("seed out of range [0, 4294967295]");
request.seed = static_cast<uint32_t>(raw);
}
}

// stop: string or array; optional - defaults to null (not set)
Expand Down
4 changes: 3 additions & 1 deletion src/llm/apis/openai_request.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
// Type that holds vector of pairs where first element is chat turn index and second is image tensor
// this way we store information about which image is associated with which chat turn
#pragma once
#include <cstdint>
#include <map>
#include <optional>
#include <string>
Expand Down Expand Up @@ -57,8 +58,9 @@ struct OpenAIRequest {
// Multinomial decoding specific
std::optional<float> temperature{std::nullopt};
std::optional<float> topP{std::nullopt};
std::optional<float> minP{std::nullopt};
std::optional<int> topK{std::nullopt};
std::optional<int> seed{std::nullopt};
std::optional<uint32_t> seed{std::nullopt};
std::optional<float> frequencyPenalty{std::nullopt};
Comment on lines 58 to 64
std::optional<float> presencePenalty{std::nullopt};
std::optional<float> repetitionPenalty{std::nullopt};
Expand Down
4 changes: 4 additions & 0 deletions src/llm/apis/openai_responses.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,10 @@ void OpenAIResponsesHandler::serializeCommonResponseParameters(Writer<StringBuff
writer.String("top_p");
writer.Double(static_cast<double>(request.topP.value()));
}
if (request.minP.has_value()) {
writer.String("min_p");
writer.Double(static_cast<double>(request.minP.value()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzegla please keep in mind that we should always return min_p, even if client did not specify it.
@michalkulakowski
here we add another param that doesnt follow the API correctly

Copy link
Copy Markdown
Collaborator

@michalkulakowski michalkulakowski May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a little bit different than top_p for example cause i think min_p is not a part of opnenai API. But i agree that it would be consistent to return values of all the generation parameters that OVMS support in Responses API response.

}
writer.String("truncation");
writer.String("disabled");
// TODO: user not supported
Expand Down
25 changes: 24 additions & 1 deletion src/llm/io_processing/base_generation_config_builder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

#include "../../logging.hpp"
#include <limits>
#include <random>
#include <string>
#include <openvino/genai/generation_config.hpp>
#include "base_generation_config_builder.hpp"
Expand Down Expand Up @@ -118,9 +119,11 @@ void BaseGenerationConfigBuilder::parseConfigFromRequest(const OpenAIRequest& re
if (request.temperature.has_value())
config.temperature = request.temperature.value();
if (request.topK.has_value())
config.top_k = request.topK.value();
config.top_k = (request.topK.value() == -1) ? std::numeric_limits<size_t>::max() : static_cast<size_t>(request.topK.value());
if (request.topP.has_value())
config.top_p = request.topP.value();
if (request.minP.has_value())
config.min_p = request.minP.value();
if (request.seed.has_value())
config.rng_seed = request.seed.value();
if (request.stop.has_value())
Expand All @@ -133,6 +136,26 @@ void BaseGenerationConfigBuilder::parseConfigFromRequest(const OpenAIRequest& re
config.presence_penalty = request.presencePenalty.value();
config.do_sample = config.temperature > 0.0f && config.num_beams == 1;

// Apply multinomial sampling defaults when not explicitly set
if (config.do_sample) {
if (!request.topK.has_value() && config.top_k == std::numeric_limits<size_t>::max()) {
config.top_k = 40;
SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Defaulting top_k to 40 for multinomial sampling.");
}
Comment thread
mzegla marked this conversation as resolved.
Comment on lines +139 to +144
// Use random seed for multinomial sampling to ensure non-deterministic behavior by default.
// Note: rng_seed from generation_config.json is not honoured — only an explicit per-request
// seed produces deterministic output.
// Use a thread_local mt19937 seeded once via std::random_device to avoid per-request overhead.
if (!request.seed.has_value()) {
static thread_local std::mt19937 rng{std::random_device{}()};
size_t seed = 0;
while (seed == 0)
seed = rng();
config.rng_seed = seed;
SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Randomizing rng_seed for multinomial sampling: {}.", config.rng_seed);
}
Comment thread
mzegla marked this conversation as resolved.
Comment on lines +145 to +156
}

if (request.logprobschat || request.logprobs)
config.logprobs = 1;
// Assisted decoding specific
Expand Down
Loading