Skip to content

Conversation

@Satbek
Copy link
Contributor

@Satbek Satbek commented Dec 30, 2025

In fast mode, every fiber was named "fast" so it could be killed when safe mode was enabled.

However, CRUD storage methods do not yield until a space operation is performed.
The rebalancing trigger is executed on insert/replace into the _bucket space; this operation does yield and switches the implementation of the bucket_ref / bucket_unref functions.

Therefore, once a fiber starts in fast mode, it remains in fast mode until the first box.space operation:

  • a write for memtx
  • a read/write for vinyl

As a result, there is no need to mark and kill iproto fibers that handle fast requests, because there is no situation in which a fiber remains in fast mode while rebalancing is in progress.

Yield checks were added to the tests (yield_checks) to ensure that no yields occur during these operations.

I didn't forget about

  • Tests
  • Changelog
  • Documentation

Closes #473

@Satbek Satbek force-pushed the rm_fiber_kill branch 4 times, most recently from 7a3db14 to 9a3bb6a Compare December 30, 2025 14:15
@Satbek Satbek requested review from a1div0 and vakhov December 30, 2025 14:40
Comment on lines 22 to 25
local function call_on_storage(run_as_user, func_name, ...)
return box.session.su(run_as_user, call_cache.func_name_to_func(func_name), ...)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Сorrect that fast/safe mode semantics are now fully encapsulated in bucket_ref_unref (and related sharding helpers), and that no other side effects at the call layer are required anymore? If so, this would be good to explicitly state somewhere (comment or docs), since this is a noticeable behavioral change compared to previous versions.

Copy link
Contributor Author

@Satbek Satbek Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are internal implementation details, and users do not need to be aware of them.

The API with subscriptions to rebalancing events remains unchanged, so I do not see a reason to explicitly document that some internal code was removed.

The fibers kill removal is already reflected in the changelog.

Comment on lines 30 to 31
local info_prev = registry[info_curr.fid]
assert(info_curr.csw == info_prev.csw, "yield happened during fiber execution")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_no_yields() assumes the fiber was previously registered. For better, it might be worth adding an explicit assert(info_prev ~= nil, "fiber is not registered") before accessing info_prev.csw.

Comment on lines 35 to 36
local fid = fiber.self():id()
assert(registry[fid] ~= nil, "fiber is not registered")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

register_fiber() uses info.fid as the registry key, while unregister_fiber() uses fiber.self():id(). suggest using either fiber.self():id() everywhere, or fiber.self():info().fid everywhere, to avoid surprises.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are the same value. In unregister_fiber needed only id, so fiber.self():id() is used. In register_fiber, check_no_yields :info used, so no need to call fiber.self():id(), when fiber id is stored in info.

crud/delete.lua Outdated
Comment on lines 32 to 65
local finish = yield_checks.start()

opts = opts or {}

local space = box.space[space_name]
if space == nil then
return nil, DeleteError:new("Space %q doesn't exist", space_name)
return finish(nil, DeleteError:new("Space %q doesn't exist", space_name))
end

local _, err = sharding.check_sharding_hash(space_name,
opts.sharding_func_hash,
opts.sharding_key_hash,
opts.skip_sharding_hash_check)

if err ~= nil then
return nil, err
return finish(nil, err)
end

local ref_ok, bucket_ref_err, unref = bucket_ref_unref.bucket_refrw(opts.bucket_id)
if not ref_ok then
return nil, bucket_ref_err
return finish(nil, bucket_ref_err)
end

yield_checks.check_no_yields()
-- add_space_schema_hash is false because
-- reloading space format on router can't avoid delete error on storage
local result = schema.wrap_func_result(space, space.delete, {
add_space_schema_hash = false,
field_names = field_names,
noreturn = opts.noreturn,
fetch_latest_metadata = opts.fetch_latest_metadata,
}, space, key)

finish()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if an unexpected Lua error is thrown between yield_checks.start() and finish() (for example, inside schema.wrap_func_result()), the fiber would remain registered in yield_checks.registry. could this potentially affect subsequent tests and lead to confusing errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would affect tests that run on the same cluster. However, there are usually not many tests sharing a single cluster, so it shouldn’t cause major issues.

@Satbek Satbek force-pushed the rm_fiber_kill branch 4 times, most recently from e0c1ca8 to b3e1701 Compare January 5, 2026 19:55
In fast mode, every fiber was named “fast” so it could be
killed when safe mode was enabled. However, crud-storage
methods do not yield until a space operation is performed.

The rebalancing trigger is executed on insert/replace into
the _bucket space; this operation does yield and switches
the implementation of the bucket_ref/bucket_unref functions.

Therefore, once a fiber starts in fast mode, it remains
in fast mode until the first box.space operation: a write
for memtx or a read/write for vinyl.

As a result, there is no need to mark and kill iproto fibers
that handle fast requests, because there is no situation in
which a fiber remains in fast mode while rebalancing is in progress.

Yield checks were added to the tests (yield_checks) to
ensure that no yields occur during request until box operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unsafe fiber termination by name causes killing unrelated fibers

3 participants