-
Notifications
You must be signed in to change notification settings - Fork 401
Make object expiry scalable #4349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
My first thought, and this is not a replacement for this PR, is that it was a mistake to have a central EXP, and that really should be STV methods. I think all the places which call EXP() also know the stevedore, so that shouldn't be too much havoc, and it would give STV's the option of implementing EXP integrated with their storage layout policy. |
|
I agree with the previous comment, multi-EXP would be helpful and would allow for specific optimizations. I see this as largely orthogonal to this PR, as we still need an EXP implementation for SML, to which this PR implements improvements, and also other stevedore implementations will likely start by copying the default implementation. |
505cfd3 to
d9691f5
Compare
d9691f5 to
edbbdfa
Compare
Before this change, we would strictly work the expire thread's inbox one item at a time: lock, take it off the list, update the exp flags, unlock, work it. This requires one lock/unlock cycle per object core in the exp inbox. We now create a batch array of items: Under the lock, we do all the work as before, but remember up to 1024 objcore pointers and their original exp flags in an array, which we then work outside the lock, increasing overall efficiency through less lock contention and better cache locality. Why an array of items, why not repurpose the inbox (exp_list)? In fact, we will do that in another commit, but at this point, the list is not the right data structure: The inbox semantics require to (atomically, under a lock) update the exp_flags to signify an oc being removed or the commands being processed. Yet to process the oc later, we need the original flags. So we use an array of pointers to the oc together with the original flags. (one alternative would be a list head for every possible flag combination, but that is probably overly complex). This changes the order of events slightly: exp_mail_it() inserts OC_EF_REMOVE jobs at the head of the list, because they will immediately free resources and are thus more important than additions to the expire binheap; also mailed object cores have a change of getting removed before even being added while waiting in the inbox. With batching, we no longer follow this priority strictly: By taking the first (up to) 1024 items from the inbox before re-checking it, we might process inbox commands which we would not have before. This is not (yet) a change in behavior for the case when the expire thread is under pressure, because the inbox will eventually contain more than 1024 OC_EF_REMOVE items at the start, and hence the expire thread will only work these. The choice of array size 1024 is arbitrary: It is deemed large enough for batching to have an effect, yet small enough to matter wrt available stack space (on 64bit, the array of this size needs 16KB). Also, we now unconditionally call exp_expire(), because either we have processed the inbox and need a new tnext for potentially blocking in the next iteration, or we did not, waited for the next due time and do need to process the binheap anyway.
This is in preparation of offloading exp_remove_oc(): The inbox list (exp_list linkage in struct oc) is free to use for OC_EF_REMOVE commands once we have processed the inbox, because the respective object core will be removed from cache and the exp_flags set to zero signal to exp_mailit() to keep out. So we now use the list to collect ocs which need to be dereferenced. Object cores with OC_EF_REMOVE and no OC_EF_INSERT set (that is, those on the binheap), will also continue to be collected on the batch array to have them deleted from the binheap. We also add a bit of abstraction around the list head, which will be extended in follow-up commits. For now, it serves for tracking the number of elements on the list and asserting on it. So once the batch is processed, we end up with a neat list of objects to de-reference.
Otherwise, for big batches (think: 100Ks of objects) the n_object counter can get out of sync at human scales.
Yes, I know this is against our cstyle, but this is very useful to avoid confusion with item->flags.
Now that we have a collection of objcores in a list, we can easily hand it off to another worker thread to run all the de-references. This step is the first building block towards exp scalability beyond the capacity of a single thread. We now hand the list of objects to be removed to a worker thread, once an (arbitrary) threshold of 128 objects to be removed is reached. If the expire inbox keeps growing, we keep tasking more and more worker threads, which will apply back pressure by allocating more system resources to expiry, and ultimately by no threads being available for entry of new objects into the cache.
This moves the HSH_kill() to after binheap removal, which makes no difference, other than prepare for batching, were it will be run outside the critical section.
Similar to the changes for cache removal batching, we now gather all object cores to be expired in another head of the deref struct in one go within a single critical region (as before, exp_list is free when the respective object is not in the inbox) and run the derefs in the existing exp_deref_work() function, which optionally gets offloaded to a separate worker thread. For efficiency, we also change handling of expired object cores with OC_EF_POSTED: Previously, they would just be ignored, only to be handled by the exp_inbox(). But now more than ever it does not make sense to keep them around, so we just remove them from the inbox and add them to our list to expire. (diff best viewed ignoring whitespace (-b option))
When the exp_expire() finds the expired object also in the inbox, it would still count it as expired. We now avoid the double count by excluding objects with OC_EF_REMOVE set. And we also reduce traffic on the global counter by using a local variable for the loop.
edbbdfa to
465d472
Compare
TL;DR: This patch series solves scalability issues of the single threaded object expiry by offloading object core de-referencing to worker threads.
The test scenario
To test the SLASH/ storage engines, I regularly run "burn in" tests on two reasonably big machines (500gb RAM, 48 cores, TBs of NVMes). The most important test tries to stress the most problematic cases all the time: objects are either very short-lived or very long lived, have random size and random ttl, basically there are only cache misses, and "way too many" passes, bans and purges. It has evolved quite a bit and been an invaluable tool to find all kinds of issues, mostly races, before users do.
The issue
The test has exposed an issue where the number of objects goes through the roof as soon as LRU nuking starts, ultimately leading to an out-of-memory situation (there have also been other relevant issues and most prominently one memory leak, but the issue addressed here was nevertheless real and root-caused in Varnish-Cache. See here for more details)
The problem is illustrated by this graph of VSCs gathered during a test run - NB: the y axis has log scale!:
Plotted here are various VSCs, either directly or as a derivative Requests Per Second
rps(x).FELLOW.fellow.g_mem_objis a counter which gets atomically updated by the storage engine, giving a reliable number of objects actually alive in the memory cache. The other counters are all Varnish-Cache standard.We would expect
MAIN.n_objectto always matchFELLOW.fellow.g_mem_objmore or less, but as we can see,n_objectgoes from roughly 2^20 (~1M) to roughly 2^24 (~16M) and there is no indication for it to decrease. It just stopped growing when the process died.We can also see how the
n_expiredrate is pretty stable at first (and due to the random but uniform nature of the test we can assume that the number of objects expiring actually is stable in relation to the number of objects added), but some time after nuking kicks in (blackn_lru_nuked) graph, not only goes the number of objects through the roof, but also then_expiredrate almost comes to a complete stop (again, this is log scale, 2^5 is, as we all know, 32).A suspicion
The immediate suspicion was that this issue could be related to the fact that the EXP thread is single threaded, but in retrospect, I would like to first give some ...
Background on the EXP thread
The exp thread uses a binheap (very interesting background read for those who do not know it) to keep a logical list of object core references by expire time. Consequently, all cacheable objects need to go though the EXP thread by design: When they enter the cache, EXP needs to register them on the binheap, until they either get removed externally or expire. So, the expire thread is a natural bottleneck. While we could support multiple expire threads, it is not the actual maintenance of the binheap which is expensive, but most of all object deletion as a consequence of expiry or removal
This PR
Herein I suggest to make a number of changes to make a single EXP thread scale up significantly:
Restructure the code to batch operations on the inbox and expired objects
Rather than handling one inbox / expired object at a time, we collect them in an array and lists, and then work these outside the expire lock.
Make sure that both removal and expiry are always handled
Bbefore, removal had precedence over expiry, which could lead to the object pileup described before
Offload de-referencing of object cores
We offload the actual de-referencing (which, most of the time, implies object deletion) to worker threads, which implicitly provides backpressure by using up system resources and worker threads.
All details are documented in the individual commit messages
Potential follow up:
hcb_cleanerhas the same problem in principle, but not sure yet how relevant, will continue testing.