-
-
Notifications
You must be signed in to change notification settings - Fork 411
memutils: Replacement of libc string.h functions - currently only Dmemset() #2662
Conversation
wilzbach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few initial comments. I think this initial PR will take up quite a bit of hard work as you need to get used to a few things, but the other ones should be a lot easier and straight-forward then.
|
BTW, with these kind of things I highly recommend checking the ASM in GDC, as the compiler might silently turn your code back into a call to the C memset function ;-) Seems to work fine in this case though: https://explore.dgnu.org/z/-uJfsJ I'll need to have a look at the GCC sources to see if we can somehow disable this pattern matching / rewriting for individual functions, just to be safe. |
|
Add an -O3 there and you'll see it jumps to memset. :P |
|
Just a general high-level question, how is this meant to scale to the plethora of already supported targets? - ARM, MIPS, RISCV, PPC, SystemZ, HPPA, SPARC, and their 64-bit variants to name few. Have considerations been made for strict alignment targets? BigEndian targets? My assumption would be that the performance would be a net-loss for everyone, leaving some improvements left wanting for the naive implementation. Which could be done by handling small arrays with minimal branching, then setting values 32 or 64 bytes at a time depending on how big the array is. |
|
@ibuclaw BTW, |
|
@ibuclaw The project was initially focused on DMD and x86_64. For all the rest of the targets the goal was to just have a naive working implementation. That changed the last week, which meant throwing most of the (ASM) code away and I started again with intrinsics, optimizing for LDC etc. But now time is constrained. Seb and I try to make the best of the time left, but the community can influence us. So, I'm open to suggestions. I didn't really understand what you meant about the naive implementation. Settings 32 or 64 bytes at a time requires SIMD, which is done in the SIMD implementation, along with other optimizations (reaching alignment and there was a switch fall-through for small sizes, but that had problems, so it forwards to naive. There are some other tweaks but those require ASM-level control). The without SIMD optimizations are not a lot. They're things like the GNU algorithm, which pretty much reaches 4/8-byte alignment. But those things are trivial for a compiler to do it itself. |
Setting then becomes (make sure pointer is aligned!) or |
I'm sure the compiler could go a long way by itself, but sometimes you need to give it a gentle nudge to make it go one better. Anyway, I'll wait to see how this ends up, as things are still in flux. |
|
So, you mean 32/64 bits at a time, not bytes. Then yes, as I said, we're going back to the modified GNU algorithm: https://www.embedded.com/print/4024961 (that's for memcpy() but the "GNU algorithm" is thrown around when talking about the idea of "align to 4/8-byte boundary and do appropriate moves"). Edit: Well, for clarification, it's not a switch fall-through. It seems worse than this since it doesn't use a jump table and goes byte-byte etc. but the idea is the same. |
src/core/experimental/memutils.d
Outdated
| the size of the array element) to `val`. | ||
| Otherwise, set T.sizeof bytes to `val` starting from the address of `dst`. | ||
| */ | ||
| void memset(T)(ref T dst, const ubyte val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this undocumented (ddoc) on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean it should be:
/**
* ...
*/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and a Params section (https://dlang.org/spec/ddoc.html#standard_sections) is probably best too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, thanks. I'm not very accustomed yet to the logistics of contributing to D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you can add scope to the reference? Explicit nogc nothrow may be good too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm not very familiar with scope but nogc, nothrow is good.
|
Thanks for your pull request and interest in making D better, @baziotis! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please see CONTRIBUTING.md for more information. If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment. Bugzilla referencesYour PR doesn't reference any Bugzilla issue. If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog. Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub fetch digger
dub run digger -- build "master + druntime#2662" |
src/core/experimental/memutils.d
Outdated
| * dst = Memory Destination whose bytes are to be set to `val`. | ||
| * | ||
| * Returns: | ||
| * Nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just leave the Returns section out if it's void. 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I didn't like it too. But I wanted a way to make a note that this memset is different from the libc in that it returns nothing. I'll add it as a N.B..
|
I think something went wrong with buildkite that is not a problem of the PR. |
|
see e.g. dlang/dmd#10244 (review) |
|
Sorry but I don't understand what I should do. |
|
Nothing, it is unrelated to this PR. |
|
Ah, ok, that's what I figured, just pointing to the problem. Thanks |
|
One thing to take into account is that the LLVM optimizer recognizes
Before using/enabling this on the optimizing compilers, please do extensive testing and assembly output checking. The current set of tests is way too small. After just a few seconds of thinking about it I already see things untested: misaligned pointer, |
| * (whose count is the length of the array times | ||
| * the size of the array element) to `val`. | ||
| * Otherwise, set T.sizeof bytes to `val` starting from the address of `dst`. | ||
| * N.B.: Contrary to the C Standard Library memset(), this functions returns nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why deviate from c stdlib memset on this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these utilities were not created with the idea to have the exact C interface. We decided to drop legacy C stuff that seem useless. Like the return value and the fact that memset gets an int instead of a byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying this is necessarily the best idea ever, because no matter how irrelevant legacy stuff is, the thing is, people have been used to that for years.
That is true. It's supposed to be done on GDC as well. It's supposed that on GDC, you can set the As far as I have tested, the functions are not turned into LLVM / GDC counterparts, except for the naive versions, which is ok. Other than that, I'm not sure what exactly you seem to want tested. Edit: Btw, as far as performance boost is concerned, things like assume_aligned would be great, but I don't know of a way to do it on either LDC or GDC.
The things you mention are tested except for Meaning, except for the basic unittests. |
Using If you haven't seen it yet, this thread is interesting: https://lists.llvm.org/pipermail/llvm-dev/2019-April/131973.html
I meant that the performance of this may heavily depend on how it is used in user code + separate compilation. Will the function be opaque or not to the optimizer? Think of user code like this:
Ah yes I see now how the tests work. By the way, perhaps more clear to have all functions take a |
Ah, I see ok. I mentioned The way I see it, my Although ideally, we should have some control besides flags for saying "we don't want this to be replaced" or smth. Ok, this was somewhat long, I hope it was clear.
No, thanks! I'll check it.
Yes, this is an important question. It adds to the questions I wrote above and it's on the point. Well, I don't know how one can tell the compiler that. As I said, I don't even know how to have something like assume_aligned, which is quite important.
Good point. I have tested by repeatedly inspecting on the debugger, but a more formal test should be added. Here's the
Well, it's just another
Hmm, you mean the |
Ah I didn't mean to add an
Yes indeed. |
Sorry, I did not get that one.
Yes, |
| { | ||
| /* SIMD implementation | ||
| */ | ||
| private void Dmemset(void *d, const uint val, size_t n) nothrow @nogc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see a path for val == 0... It is usually possible to make zeromem substantially faster.
But again, none of this code should apply to LDC/GDC, those compilers have intrinsics for this stuff; this should be considered strictly for DMD.
For instance, on PowerPC, there's an instruction dcbz (data cache-block zero) which resets an entire 256byte cache line to zero in one instantaneous operation... I can't see your PPC branch that takes advantage of 256byte alignment and dcbz... but the backends all do.
Assuming this is targetting DMD, then your assumptions that the compiler won't do anything interesting, and that the target is x64 are both reasonable assumptions, otherwise I can't get behind any of this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see a path for val == 0... It is usually possible to make zeromem substantially faster.
No memset that I know of does anything different for val == 0 and targets x64.
But again, none of this code should apply to LDC/GDC, those compilers have intrinsics for this stuff; this should be considered strictly for DMD.
Check this: #2687 (comment)
PowerPC was not a target of this project.
Assuming this is targetting DMD, then your assumptions that the compiler won't do anything interesting, and that the target is x64 are both reasonable assumptions, otherwise I can't get behind any of this code.
The focus was initially DMD, then LDC and GDC by not using their intrinsics. It's again the same question - What happens when libc is not available and do we care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PowerPC was not a target of this project.
Right, that's my point, and why I think it's not a good idea to re-implement these functions. There's too many versions of them to worry about, and maintaining great perf is a moving target as arch like x86 and arm evolve.
What happens when libc is not available and do we care?
libc is always available... what is a case when it's not?
I'd also like to know if the intrinsics definitely call out to libc, or if the intrinsics call into the compilers runtime lib? What is the symbol that the intrinsics call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to look at this. Let me be clear again, I don't know for a fact that they call libc, it just seemed that way as they execute the same code. I'll definitely have to look at that if we move forward with this PR.
| return; | ||
| } | ||
| void *temp = d + n - 0x10; // Used for the last 32 bytes | ||
| const uint v = val * 0x01010101; // Broadcast c to all 4 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mul is almost never a good idea. I suspect the mul may be a bottleneck in the ~32-128 byte range. If you're gonna do it, why not extend to 8 bytes and get more value from it?
Do you have a reference for this code? It doesn't look very optimal to me. If you're gonna use SSE, there are broadcast and permute functions, which not introduce hazards as bad as mul...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's inspired by Agner Fog. I didn't copy code from him but I have read his optimization manual. This mul trick is his. And no, a mul in x86 is not a problem nowadays.
Our code turned out to be similar e.g. his AVX version: https://github.com/tpn/agner/blob/master/asmlib/asmlibSrc/memset64.asm#L188
It's also inspired by GCC.
Edit: Oh, and there's no point in doing in 8 bytes. It is immediately broadcasted in an XMM register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hyper-aware of this mul trick, I've been doing this for decades, but it absolutely IS a problem 'nowdays'... I've written a lot of code using mul-tricks like this, and been surprised when they become an unintuitive bottleneck.
It might not be a problem here, because I think it's usually the case that memset is not called in a hot-loop, but multiplies usually have a lot of limitations on the pipeline.
There's usually only one imul pipeline, and while the latency might not be so bad these days, the circuit is kinda recursive, so it can't accept a new instruction per cycle like add/sub circuits, so if there are multiple mul's contending for the limited imul resource, they create stalling hazards on each other.
imul is often used in address calculation, so there's a reasonable possibility of contention, but there's an if above this, so the pipeline might already have flushed...
I do a lot of work with colour precision scaling using this same trick; the imul is almost always the limiting factor in my loops.
Anyway, I only say this because you're shoving the value straight into SSE in the following line, where you can use much faster permute operations to broadcast the value very easily (like pshufb, or complements on other architectures).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know all that stuff about muls limiting the pipeline. I just knew that the cycles they consume are about the same as say add / sub. Is there somewhere I can read more about those?
But the probabilities of Agner not having thought this is close to 0. :P
|
Closed due to not being useful - more info here. |
As this is my first PR to D runtime, here are some notes:
unittestbut these tests can't really be put inunittest, unless I miss something.Dmemset. I think that a better choice would be to have such code in something likeexperimental/simd.dbut I was not sure if that's desired.