The REP MOVS series of instructions have an interesting history due to the advantages and disadvantages of microcode and its shifting performance relative to manual code with each CPU generation. It has long been great for aligned large copies due to the microcode having access to cache-wide copies, but until recently struggled with small copies. Apparently, one of the reasons is a lack of branch prediction in microcode:
Non-temporal stores are tricky performance wise. They can be dramatically faster than normal stores (~3x), they may be faster on some generations of CPUs than others, they may be slower if subsequent code needs the destination in the CPU cache, and even for GPUs they may not be ideal if an iGPU is sharing part of the cache hierarchy with the CPU. But the worst issue is that occasionally a specific CPU will have some random pathological behavior with them. IIRC, masked non-temporal stores were horrifically slow on some AMD APUs, on the order of hundreds to thousands of cycles per instruction. I find it hard to recommend them much anymore.
Not sure what Visual Studio has done over the years but I remember decompiling Gearbox's utilities .dll in James Bond 007 Nightfire (2002) and it appeared to have a bunch of string manipulation functions written using these instructions.
I see this test/cmp all the time after the instruction and I don't understand it. pcmpestri will set ZF if edx < 16, and it will set SF if eax < 16. It is already giving you the necessary status. Also testing sub words of the larger register is very slow and is a pipeline hazard.
You've got this monster of an instruction and then people place all this paranoid slowness around it. Am I reading the x86 manual wrong?
I think people started doing that after one of the Intel SSE examples did it and everyone just copied it.
But on any modern CPU there should be essentially no penalty for doing that now. Testing the full register is basically free as long as you aren't doing a partial write followed by a full read (write AH then read AX), and I don't think there's any case where this could stall on anything newer than a Core 2 era processor. But just replacing that with a "jnc" or whatever you're exactly trying to test for would be less instructions at least. I'd love to see benchmarks though if someone has dug deeper into this than I have.
Unless instances are sparse, higher code density is of course always better, because of the instruction cache (and the microcode cache, if this doesn't get "pinhole optimized" away or something like that, I know nothing about the microcode cache).
But yeah, it may not make a real impact yet anyway.
I do wish Intel would make the other string instructions faster, just like they did with MOVS, because the alternatives are so insanely bloated.
it is never used with a prefix (the value would be overwritten for each repetition)
...which is still useful for extreme size-optimisation; I remember seeing "rep lodsb" in a demo, as a slower-but-tiny (2 bytes) way of [1] adding cx to si, [2] zeroing cx, [3] putting the byte at [cx + si - 1] into al, and [4] conditionally leaving al and si unchanged if cx is 0, all effectively as a single instruction. Not something any optimising compiler I know of would be able to do, but perhaps within the possibility of an LLM these days.
https://stackoverflow.com/questions/33902068/what-setup-does...
Non-temporal stores are tricky performance wise. They can be dramatically faster than normal stores (~3x), they may be faster on some generations of CPUs than others, they may be slower if subsequent code needs the destination in the CPU cache, and even for GPUs they may not be ideal if an iGPU is sharing part of the cache hierarchy with the CPU. But the worst issue is that occasionally a specific CPU will have some random pathological behavior with them. IIRC, masked non-temporal stores were horrifically slow on some AMD APUs, on the order of hundreds to thousands of cycles per instruction. I find it hard to recommend them much anymore.
You've got this monster of an instruction and then people place all this paranoid slowness around it. Am I reading the x86 manual wrong?
But on any modern CPU there should be essentially no penalty for doing that now. Testing the full register is basically free as long as you aren't doing a partial write followed by a full read (write AH then read AX), and I don't think there's any case where this could stall on anything newer than a Core 2 era processor. But just replacing that with a "jnc" or whatever you're exactly trying to test for would be less instructions at least. I'd love to see benchmarks though if someone has dug deeper into this than I have.
But yeah, it may not make a real impact yet anyway.
it is never used with a prefix (the value would be overwritten for each repetition)
...which is still useful for extreme size-optimisation; I remember seeing "rep lodsb" in a demo, as a slower-but-tiny (2 bytes) way of [1] adding cx to si, [2] zeroing cx, [3] putting the byte at [cx + si - 1] into al, and [4] conditionally leaving al and si unchanged if cx is 0, all effectively as a single instruction. Not something any optimising compiler I know of would be able to do, but perhaps within the possibility of an LLM these days.