Everything was slow because sorting was taking a lot of time. Sorting was slow because its comparator was taking ~6 read locks on every comparison, and was cloning large structures to avoid holding the lock for a long time. The first fix was to access just the information needed to avoid the clones, the second fix was to cache exactly the data needed for sorting after the underlying data was updated, and use that for the comparators without needing to take the underlying lock.
I'm looking forward to the next post about how cache consistency is tough.
> So, yes, it takes time to read from memory. That's why we try to avoid allocations as much as possible.
This whole thing is summed up by some pretty basic physics. What you actually want to minimize is the communication of information between physical cores. Nothing else really matters. Certainly not terminology or clever tricks that effectively try to cheat thermodynamics. The cost of communicating information is almost always much more substantial than the cost of computing over that same information. The ALU is not nearly as power hungry as infinity fabric.
Always wondered if it is possible/useful to model multi core CPU caches as CISC cores where the complex instructions are all the operations you can do with the L1 within the time frame of an L2 access
The useful distinction here is not just AoS vs SoA, it is moving expensive work off the hot path. The biggest win in the article seems to be caching the sort/filter inputs so lock-taking and cache misses happen on updates, not during every comparison. That is a very transferable lesson even if you never go full data-oriented design.
Everything was slow because sorting was taking a lot of time. Sorting was slow because its comparator was taking ~6 read locks on every comparison, and was cloning large structures to avoid holding the lock for a long time. The first fix was to access just the information needed to avoid the clones, the second fix was to cache exactly the data needed for sorting after the underlying data was updated, and use that for the comparators without needing to take the underlying lock.
I'm looking forward to the next post about how cache consistency is tough.
This whole thing is summed up by some pretty basic physics. What you actually want to minimize is the communication of information between physical cores. Nothing else really matters. Certainly not terminology or clever tricks that effectively try to cheat thermodynamics. The cost of communicating information is almost always much more substantial than the cost of computing over that same information. The ALU is not nearly as power hungry as infinity fabric.
It seems all of these Rust performance warriors have no understanding of computer architecture fundamentals.