Nice results! SIMD can be a pain, good to know Zig makes it easy.
However, note that the plot under "Native SIMD Throughput Comparison" is extremely misleading: for an accurate proportional comparison between bar charts, you should start the y-axis at zero. The way the data are presented makes it look like a 10-100x gain, rather than the actual 2x improvement.
I was going to comment the same. I saw the huge difference and went "wow", then read that it was a 2x improvement and had to check the axes properly, thinking "slightly less wow". It reminds me of that barchart of women's average heights in different countries that starts at 5 feet https://preview.redd.it/dohqa8l94kb41.png?auto=webp&s=865180...
Is that solving the right problem? The algorithm can give reasonably accurate positions at arbitrary points in future, but you don’t need to run it over and over if you need positions every second. You can generate keyframes and interpolate the positions between, as the short term orbital movements are rather trivial.
It is funny how we often assume we need a graphics card for these kinds of calculations when a standard processor is actually plenty fast. The specific changes to the memory layout seemed to make the biggest difference here by allowing the hardware to actually use its vector capabilities.
At risk of being called out for my ignorance (I am still new to GPU development and have only limited experience with CUDA), it seems to come down to how appropriate the execution model is to the work e.g. SIMT vs SIMD here.
These days a single machine with lots of ram and cores will handle almost everything you throw at it, barring specific compute intensive / memory bound scenarios ( current AI, gaming etc ).
There's one example given where either the result of a simple or complex calculation is picked depending on eccentricity mentioning it's faster to just always calculate both and picking with a mask.
If you calculate both, wouldn't it be even faster to just always do the complex calculation? (presumably that's more precise?)
However, note that the plot under "Native SIMD Throughput Comparison" is extremely misleading: for an accurate proportional comparison between bar charts, you should start the y-axis at zero. The way the data are presented makes it look like a 10-100x gain, rather than the actual 2x improvement.
If you calculate both, wouldn't it be even faster to just always do the complex calculation? (presumably that's more precise?)