Linux extreme performance H1 load generator

(gcannon.org)

21 points | by MDA2AV 2 days ago

4 comments

qcoudeyr 23 minutes ago
From my benchmark, i will keep using oha (https://github.com/hatoo/oha). Oha is more complete than gcannon and have similar req/s rate while handling ipv6, https, etc...
[-]
- G3nt0 10 minutes ago
  oha is one of the slowest load gen, you should look into h2load if you need h2/h3 support. I just tried oha and it pulls more CPU than the server I am testing, not to mention h2 and h3 results are just nonsense
0x000xca0xfe 1 hour ago
Interesting, I made something similar years ago when io_uring wasn't around yet and it is just a couple threads blocking on sendfile: https://github.com/evelance/sockbiter
Of course it needs to pre-generate the file and you need enough RAM for both the server running and caching the file but it needs almost zero CPU during the test run and can probably produce even more load than this io_uring tool.
[-]
- MDA2AV 30 minutes ago
  Very cool!
  So I just tried your tool and it just hangs, I see you're sending close requests, is this configurable to keep-alive, or even better, nothing? In Http/1.1 keep-alive/close is better not used at all, never try to enforce this as it is not mandatory.
  A lot of servers just ignore the close and don't close the connection (like the one I am using) so this can be the issue I am having.
Veserv 6 hours ago
What is the point of making up claims of "extreme" performance without any accompanying benchmarks or comparisons?
It really should be shameful to use unqualified adjectives in headline claims without also providing the supporting evidence.
[-]
- MDA2AV 1 hour ago
  I agree, I'll try adding some. We use the tool on a benchmarking platform so we run this thing hundreads of times daily and did dozens of tests against pretty much every other load generator (that I know of). Numbers are also always tied to the hardware where you run it and typically benchmarks provided by the maintainer himself are always biased and won't match what you get though.
  I personally never care about benchmarks presented, it's much better to use and see for myself so didn't think much about having a table with values there but I can understand how it may help.
- raks619 6 hours ago
  did you scroll down?
  [-]
  - ziml77 6 hours ago
    I did and I still didn't see any numbers. Just a bunch of AI generated text about why it's supposedly fast. It even says it records numbers multiple times, so why aren't there any presented?
bawolff 5 hours ago
Really stupid question from someone who doesnt know much about io_uring. Wouldn't doing all this i/o async make the latency measurements less accurate? How do you know when the i/o starts if you are submitting it async in batches of 2048?
[-]
- tuetuopay 3 hours ago
  The main difference with io_uring is you're not blocking the thread, just like O_NONBLOCK + epoll would, but don't have to rely on thread-level syscalls to do so: there's no expensive context switch to kernel mode. Using O_NONBLOCK + epoll is already async :)
  In fact, in all cases, you don't know when the syscall actually starts execution even with regular calls. The only thing you're sure is the kernel "knows" about the syscall you want. However, you have absolutely no indication on whether it started to run or not.
  The real question is: are the classical measures accurate? All we have is an upper bound on the time it took: I fired the write at t0 and finished reading the response at t1. This does not really change with io_uring. Batches will mostly change one fact: multiple measurements will share a t0, and possibly a t1 when multiple replies arrive at once.
  Is it important? Yes and no. The most important thing in such benchmarks is for the added delay to be consistent between measurements, and when it starts to break down. So it's important if you're chasing every µs in the stack, but not if your goal is lowering the p99 which happens under heavy load. In this case, consistency between measurements is paramount in order to get histograms and such that make sense.
- dijit 4 hours ago
  Its not a stupid question.
  Normally when I have run latency calculations in the past I run them from the perspective of the caller, not the server.
  In most cases this is over the network, a named pipe or sock file.
  I guess it should be possible to run multiple runtimes inside a program that run independently.