At the point you've gotten syscall overhead is definitely going to be a big thing (even without spectre mitigations enabled) -- I'd be very curious to see how far a similar io_uring benchmark would get you.
It supports IOPOLL (polling of the socket) and SQPOLL (kernel side polling of the request queue) so hopefully the fact that application driving it is in another thread wouldn't slow it too much... With multi-shot accept/recv you'd only need to tell it to keep accepting connections on the listener fd, but I'm not sure if you can chain recvs to the child fd automatically from kernel or not yet... We live in interesting times!
I would love to see an io_uring comparison as well; while it's a substantial amount of work to port an existing framework to io_uring, at the point where you're considering DPDK, io_uring seems relatively small by comparison.
I personally started a new framework and I went for io_uring for simplicity, that is already giving most of what I need -- asynchronous I/O with no context switching.
DPDK is huge and inflexible, it does a lot of things which I'd rather be in control of myself and I think it's easier to just do my own userspace vfio.
oh! That's not obvious at all from the man page (io_uring_enter.2)
If the io_uring instance was configured for polling, by specifying IORING_SETUP_IOPOLL in the
call to io_uring_setup(2), then min_complete has a slightly different meaning. Passing a
value of 0 instructs the kernel to return any events which are already complete, without
blocking. If min_complete is a non-zero value, the kernel will still return immediately if
any completion events are available. If no event completions are available, then the call
will poll either until one or more completions become available, or until the process has ex‐
ceeded its scheduler time slice.
... Well, TIL -- thanks! and the NAPI patch you pointed at looks interesting too.
Yes, I would be very interested in that as well. I work with a DPDK based app and have sometimes thought about how close to the same performance we could get by using as many kernel optimizations as possible (io_uring, xdp, I don't know what else). It would be an interesting option to give the app since it would greatly simplify the deployment sometimes, as dealing with the DPDK drivers and passthrough in a container and cloud world isn't always the easiest thing.
It supports IOPOLL (polling of the socket) and SQPOLL (kernel side polling of the request queue) so hopefully the fact that application driving it is in another thread wouldn't slow it too much... With multi-shot accept/recv you'd only need to tell it to keep accepting connections on the listener fd, but I'm not sure if you can chain recvs to the child fd automatically from kernel or not yet... We live in interesting times!