By the way, PSI has the same shortcoming of not accounting the time of threads w...

zaarn · on April 19, 2022

Yes, but async I/O means the process can continue doing other, potentially useful, work, while waiting for async I/O to complete. So if a process has issues an async I/O, it's not blocked or dead, like a process who issued sync I/O.

And on the topside, if you block on the io_submit and friends, async I/O counts towards the PSI and load average again. Ie, when you can't do useful work because the device queue is full, for example.

So it's not as much of a shortcoming as you think, because if the system is I/O exhausted, async I/O processes will quickly contribute to load as well once the I/O queues are nearly full or full.

tanelpoder · on April 19, 2022

The reason why I'm aware of this is that Oracle databases use libaio for some reads and most of the writes on Linux, including transaction log writes. There are plenty of cases where you don't exhaust the block device I/O queues - io_submit() finishes quickly without blocking, but then the I/O issuing process wants to wait for the asynchronously submitted I/O completion, before moving on. It will use io_getevents() with the "timeout > 0" argument, so it will sleep (in S mode), waiting for I/O completion. If it had submitted a synchronous I/O, it would have increased Linux load, but if configured to use libaio, it won't.

So, the I/O component of Linux system load and PSI do not include the async I/O waiting threads that decide to synchronoysly wait for asynchronously submitted I/O requests :-)

I see MySQL is using libaio in some places too (the io_submit/io_getevents show up in syscalls), but I haven't looked deeper into whether the getevents are "willing to wait" or not. But any application using libaio, that at some point uses io_getevents() in "willing to wait" mode, would be affected by this discrepancy.

zaarn · on April 19, 2022

I still think that is a reasonable caveat; the process didn't get hung up waiting for IO because IO was exhausted when it issues the write, rather it got hung because it deliberately chose to wait for IO to be completed. That isn't a load factor, it was the application doing it. Ideally Oracle DB (and MySQL) should track this and offer a performance metric somewhere (It's systemd service for example) of how long it's waiting for AIO completion.

The crucial point there is that while the application is slow, the system remains as reactive as the PSI number indicates. Just the application didn't.

tanelpoder · on April 19, 2022

Yep I do agree that it's a reasonable caveat and your point that async I/O allows the app to move on and do other stuff (in whatever thread state) and come back later.

But I'm talking about the cases where the app is done with the other work and needs to ensure that the previous write is persisted (for example to a write ahead log), before moving on? It will have to deliberately wait for I/O completion now, thus will run io_getevents() with the "willing to wait" mode enabled. I see this in Oracle database world all the time and MySQL world occasionally too:

- Not seeing a high number of threads in D state or PSI IO figure does not mean that there are no threads waiting for I/O

- Seeing a high number of threads in D state and high PSI IO figure means that there are threads waiting for I/O (but possibly even more, if you'd look into the ones in S state, but in io_getevents syscall, with WCHAN=read_events.

zaarn · on April 19, 2022

Applications should account for it and measure it themselves, when an application chooses to wait (to persist writes, for example), then it's hard to argue the kernel should take that as load. A similar argument could otherwise be made if an app is waiting on a futex or other resource. I don't think an fsync counts either. The solution there, IMO, is to use IO fences. Tell the OS that async writes and reads may not reorder beyond point X in time for this thread (or all threads, optionally for all FDs or just a specific one).

Then your app, like Oracle, simply issues a fence and can be done with it. If the app crashes before the fence is persisted, then it must be able to resume work from a previous one (a simple example would be that between WAL checkpoints, a fence is issued). The app won't have to worry that only some specific writes completed vs some others not beyond what the fence permitted. Additionally a good mechanism might be a call to wait for a fence to be persisted to disk.

Simple example for the WAL use case:

  1. Write new transaction to WAL
  2. Issue an IO fence for the WAL file
  3. Write new data to database file
  4. Wait for Fence from 2
  5. Return success

This is roughly equivalent to the synchronous example:

  1. Write new transaction to WAL file
  2. Issue fsync
  3. Write new data to database file
  4. Return Success

In case writes to the database file are lost, you can recover from the WAL (as intended). Notably Step 4 of the Async Example is a case where the thread is waiting but it can do other useful work while that is happening. The same thread can offload the work and simply issue more IO in the meanwhile and return success to the client once it sees the correct wait-for-fence returning in the async queue. And it won't have to wait for AIO event completion/checkpoints like currently, that make the system load non-indicative of app load (though frankly, system load is never indicative of app load, apache2 doesn't increase load if it runs out of workers).

tanelpoder · on April 19, 2022

You have a good point that the app should track the difference between I/O submission times and I/O completion times. Can't speak for MySQL, but Oracle nowadays has two "wait events" for the DB Writer flow (that syncs dirty buffers from its buffer cache to disk):

1. db file async I/O submit

2. db file parallel write

(the 2nd one has such a name for historical reasons, but it's the I/O reaping/completion check timing, not the submission).