Description
Is your feature request related to a problem? Please describe.
The problem is trying to use modern SSDs to their maximum performance for random I/O (particularly random reads) on normal files (not raw block devices), across multiple cores/capabilities. To do this one needs two things: good async I/O APIs and opening files in a mode that bypasses the page cache. Bypassing the page cache is needed to achieve the maximum IOPS, especially when submitting IO operations from many OS threads at once (so from many RTS capabilities). Good async I/O APIs is out of scope for this feature request.
A similar problem is wanting to do lots of random I/O while optimising the memory of the host system by not polluting the page cache with disk pages that will only be used once (to make best use of the page cache for other files that are used). Again for this use case one wants to open a file in a mode that bypasses or suppresses the page cache.
Another similar problem is wanting to do disk I/O performance benchmarking, and one needs to work around the caching that the OS does: either by dropping caches before a run and avoiding re-reading the same page twice, or avoiding caching altogether.
Describe the solution you'd like
The solution is to allow opening a file in a mode that attempts to suppresses or eliminates the use of disk/page caching for this use of this file. This is a feature that all widely used unix-like OSs support, but it is not standardised by posix:
- On Linux this is the
O_DIRECT
flag toopen(2)
. - On FreeBSD this is
O_DIRECT
flag toopen(2)
. - On OSX this is done using
F_NOCACHE
tofcntl(2)
(link here is to the iPhoneOS man page version because apple removed the online rendered version of the desktop man pages)
For platforms that do not support any of these methods, the fallback should simply be to do nothing. The semantics of continuing to do caching is contained within the semantics of no caching (but with different performance characteristics).
Note also that given we will document the semantics as trying to do less/no caching, then we also don't worry about the slight difference in behaviour between OSX and FreeBSD and Linux on the use of the page cache. (OSX will use cached pages for the file if they are present already, while Linux will ignore cached pages even if there are cached pages already. This difference is only relevant for I/O benchmarks, and such programs need to be aware of a lot of platform specific details already).
The feature should be implemented as an extra boolean flag in the OpenFileFlags
. The name of this field should be descriptive since there is no POSIX name to follow (and different platforms call it different things, so e.g. direct
would be inappropriate). Suggestions include noCache :: Bool
, since that's simply descriptive (though it happens to be what OSX uses too).
Additionally (and this is a matter of API design tastes where reasonable people may differ) one may wish to provide some feature flag that one can test to see if support is present (since no exception will be thrown if it is not present).
The documentation for the feature should also clearly describe that when using this feature, some platforms impose additional constraints on the alignment of file reads/writes and the memory buffers used for reads/writes. Optionally it may also make sense to provide some constants to give the most portable values for disk and memory alignment, or an action to obtain these alignment hints. Feedback on this aspect of the API is welcome.
Describe alternatives you've considered
The alternative is an extension package, unix-odirect
or something, with just the file open support and nothing else.
Additional context
My colleagues and I are happy to implement this feature, including docs etc and shepherd it through PR review.
Related older tickets: #48 and #6. But these propose just using and exposing the non-portable O_DIRECT
rather than trying to provide portable support.
API breaking changes
It would be an extra member of the OpenFileFlags
record, with a default (normal caching behaviour) in the defaultFileFlags
value. So this should not break most exising library users which create the OpenFileFlags
record value by overriding defaultFileFlags
rather than using the raw constructor.
Posix compliance
This is a feature available in all major Posix compatible OSs (even windows) but it is not standardised by POSIX.
Relevant excerpts from man pages (linked above):
- Linux
open O_DIRECT
:
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is described in raw(8).
- FreeBSD
open O_DIRECT
:
O_DIRECT may be used to minimize or eliminate the cache effects of reading and writing. The system will attempt to avoid caching the data you read or write. If it cannot avoid caching the data, it will minimize the impact the data has on the cache. Use of this flag can drastically reduce performance if not used with care.
- OSX
fcntl F_NOCACHE
:
Turns data caching off/on. A non-zero value in arg turns data caching off. A value of zero in arg turns data caching on.