Open
Description
We can compute h = [H^n, H^(n-1), ..., H]
and then process N blocks at a time. On a 2020 M1, a stride of 8 runs at about 0.17 cycles per byte whereas a stride of 1 runs at about 1.4 cycles per byte—an ~8x improvement.
Note that sometimes this isn't desirable. For example, HCTR-2 computes POLYVAL over single blocks and the overhead of constructing plus cleaning up an N-wide POLYVAL hurt performance. So, we probably need to offer both a "wide" and a "lite" implementation.
I'm happy to donate my implementation (x86 and aarch64).
Metadata
Assignees
Labels
No labels
Activity