Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary GCD #755

Open
wants to merge 73 commits into
base: master
Choose a base branch
from
Open

Binary GCD #755

wants to merge 73 commits into from

Conversation

erik-3milabs
Copy link
Contributor

This PR introduces an implementation of Optimized Binary GCD. Ref: Pornin, Algorithm 2.

Upsides to this technique:

  • it is up to 27x faster than the gcd algorithm currently implemented in crypto_bigint (see below) (really, it just has a different complexity bound).
  • does not need UNSAT_LIMBS
  • it is actually constant time (unlike the current implementation, which is sneakily vartime in the maximum of the bitsizes of the two operands).

Benchmark results

Word = u64

limbs gcd (vt) gcd (ct) new_gcd (ct)
2 10.687 µs 20.619 µs 3.6090 µs
4 29.121 µs 56.433 µs 7.1124 µs
8 99.819 µs 195.02 µs 16.184 µs
16 359.39 µs 710.26 µs 44.294 µs
32 1.6804 ms 3.3097 ms 136.49 µs
64 6.9717 ms 13.028 ms 494.16 µs
128 29.099 ms 57.325 ms 2.3335 ms
256 143.22 ms 244.89 ms 8.7722 ms

@erik-3milabs
Copy link
Contributor Author

Update:

  • I refactored the code, improving readability
  • I introduced the "small" version of the algorithm that achieves slightly faster results for smaller instances (i.e., uints with fewer limbs) than the original "large" version.

Benchmarks below.
Note:

  • The original gcd algorithm is unexpectedly slow for 4 limbs. Any clue what is up with that?
  • Based on these results, I have bingcd select the "small" variant up to 7 limbs and the "large" beyond that. Anything else I can do to improve this?
limbs gcd bingcd factor bingcd (small) bingcd (large)
1 2.6475 µs 363.93 ns 7.3x 342.17 ns 1.1385 µs
2 6.9068 µs 1.3533 µs 5.1x 1.2378 µs 3.2917 µs
3 12.677 µs 2.7281 µs 4.6x 2.5826 µs 5.5247 µs
4 42.896 µs 4.7267 µs 9.1x 4.5286 µs 6.6132 µs
5 27.987 µs 7.1246 µs 3.9x 6.8353 µs 10.100 µs
6 38.466 µs 10.136 µs 3.8x 9.9228 µs 11.977 µs
7 49.654 µs 13.621 µs 3.6x 13.614 µs 14.921 µs
8 63.467 µs 15.749 µs 4.0x 17.757 µs 15.545 µs
-- -- -- -- -- --
16 632.80 µs 44.272 µs 14.3x 76.077 µs 44.593 µs
32 3.1699 ms 137.51 µs 23.1x 447.11 µs 137.29 µs
64 12.827 ms 489.57 µs 26.2x 1.7868 ms 499.83 µs
128 56.750 ms 2.3381 ms 24.3x 8.7670 ms 2.3186 ms
256 243.97 ms 8.7289 ms 27.9x 34.823 ms 8.7274 ms

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Feb 3, 2025

There is still significant room for improvement for the "large" variant. In each of its (2*Uint::BITS-1)/62 outer loop iterations, it multiplies a and b by a I64 matrix that shrinks their size by at least 62 bits. After iteration X, we thus know that either

  • a and b are at most $BITS - 62X$ bits long, or
  • the matrix is the unit matrix.

Assuming the first case, we can thus improve this multiplication by not considering some of the top limbs of a and b. Over the full algorithm, this would equal a ~1.8x reduction for time spent on those multiplications.

Aside from these multiplications, the "large" variant is linear in Uint::BITS. Interpolating from the benchmarks, the fraction of the time spent on these multiplications is roughly estimated as $$L/(L+6.85)$$ for $L$ limbs, which is already equal to ~70% for L=16 (U1024). The benefit would already be ~~30% for U1024, and increasingly more for larger limb sizes.

This would require the development of a "bounded" multiplication feature. Would this project be interested in that?

Spoiler: The upcoming xgcd feature I'm developing would enjoy the benefit ~3x.

@erik-3milabs
Copy link
Contributor Author

@tarcieri @ycscaly, I think this PR has stabilized for review (truly now 🙈). Looking forward to your reviews!

Copy link

@dolevmu dolevmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. I find the implementation constant time. Offered a few annotations and documentations to clarify the code after going over it, and suggested some changes to boost performance. I think it is worth reading and considering, but I approve this.

Great work thanks @erik-3milabs !

Comment on lines 30 to 31
// Todo: tweak this threshold
if LIMBS < 8 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, but looks like you answered it, your problem is that the answer may depend on the machine. Suppose I gave you an answer, do you know what to do in the code? If I say "this is optimal for 64-bit architecture and that for 32-bit".. If so, let's check on another machine.

b = Uint::select(&b, &b.wrapping_sub(&a), b_odd);
matrix.conditional_subtract_top_row_from_bottom(b_odd);

// Div b by two and double the top row of the matrix when a, b ≠ 0.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Algorithm 2, you multiply by 2 always, but I suppose you know what you are doing and he is wrong on this edge case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it multiplies the second row and not the first one (f1,g1) is multiplied by two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is a modification I made to the algorithm. It prevents integer overflows in an upcoming PR. I've updated the docstring of this function to indicate this.
  2. yes, but this code has a and b swapped, so I should be doubling the first row ;-)

Note: that doubling problem was fixed because I reverted the (a, b) swap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out, this is wrong: this leads to buggy behavior for specific pairs of numbers. I'm refactoring some code to bypass this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #761

@@ -25,6 +25,12 @@ impl<const LIMBS: usize> Uint<LIMBS> {
Uint { limbs }
}

/// Swap `a` and `b` if `c` is truthy, otherwise, do nothing.
#[inline]
pub(crate) const fn conditional_swap(a: &mut Self, b: &mut Self, c: ConstChoice) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of my previous comment, it is worth to consider if we can efficiently select between two vectors or matrices, and similarly swap columns/rows efficiently (which you did implement).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarcieri do you have any ideas on improving the performance of this and other swapping operations?

Uint::conditional_swap(&mut a, &mut b, do_swap);

// subtract b from a when a is odd
a = a.wrapping_sub(&Uint::select(&Uint::ZERO, &b, a_odd));
Copy link
Contributor Author

@erik-3milabs erik-3milabs Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarcieri what do you think of this line? Previously, it was like this:

a = Uint::select(&a, &a.wrapping_sub(&b), a_odd);

The current code is 25-10% faster for Uints with few limbs (1, 2, 3, etc.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised there's that much of a difference. Are you sure it's always faster or is it faster depending on a?

Copy link
Contributor Author

@erik-3milabs erik-3milabs Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty confident it is always faster.

The reason I think this is faster, is that we are now selecting between a constant and a variable, instead of two variables. Given that select is const and loves to be inlined, the compiler can now optimize the select operation.

Recall, Uint::select calls Limb::select, which in turn calls

impl ConstChoice {
    /// Return `b` if `self` is truthy, otherwise return `a`.
    #[inline]
    pub(crate) const fn select_word(&self, a: Word, b: Word) -> Word {
        a ^ (self.0 & (a ^ b))
    }
}

When a is the constant ZERO, this can be optimized as:

        self.0 & b

saving two XOR operations, or 2/3's of this operation.

Returning to the gcd subroutine, this select is in the hot loop of this algorithm. In total, the loop executes:

  • Uint::is_odd (1 op)
  • Uint::lt (2 ops/word),
  • ConstChoice::and (1 op),
  • Uint::wrapping_sub (4 ops/word),
  • Uint::select (3 ops/word -> 1 ops/word)
  • Uint::shr (3 ops/word)

So, there is a reduction from 12 to 10 ops/word, or a 17% improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a is the constant ZERO

Aah, ok

@erik-3milabs erik-3milabs mentioned this pull request Feb 14, 2025
@tarcieri
Copy link
Member

@erik-3milabs I just set master to v0.7.0-pre in #765

This means you can now make breaking changes, such as removing the existing safegcd implementation and changing trait impls like Gcd and InvMod to use bingcd instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants