Binary GCD #755

erik-3milabs · 2025-01-30T15:06:08Z

This PR introduces an implementation of Optimized Binary GCD. Ref: Pornin, Algorithm 2.

Upsides to this technique:

it is up to 27x faster than the gcd algorithm currently implemented in crypto_bigint (see below) (really, it just has a different complexity bound).
does not need UNSAT_LIMBS
it is actually constant time (unlike the current implementation, which is sneakily vartime in the maximum of the bitsizes of the two operands).

Benchmark results

Word = u64

limbs	gcd (vt)	gcd (ct)	new_gcd (ct)
2	10.687 µs	20.619 µs	3.6090 µs
4	29.121 µs	56.433 µs	7.1124 µs
8	99.819 µs	195.02 µs	16.184 µs
16	359.39 µs	710.26 µs	44.294 µs
32	1.6804 ms	3.3097 ms	136.49 µs
64	6.9717 ms	13.028 ms	494.16 µs
128	29.099 ms	57.325 ms	2.3335 ms
256	143.22 ms	244.89 ms	8.7722 ms

erik-3milabs · 2025-02-03T12:14:53Z

Update:

I refactored the code, improving readability
I introduced the "small" version of the algorithm that achieves slightly faster results for smaller instances (i.e., uints with fewer limbs) than the original "large" version.

Benchmarks below.
Note:

The original gcd algorithm is unexpectedly slow for 4 limbs. Any clue what is up with that?
Based on these results, I have bingcd select the "small" variant up to 7 limbs and the "large" beyond that. Anything else I can do to improve this?

limbs	gcd	bingcd	factor	bingcd (small)	bingcd (large)
1	2.6475 µs	363.93 ns	7.3x	342.17 ns	1.1385 µs
2	6.9068 µs	1.3533 µs	5.1x	1.2378 µs	3.2917 µs
3	12.677 µs	2.7281 µs	4.6x	2.5826 µs	5.5247 µs
4	42.896 µs	4.7267 µs	9.1x	4.5286 µs	6.6132 µs
5	27.987 µs	7.1246 µs	3.9x	6.8353 µs	10.100 µs
6	38.466 µs	10.136 µs	3.8x	9.9228 µs	11.977 µs
7	49.654 µs	13.621 µs	3.6x	13.614 µs	14.921 µs
8	63.467 µs	15.749 µs	4.0x	17.757 µs	15.545 µs
--	--	--	--	--	--
16	632.80 µs	44.272 µs	14.3x	76.077 µs	44.593 µs
32	3.1699 ms	137.51 µs	23.1x	447.11 µs	137.29 µs
64	12.827 ms	489.57 µs	26.2x	1.7868 ms	499.83 µs
128	56.750 ms	2.3381 ms	24.3x	8.7670 ms	2.3186 ms
256	243.97 ms	8.7289 ms	27.9x	34.823 ms	8.7274 ms

erik-3milabs · 2025-02-03T12:35:07Z

There is still significant room for improvement for the "large" variant. In each of its (2*Uint::BITS-1)/62 outer loop iterations, it multiplies a and b by a I64 matrix that shrinks their size by at least 62 bits. After iteration X, we thus know that either

a and b are at most $BITS - 62X$ bits long, or
the matrix is the unit matrix.

Assuming the first case, we can thus improve this multiplication by not considering some of the top limbs of a and b. Over the full algorithm, this would equal a ~1.8x reduction for time spent on those multiplications.

Aside from these multiplications, the "large" variant is linear in Uint::BITS. Interpolating from the benchmarks, the fraction of the time spent on these multiplications is roughly estimated as $$L/(L+6.85)$$ for $L$ limbs, which is already equal to ~70% for L=16 (U1024). The benefit would already be ~~30% for U1024, and increasingly more for larger limb sizes.

This would require the development of a "bounded" multiplication feature. Would this project be interested in that?

Spoiler: The upcoming xgcd feature I'm developing would enjoy the benefit ~3x.

erik-3milabs · 2025-02-03T12:43:22Z

@tarcieri @ycscaly, I think this PR has stabilized for review (truly now 🙈). Looking forward to your reviews!

…n-zero.

dolevmu

Looks great overall. I find the implementation constant time. Offered a few annotations and documentations to clarify the code after going over it, and suggested some changes to boost performance. I think it is worth reading and considering, but I approve this.

Great work thanks @erik-3milabs !

src/uint/bingcd/gcd.rs

dolevmu · 2025-02-05T15:30:05Z

src/uint/bingcd/gcd.rs

+        // Todo: tweak this threshold
+        if LIMBS < 8 {


Good question, but looks like you answered it, your problem is that the answer may depend on the machine. Suppose I gave you an answer, do you know what to do in the code? If I say "this is optimal for 64-bit architecture and that for 32-bit".. If so, let's check on another machine.

src/uint/bingcd/gcd.rs

src/uint/bingcd/xgcd.rs

dolevmu · 2025-02-06T07:05:23Z

src/uint/bingcd/xgcd.rs

+            b = Uint::select(&b, &b.wrapping_sub(&a), b_odd);
+            matrix.conditional_subtract_top_row_from_bottom(b_odd);
+
+            // Div b by two and double the top row of the matrix when a, b ≠ 0.


According to Algorithm 2, you multiply by 2 always, but I suppose you know what you are doing and he is wrong on this edge case.

Also, it multiplies the second row and not the first one (f1,g1) is multiplied by two.

This is a modification I made to the algorithm. It prevents integer overflows in an upcoming PR. I've updated the docstring of this function to indicate this.

yes, but this code has a and b swapped, so I should be doubling the first row ;-)

Note: that doubling problem was fixed because I reverted the (a, b) swap.

Turns out, this is wrong: this leads to buggy behavior for specific pairs of numbers. I'm refactoring some code to bypass this issue.

Fixed in #761

src/uint/bingcd/xgcd.rs

dolevmu · 2025-02-06T07:28:36Z

src/uint/cmp.rs

@@ -25,6 +25,12 @@ impl<const LIMBS: usize> Uint<LIMBS> {
        Uint { limbs }
    }

+    /// Swap `a` and `b` if `c` is truthy, otherwise, do nothing.
+    #[inline]
+    pub(crate) const fn conditional_swap(a: &mut Self, b: &mut Self, c: ConstChoice) {


In light of my previous comment, it is worth to consider if we can efficiently select between two vectors or matrices, and similarly swap columns/rows efficiently (which you did implement).

@tarcieri do you have any ideas on improving the performance of this and other swapping operations?

* This reverts commit 0897439 * This adds further annotation

erik-3milabs · 2025-02-07T10:04:49Z

src/uint/bingcd/gcd.rs

+            Uint::conditional_swap(&mut a, &mut b, do_swap);
+
+            // subtract b from a when a is odd
+            a = a.wrapping_sub(&Uint::select(&Uint::ZERO, &b, a_odd));


@tarcieri what do you think of this line? Previously, it was like this:

a = Uint::select(&a, &a.wrapping_sub(&b), a_odd);

The current code is 25-10% faster for Uints with few limbs (1, 2, 3, etc.)

I'm surprised there's that much of a difference. Are you sure it's always faster or is it faster depending on a?

I'm pretty confident it is always faster.

The reason I think this is faster, is that we are now selecting between a constant and a variable, instead of two variables. Given that select is const and loves to be inlined, the compiler can now optimize the select operation.

Recall, Uint::select calls Limb::select, which in turn calls

impl ConstChoice { /// Return `b` if `self` is truthy, otherwise return `a`. #[inline] pub(crate) const fn select_word(&self, a: Word, b: Word) -> Word { a ^ (self.0 & (a ^ b)) } }

When a is the constant ZERO, this can be optimized as:

self.0 & b

saving two XOR operations, or 2/3's of this operation.

Returning to the gcd subroutine, this select is in the hot loop of this algorithm. In total, the loop executes:

Uint::is_odd (1 op)

Uint::lt (2 ops/word),

ConstChoice::and (1 op),

Uint::wrapping_sub (4 ops/word),

Uint::select (3 ops/word -> 1 ops/word)

Uint::shr (3 ops/word)

So, there is a reduction from 12 to 10 ops/word, or a 17% improvement.

When a is the constant ZERO

Aah, ok

tarcieri · 2025-02-14T19:33:05Z

@erik-3milabs I just set master to v0.7.0-pre in #765

This means you can now make breaking changes, such as removing the existing safegcd implementation and changing trait impls like Gcd and InvMod to use bingcd instead

erik-3milabs added 30 commits January 27, 2025 10:20

Impl new_inv_mod_odd

e8d8f4f

Modify new_inv_mod_odd algorithm

f46213d

Make as_limbs_mut const

378e2ee

Introduce const conditional_swap

45fc11d

Improve Int::checked_mul notation

4ba745c

Introduce new_gcd

02ceb4f

Get bingcd working

c6891f4

Fix fmt

b9fb154

65mus U1024::gcd

ffe7bb2

Clean up

dc6f517

Clean

a3253d8

Remove DOUBLE requirement

6b95681

Extract restricted xgcd.

b5c9951

Introduce const_min and const_max

09b9ee7

Clean up summarize

fbd39e6

Clean up compact

a7f8dae

Update ExtendedInt

41d32f6

Impl Matrix

e445ceb

Make new_odd_gcd constant time

30aabf1

Replace shr by proper div_2k

51b93f0

Remove ExtendedInt::abs

714d608

Update restricted_extended_gcd

6d9a3fe

Annotate new_gcd

32b8e9f

Refactor IntMatrix

cf064a4

Refactor ExtendedInt into ExtendedInt and ExtendeUint

e4f4359

Fix bug

4b84597

Inline ExtendedUint and ExtendedInt

e822db1

Expand Uint::gcd benchmarking

a1ff0a8

Expand Uint::new_gcd testing

46211cf

Annotate new_gcd.rs

a824a4d

Fix issues

1a97070

erik-3milabs requested review from tarcieri and ycscaly February 3, 2025 12:43

erik-3milabs added 8 commits February 3, 2025 14:59

Fix IntMatrix::conditional_swap bug

4c699d2

Fix IntMatrix::conditional_double bug

2fee828

Fix restricted_extended_gcd bug

9781b87

Align gcd return values with their type

39a4e88

Remove sneaky swap operation

0897439

Expand bingcd testing

4007347

Refactor bingcd test suite

3f28258

Minor optimization; bingcd can always divide b by two; a is always no…

9a941da

…n-zero.

dolevmu approved these changes Feb 6, 2025

View reviewed changes

erik-3milabs added 12 commits February 6, 2025 12:01

Improve bingcd annotation

d1347e3

Split optimized_bingcd in two parts.

87d8ee7

Tune optimized_bingcd parameters

23a8dd5

Make compact generic in K

2d3df09

Annotate the use of shl_vartime and shr_vartime

30189d1

Take iterations out of the optimized_bingcd_ loop

2e93021

Indicate restricted_extended_gcd as _vartime in iterations

d464a02

Revert "Remove sneaky swap operation"

579d93e

* This reverts commit 0897439 * This adds further annotation

Annotate partial_binxgcd_vartime

89d4d29

Fix clippy

fc686d1

Fix docstring Matrix::conditional_double_bottom_row

b782790

Optimize conditional sub operation in classic_bingcd

859bbb9

erik-3milabs commented Feb 7, 2025

View reviewed changes

erik-3milabs mentioned this pull request Feb 14, 2025

Binary XGCD #761

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary GCD #755

Binary GCD #755

erik-3milabs commented Jan 30, 2025

erik-3milabs commented Feb 3, 2025

erik-3milabs commented Feb 3, 2025 •

edited

Loading

erik-3milabs commented Feb 3, 2025

dolevmu left a comment

dolevmu Feb 5, 2025

dolevmu Feb 6, 2025

dolevmu Feb 6, 2025

erik-3milabs Feb 6, 2025

erik-3milabs Feb 12, 2025

erik-3milabs Feb 14, 2025

dolevmu Feb 6, 2025

erik-3milabs Feb 14, 2025

erik-3milabs Feb 7, 2025 •

edited

Loading

tarcieri Feb 13, 2025

erik-3milabs Feb 14, 2025 •

edited

Loading

tarcieri Feb 14, 2025

tarcieri commented Feb 14, 2025

Binary GCD #755

Are you sure you want to change the base?

Binary GCD #755

Conversation

erik-3milabs commented Jan 30, 2025

Upsides to this technique:

Benchmark results

erik-3milabs commented Feb 3, 2025

erik-3milabs commented Feb 3, 2025 • edited Loading

erik-3milabs commented Feb 3, 2025

dolevmu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erik-3milabs Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erik-3milabs Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri commented Feb 14, 2025

erik-3milabs commented Feb 3, 2025 •

edited

Loading

erik-3milabs Feb 7, 2025 •

edited

Loading

erik-3milabs Feb 14, 2025 •

edited

Loading