in other words x^{′}(2^{E})^{−}^{1} mod u0 is the desired solution which is obtained by
Algorithm 12.

6 c := (u * Modinv(v, 2^_2t)) mod 2^_2t;

7 a := 2^_2t;

8 b,d := copy2(0,1);

9 while a ge Sqrt(2^_2t) do 10 assert (A*e2 + B*b) eq a;

11 assert (A*e1 + B*d) eq c;

12

13 q := a div c;

14 a := a - q*c;

15 a, c := swapt(a,c);

16

17 b := b - q*d;

18 b, d := swapt(b,d);

19 end while;

20 return a,b,c,d;

21 end function;

Code 3.7: Magma Code for ReducedRatMod

1 accelModinv := function(u, v, base, s, t, W) 2 xdd,xd,ud,vd := copy4(1,0,u,v);

3 ydd,yd := copy2(0,1);

4 E := 0;

5 v,xd,yd,E := MakeOdd(v,xd,yd,E);

6 while (v ne 0) do

7 assert vd*xd + ud*yd eq u*2^E;

8 assert vd*xdd + ud*ydd eq v*2^E;

9 a,b,c,d := ReducedRatMod(u mod 2^(2*t), v mod 2^(2*t), 2*t);

10 u, v := LinearTransform(u,v,a,b,c,d);

11 xd, xdd := LinearTransform(xd,xdd,a,b,c,d);

12 yd, ydd := LinearTransform(yd,ydd,a,b,c,d);

13 u := RemoveDigits(u, 2*t);

14 v := RemoveDigits(v, 2*t);

15 u,xdd,ydd,E := MakeOdd(u,xdd,ydd,E);

16 v,xd,yd,E := MakeOdd(v,xd,yd,E);

17 u,xd,yd := MakePositive(u,xd,yd,u lt 0);

18 v,xdd,ydd := MakePositive(v,xdd,ydd,v lt 0);

19 u,v,xd,xdd,yd,ydd := Swap(u,v,xd,xdd,yd,ydd,v gt u);

20 E +:= 2*t;

21 end while;

22 return Modinv2e(xd,ud,E);

23 end function;

Code 3.8: Magma Code for k-ary Modular Inverse

### CHAPTER 4

## SIMD IMPLEMENTATION

Intel’s AVX2 instruction set is currently the most accessible high-end processing platform since it is available in and after every Haswell processors including other popular processor families like Skylake and Kabylake. Therefore, it is reasonable to investigate the performance of Algorithm 6. AVX2 provides 16×256-bitymm registers. The amount of data that can be kept in these registers is over 4 times more than the data that be accommodated in the 16 × 64-bit integer registers.

Therefore, inputs of Algorithm 6 has potential to be processed faster on AVX2.

This section investigates this possibility.

AVX2 feature is extremely important where time consuming operations are in question. AVX2 instructions are capable of processing a large set of numbers at a time, rather than processing them individually and so that enhance the application performance. These large numbers are placed into AVX2 vectors such that, they can enlarge up to 256 bits. AVX2 features can be accessed via immitrin.h header file through Intel intrinsics.

In implementing Algorithm 6 over AVX2 circuit, the first question that arises is how to represent large integers. There is a vast number of possibilities at this phase. It is our experience that the representation choice tends to make a huge difference in the overall performance. We summarize a few below and explain the best choice out of them together with the reasoning.

## 4.1 HIGH LEVEL REPRESENTATION OF DATA

One approach could be working over the four 64-bit lanes where the lanes are
dedicated tov,u,x^{′′}, andx^{′}. Such an approach look very simple, cf. Figure 4.2.

This approach leads to very poor utilization of the underlying hardware since
u and v tends to decrease where x^{′} and x^{′′} tends to increase in size. However,
when keeping then side by side in vector form, the implementer is forced to
allocate equal amount of memory for all. And then, several digits will be dummily
processed. Other problems do exist. For instance, one can easily compute bu,

in vector form.

Figure 4.1 4-way Representation, a first attempt

Another approach which solves some of these problems is to separate vector
variables for u & v and x^{′} & x^{′′}. In this version, the 64 bit lanes in a vector
contains repeated data in the formu, u, andv, v. Yet another variable contains
x^{′},x^{′} andx^{′′},x^{′′}. In this formuandvcan share equal number of digits from start
to the end of computation. Similar applies tox^{′} and x^{′′}. This approach partially
solves the digit count problem in the first approach. However, permutations are
still not eliminates. For instance, Figure 4.2 depicts linear transformation phase
in such a situation.

Figure 4.2 4-way Representation, a second attempt

The outputav−bu is now need to be copied over the first two lanes ofv. Similar
applies to cv−du, ax^{′} −bx^{′′}, cx^{′}−dx^{′′}. The programmer should prevent such
permutations as much as possible in order to obtain a high throughput.

A third approach could be place limbs of each variable vertically. Figure 4.3 summarizes this situation. The main problem here is the maintenance of carries between limbs. For instance, carries from a2 to a3 would require a sizeable

amount of extra code which will not only cost time but also sacrifice code readability and easy maintenance.

Figure 4.3 4-way Representation, a third attempt

a0 a3 a6 a9

a^{1} a^{4} a^{7} a^{10}

a2 a5 a8 a11

v[0][0] v[0][1] v[0][2] v[0][3]

v[1][0] v[1][1] v[1][2] v[1][3]

v[2][0] v[2][1] v[2][2] v[2][3]

vec1

vec2

vec3

. . .

. . .

Up to now, it seems that any alternative comes with a huge disadvantage.

Nevertheless, we were able to find the following fine grain solution.

The representation that we use separates all variables in to distinct vector arrays and places the limbs of a variable first in horizontal fashion in 64 bit lanes of a vector and then vertically over elements of the vector array. This approach is depicted in Figure 4.4.

Figure 4.4 4-way Representation, the selected approach

a0 a1 a2 a3

a^{4} a5 a6 a7

a^{8} a^{9} a10 a11

v[0][0] v[0][1] v[0][2] v[0][3]

v[1][0] v[1][1] v[1][2] v[1][3]

v[2][0] v[2][1] v[2][2] v[2][3]

vec^{1}

vec^{2}

vec3

. . .

. . .

This final approach has its pros and cons. On the positive side, every variable is maintained separately so that if not needed the limb access can be limited. In addition, no permutation is needed between lanes. Moreover, the code readability is fairly better in comparison with other alternatives. However, handling the carries and right shifts seems to be problematic at the first glance. But we found a programmatic way of minimizing the speed penalties referenced from this representation. Our solution is as follows. We concentrate on Figure 4.4 for simplicity. For instance, carries that needs to be transferred from a3 to a4 can be handled by slow permutation operation. However, we want to eliminate all such permutations. At this stage, one can define a vector pointer whose starting address is a1. Then, the vector pointer acts as 64 bit right shifted array on

with screen outputs.