Description
Describe the bug
I tried the experimental branch of TornadoVM working with JDK 22, and noticed that the result of the generated OpenCL kernels for FPGAs are not result.
How To Reproduce
I have built TornadoVM as follows:
git checkout feat/jdk22_2nd_iteration
./bin/tornadovm-installer --jdk sapmachine-jdk-22 --backend opencl && source setvars.sh
tornado --version
returns
version=1.0.7-dev
branch=feat/jdk22_2nd_iteration
commit=dec5a87
Backends installed:
- opencl
and I have used the oneAPI base toolkit to see also the Intel FPGA in emulation mode:
tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector
Number of Tornado drivers: 1
Driver: OpenCL
Total number of OpenCL devices : 4
Tornado device=0:0 (DEFAULT)
OPENCL -- [NVIDIA CUDA] -- NVIDIA RTX A2000 8GB Laptop GPU
Global Memory Size: 7.8 GB
Local Memory Size: 48.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [1024]
Max WorkGroup Configuration: [1024, 1024, 64]
Device OpenCL C version: OpenCL C 1.2
Tornado device=0:1
OPENCL -- [Intel(R) OpenCL HD Graphics] -- Intel(R) Graphics [0x46a6]
Global Memory Size: 24.8 GB
Local Memory Size: 64.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [512]
Max WorkGroup Configuration: [512, 512, 512]
Device OpenCL C version: OpenCL C 1.2
Tornado device=0:2
OPENCL -- [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i9-12900H
Global Memory Size: 31.0 GB
Local Memory Size: 32.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [8192]
Max WorkGroup Configuration: [8192, 8192, 8192]
Device OpenCL C version: OpenCL C 3.0
Tornado device=0:3
OPENCL -- [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
Global Memory Size: 31.0 GB
Local Memory Size: 256.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [67108864]
Max WorkGroup Configuration: [67108864, 67108864, 67108864]
Device OpenCL C version: OpenCL C 1.2
Then, I ran the DFTMT.java
class from the dynamic examples (uk.ac.manchester.tornado.examples.dynamic
) on the FPGA emulation platform, and got the output as follows:
tornado --threadInfo --printKernel --jvm="-Ds0.t0.device=0:3" -m tornado.examples/uk.ac.manchester.tornado.examples.dynamic.DFTMT --params="1024 default 1"
WARNING: Using incubator modules: jdk.incubator.vector
Initialization time: 2021238628 ns
Version running: default !
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__attribute__((reqd_work_group_size(64, 1, 1)))
__kernel void computeDFT(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *inreal, __global uchar *inimag, __global uchar *outreal, __global uchar *outimag, __global uchar *inputSize)
{
int i_4, i_36, i_5, i_6, i_10, i_11, i_12, i_31, i_32, i_33;
ulong ul_13, ul_0, ul_15, ul_2, ul_34, ul_1, ul_3, ul_35;
float f_21, f_22, f_23, f_24, f_17, f_18, f_19, f_20, f_14, f_16, f_9, f_7, f_8, f_29, f_30, f_25, f_26, f_27, f_28;
// BLOCK 0
ul_0 = (ulong) inreal;
ul_1 = (ulong) inimag;
ul_2 = (ulong) outreal;
ul_3 = (ulong) outimag;
i_4 = get_global_size(0);
i_5 = get_global_id(0);
// BLOCK 1 MERGES [0 5 ]
i_6 = i_5;
// BLOCK 2
f_7 = (float) i_6;
// BLOCK 3 MERGES [2 4 ]
f_8 = 0.0F;
f_9 = 0.0F;
i_10 = 0;
#pragma unroll 4
for(;i_10 < 1024;)
{
// BLOCK 4
i_11 = i_10 + 6;
i_12 = i_11 << 2;
ul_13 = ul_0 + i_12;
f_14 = *((__global float *) ul_13);
ul_15 = ul_1 + i_12;
f_16 = *((__global float *) ul_15);
f_17 = *((__global float *) ul_13);
f_18 = *((__global float *) ul_15);
f_19 = (float) i_10;
f_20 = f_19 * 6.2831855F;
f_21 = f_20 * f_7;
f_22 = f_21 / 1024.0F;
f_23 = native_sin(f_22);
f_24 = native_cos(f_22);
f_25 = f_24 * f_18;
f_26 = fma(f_23, f_17, f_25);
f_27 = f_9 - f_26;
f_28 = f_16 * f_23;
f_29 = fma(f_24, f_14, f_28);
f_30 = f_8 + f_29;
i_31 = i_10 + 1;
f_8 = f_30;
f_9 = f_27;
i_10 = i_31;
} // B4
// BLOCK 5
i_32 = i_6 + 6;
i_33 = i_32 << 2;
ul_34 = ul_2 + i_33;
*((__global float *) ul_34) = f_8;
ul_35 = ul_3 + i_33;
*((__global float *) ul_35) = f_9;
i_36 = i_4 + i_6;
i_6 = i_36;
// BLOCK 6
return;
} // kernel
Task info: s0.t0
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 1
Global work offset: [0]
Global work size : [1024]
Local work size : [64, 1, 1]
Number of workgroups : [16]
Total time: 441707758 ns
-6.5101576 vs -1.5097469
Validation: FAIL
Expected behavior
I did a diff with the kernel for the same test as generated in the develop
branch. The first observation is that there is a casting on the values used to calculate the address (ul_14
) that is used for loading the data. See variable l_12
, in the following screenshot.
I guess, the expected behavior would be to avoid this casting.
Computing system setup (please complete the following information):
- OS:
Ubuntu 23.10
- OpenCL version:
OpenCL 1.2
- Driver:
2023.16.10.0.17_160000
- TornadoVM commit id:
dec5a87
from branchfeat/jdk22_2nd_iteration