Skip to content

FPGA result is not correct in JDK 22 branch, due to casting of the address type  #517

Open
@stratika

Description

Describe the bug

I tried the experimental branch of TornadoVM working with JDK 22, and noticed that the result of the generated OpenCL kernels for FPGAs are not result.

How To Reproduce

I have built TornadoVM as follows:

git checkout feat/jdk22_2nd_iteration
./bin/tornadovm-installer --jdk sapmachine-jdk-22 --backend opencl && source setvars.sh
tornado --version

returns

version=1.0.7-dev
branch=feat/jdk22_2nd_iteration
commit=dec5a87

Backends installed: 
	 - opencl

and I have used the oneAPI base toolkit to see also the Intel FPGA in emulation mode:

tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 4
  Tornado device=0:0  (DEFAULT)
	OPENCL --  [NVIDIA CUDA] -- NVIDIA RTX A2000 8GB Laptop GPU
		Global Memory Size: 7.8 GB
		Local Memory Size: 48.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [1024]
		Max WorkGroup Configuration: [1024, 1024, 64]
		Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
	OPENCL --  [Intel(R) OpenCL HD Graphics] -- Intel(R) Graphics [0x46a6]
		Global Memory Size: 24.8 GB
		Local Memory Size: 64.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [512]
		Max WorkGroup Configuration: [512, 512, 512]
		Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
	OPENCL --  [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i9-12900H
		Global Memory Size: 31.0 GB
		Local Memory Size: 32.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [8192]
		Max WorkGroup Configuration: [8192, 8192, 8192]
		Device OpenCL C version: OpenCL C 3.0

  Tornado device=0:3
	OPENCL --  [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
		Global Memory Size: 31.0 GB
		Local Memory Size: 256.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [67108864]
		Max WorkGroup Configuration: [67108864, 67108864, 67108864]
		Device OpenCL C version: OpenCL C 1.2

Then, I ran the DFTMT.java class from the dynamic examples (uk.ac.manchester.tornado.examples.dynamic) on the FPGA emulation platform, and got the output as follows:

tornado --threadInfo --printKernel --jvm="-Ds0.t0.device=0:3" -m tornado.examples/uk.ac.manchester.tornado.examples.dynamic.DFTMT --params="1024 default 1"
WARNING: Using incubator modules: jdk.incubator.vector
Initialization time:  2021238628 ns

Version running: default ! 
#pragma OPENCL EXTENSION cl_khr_fp64 : enable  
#pragma OPENCL EXTENSION cl_khr_fp16 : enable  
__attribute__((reqd_work_group_size(64, 1, 1)))
__kernel void computeDFT(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *inreal, __global uchar *inimag, __global uchar *outreal, __global uchar *outimag, __global uchar *inputSize)
{
  int i_4, i_36, i_5, i_6, i_10, i_11, i_12, i_31, i_32, i_33; 
  ulong ul_13, ul_0, ul_15, ul_2, ul_34, ul_1, ul_3, ul_35; 
  float f_21, f_22, f_23, f_24, f_17, f_18, f_19, f_20, f_14, f_16, f_9, f_7, f_8, f_29, f_30, f_25, f_26, f_27, f_28; 

  // BLOCK 0
  ul_0  =  (ulong) inreal;
  ul_1  =  (ulong) inimag;
  ul_2  =  (ulong) outreal;
  ul_3  =  (ulong) outimag;
  i_4  =  get_global_size(0);
  i_5  =  get_global_id(0);
  // BLOCK 1 MERGES [0 5 ]
  i_6  =  i_5;
  // BLOCK 2
  f_7  =  (float) i_6;
  // BLOCK 3 MERGES [2 4 ]
  f_8  =  0.0F;
  f_9  =  0.0F;
  i_10  =  0;
  #pragma unroll 4
  for(;i_10 < 1024;)
  {
    // BLOCK 4
    i_11  =  i_10 + 6;
    i_12  =  i_11 << 2;
    ul_13  =  ul_0 + i_12;
    f_14  =  *((__global float *) ul_13);
    ul_15  =  ul_1 + i_12;
    f_16  =  *((__global float *) ul_15);
    f_17  =  *((__global float *) ul_13);
    f_18  =  *((__global float *) ul_15);
    f_19  =  (float) i_10;
    f_20  =  f_19 * 6.2831855F;
    f_21  =  f_20 * f_7;
    f_22  =  f_21 / 1024.0F;
    f_23  =  native_sin(f_22);
    f_24  =  native_cos(f_22);
    f_25  =  f_24 * f_18;
    f_26  =  fma(f_23, f_17, f_25);
    f_27  =  f_9 - f_26;
    f_28  =  f_16 * f_23;
    f_29  =  fma(f_24, f_14, f_28);
    f_30  =  f_8 + f_29;
    i_31  =  i_10 + 1;
    f_8  =  f_30;
    f_9  =  f_27;
    i_10  =  i_31;
  }  // B4
  // BLOCK 5
  i_32  =  i_6 + 6;
  i_33  =  i_32 << 2;
  ul_34  =  ul_2 + i_33;
  *((__global float *) ul_34)  =  f_8;
  ul_35  =  ul_3 + i_33;
  *((__global float *) ul_35)  =  f_9;
  i_36  =  i_4 + i_6;
  i_6  =  i_36;
  // BLOCK 6
  return;
}  //  kernel

Task info: s0.t0
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 1
	Global work offset: [0]
	Global work size  : [1024]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : [16]

Total time:  441707758 ns 

-6.5101576 vs -1.5097469

Validation:  FAIL 

Expected behavior

I did a diff with the kernel for the same test as generated in the develop branch. The first observation is that there is a casting on the values used to calculate the address (ul_14) that is used for loading the data. See variable l_12, in the following screenshot.

image

I guess, the expected behavior would be to avoid this casting.

Computing system setup (please complete the following information):

  • OS: Ubuntu 23.10
  • OpenCL version: OpenCL 1.2
  • Driver: 2023.16.10.0.17_160000
  • TornadoVM commit id: dec5a87 from branch feat/jdk22_2nd_iteration

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingjdk22

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions