Forums - If I use 'half' data type in OpenCL kernel instead of 'float', can we achieve 2x performance?

5 posts / 0 new
Last post
If I use 'half' data type in OpenCL kernel instead of 'float', can we achieve 2x performance?
tylcgy
Join Date: 9 Feb 15
Posts: 1
Posted: Tue, 2015-03-31 19:57

Hi, 

I am checking if Adreno 430 can achieve 2x performance if I use fp16 instead of fp32 in openCL app.

So, I write two kernels, one uses float, and the other uses half(I also test half2, half4 version).   The workgroup size is set to 128.    

But unfortunately, I find no explicit performance increase using fp16.     

 

Using adreno profiler for the float and half test, 

(1) kernel analyzer really shows half ALU is used.

(2) But profiler shows same ALU utilization.        

 

I am surprised why there is no performance increase when switching from fp32 to fp16.  

Originally, I do expect nearlly 2x performance increase....

 

So, I want to check and confirm:

Is my expectation incorrect?   fp16 throughput is not 2x of fp32 throughput?

Or do I need any special settings to achieve near 2x performance increase?

 

 

   

 

 

  • Up0
  • Down0
Ayo Moderator
Profile picture
Join Date: 23 Jan 15
Posts: 31
Posted: Tue, 2015-05-26 11:45

Hello. First of all, apologies for the late response.

Your expectation is correct, but the answer to your question "If I use 'half' data type in OpenCL kernel instead of 'float', can we achieve 2x performance?" is "not necessarily".

"Half float" (fp16) should run at double speed compared to "float" (fp32). However for OpenCL kernels there are usually higher impact bottlenecks before you get to the level of the fp16 computations.

So for example, the overhead of submitting the workgroup might be much higher than the actual workload, in which case no benefit would be seen after optimizing the workload. Another example is that the workgroup size might not be optimal, resulting in lower "occupancy" of the shader cores. This results in the effects of memory access latency defeating any benefits of fp16 performance.

Another reason is that using "half float" could result in increase in register footprint which results in a smaller workgroup size and lower overall performance.

Finally it is possible that the thermal mitigation and clock control mechanisms built into the device are interfering with your performance measurements. You should enable performance mode in the GPU when optimizing.

Therefore to be able to see the benefits of "half float", you need to account for a lot of the points above first.

  • Up0
  • Down0
daleiwang
Join Date: 7 Aug 14
Posts: 3
Posted: Mon, 2015-08-03 09:34

Hi,

I ran the following comparison test to do a simple memory transfer with OpenCL:

#pragma OPENCL EXTENSION cl_khr_fp16 : enable

__kernel __attribute__(( work_group_size_hint (32, 32, 1) ))
void FullPrecision(__global float* in, __global float* out)
{
        const int gtidx = get_global_id(0);
        const int gtidy = get_global_id(1);
        float4 v = vload4(0, in+gtidy*W+(gtidx<<2));
        vstore4(v, 0, out+gtidy*W+(gtidx<<2));
}

__kernel __attribute__(( work_group_size_hint (32, 32, 1) ))
void HalfPrecision(__global half* in, __global half* out)
{
        const int gtidx = get_global_id(0);
        const int gtidy = get_global_id(1);
        float8 v = vload_half8(0, in+gtidy*W+(gtidx<<3));
        vstore_half8(v, 0, out+gtidy*W+(gtidx<<3));
}

__kernel __attribute__(( work_group_size_hint (32, 32, 1) ))
void ShortPrecision(__global short* in, __global short* out)
{
        const int gtidx = get_global_id(0);
        const int gtidy = get_global_id(1);

        short8 v = vload8(0, in+gtidy*W+(gtidx<<3));
        vstore8(v, 0, out+gtidy*W+(gtidx<<3));
}


The test platform is an Adreno 420.

I followed the instruction in section 3.3.5.3.2 of document "Snapdragon(TM) OpenCL General Programming and Optimization", version  80-N8592-1 L, released on August 29, 2014, in order to enable the performance mode of the GPU.

To test the "FullPrecision" kernel, I launched 32 by 32 threads per workgroup, with a global grid of 12 by 96 workgroups, to transfer a buffer of 3072 by 3072 cl_floats. It takes 6.0ms to complete (measured using cl_event).

To test the "HalfPrecision" kernel, I launched 32 by 32 threads per workgroup, with a global grid of 24 by 96 workgroups,  to transfer a buffer of 3072 by 3072 cl_half. It takes 6.7ms to complete.

To test the "ShortPrecision" kernel, I launched 32 by 32 threads per workgroup, with a global grid of 12 by 96 workgroups, to transfer a buffer of 3072 by 3072 cl_short. It takes 3.3ms to complete.

These measurements make no sense to me, since:
1) The buffer size is large enough that the overhead time to start the kernel is not a significant factor
2) I was able to launch the same local workgroup size in all 3 tests, I did NOT use qcom-sched-rule=2 flag to complile the kernels
3) I would expect "ShortPrecision" and "HalfPrecision" to have the same speed, since they should both be transferring the same amount of data with identical layout in memory

Could you shed some light on this matter?

  • Up0
  • Down0
daleiwang
Join Date: 7 Aug 14
Posts: 3
Posted: Mon, 2015-08-03 09:36

My apologies, the 4th and 5th paragraphs below the code snippet in the previous post should read:

To test the "FullPrecision" kernel, I launched 32 by 32 threads per workgroup, with a global grid of 24 by 96 workgroups, to transfer a buffer of 3072 by 3072 cl_floats. It takes 6.0ms to complete (measured using cl_event).

To test the "HalfPrecision" kernel, I launched 32 by 32 threads per workgroup, with a global grid of 12 by 96 workgroups,  to transfer a buffer of 3072 by 3072 cl_half. It takes 6.7ms to complete.

  • Up0
  • Down0
daleiwang
Join Date: 7 Aug 14
Posts: 3
Posted: Tue, 2015-08-11 06:05

Seems like the problem is with vload_halfn and/or vstore_halfn on the Adreno 420. I tried rewriting the 'HalfPrecision' kernel as follows (with extension cl_khr_fp16 enabled):

__kernel __attribute__(( work_group_size_hint (32, 32, 1) ))
void HalfPrecision(__global half* in, __global half* out)
{
        const int gtidx = get_global_id(0);
        const int gtidy = get_global_id(1);
        __global half8* in8 = (__global half8*) in;
        __global half8* out8 = (__global half8*) out;

        half8 v = in8[gtidy*(W>>3)+gtidx];
        out8[gtidy*(W>>3)+gtidx] = v;

},

and now the speed of this kernel is on par with that of 'ShortPrecision'.

 

  • Up0
  • Down0
or Register

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries (“Qualcomm”). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.