Forums - Porting OpenCL from AMD to Adreno

1 post / 0 new
Porting OpenCL from AMD to Adreno
michal.hradecky
Join Date: 26 Jan 17
Posts: 3
Posted: Thu, 2017-02-16 01:34

Hello,

our company tries to port our OpenCL image processing library (currently for Win/Linux/Mac) to 64bit Arm for one of our customers. Depending on our performance they may use solution based on Snapdragon kits. We use devkits with Snapdragon 820 SoC (Adreno 530) and Qualcomm LLVM for Android as a cross-compiler.

1) On AMD cards our kernels have a solid performance, but on Adreno 530 with cca 400MFLOPS we're getting cca 20 to 50 times lower performance than on the AMD card with 5 GFLOPS (expected is cca 10 times lower). Do you have some kind of advice guide for porting OpenCL kernels from AMD cards to Adreno or at least list of function we should avoid using to get a better performance on Arm? I compiled the most time-consuming kernel in Kernel Analyzer (results attached), but I have no idea which istructions are the bottleneck ones.

In the "Qualcomm Adreno OpenCL Programming Tips" you said there is a longer vesion of a programming guide in progress. Is it possible to get it somewhere (even some work-in-progress version)?

2) OpenCL on Snapdragon lacks the SPIR extension. Is there a way to replace SPIR / compile the kernels offline for Andreno GPUs? We dont want to provide source codes for OpenCL kernels with our product.

Thanks

--------------------------

Stats from Kernel Analyzer:

  • - Instruction stats
  • - All Instructions:     5095,   10451 (rpt), ratio  2.05
  • - ALUs          :     2104,    2127 (rpt), ratio  1.01
  • - Half ALUs     :      152,     152 (rpt), ratio  1.00
  • - Total NOPs    :     1710,    6132 (rpt), ratio  3.59
  • - NOPs          :      917,    3933 (rpt), ratio  4.29
  • - Post-NOPs     :      793,    2199 (rpt), ratio  2.77
  • - MOVs          :      965,    1083 (rpt), ratio  1.12
  • - Loads/Stores  :      679,     902 (rpt), ratio  1.33
  • - ldp          :      144,     168 (rpt), ratio  1.17
  • - ldg          :       74,     106 (rpt), ratio  1.43
  • - ldl          :      124,     165 (rpt), ratio  1.33
  • - stp          :      237,     313 (rpt), ratio  1.32
  • - stg          :       30,      30 (rpt), ratio  1.00
  • - stl          :       69,     119 (rpt), ratio  1.72
  • - atomicga     :        1,       1 (rpt), ratio  1.00
  • - addrcalc     :        1,       0 (rpt), ratio  0.00
  • - EI Position   :       -1
  • - Flow Instrs   :      271
  • - BRAA Instrs :       24
  • - BRAO Instrs :        3
  • - Sync Instrs   :        7
  • - Barriers      :        7
  • - Short sync flags:      306
  • - Long sync flags :      207
  • - Instruction stats
  • - All Instructions:       56,     100 (rpt), ratio  1.79
  • - ALUs          :       35,      35 (rpt), ratio  1.00
  • - Total NOPs    :       18,      49 (rpt), ratio  2.72
  • - NOPs          :        5,      14 (rpt), ratio  2.80
  • - Post-NOPs     :       13,      35 (rpt), ratio  2.69
  • - MOVs          :       14,      14 (rpt), ratio  1.00
  • - Loads/Stores  :        1,       1 (rpt), ratio  1.00
  • - stg          :        1,       1 (rpt), ratio  1.00
  • - EI Position   :       -1
  • - Flow Instrs   :        1
  • - Long sync flags :        1
  • - Instruction stats
  • - All Instructions:     5151,   10551 (rpt), ratio  2.05
  • - ALUs          :     2139,    2162 (rpt), ratio  1.01
  • - Half ALUs     :      152,     152 (rpt), ratio  1.00
  • - Total NOPs    :     1728,    6181 (rpt), ratio  3.58
  • - NOPs          :      922,    3947 (rpt), ratio  4.28
  • - Post-NOPs     :      806,    2234 (rpt), ratio  2.77
  • - MOVs          :      979,    1097 (rpt), ratio  1.12
  • - Loads/Stores  :      680,     903 (rpt), ratio  1.33
  • - ldp          :      144,     168 (rpt), ratio  1.17
  • - ldg          :       74,     106 (rpt), ratio  1.43
  • - ldl          :      124,     165 (rpt), ratio  1.33
  • - stp          :      237,     313 (rpt), ratio  1.32
  • - stg          :       31,      31 (rpt), ratio  1.00
  • - stl          :       69,     119 (rpt), ratio  1.72
  • - atomicga     :        1,       1 (rpt), ratio  1.00
  • - addrcalc     :        1,       0 (rpt), ratio  0.00
  • - EI Position   :       -1
  • - Flow Instrs   :      272
  • - BRAA Instrs :       24
  • - BRAO Instrs :        3
  • - Sync Instrs   :        7
  • - Barriers      :        7
  • - Short sync flags:      306
  • - Long sync flags :      208
  • - Full Registers  :       28
  • - Half Registers  :        3
  • - uGPR Registers  :        4
  • - Unified Regs    :       30
  • - Scratch space   :       40
  • - Total footprint :      520
  • - Max num of waves:        4
  • - Maximal Waves   :        2 (A5x)
  • Up0
  • Down0

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries (“Qualcomm”). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.