I have an OpenCL kernel, which is fairly well optimized. But I am facing the following problem: As the complexity of the kernel or the local memory increases the max work group size also decreases. I believe that my kernel would perform further well if I am able to increase the work group size. The total local memory allocated is less than 10K bytes, which is very less compared to the 32K limit of the Adreno 530 hardware limit. For 10K local memory I am able to launch only 64 threads(Max possible being 1024). Storing the OpenCL image binary also does not produce readable code.
I am looking for some directions beyond this for further enhancing the performance. Like can I increase the performance by somehow enabling launching the kernel with larger work group size. Probably by looking at the generated readable binary code would help me. (How do I do this? )