I am currently porting a video processing algorithm to Adreno 320 / OpenCL (Nexus 7) which we had previously implemented using DirectX Compute Shaders. Right now I am battling an issue which has crippled my progress. On executing some of my opencl kernels I get an error code -54 which indicates invalid work group size. The problem however is that the work group size in those instances is actually perfectly fine (16x8=128) and is same as other kernels which work fine. The error is in fact dependent on the kernel code. If I comment most of the kernel code the error goes away. As I start adding back funtionality at some point it errors out saying invalid work group size.
My initial hunch was that when there are too many memory operations in flight by the threads then this error comes up. In one case I was able to make this error go away by inserting barriers in the code to ensure there ate not too many outstanding memory transactions in flight by the threads at any given time. In other cases even that didn't work. And in some cases even a fairly simple kernel introduces this error. I am really stuck on this as I can't find any deterministic way to avoid this.
Are there any ideas / suggestions about this?
A followup to my previous post:
After playing around more, I found out that depending on the contents of the kernel the maximum supported group size seems to go down. For one of my kernels (which is using local shared memory as well as atomic operations) it runs fine as long as I keep the group size at 32 or smaller. It could be 32x1, 8x4, 16x2, 4x8, but as long as it is within 32 it works. For another kernel which is using multiple texture reads and shared local memort, the limit seems to be 64. And for my simplerst kernels which just read a single tetxure element and write back to another texture the limit seems to be 128. As per the device information calls in OpenCL, the device supports a group size of up to 256 but I always get an error if I try to go above 128 even if I have basically a simple pass-through kernel.
So in summary it seems that based on kernel contents the driver seems to enforce diffrent work group size limits. Not sure why is that and if it will be fixed, but for now at least I can move forward though with a performance hit.