In my neural network, there are so many layers to call clEnqueueNDRangeKernel and the GPU compution task was not heavy that the the time consume of it on host side was much more than the kernel execution time.
Suer, it's obvious that we should focus on simplifying our network. But I want to know is there some tricks to reduce the API calls on the device side or reduce the time consum of every API call. In my project, the whole program execution time is about 60ms, however the GPU kernel execution time is less than 10ms.
Here is the profiler gragh:
https://drive.google.com/file/d/1ex0v4ut981i6lrNM22GHBrXnaAl_wQmV/view?u...