Hi,
I have some difficulties to understand exactly what is limiting our performances.
I recently a sorter in our graphical engine to reduce the number of state changes between draw calls. It result in a great improvement on some devices like older iOS devices (x3 factor) but on my Motorola Nexus 6 there is absolutly no change.
I also tried to use VAO, there was no win neither.
Here is a capture of what I can profile:
I suspect that we are limited by the CPU overhead, because our scene contains about 1000 draw calls. It's a CAO application so, as the user can change materials of each part of objects there is one draw call per sub-mesh.
Is it possible to generate texture atlas with compressed texture (etc1, pvr,...)? Doing this can eventually help us to reduce the number of draw calls.
There is also some meshes that are procedural we can certainly merge into few vertex buffers, but in this case it will request to allocate dynamic buffers and update it for each frame.