Hi all,
Apologies if this is the wrong place to post this, but we noticed that in one of our models the bilinear resize operation would occasionally take an exceedingly long time to complete. The operation typically takes much less than 0.1 ms to complete, but sometimes would take more than 40 ms. These long execution times happened about 0.01% of the time, but were still frequent enough to have a substantial impact on our application.
We were able to fix the issue by modifying the `resize_bilinear_d32_execute` function. The original version of this function had the following lines:
```
for(int i =0; i < n_threads; i++)
nn_os_work_for_vector( nn, runstate.run_all_func, &runstate );
// copy the min and max through
tensor_copy( out_min_tensor, in_min_tensor );
tensor_copy( out_max_tensor, in_max_tensor );
nn_sem_wait_n_times( &runstate.done_sem, n_threads);
```
We were able to fix the issue by moving the `tensor_copy` operations to after the threads finish. The median execution time is slightly longer, but we do not observe excessively long execution times with this change. We were able to check that when the operation takes a long time to execute there is a much larger number of cache misses, so this seems to be a cache thrashing issue.
Is this a known issue? Are there other ops that have similar issues?