Hi, QML Team,
We work to optimize LSTM algorithm on Android devices.
We have some parallel operators with arm-neon-vfp intrinsics like 'vmull_s8' and 'vaddq_f32' ...
Now, we want to refactor with QML. The premise is the operators with QML have better performance than arm-neon.
So could you introduce more details about QML parallel implementations
if QML run faster than common SIMD instructions on SnapDragon platform.