We are running QAic-exec in simulation mode.
When I perform the inference for the FP32 precision we are getting inference time of 191 sec for one image.
Once we set flags for INT8 quantisation we are getting inference time of 26 min per image. we have tried both static and dyanamic quantisation but we are getting similar timing for both the methods.
We have followed the following steps
1.Generate the profile for the model.
./qaic-exec -m=<ONNX model path> -input-list-file=<text file path> -dump-profile=<path to dump profile(.yaml file)>
2.Inference using static quantization.
./qaic-exec -m=<ONNX model path> -input-list-file=<text file path> -load-profile=<path to load profile(.yaml file)> -write-output-dir=<path to store the outputs> -quantization-precision-bias=Int8 -quantization-precision=Int8 -quantization-schema=symmetric_with_uint8
we observed that inference time for int8 quantization is way more than that taken by FP32 in simulator mode. Is this expected behaviour of QAic-exec in simulator mode?
If no, Please let us know the suitable command line arguments for the execution in simulator.
Dear customer,
The inference time is better if you input fixed tensor compared with float tensor, as float data needs more time to quantize to fixed points.
BR.
Wei