Forums - TfLite NNAPI slower than CPU

1 post / 0 new
TfLite NNAPI slower than CPU
marko.milosevic
Join Date: 14 Jul 21
Posts: 5
Posted: Mon, 2021-09-20 01:15
Hello,
 
I noticed that nnapi is slower than cpu. I am using tflite built from source v2.7 as of writing.
 
I used benchmark provided with tensorflow source code and get following results:
 
Model ssd_mobilenet_v1_1_default_1.tflite: CPU=10.5 ms, NNAPI=19.8ms
 
Model my model 'o11.tflite': CPU=584ms, NNAPI=7168ms
 
I have similar results in my own code, and pre built linux_aarch64_benchmark_model. I tried with both quantized and regular models and changing nnapi options, nothing helps.
 
I tried using Snapdragon Profiler to see if dsp is being utilized, but it doesn't seem to have rb5 support (works fine with my android phone).
 
What could be the issue?
 
$ cat /build.prop
 
ro.product.board=qrb5165
ro.debuggable=1
ro.hardware.camera=qcom
ro.vendor.audio.sdk.ssr=false
ro.vendor.audio.sdk.fluencetype=none
persist.vendor.audio.fluence.voicecall=false
persist.vendor.audio.fluence.voicerec=false
persist.vendor.audio.fluence.speaker=false
persist.vendor.audio.fluence.audiorec=false
vendor.audio.tunnel.encode = false
vendor.audio.offload.buffer.size.kb=64
audio.offload.video=true
vendor.voice.path.for.pcm.voip=true
vendor.audio.offload.gapless.enabled=true
vendor.voice.playback.conc.disabled=true
vendor.voice.record.conc.disabled=true
vendor.voice.voip.conc.disabled=true
vendor.rec.playback.conc.disabled=true
vendor.audio.dolby.ds2.enabled=true
ro.qc.sdk.fwk.mic_support=6
persist.vendor.audio.qas.enabled=true
vendor.egl.default.platform=wayland
persist.vendor.sensors.enable.tdkMezzCard=false
service.adb.root=1
ro.build.version.release=202108122054
ro.product.name=qrb5165-qti-distro-ubuntu-fullstack-debug
 
Full benchmark calls and results:
 
$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=/home/euroicc/ssd_mobilenet_v1_1_default_1.tflite --num_threads=8
STARTING!
Log parameter values verbosely: [0]
Num threads: [8]
Graph: [/home/euroicc/ssd_mobilenet_v1_1_default_1.tflite]
#threads used for CPU inference: [8]
Loaded model /home/euroicc/ssd_mobilenet_v1_1_default_1.tflite
The input model file size (MB): 4.18331
Initialized session in 4.335ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=40 first=83329 curr=11507 min=9123 max=83329 avg=12614.5 std=11405
 
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=95 first=9643 curr=11855 min=8904 max=15041 avg=10592.7 std=1375
 
Inference timings in us: Init: 4335, First inference: 83329, Warmup (avg): 12614.5, Inference (avg): 10592.7
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=5.40234 overall=7.64062
 
$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=/home/euroicc/ssd_mobilenet_v1_1_default_1.tflite --num_threads=8 --use_nnapi=true --nnapi_allow_fp16=true --nnapi_execution_preference=sustained_speed
STARTING!
Log parameter values verbosely: [0]
Num threads: [8]
Graph: [/home/euroicc/ssd_mobilenet_v1_1_default_1.tflite]
#threads used for CPU inference: [8]
Use NNAPI: [1]
NNAPI execution preference: [sustained_speed]
NNAPI accelerators available: [libunifiedhal-driver.so0,libunifiedhal-driver.so1,libunifiedhal-driver.so2,libunifiedhal-driver.so3,nnapi-reference]
Allow fp16 in NNAPI: [1]
Loaded model /home/euroicc/ssd_mobilenet_v1_1_default_1.tflite
INFO: Created TensorFlow Lite delegate for NNAPI.
NNAPI delegate created.
Explicitly applied NNAPI delegate, and the model graph will be partially executed by the delegate w/ 1 delegate kernels.
The input model file size (MB): 4.18331
Initialized session in 459.384ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=25 first=28360 curr=20294 min=13159 max=28360 avg=20631.2 std=2740
 
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=21577 curr=19661 min=18862 max=22730 avg=19886.2 std=1095
 
Inference timings in us: Init: 459384, First inference: 28360, Warmup (avg): 20631.2, Inference (avg): 19886.2
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=16.418 overall=16.418
 
$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=/home/euroicc/rb5/models/o11.tflite --num_threads=8 --use_nnapi=true --nnapi_allow_fp16=true --nnapi_execution_preference=sustained_speed
STARTING!
Log parameter values verbosely: [0]
Num threads: [8]
Graph: [/home/euroicc/rb5/models/o11.tflite]
#threads used for CPU inference: [8]
Use NNAPI: [1]
NNAPI execution preference: [sustained_speed]
NNAPI accelerators available: [libunifiedhal-driver.so0,libunifiedhal-driver.so1,libunifiedhal-driver.so2,libunifiedhal-driver.so3,nnapi-reference]
Allow fp16 in NNAPI: [1]
Loaded model /home/euroicc/rb5/models/o11.tflite
INFO: Created TensorFlow Lite delegate for NNAPI.
NNAPI delegate created.
Explicitly applied NNAPI delegate, and the model graph will be partially executed by the delegate w/ 97 delegate kernels.
The input model file size (MB): 12.6068
Initialized session in 11335ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=10080560
 
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=21 first=7161284 curr=7243335 min=6961929 max=7394518 avg=7.16484e+06 std=119603
 
Inference timings in us: Init: 11334965, First inference: 10080560, Warmup (avg): 1.00806e+07, Inference (avg): 7.16484e+06
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=461.387 overall=1654.13
 
$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=/home/euroicc/rb5/models/o11.tflite --num_threads=8
STARTING!
Log parameter values verbosely: [0]
Num threads: [8]
Graph: [/home/euroicc/rb5/models/o11.tflite]
#threads used for CPU inference: [8]
Loaded model /home/euroicc/rb5/models/o11.tflite
The input model file size (MB): 12.6068
Initialized session in 12.748ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=1313188
 
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=617809 curr=560921 min=533275 max=663008 avg=584858 std=27823
 
Inference timings in us: Init: 12748, First inference: 1313188, Warmup (avg): 1.31319e+06, Inference (avg): 584858
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=8.12109 overall=382.352
  • Up0
  • Down0

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries (“Qualcomm”). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.