Forums - TfLite NNAPI slower than CPU

1 post / 0 new

or Register

TfLite NNAPI slower than CPU

marko.milosevic

Join Date: 14 Jul 21

Posts: 5

Posted: Mon, 2021-09-20 01:15

Top

Hello,

I noticed that nnapi is slower than cpu. I am using tflite built from source v2.7 as of writing.

I used benchmark provided with tensorflow source code and get following results:

Model ssd_mobilenet_v1_1_default_1.tflite: CPU=10.5 ms, NNAPI=19.8ms

Model my model 'o11.tflite': CPU=584ms, NNAPI=7168ms

I have similar results in my own code, and pre built linux_aarch64_benchmark_model. I tried with both quantized and regular models and changing nnapi options, nothing helps.

I tried using Snapdragon Profiler to see if dsp is being utilized, but it doesn't seem to have rb5 support (works fine with my android phone).

What could be the issue?

$ cat /build.prop

ro.product.board=qrb5165

ro.debuggable=1

ro.hardware.camera=qcom

ro.vendor.audio.sdk.ssr=false

ro.vendor.audio.sdk.fluencetype=none

persist.vendor.audio.fluence.voicecall=false

persist.vendor.audio.fluence.voicerec=false

persist.vendor.audio.fluence.speaker=false

persist.vendor.audio.fluence.audiorec=false

vendor.audio.tunnel.encode = false

vendor.audio.offload.buffer.size.kb=64

audio.offload.video=true

vendor.voice.path.for.pcm.voip=true

vendor.audio.offload.gapless.enabled=true

vendor.voice.playback.conc.disabled=true

vendor.voice.record.conc.disabled=true

vendor.voice.voip.conc.disabled=true

vendor.rec.playback.conc.disabled=true

vendor.audio.dolby.ds2.enabled=true

ro.qc.sdk.fwk.mic_support=6

persist.vendor.audio.qas.enabled=true

vendor.egl.default.platform=wayland

persist.vendor.sensors.enable.tdkMezzCard=false

service.adb.root=1

ro.build.version.release=202108122054

ro.product.name=qrb5165-qti-distro-ubuntu-fullstack-debug

Full benchmark calls and results:

$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=/home/euroicc/ssd_mobilenet_v1_1_default_1.tflite --num_threads=8

STARTING!

Log parameter values verbosely: [0]

Num threads: [8]

Graph: [/home/euroicc/ssd_mobilenet_v1_1_default_1.tflite]

#threads used for CPU inference: [8]

Loaded model /home/euroicc/ssd_mobilenet_v1_1_default_1.tflite

The input model file size (MB): 4.18331

Initialized session in 4.335ms.

Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.

count=40 first=83329 curr=11507 min=9123 max=83329 avg=12614.5 std=11405

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

count=95 first=9643 curr=11855 min=8904 max=15041 avg=10592.7 std=1375

Inference timings in us: Init: 4335, First inference: 83329, Warmup (avg): 12614.5, Inference (avg): 10592.7

Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.

Memory footprint delta from the start of the tool (MB): init=5.40234 overall=7.64062

$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=/home/euroicc/ssd_mobilenet_v1_1_default_1.tflite --num_threads=8 --use_nnapi=true --nnapi_allow_fp16=true --nnapi_execution_preference=sustained_speed

STARTING!

Log parameter values verbosely: [0]

Num threads: [8]

Graph: [/home/euroicc/ssd_mobilenet_v1_1_default_1.tflite]

#threads used for CPU inference: [8]

Use NNAPI: [1]

NNAPI execution preference: [sustained_speed]

NNAPI accelerators available: [libunifiedhal-driver.so0,libunifiedhal-driver.so1,libunifiedhal-driver.so2,libunifiedhal-driver.so3,nnapi-reference]

Allow fp16 in NNAPI: [1]

Loaded model /home/euroicc/ssd_mobilenet_v1_1_default_1.tflite

INFO: Created TensorFlow Lite delegate for NNAPI.

NNAPI delegate created.

Explicitly applied NNAPI delegate, and the model graph will be partially executed by the delegate w/ 1 delegate kernels.

The input model file size (MB): 4.18331

Initialized session in 459.384ms.

Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.

count=25 first=28360 curr=20294 min=13159 max=28360 avg=20631.2 std=2740

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

count=50 first=21577 curr=19661 min=18862 max=22730 avg=19886.2 std=1095

Inference timings in us: Init: 459384, First inference: 28360, Warmup (avg): 20631.2, Inference (avg): 19886.2

Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.

Memory footprint delta from the start of the tool (MB): init=16.418 overall=16.418

$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=/home/euroicc/rb5/models/o11.tflite --num_threads=8 --use_nnapi=true --nnapi_allow_fp16=true --nnapi_execution_preference=sustained_speed

STARTING!

Log parameter values verbosely: [0]

Num threads: [8]

Graph: [/home/euroicc/rb5/models/o11.tflite]

#threads used for CPU inference: [8]

Use NNAPI: [1]

NNAPI execution preference: [sustained_speed]

NNAPI accelerators available: [libunifiedhal-driver.so0,libunifiedhal-driver.so1,libunifiedhal-driver.so2,libunifiedhal-driver.so3,nnapi-reference]

Allow fp16 in NNAPI: [1]

Loaded model /home/euroicc/rb5/models/o11.tflite

INFO: Created TensorFlow Lite delegate for NNAPI.

NNAPI delegate created.

Explicitly applied NNAPI delegate, and the model graph will be partially executed by the delegate w/ 97 delegate kernels.

The input model file size (MB): 12.6068

Initialized session in 11335ms.

Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.

count=1 curr=10080560

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

count=21 first=7161284 curr=7243335 min=6961929 max=7394518 avg=7.16484e+06 std=119603

Inference timings in us: Init: 11334965, First inference: 10080560, Warmup (avg): 1.00806e+07, Inference (avg): 7.16484e+06

Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.

Memory footprint delta from the start of the tool (MB): init=461.387 overall=1654.13

$ bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model --graph=/home/euroicc/rb5/models/o11.tflite --num_threads=8

STARTING!

Log parameter values verbosely: [0]

Num threads: [8]

Graph: [/home/euroicc/rb5/models/o11.tflite]

#threads used for CPU inference: [8]

Loaded model /home/euroicc/rb5/models/o11.tflite

The input model file size (MB): 12.6068

Initialized session in 12.748ms.

Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.

count=1 curr=1313188

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.

count=50 first=617809 curr=560921 min=533275 max=663008 avg=584858 std=27823

Inference timings in us: Init: 12748, First inference: 1313188, Warmup (avg): 1.31319e+06, Inference (avg): 584858

Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.

Memory footprint delta from the start of the tool (MB): init=8.12109 overall=382.352

Forum vote up/down

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries (“Qualcomm”). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Sort By

Filter Results