Hello,
I am attempting to perform profiling on the LLama 2 7B model usingqnn-net-run
. I aim to verify the operation of CPU, GPU, and DSP individually.
Throughqnn-net-run
, I could inspect the execution time of each layer (operation in ONNX). Below is a partial capture of the actual results.
Mobile Device
Galaxy S24, snapdragon 8 Gen 3 for Galaxy
CPU
Log File Created: Tue May 14 18:39:45 2024 Time Scale: 1e-06 Epoch Timestamp: 1715679585048793 Steady Clock Timestamp: 2417645967274 Generated using: qnn-profile-viewer v2.20.0.240223161333_83920 qnn-net-run v2.20.0.240223161333_83920 Backend v2.20.0.240223161333_83920
Execute Stats (Average):Total Inference Time:Graph 0 (llama_7B_layer_0): NetRun: 111511 us Backend (GRAPH_EXECUTE): 111494 us _dummy_input_ncf: 19 us __input_layernorm_Pow: 8 us __input_layernorm_ReduceMean: 64 us __input_layernorm_Add: 12 us __input_layernorm_Sqrt: 2 us __input_layernorm_Div: 5 us __input_layernorm_Mul: 15 us __input_layernorm_Cast_1_output_0_ncf: 16 us __input_layernorm_Mul_1: 5 us __self_attn_v_proj_MatMul_pre_reshape: 2 us __self_attn_k_proj_MatMul_pre_reshape: 2 us __self_attn_q_proj_MatMul_pre_reshape: 2 us __self_attn_q_proj_MatMul: 21327 us __self_attn_q_proj_MatMul_post_reshape: 9 us __self_attn_k_proj_MatMul: 9790 us __self_attn_k_proj_MatMul_post_reshape: 9 us __self_attn_v_proj_MatMul: 5452 us __self_attn_v_proj_MatMul_post_reshape: 8 us __self_attn_Transpose: 21 us __self_attn_Transpose_1: 22 us __self_attn_Transpose_2: 19 us
GPU
Total Inference Time:Graph 0 (llama_7B_layer_0): NetRun: 205884 us Backend (QnnGraph_execute): 205859 us dummy_input_ncf: 42 us _input_layernorm_Pow: 51 us _input_layernorm_ReduceMean: 560 us _input_layernorm_Add: 5 us _input_layernorm_Sqrt: 2 us _input_layernorm_Div: 1 us _input_layernorm_Mul: 8 us _input_layernorm_Cast_1_output_0_ncf: 90 us _input_layernorm_Mul_1: 27 us _self_attn_v_proj_MatMul_pre_reshape: 4 us _self_attn_k_proj_MatMul_pre_reshape: 2 us _self_attn_q_proj_MatMul_pre_reshape: 3 us _self_attn_q_proj_MatMul: 9126 us _self_attn_q_proj_MatMul_post_reshape: 7 us _self_attn_k_proj_MatMul: 9529 us _self_attn_k_proj_MatMul_post_reshape: 7 us _self_attn_v_proj_MatMul: 10164 us _self_attn_v_proj_MatMul_post_reshape: 8 us _self_attn_Transpose: 9 us _self_attn_Transpose_1: 7 us _self_attn_Transpose_2: 9 us
DSP
Total Inference Time:
Graph 0 (llama_7B_layer_edited_0): NetRun: 33469 us Backend (Number of HVX threads used): 4 count Backend (RPC (execute) time): 33093 us Backend (QNN accelerator (execute) time): 27313 us Backend (Num times yield occured): 0 count Backend (Time for initial VTCM acquire): 484 us Backend (Time for HVX + HMX power on and acquire): 33292 us Backend (Accelerator (execute) time (cycles)): 4146605 cycles Input OpId_2 (cycles): 2701 cycles dummy_input_ncf:OpId_17 (cycles): 30655 cycles _input_layernorm_Pow:OpId_20 (cycles): 25352 cycles _input_layernorm_ReduceMean:OpId_22 (cycles): 24761 cycles _input_layernorm_Add:OpId_26 (cycles): 0 cycles _input_layernorm_Sqrt:OpId_27 (cycles): 0 cycles _input_layernorm_Div:OpId_29 (cycles): 68 cycles _input_layernorm_Mul:OpId_30 (cycles): 176254 cycles _input_layernorm_Cast_1_output_0_ncf:OpId_32 (cycles): 33012 cycles _input_layernorm_Mul_1:OpId_35 (cycles): 36680 cycles _self_attn_v_proj_MatMul_pre_reshape:OpId_36 (cycles): 0 cycles _self_attn_k_proj_MatMul_pre_reshape:OpId_37 (cycles): 0 cycles _self_attn_q_proj_MatMul_pre_reshape:OpId_38 (cycles): 0 cycles _self_attn_q_proj_MatMul:OpId_41 (cycles): 0 cycles _self_attn_q_proj_MatMul_post_reshape:OpId_43 (cycles): 0 cycles _self_attn_k_proj_MatMul:OpId_46 (cycles): 0 cycles _self_attn_k_proj_MatMul_post_reshape:OpId_48 (cycles): 0 cycles _self_attn_v_proj_MatMul:OpId_51 (cycles): 0 cycles _self_attn_v_proj_MatMul_post_reshape:OpId_53 (cycles): 0 cycles _self_attn_Transpose:OpId_55 (cycles): 257195 cycles _self_attn_Transpose_1:OpId_57 (cycles): 207143 cycles _self_attn_Transpose_2:OpId_59 (cycles): 0 cycles
I noticed that certain operations on the DSP are recorded with 0 cycles (profiling-level: detailed). Operations like_self_attn_q_proj_MatMul
involve matrix multiplication, which should consume cycles on the DSP.
To analyze this further, I performed profiling using the linting level backend to extract a Chrome trace and then verified it through Perfetto.
The Chrome trace results are as follows, showing duration, resources allocated, etc., for each operation.
In this Chrome trace, operations recorded as 0 cycles in the log are treated as Background operations, and resource allocation cannot be confirmed.
On the other hand, operations with cycle numbers in the log provide information such as duration and allocated resources in the foreground.
I wonder if these results indicate that these operations cannot run on the DSP (HTP) hence resulting in 0 cycles? If not, could it be an issue with the ONNX file, conversion process, or theqnn-net-run
configuration file not properly allocating to the DSP?
Feel free to let me know if you need any further assistance or clarification!!
Below is the command I used for DSP execution.
0. Model Export (*.onnx) & Extraction of Sub-Layer ONNX Files If Needed (host)
Extracted the 0th layer of the LLama 2 7B model and saved it as an ONNX file.
1. Model conversion & quantization (host)
${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter --input_network /disk/models/outputlayer/llama_7B_layer_0.onnx --no_simplification --batch 1 --input_list /disk/models/outputlayer/input_list.txt --output_path /disk/models/outputlayer/llama_7B_layer_edited_0.cpp --act_bw 8 --weight_bw 8
2. Model compile (host)
${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-model-lib-generator -c /disk/models/outputlayer
/llama_7B_layer_edited_0.cpp
-b /disk/models/outputlayer
/llama_7B_layer_edited_0.bin
-o **/disk/models/outputlayer/model_libs/x86_64-l**inux-clang/libllama_7B_layer_0_jyh.so
3. Make backend configuration file (host, device(only different path))
Created a backend configuration file for HTP usage. The mobile version only differs in library and config file paths; the rest of the content remains the same as the provided config file.
$ vi htp_backend_extensions_jyh.json
{
"backend_extensions": {
"shared_library_path": "/opt/qcom/aistack/qnn/2.20.0.240223/lib/x86_64-linux-clang/libQnnHtpNetRunExtensions.so",
"config_file_path": "/opt/qcom/aistack/qnn/2.20.0.240223/benchmarks/QNN/config/htp_config_part1_jyh.json"
}
}
$ vi htp_config_part1_jyh.json
{
"graphs": {
"graph_names": ["llama_7B_layer_edited_0"],
"fp16_relaxed_precision": 1,
"vtcm_mb": 8
},
"devices": [
{
"dsp_arch": "v73",
"profiling_level": "linting",
"cores": [
{
"core_id": 0,
"perf_profile": "high_performance",
"rpc_control_latency": 100,
"rpc_polling_time": 9999,
"hmx_timeout_us": 300000
}
]
}
],
"context": {
"enable_weight_sharing": false
}
}
4. Model offline compile for cache model (host)
Offline compilation for HTP usage.
${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-context-binary-generator --backend /opt/qcom/aistack/qnn/2.20.0.240223/lib/x86_64-linux-clang/libQnnHtp.so --model /disk/models/outputlayer/model_libs/x86_64-linux-clang/libllama_7B_layer_0_jyh.so/x86_64-linux-clang/libllama_7B_layer_edited_0.so --output_dir /opt/qcom/aistack/qnn/2.20.0.240223/result_tmp --binary_file llama_7B_layer_edited_0.bin --config_file /opt/qcom/aistack/qnn/2.20.0.240223/benchmarks/QNN/config/htp_backend_extensions_jyh.json
5. execute qnn-net-run on galaxy s24 ultra (device)
Transferred the config file, model bin, libqnnhtp.so, etc., to the mobile device and executed the following command:
./qnn-net-run --backend libQnnHtp.so --input_list input_list.txt --retrieve_context llama_7B_layer_edited_0_test.bin --profiling_level backend --config_file htp_backend_extensions_jyh.json
Upon executing the command, the generated qnn-profiling-data_0.log
was transferred to the host.
6. extract result to log & chrome trace (host)
../../../bin/x86_64-linux-clang/qnn-profile-viewer --input_log ./qnn-profiling-data_0.log --reader /opt/qcom/aistack/qnn/2.20.0.240223/lib/x86_64-linux-clang/libQnnChrometraceProfilingReader.so --output ./chrometrace.json