Forums - qnn-net-run : LLaMA2 zero cycle result of dsp execution

1 post / 0 new
qnn-net-run : LLaMA2 zero cycle result of dsp execution
yhjeon
Join Date: 29 Apr 24
Posts: 1
Posted: Wed, 2024-05-22 22:55

Hello,

I am attempting to perform profiling on the LLama 2 7B model usingqnn-net-run. I aim to verify the operation of CPU, GPU, and DSP individually.

Throughqnn-net-run, I could inspect the execution time of each layer (operation in ONNX). Below is a partial capture of the actual results.

Mobile Device

Galaxy S24, snapdragon 8 Gen 3 for Galaxy

CPU

Log File Created: Tue May 14 18:39:45 2024 Time Scale: 1e-06 Epoch Timestamp: 1715679585048793 Steady Clock Timestamp: 2417645967274 Generated using: qnn-profile-viewer v2.20.0.240223161333_83920 qnn-net-run v2.20.0.240223161333_83920 Backend v2.20.0.240223161333_83920

Execute Stats (Average):Total Inference Time:

Graph 0 (llama_7B_layer_0): NetRun: 111511 us Backend (GRAPH_EXECUTE): 111494 us _dummy_input_ncf: 19 us __input_layernorm_Pow: 8 us __input_layernorm_ReduceMean: 64 us __input_layernorm_Add: 12 us __input_layernorm_Sqrt: 2 us __input_layernorm_Div: 5 us __input_layernorm_Mul: 15 us __input_layernorm_Cast_1_output_0_ncf: 16 us __input_layernorm_Mul_1: 5 us __self_attn_v_proj_MatMul_pre_reshape: 2 us __self_attn_k_proj_MatMul_pre_reshape: 2 us __self_attn_q_proj_MatMul_pre_reshape: 2 us __self_attn_q_proj_MatMul: 21327 us __self_attn_q_proj_MatMul_post_reshape: 9 us __self_attn_k_proj_MatMul: 9790 us __self_attn_k_proj_MatMul_post_reshape: 9 us __self_attn_v_proj_MatMul: 5452 us __self_attn_v_proj_MatMul_post_reshape: 8 us __self_attn_Transpose: 21 us __self_attn_Transpose_1: 22 us __self_attn_Transpose_2: 19 us

GPU

Total Inference Time:

Graph 0 (llama_7B_layer_0): NetRun: 205884 us Backend (QnnGraph_execute): 205859 us dummy_input_ncf: 42 us _input_layernorm_Pow: 51 us _input_layernorm_ReduceMean: 560 us _input_layernorm_Add: 5 us _input_layernorm_Sqrt: 2 us _input_layernorm_Div: 1 us _input_layernorm_Mul: 8 us _input_layernorm_Cast_1_output_0_ncf: 90 us _input_layernorm_Mul_1: 27 us _self_attn_v_proj_MatMul_pre_reshape: 4 us _self_attn_k_proj_MatMul_pre_reshape: 2 us _self_attn_q_proj_MatMul_pre_reshape: 3 us _self_attn_q_proj_MatMul: 9126 us _self_attn_q_proj_MatMul_post_reshape: 7 us _self_attn_k_proj_MatMul: 9529 us _self_attn_k_proj_MatMul_post_reshape: 7 us _self_attn_v_proj_MatMul: 10164 us _self_attn_v_proj_MatMul_post_reshape: 8 us _self_attn_Transpose: 9 us _self_attn_Transpose_1: 7 us _self_attn_Transpose_2: 9 us

DSP

Total Inference Time:

 

Graph 0 (llama_7B_layer_edited_0): NetRun: 33469 us Backend (Number of HVX threads used): 4 count Backend (RPC (execute) time): 33093 us Backend (QNN accelerator (execute) time): 27313 us Backend (Num times yield occured): 0 count Backend (Time for initial VTCM acquire): 484 us Backend (Time for HVX + HMX power on and acquire): 33292 us Backend (Accelerator (execute) time (cycles)): 4146605 cycles Input OpId_2 (cycles): 2701 cycles dummy_input_ncf:OpId_17 (cycles): 30655 cycles _input_layernorm_Pow:OpId_20 (cycles): 25352 cycles _input_layernorm_ReduceMean:OpId_22 (cycles): 24761 cycles _input_layernorm_Add:OpId_26 (cycles): 0 cycles _input_layernorm_Sqrt:OpId_27 (cycles): 0 cycles _input_layernorm_Div:OpId_29 (cycles): 68 cycles _input_layernorm_Mul:OpId_30 (cycles): 176254 cycles _input_layernorm_Cast_1_output_0_ncf:OpId_32 (cycles): 33012 cycles _input_layernorm_Mul_1:OpId_35 (cycles): 36680 cycles _self_attn_v_proj_MatMul_pre_reshape:OpId_36 (cycles): 0 cycles _self_attn_k_proj_MatMul_pre_reshape:OpId_37 (cycles): 0 cycles _self_attn_q_proj_MatMul_pre_reshape:OpId_38 (cycles): 0 cycles _self_attn_q_proj_MatMul:OpId_41 (cycles): 0 cycles _self_attn_q_proj_MatMul_post_reshape:OpId_43 (cycles): 0 cycles _self_attn_k_proj_MatMul:OpId_46 (cycles): 0 cycles _self_attn_k_proj_MatMul_post_reshape:OpId_48 (cycles): 0 cycles _self_attn_v_proj_MatMul:OpId_51 (cycles): 0 cycles _self_attn_v_proj_MatMul_post_reshape:OpId_53 (cycles): 0 cycles _self_attn_Transpose:OpId_55 (cycles): 257195 cycles _self_attn_Transpose_1:OpId_57 (cycles): 207143 cycles _self_attn_Transpose_2:OpId_59 (cycles): 0 cycles

I noticed that certain operations on the DSP are recorded with 0 cycles (profiling-level: detailed). Operations like_self_attn_q_proj_MatMulinvolve matrix multiplication, which should consume cycles on the DSP.

 

To analyze this further, I performed profiling using the linting level backend to extract a Chrome trace and then verified it through Perfetto.

The Chrome trace results are as follows, showing duration, resources allocated, etc., for each operation.

In this Chrome trace, operations recorded as 0 cycles in the log are treated as Background operations, and resource allocation cannot be confirmed.

On the other hand, operations with cycle numbers in the log provide information such as duration and allocated resources in the foreground.

I wonder if these results indicate that these operations cannot run on the DSP (HTP) hence resulting in 0 cycles? If not, could it be an issue with the ONNX file, conversion process, or theqnn-net-runconfiguration file not properly allocating to the DSP?

Feel free to let me know if you need any further assistance or clarification!!

Below is the command I used for DSP execution.

 

0. Model Export (*.onnx) & Extraction of Sub-Layer ONNX Files If Needed (host)

Extracted the 0th layer of the LLama 2 7B model and saved it as an ONNX file.

1. Model conversion & quantization (host)

${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter --input_network /disk/models/outputlayer/llama_7B_layer_0.onnx --no_simplification --batch 1 --input_list /disk/models/outputlayer/input_list.txt --output_path /disk/models/outputlayer/llama_7B_layer_edited_0.cpp --act_bw 8 --weight_bw 8

2. Model compile (host)

${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-model-lib-generator -c /disk/models/outputlayer/llama_7B_layer_edited_0.cpp -b /disk/models/outputlayer/llama_7B_layer_edited_0.bin -o **/disk/models/outputlayer/model_libs/x86_64-l**inux-clang/libllama_7B_layer_0_jyh.so

3. Make backend configuration file (host, device(only different path))

Created a backend configuration file for HTP usage. The mobile version only differs in library and config file paths; the rest of the content remains the same as the provided config file.

$ vi htp_backend_extensions_jyh.json

{
  "backend_extensions": {
    "shared_library_path": "/opt/qcom/aistack/qnn/2.20.0.240223/lib/x86_64-linux-clang/libQnnHtpNetRunExtensions.so",
    "config_file_path": "/opt/qcom/aistack/qnn/2.20.0.240223/benchmarks/QNN/config/htp_config_part1_jyh.json"
  }
}

$ vi htp_config_part1_jyh.json

{
  "graphs": {
    "graph_names": ["llama_7B_layer_edited_0"],
    "fp16_relaxed_precision": 1,
    "vtcm_mb": 8
  },
  "devices": [
    {
      "dsp_arch": "v73",
      "profiling_level": "linting",
      "cores": [
        {
          "core_id": 0,
          "perf_profile": "high_performance",
          "rpc_control_latency": 100,
          "rpc_polling_time": 9999,
          "hmx_timeout_us": 300000
        }
      ]
    }
  ],
  "context": {
    "enable_weight_sharing": false
  }
}

4. Model offline compile for cache model (host)

Offline compilation for HTP usage.

${QNN_SDK_ROOT}/bin/x86_64-linux-clang/qnn-context-binary-generator --backend /opt/qcom/aistack/qnn/2.20.0.240223/lib/x86_64-linux-clang/libQnnHtp.so --model /disk/models/outputlayer/model_libs/x86_64-linux-clang/libllama_7B_layer_0_jyh.so/x86_64-linux-clang/libllama_7B_layer_edited_0.so --output_dir /opt/qcom/aistack/qnn/2.20.0.240223/result_tmp --binary_file llama_7B_layer_edited_0.bin --config_file /opt/qcom/aistack/qnn/2.20.0.240223/benchmarks/QNN/config/htp_backend_extensions_jyh.json

5. execute qnn-net-run on galaxy s24 ultra (device)

Transferred the config file, model bin, libqnnhtp.so, etc., to the mobile device and executed the following command:

./qnn-net-run --backend libQnnHtp.so --input_list input_list.txt --retrieve_context llama_7B_layer_edited_0_test.bin --profiling_level backend --config_file htp_backend_extensions_jyh.json

Upon executing the command, the generated qnn-profiling-data_0.log was transferred to the host.

6. extract result to log & chrome trace (host)

../../../bin/x86_64-linux-clang/qnn-profile-viewer --input_log ./qnn-profiling-data_0.log --reader /opt/qcom/aistack/qnn/2.20.0.240223/lib/x86_64-linux-clang/libQnnChrometraceProfilingReader.so --output ./chrometrace.json

  • Up0
  • Down0

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries (“Qualcomm”). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.