Forums - Link to Eigen but no performance improved

20 posts / 0 new
Last post
Link to Eigen but no performance improved
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Thu, 2017-11-09 23:03

I link qsml to eigen blas interface, and the eigen is used in ceres-solver to do some optimization for slam.

but after use qsml, the performance does not improved, any suggestion?

  • Up0
  • Down0
rakihasa
Join Date: 21 Sep 17
Posts: 27
Posted: Fri, 2017-11-10 09:56

Hi shengxd1, 

 

Could you provide more details of your runtime environment? e.g. which device, version of OS, QSML and Eigen you are using. 

 

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Tue, 2017-11-14 23:57

snapdragon 820 CPU ,

Android 7.0 ,

QSLM is the latest version , 

and Eigen is the latest version 3.3.4

ceres solver is also the latest version.

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Wed, 2017-11-15 01:28

And I comfirmed that the Eigen link qsml success, which means that it use qsml to do some calculation.

  • Up0
  • Down0
rakihasa
Join Date: 21 Sep 17
Posts: 27
Posted: Wed, 2017-11-15 10:27

Hi shengxd1, 

Thank you for the reply. 

I will try to compare Eigen and QSML on my end and let you know. A couple more questions, are you using 64-bit builds? sequential or parallel?

 

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Wed, 2017-11-15 17:51

Yes, I used 64bit builds.

 sequential or parallel? I compile and run the program in sequential.

  • Up0
  • Down0
rakihasa
Join Date: 21 Sep 17
Posts: 27
Posted: Thu, 2017-11-16 09:39

Hi shengxd1, 

Thank you very much for the reply. 

I compared Eigen and QSML latest versions on Snapdragon 820, only the matrix multiply operation. QSML seems to be about 2x faster than Eigen for a DGEMM operation on 1200x1200 matrices (this for the Sequential library).  

I am not familiar with the ceres-solver library. Do you know which BLAS operations it uses and what are the size of the problems for these BLAS calls? If not I can try to run ceres-solver myself. In that case, could you tell me how you are compiling, linking and running it (to reproduce and hopefully resolve the issue)? 

 

Also, for the sequential/parallel question, I meant which QSML library you used to link with Eigen. 

 

 

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Thu, 2017-11-16 17:50

I linked with libQSML-0.15.2.so. 

I will try libQSML-sequential-0.15.2.so.

I didn't know there are sequential and parallel version QSML.

so what is the difference?

  • Up0
  • Down0
rakihasa
Join Date: 21 Sep 17
Posts: 27
Posted: Thu, 2017-11-16 18:35

The sequential library does all the computation with only 1 thread/core, while the parallel version tries to utilize all the cores and distribute the work among them.

So, if your application is not parallel itself (meaning only one thread in your application), parallel QSML provides higher performance.

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Thu, 2017-11-16 21:10

I use ceres-solver in my project, so it's difficulty for you to reproduce the problem.

I'll try to find which operation cost much time in it. currently I just know the linear solver in it cost much time, which I think it must use some matrix calculation.

our problem size is about 300 variables.

ceres solver is a open source libariay of google, it is easy compile for ARM by the script in it.

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Thu, 2017-11-16 21:39

Hi,

I tried a matrix multipy like this:

------------------------------------------------------------------------

TicToc t_s;
MatrixXd m = MatrixXd::Random(1200,1200);
MatrixXd n = MatrixXd::Random(1200,1200);
MatrixXd k = m*n;
std::cout << "time: " << t_s.toc() << ", " << k(0,0) << endl;

------------------------------------------------------------------------

but the running time with QSML or without QSML seems almost the same.

In Android.mk, I use QSML like this:

------------------------------------------------------------------------

include $(CLEAR_VARS)
LOCAL_MODULE := qsml
LOCAL_SRC_FILES := Thirdparty/QSML/android/$(APP_ABI)/lp64/ndk-r11/lib/libQSML-sequential-0.15.2.so #libQSML-0.15.2.so
include $(PREBUILT_SHARED_LIBRARY)
...
...
LOCAL_SHARED_LIBRARIES := qsml
LOCAL_CFLAGS += -DEIGEN_USE_BLAS
LOCAL_CFLAGS += -fPIC -frtti -fexceptions -lz -Wno-long-long -O3
------------------------------------------------------------------------
 
and Application.mk setting in this way:
 
APP_ABI:= arm64-v8a #arm64-v8a #armeabi-v7a,armeabi,x86,arm64-v8a
APP_STL := gnustl_static #c++_shared# #gnustl_static
APP_CPPFLAGS:=-frtti -fexceptions -std=c++11 -fvisibility=hidden -mfpu=neon-vfpv4 -mfloat-abi=softfp
APP_PLATFORM := android-21

 

my NDK version is r15-c

so, is there any problem to use QSML?

 

  • Up0
  • Down0
rakihasa
Join Date: 21 Sep 17
Posts: 27
Posted: Fri, 2017-11-17 09:26

Hi shengxd1, 

In my experiment, I created two different codes: one calls QSML directly, the other calls Eigen (without QSML).

But I will try to do the experiment your way i.e. use QSML through Eigen and give you an update.

 

  • Up0
  • Down0
rakihasa
Join Date: 21 Sep 17
Posts: 27
Posted: Fri, 2017-11-17 10:18

Hi shengxd1, 

One thing to note: shouldn't you only time the multiply? i.e. TicTock around only "k = m*n;" instruction, and not the matrix creation.

But I tried your exact code (except for TicToc, I used std::chrono), i.e. timing matrix creation and the multiply, I can still see about 1.7x speedup when using QSML.

Can you provide some details on how you are running the executable?

 

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Sun, 2017-11-19 19:24

Hasan, thank you very much.

I test my program again, and use std::chrono to get time, I get about 1.3x speedup with sequential lib now.

with pararell lib, I can get about 3~4x speedup.

maybe some different speedup compared with you is the CPU different. I re-comfirmed that the CPU I used is snapdragon821.

 

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Sun, 2017-11-19 20:33

one more thing, if I change matrix size from 1200 to 300, it seems that there are almost no speedup by using QSML.

  • Up0
  • Down0
rakihasa
Join Date: 21 Sep 17
Posts: 27
Posted: Mon, 2017-11-20 10:26

Hi shengxd1, 

You are welcome. For size 300, did you change the timing code to time only the multiplication? 

Try your code like this:

-------------------------------------------------------

MatrixXd m = MatrixXd::Random(300,300);
MatrixXd n = MatrixXd::Random(300,300);
MatrixXd k = MatrixXd::Random(300,300);
 
std::chrono::.....
k = m*n;

​std::chrono::.....

-------------------------------------------------------

With this style code, I am still seeing about 1.85x speedup for size 300 with the sequential library on Snapdragon 820.

Can you confirm the timing style?

 

EDIT: The reason for this is to just compare the actual operation. When the input is small (like 300), creating the matrices can be the dominating cost for which you see no difference in performance when using different libraries.

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Mon, 2017-11-20 18:11

my code style is the same as you suggested.

I find that in armabi-v7a mode, there are speedup, but in arm64-v8a, there are no speedup.

attached is my test code and compile method.

test_qsml.cpp:

#include <iostream>
#include <chrono>

#include <eigen3/Eigen/Dense>
using namespace Eigen;
using namespace std;

int main()
{

MatrixXd m = MatrixXd::Random(300,300);
MatrixXd n = MatrixXd::Random(300,300);
MatrixXd k;
std::chrono::steady_clock::time_point t1 = std::chrono::steady_clock::now();
 
//for(int i=0; i<10;i++){
k = m*n;
//}
std::chrono::steady_clock::time_point t2 = std::chrono::steady_clock::now();


double timeSpent =std::chrono::duration_cast<std::chrono::duration<double>>(t2-t1).count();
std::cout << "time used: " << timeSpent <<", " << k(0,0) << endl;

return 0;
}

 

Android.mk:

LOCAL_PATH := $(call my-dir)

#-----------------------------------------

include $(CLEAR_VARS)
LOCAL_MODULE := qsml
LOCAL_SRC_FILES := QSML/android/$(APP_ABI)/lp64/ndk-r11/lib/libQSML-sequential-0.15.2.so #libQSML-0.15.2.so libQSML-sequential-0.15.2
include $(PREBUILT_SHARED_LIBRARY)
#---------------------------------------


#build a test executable
include $(CLEAR_VARS)


LOCAL_MODULE := test_qsml
LOCAL_SRC_FILES := $(LOCAL_PATH)/test_qsml.cpp


LOCAL_SHARED_LIBRARIES := qsml
LOCAL_CFLAGS += -DEIGEN_USE_BLAS


LOCAL_CFLAGS += -fPIC -frtti -fexceptions -lz -O3
LOCAL_LDLIBS += -lm -llog -lz
include $(BUILD_EXECUTABLE)

 

Application.mk:

APP_ABI:= arm64-v8a #arm64-v8a #armeabi-v7a,armeabi,x86,arm64-v8a
APP_STL := gnustl_static #c++_shared# #gnustl_static
APP_CPPFLAGS:=-frtti -fexceptions -std=c++11 -fvisibility=hidden -mfpu=neon-vfpv4 -mfloat-abi=softfp
APP_PLATFORM := android-21

 

 

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Mon, 2017-11-20 18:57

I got speedup in armabi-v7a mode, but in arm64-v8a mode, there are no speedup.

the following is my code:

---------------------------------------

test_qsml.cpp:

#include <iostream>
#include <chrono>

#include <eigen3/Eigen/Dense>
using namespace Eigen;
using namespace std;

int main()
{

    MatrixXd m = MatrixXd::Random(300,300);    
    MatrixXd n = MatrixXd::Random(300,300);
    MatrixXd k;
    std::chrono::steady_clock::time_point t1 = std::chrono::steady_clock::now();
     
    //for(int i=0; i<10;i++){
    k = m*n;
    //}
    std::chrono::steady_clock::time_point t2 = std::chrono::steady_clock::now();


    double timeSpent =std::chrono::duration_cast<std::chrono::duration<double>>(t2-t1).count();
    std::cout << "time used: " << timeSpent <<", " << k(0,0) << endl;

    return 0;
}

---------------------------------------

Android.mk:

LOCAL_PATH := $(call my-dir)

include $(CLEAR_VARS)
LOCAL_MODULE := qsml
LOCAL_SRC_FILES := QSML/android/$(APP_ABI)/lp64/ndk-r11/lib/libQSML-sequential-0.15.2.so #libQSML-0.15.2.so libQSML-sequential-0.15.2
include $(PREBUILT_SHARED_LIBRARY)

#build a test executable
include $(CLEAR_VARS)


LOCAL_MODULE := test_qsml
LOCAL_SRC_FILES := $(LOCAL_PATH)/test_qsml.cpp


LOCAL_SHARED_LIBRARIES := qsml
LOCAL_CFLAGS += -DEIGEN_USE_BLAS


LOCAL_CFLAGS += -fPIC -frtti -fexceptions -lz -O3
LOCAL_LDLIBS += -lm -llog -lz
include $(BUILD_EXECUTABLE)

---------------------------------------

Application.mk:

#APP_ABI := all
APP_ABI:= arm64-v8a #arm64-v8a #armeabi-v7a,armeabi,x86,arm64-v8a
APP_STL := gnustl_static #c++_shared# #gnustl_static  
APP_CPPFLAGS:=-frtti -fexceptions -std=c++11 -fvisibility=hidden -mfpu=neon-vfpv4 -mfloat-abi=softfp 
APP_PLATFORM := android-21
#APP_OPTIM := release

 

 

  • Up0
  • Down0
rakihasa
Join Date: 21 Sep 17
Posts: 27
Posted: Tue, 2017-11-21 11:17

Hi shengxd1, 

Even with your code, I can still see speedup of about 20%. 

A couple of things I should mention:

1. For problems of such small sizes, if you just look at the execution time it may seem pretty close. Instead, you can try to calculate the difference as a percentage.

2. If your device's CPU governor/power policy is set to something such that the high-performance cores are offline or running at lowest frequency by default and only become fully active when CPU utilization gets high, a problem size of 300 takes so little time that it may not get a chance to utilize the high-performance cores. In which case, both cases may be running on the low-power cores and optimization for low-power cores is not supported with the current version of QSML (could be supported in the future versions).

 

Hope this was helpful.

 

  • Up0
  • Down0
shengxd1
Join Date: 23 Feb 17
Posts: 15
Posted: Wed, 2017-11-22 19:03

Hi, Hasan,

Thank you very much for your help:)

  • Up0
  • Down0
or Register

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries (“Qualcomm”). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.