Forums - ceres-solver:eigen:blas:qsml - performance

5 posts / 0 new
Last post
ceres-solver:eigen:blas:qsml - performance
ternence
Join Date: 17 Feb 16
Posts: 3
Posted: Fri, 2018-05-18 01:28

hi guys,

I use ceres-solver make a simple Curve Fitting like this:

 

and the code:---------------------------------->

struct CURVE_FITTING_COST1
{
    CURVE_FITTING_COST1(double x,double y):_x(x),_y(y){}
    template <typename T>
    bool operator()(const T* const abc,T* residual)const
    {
        residual[0]=_y-ceres::exp(abc[0]*_x*_x+abc[1]*_x+abc[2]);
        return true;
    }
    const double _x,_y;
};

double a=3,b=2,c=1;
double w=1;
std::srand(std::time(nullptr));
int random_variable;                      
double abc[3]={0,0,0};
vector<double> x_data,y_data;
for(int i=0;i<100000;i++)
{
        double x=i/100000.0;
        x_data.push_back(x);
        y_data.push_back(exp(a*x*x+b*x+c)+(double)random_variable/(double)RAND_MAX);//
 }

...
ceres::Problem problem;
for(int i=0;i<100000;i++)
{
        problem.AddResidualBlock(
                new ceres::AutoDiffCostFunction<CURVE_FITTING_COST1,1,3>(
                        new CURVE_FITTING_COST1(x_data[i],y_data[i])
                ),
                nullptr,
                abc
        );
 }

------------------------------------------<

I use qsml for the Eigen's BLAS ,run it on Samsung GalaxyS8, but the time is not better.

    Time (in seconds):
    Preprocessor                         0.028604
    
      Residual only evaluation           0.359672 (27)
      Jacobian & residual evaluation     0.643167 (21)
      Linear solver                      0.128839 (27)
    Minimizer                            1.183266
    
    Postprocessor                        0.004095
    Total                                1.215966

The qsml library is linked successed, and it run on cpu' main kernel,the cpu level i set is 81~100.

__------------------------------------------------------------------------____

 

 

And i have make this test on iphone 6s, use the same code,but the time is better:

Time (in seconds):

Preprocessor                           0.0304

  Residual evaluation                  0.1898

  Jacobian evaluation                  0.2669

  Linear solver                        0.2065

Minimizer                              0.7556

Postprocessor                          0.0014

Total                                  0.7875

 

Anybody have some idears?

Thank you.

  • Up0
  • Down0
jam513
Join Date: 5 Nov 16
Posts: 11
Posted: Thu, 2018-05-24 18:25

Hi ternence,

Just a quick note.  The QSML stands for Qualcomm Snapdragon Math Library .  It is generally only optimized for running on Qualcomm Snapdragon processors.  I'm fairly certain neither the Samsung GalaxyS8 nor the iphone 6s use a Qualcomm Snapdragon processor. 

-J

  • Up0
  • Down0
ternence
Join Date: 17 Feb 16
Posts: 3
Posted: Sun, 2018-05-27 19:44

~$ adb shell cat /proc/cpuinfo
* daemon not running. starting it now on port 5037 *
* daemon started successfully *
Processor    : AArch64 Processor rev 4 (aarch64)
processor    : 0
BogoMIPS    : 38.40
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer    : 0x51
CPU architecture: 8
CPU variant    : 0xa
CPU part    : 0x801
CPU revision    : 4

processor    : 1
BogoMIPS    : 38.40
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer    : 0x51
CPU architecture: 8
CPU variant    : 0xa
CPU part    : 0x801
CPU revision    : 4

processor    : 2
BogoMIPS    : 38.40
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer    : 0x51
CPU architecture: 8
CPU variant    : 0xa
CPU part    : 0x801
CPU revision    : 4

processor    : 3
BogoMIPS    : 38.40
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer    : 0x51
CPU architecture: 8
CPU variant    : 0xa
CPU part    : 0x801
CPU revision    : 4

processor    : 4
BogoMIPS    : 38.40
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer    : 0x51
CPU architecture: 8
CPU variant    : 0xa
CPU part    : 0x800
CPU revision    : 1

processor    : 5
BogoMIPS    : 38.40
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer    : 0x51
CPU architecture: 8
CPU variant    : 0xa
CPU part    : 0x800
CPU revision    : 1

processor    : 6
BogoMIPS    : 38.40
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer    : 0x51
CPU architecture: 8
CPU variant    : 0xa
CPU part    : 0x800
CPU revision    : 1

processor    : 7
BogoMIPS    : 38.40
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer    : 0x51
CPU architecture: 8
CPU variant    : 0xa
CPU part    : 0x800
CPU revision    : 1

Hardware    : Qualcomm Technologies, Inc MSM8998

I'm sure the s8 is MSM8998,  and i test the ceres-solver with  arm64-v8a:

ceres -no OMP-eigen-no OMP :                                           time is              71.989774s

ceres -no OMP-eigen-OMP :                                                 time is              67.033344s

ceres -no OMP-eigen-BLAS-QSML-sequential :               time is              76.045295s

ceres -no OMP-eigen-BLAS-QSML-parallel:                     time is              570.746197s

ceres -OMP-eigen-no OMP:                                                 time is              61.268379s

why use BLAS-QSML-parallel model the time is too bad?

and on arm64 BLAS-QSML have not better?

Thank you.

  • Up0
  • Down0
jam513
Join Date: 5 Nov 16
Posts: 11
Posted: Tue, 2018-05-29 09:53

a small correction.

In terms of the processor being a Snapdragon on the GalaxyS8, you are correct.   There are two versions of the Galaxy S8.  One of those versions uses a Exynos 8895 and the other does infact use a Qualcomm processor, The Snapdragon 835.  

  

  • Up0
  • Down0
rakihasa
Join Date: 21 Sep 17
Posts: 27
Posted: Mon, 2018-10-29 18:32

Hi ternence, 

Apology for the late reply.

You are correct. The device that you are using is indeed MSM8998 .

A new version of the library (now named QML, version 1.0.0) is released. Please give it a try.

​Based on your timings, it seems most of the input sizes are not big since using parallel version (Eigen or ceres-OMP) doesn't provide much benefit. The previous version (QSML 0.15.5) had a bug during the selection of whether to use sequential or parallel implementation based on the input size. Due to the bug, even if the problem size is tiny, it tried to use the parallel implementation which is not optimal for tiny input sizes.

The latest release not only fixes this bug but also provides better performance for tiny input sizes for various routines.

Please let me know if you see any other issues.

  • Up0
  • Down0
or Register

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries (“Qualcomm”). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.