UCam: Tackling Compute-Intensive Photo Tasks with Parallel Computing

UCam: Tackling Compute-Intensive Photo Tasks with Parallel Computing

“Cooperating with Qualcomm on MARE gives us one important competitive edge: the ability to implement high-performance algorithms much faster. Our UCam photo app features high-performance parallel filtering we could not deliver without MARE.”

- Pengcheng Zou, SVP Strategic Product Division, Thundersoft

  • Simple approach to threading and parallel computing in 1/5th the code of Pthread
  • 60% faster image processing than single-threaded version of app
  • MARE library handles Android thread pool management and thread synchronization so programmer does not have to

How many camera apps does it take to make a user happy? A lot, especially in Asia.

Thundersoft realized that users in China and Japan had to run one app for filters and sharing, another app for reading bar codes, another for beautification, and several more for functions like collage, picture-in-picture, animated GIF and face detection. They created UCam, a full-featured camera platform that lets developers plug camera-related features into a single app. UCam lets developers roll out new functionality without needing to build their app from scratch, and it gives users new features without the need to install and learn entirely new apps.

The free UCam app with over 20 different camera functions caught on very well in Google Play, netting thousands of five-star reviews.

Parallel computing on multicore CPUs
As UCam continued to grow in popularity and functionality, a few market realities came to bear on Thundersoft’s engineering effort:
  • Mobile screen size and camera resolution continued to increase.
  • Thundersoft began introducing more compute-intensive effects like morphing, 3D view and manga filtering.
  • The UCam UI included static previews showing the image with all available effects. Engineers wanted to display real-time previews of the live, 20fps image coming from the camera sensor.

In short, the evolving product demanded more power. Knowing that multicore mobile devices are more plentiful than servers or PCs, Thundersoft’s engineers realized it was time to modify UCam to use all available CPU cores. They decided to implement parallel computing for two particularly intensive functions: parallel filtering and animated GIFs.

They turned first to POSIX threads, or Pthreads, to run tasks in parallel across multiple CPU cores. They wrote a prototype with their own queue and thread pool to deal with the parallel filtering and the continuous image feed coming from the sensor.

“We tried Pthread for parallel computing,” says Pengcheng Zou, Thundersoft’s SVP of Strategic Products, “because many of our programmers had learned it in school. But it’s difficult for common programmers to master thread management (synchronization, dialog, etc.) in Pthread, so we looked for another way.”

Simplifying parallel computing with MARE – in 2 days
Through its long-time relationship with Qualcomm Research, Thundersoft learned of Qualcomm’s Multicore Asynchronous Runtime Environment (MARE). Thundersoft’s engineers found MARE appealing because it:
  • lives as a C++ library that they could use with the latest standard Android NDK.
  • includes a runtime and task scheduler, so they did not need to build their own.
  • allowed them to run tasks in parallel on multiple cores at the app level, without deep system access.

The programming model of MARE, designed to reduce the effort of implementing parallelism, also attracted the engineers. They modified UCam to use the basic thread management API, which lets MARE decide how best to parallelize and synchronize tasks. That means that the app does not need to know how many threads are in use, how many cores are available or which tasks are already running on which cores. With the dynamic library in MARE, the only API loaded into memory is the one required by the app, so there is no appreciable increase in footprint.

For the first parallel-computing version of UCam, Thundersoft engineers needed to modify only the code that launches each task as follows:

mare::group_ptr g=mare::create_group("doEffect");
for(int i=0;i<effectNumber;i++){
   int effectType = effectPtr[i];

The resulting version uses less than one-fifth the code of the Pthread prototype to handle parallel filtering. Implementing MARE required no code changes in UCam’s algorithm and took one engineer about two days. Subsequent development on more complex effects like manga filtering took about one week.

60 percent performance boost
Thundersoft tested on several devices and several processors from several manufacturers, including Snapdragon. “The more cores the app has available, the better the performance and the less time to process a frame from the camera,” Zou explains. “However, our biggest need for improved performance was among users on lower-end phones, so we tested most heavily there. With parallel computing from MARE,

UCam completed its image processing tasks 60 percent faster than in the single-threaded version of the app. It performed 10 percent faster than our Pthread prototype because MARE manages the thread pool so well.”

UCam does not need to determine the availability of multiple cores on the current device – MARE handles that – so Thundersoft can still deliver UCam as a single .apk in Google Play, regardless of the device or processor. Since integrating parallel filters, UCam has attracted not only many end users but also ODM/OEM clients.

“We have always wanted to incorporate more of our own intellectual property and build our own products,” says Zou. “Cooperating with Qualcomm Research has given us the chance to see how our technology was being used. With the technology from Qualcomm Research we have been able to reach more customers and gain more confidence in the maturity of our products.”

Next steps
Zou emphasizes three types of tasks that are ripe for parallel computing:

  • Task parallelism – Many isolated or loosely coupled tasks that can run in parallel (e.g., browsing)
  • Loosely coupled computing – One computation task that can be split into several loosely coupled computational units run in parallel, usually with minimal conversion effort (parallel filtering in UCam)
  • Tightly coupled computing – A complex algorithm in a task requiring greater conversion effort (manga filter, real-time edge detection)

MARE is designed for parallelizing tasks in performance-oriented application categories like games, computer vision and multimedia processing, and in natural interfaces including gestures. Download the MARE SDK and see how easily you can program your own Android apps to take advantage of all of the cores in your users’ mobile devices.