In my previous blog, Mobile AI Through Machine Learning Algorithms, I mentioned that a neural network is a popular artificial intelligence (AI) mechanism, and that advances in processing power means they can now be run on mobile platforms.

But with the popularity of neural networks and the plethora of libraries that do much of the work for you, how does a neural network actually work? Let’s find out through a simplified overview of the classic “hello world” neural network problem, of processing images containing handwritten numbers and identifying those numbers solely from the pixel data.

## Hello Neural Network World

In this classic example, there are 10 images each containing a hand-written glyph of a number ranging from zero through nine. Each image is 32x32 pixels in size, or 1024 pixels. The pixels are grayscale values with 0 representing black, 255 representing white, and values in between representing various shades of gray.

## The Starting Point

A neural network loosely mimics neurons in the human brain, with interactions that “fire” as we think. In the digital world a neuron is just a node containing some value that represents an activation, that you can think of as the degree to which something is true. Each node has one or more connections to other nodes.

Neurons are usually organized into “layers”, with the first layer looking at the input image, the second layer looking at the output of the first layer, and so on up until the last layer, that produces the output of the neural network. Now here’s the mathematical trick that enables hardware acceleration of neural networks: the activations of a layer are really just a list of numbers, or a “vector”. When all neurons of one layer are connected to all neurons of another layer, the connections can be represented mathematically using a “matrix”. This why hardware-accelerated vector and matrix processing units such as the Qualcomm® Hexagon™ DSP, are essential for neural network performance.

A neural network starts with two layers. The first layer is the input, which in this example consists of 1024 nodes, each containing a gray scale pixel value from the image to process. These values are represented as a vector where each is normalized from 0.0 to 1.0. The next layer is the output, here showing ten nodes that represent the digits zero through nine, each containing the probability between 0.0 and 1.0 that the given image represents the respective digit.

## Dividing and Conquering

In order to derive the output from the input, we need to break the problem down to sub problems. In this example, we can use the fact that glyphs can be broken down into shapes such as lines, circles, and other sub glyphs:

We can look for these shapes amongst the pixels in a given image and use the probabilities that they were found, to determine the output (i.e. the probability of an image representing a number). But for even greater accuracy, we can break the problem down even further, by first checking the image for the sub strokes that make up these features, such as arcs, smaller line segments, etc.:

In a neural network, we represent the probabilities from these sub problems using “hidden layers”:

In this example, the nodes of the first hidden layer will contain normalized values that represent the probabilities that a given shape feature was found from amongst the pixel activations in the first layer. The nodes in the second hidden layer will contain the probabilities that each such sub stroke was found amongst the shape features activated in the first hidden layer:

You can create any number of hidden layers of any size, based on how creative you want to get in breaking glyphs of digits down into sub elements. However, don’t go overboard because each additional node can increase the processing required.

## Weighing Your Options

After establishing the layers, weights and biases are then assigned to the connections. These are the “knobs and dials” that will be tweaked when training the neural network.

A weight influences how much a given node activates a node in the next layer. For example, connections between certain pixel nodes and feature nodes are assigned higher weights to indicate that those pixels are more likely to be found in a pixel pattern for a given feature. A bias can also be added, which is a threshold that can help weed out activations that might give false positives.

When first starting out, small random values are assigned to weights and biases of all layers of the networks, and at this stage the network cannot identify numbers correctly. In fact, it produces random results that are going to be wrong nine times out of ten. Then training is used to iteratively improve the weights and biases until the numbers predicted by the network are as close as possible to the real numbers.

## Training

Before the neural network can be used for inference, the network should be trained so that it can successfully identify numbers from arbitrary 32x32 pixel images. During inference, the pixel activations in the first layer are multiplied by weights, biases are factored in, and the results are averaged to derive the activations for the first hidden layer. This process is then repeated between subsequent layers. But before this can work, the weights and biases should be tweaked, and then you can compare the neural network’s output to known good output. The difference between the output of the network and the known good output is called the “cost”.

Training starts by determining the initial cost, which is likely poor to start since we started with random values for weights and biases. Cost is determined using a “cost function” which treats all of the weights and biases from the neural network as a giant vector, and then traverses the neural network to determine which changes to those weights and biases will give the most rapid decrease in cost.

The cost function is performed using “back propagation” starting with the output layer and working backwards towards the input layer. The desired activations for each node in the output layer are analyzed, and weights and biases of the connections with the previous layer are adjusted. The principle is simple: connections that contributed positively to the result are strengthened, while connections that contributed to a wrong guessing are weakened. This process works backwards through the neural network recursively through each previous layer, in an effort to reduce the errors for each of the 10 output nodes.

This is repeated using all training data, or sometimes batches of training data, averaging out the costs amongst hundreds or thousands of test images. The goal is to minimize cost, which translates to having the best accuracy for identifying all 10 digits from arbitrary images during inference. During training the costs goes down while accuracy goes up.

Training ends either when running out of data samples, or when the accuracy has reached a good enough level over which it would not be beneficial to show the network more data. In other words, when there is not much more that the network can learn.

## Taking it to the Edge

Of course, there are a lot more implementation details and math to understand, but the example above should give you an idea as to what goes on inside of a neural network.

The good news is that there are a lot of libraries and frameworks that do much of the heavy lifting for you. TensorFlow, PyTorch and Caffe2 are great for building and training neural networks, while the ONNX exchange format can be used to save and load neural networks between frameworks. You can use the neural networks generated from these frameworks for inference on mobile platforms such as the Qualcomm® Snapdragon™ 845 mobile platform using the Qualcomm® Neural Processing SDK. This SDK also provides hardware acceleration on the Qualcomm® Kryo™ 385 CPU, Hexagon 685 DSP, and Qualcomm® Adreno™ 630 GPU found in the Snapdragon 845 mobile platform.

Qualcomm Hexagon, Qualcomm Snapdragon, Qualcomm Neural Processing SDK, Qualcomm Kryo and Qualcomm Adreno are products of Qualcomm Technologies, Inc. and/or its subsidiaries.