Can Machines “See” Using Convolutional Neural Networks?

Can Machines “See” Using Convolutional Neural Networks?

http://www.altera.com/technology/system-design/articles/2015/convolutional-neural-networks.html

Convolutional What?

“The simplest way to look at CNNs,” explains i-Abra CTO Greg Compton, “is as a way to build a very powerful many-to-one filter.” For instance, you might design a CNN that would take in a single frame of high-definition video and put out a single bit expressing the presence or absence of a pedestrian in the frame.

This object-recognition job has been done in the past in a variety of ways. The most common and intuitive approach has been rule-based: search the image for pre-defined shapes, feed the locations of the shapes into a classifier, and give the classifier rules about what combinations of shapes might represent a human. In abstract tests such systems can perform adequately. But on real-world images with changes in position, orientation, lighting, and noise level, they are often inadequate.

At the other extreme lie neural networks. In the 1940s researchers began speculating on how the specialized neuron cells in animals might work. This had led, by the 1960s to electronic, and eventually computer, models of neurons, and to their application in small networks to solve some specific sorts of problems. These problems were often characterized by very low signal-to-noise levels, wide variations in the appearance of a sought-for pattern, or unclear definitions of the object to be detected.

By the 1990s researchers had established that neural networks—especially deep-learning neural networks—can be very successful with image recognition and object classification. But deep-learning networks comprise successive layers of artificial neurons, each neuron taking in a weighted sum of the outputs of all the neurons in the preceding layer. The results can be very good—but if the input is a high-definition, extended-dynamic-range camera pumping out 120 frames per second, the computing load is horrendous.

To solve the computing problem, CNNs combine these two ideas from opposite ends of the numerical-processing spectrum: they employ small convolution kernels to search across the input image for specific visual features, vastly reducing the connectivity of the network at the end where the data set is largest. And they employ a neural-network back-end to interpret the patterns of features that emerge from the convolutional layers. One more vital point: CNNs apply the key neural-network technique of back-propagated learning to train not only the neuron connections, but the convolution kernels. The result (Figure 1) is a network that mixes convolutional layers and subsampling on the front end with neural layers on the back end.

Figure 1. A common Convolutional Neural Network design comprises successive layers of convolution feature maps and subsampling functions, followed by layers of conventional neural network tissue.

What Goes on Inside

From here we should look more closely at the inside of the CNN. Everything starts out with a two-dimensional frame of data arriving at the first layer of the network. Using the familiar technique of a small convolutional kernel—say, 4 by 4 pixels—the first convolutional layer convolves the kernel with 4-by-4 cells, one for each pixel in the image, and replaces the pixel value with the result of the convolution. The result is in effect a slightly smaller frame in which the value of each pixel expresses how closely the cell cornered at that pixel resembled the feature in the convolution kernel. This figure might be a line segment, an angle, a T, or something else; we will discuss where it came from later.

Now each convolutional layer in the CNN will typically repeat this process a number of times, each time with a different kernel. Each pass will produce a different output, which CNN folks call a feature map. So the first layer in our CNN might contain eight feature maps, each feature map detailing the probably locations of its particular feature in the original image. If we are doing printed-character recognition, for instance, we might end up with one feature map each for vertical, horizontal, and diagonal line segments, and a few feature maps for different kinds of joins. Each of these maps will have nearly the same number of pixels as the original image. But the amount of processing needed to produce all of them is a tiny fraction of what a neural layer would have required, with every pixel in the image connected to every neuron.

A final, less-than-obvious function of the convolution layer is de-linearization. In many CNN designs, each layer applies a transcendental function—hyperbolic tangent is popular—that limits the range of the pixel values, and insures that the outputs are not a linear combination of the inputs. This is important later on, for training purposes.

A CNN often contains more than one convolutional layer. If it does, the deeper layers will also have multiple feature maps. But these feature maps will come from 3D, rather than 2D, convolutions. Each pixel in the feature map will be a weighted sum of a small 3D kernel times the pixels in the corresponding cells from each of the feature maps in the preceding layer (Figure 2). Intuitively—here is where the applied mathematicians in the audience will cringe—the feature maps in the subsequent layers are maps of the locations of combinations of features identified in the feature maps of preceding layers: say, a vertical line, a diagonal, and a join all in close proximity, in our character-recognition example.

Figure 2. Internal convolution layers convolve a 3D kernel with a 2D cell from each feature map in the previous layer.

In principle, we could build up really complicated feature-identification networks by just stacking up lots of convolutional layers one on top of the other. But in most applications that would be vast overkill. When you want to know the location of a combination of nearby features, you do not need to specify the location down to the high-definition pixel level. You can simply say that the particular combination of line segments, angles, and joins for this feature map was approximately about there on the original image. In other words, at the end of each convolutional layer, you can subsample each of the feature maps by keeping only the maximum value of all the pixels in a small cell. That means that the amount of data that has to be convolved drops sharply at each layer, but without, experience has shown, reducing the accuracy of the results.

Classification

After two or three of these convolution-plus-subsampling layer-pairs, we should have a large set of small feature maps. Each of these maps represents the possible presence and locations of instances of a potentially quite complex pattern in the original image: a string of characters or a pedestrian-shaped object. But we need more than just a map—we want to extract meaning. Does this hand-written letter mention George Washington, first US president? Is there a pedestrian staggering across the road ahead? We need yes or no.

This is not recognition, it is classification. To perform classification, CNNs apply fully-connected neural-network layers behind the convolutional layers. In the first neural layer, each neuron takes in the value of every pixel from every feature map in the final convolution/sub-sampling layer, multiplies each value by a pre-determined weight, and de-linearizes the sum. In effect, the output of each neuron now represents a judgment about the entire original image. Did someone sign this letter G. Washington? Is there within the boundaries of the road an object that might be a person?

As we ask for more and more abstract conclusions about the image, we add additional neural layers. Each of these layers is fully-connected as well, although some CNN architectures switch to a Gaussian connection pattern in the final layer. By now, there may be relatively few neurons in the layer, but they may represent very complex judgments: is this a letter President Washington wrote during his first term? Is that a drunk who could fall down before I get to the intersection?

If you’ve been keeping track, you can see that each new frame presented to the CNN triggers an avalanche of computation, and that much of this computation will probably need to be in floating-point format. In the input layer, each possible 4-by-4 cell in the image requires a small finite impulse response (FIR) filter to be applied to it: 16 multiples, some additions, and evaluation of tanh(x). That has to be done for each feature map in the input layer. In subsequent layers, potentially each possible cell (including all the overlapping ones) in each feature map from the proceeding layer gets fed into the convolution engine for every feature map in the new layer. Subsampling saves this from becoming an enormous job, but it is still considerable. Yet it is tiny compared to the work necessary to evaluate fully-connected layers at this stage.

Training

We casually skipped over a very critical question: where do the coefficients in those convolution kernels—the numbers that define the pattern the kernel will identify—come from? Similarly, where do the weights on the neuron inputs come from? The performance of the CNN depends hugely on the choice of these patterns. Otherwise, you will have a very large room full of monkeys and typewriters, but no Shakespeare.

Experience with the mediocre performance of rule-based systems led researchers to try something radical and not at all intuitive. If it is so hard for application experts to choose appropriate kernels, can the iterative process of training used in pure neural networks do better? The answer, it turns out, is yes.

In principle, neural-network training is quite simple, and requires only four things. You need a set of training images that represent the situations you want the CNN to resolve. You need the ideal correct output for each of those images. You need an error function that quantifies the difference between the desired output and the actual output. And you need an initial state for each of the convolution-kernel coefficients and neuron input-weights in the CNN. In principle, these are straight-forward. In practice, each of these items presents certain challenges.

We can start with the training images. “The training images must represent the real world,” Compton cautions. If they are too simple, the CNNs results may be too simplistic. If they are too complex, training may simply not work. If they leave out important cases—a different style of penmanship, or a man carrying a large box, say—the CNN’s behavior when it encounters these inputs will be unpredictable. Unfortunately, the only tools for choosing the size of the training set and selecting the images are experience and judgment. Similarly, but less problematically, you have to be sure what you want the CNN to conclude from each of the training images.

Next you need initial values. These have to be chosen in a way that doesn’t effectively shut off portions of the CNN; all zeroes or all ones would be a bad guess, for instance. Experience has shown that choosing values at random from a particular statistical distribution works well.

That brings us to the error function. If you evaluate the CNN for a given training image, and if you know the desired output, you can write the difference as an expression with a lot of sums of products and hyperbolic tangents in it: you have expressed the size of the error, as a function of the input image pixels, the convolution coefficients, and the neuron weights.

Now comes the fascinating part. With this information, you can calculate all the partial differentials of the error function with respect to all neuron weights in the last layer of the CNN. Those slopes will tell you how to adjust the weights for the next try. You can then propagate this process back to the preceding layer, and adjust those weights, and so on. The really innovative idea in CNNs is that once you reach the convolutional layers, you can just keep going—using the same process of calculating the partial derivative of the error function with respect to each kernel coefficient, adding a little delta to the coefficient, and so on. You don’t have to try to guess what the coefficients should be based on features you think the CNN should be looking for.

You keep repeating the training process until the error is acceptably small. Then you move on to the next image and repeat. There are no mathematics to predict convergence or final accuracy. But experience has shown that the process usually converges, and the accuracy is usually acceptable.

This Just Doesn’t Feel Right

There are several things about this process that repel many engineers. First, it is utterly non-deterministic. Attempts to predict the response of the trained network to real-world inputs, its overall accuracy, its response in certain critical situations, or even the required training time have mostly failed. Much depends on the structure you choose for your CNN, your error function, and especially your choice of training images. Second, it can be difficult or impossible to get an intuitive grasp of what the feature maps and neurons in the trained CNN mean. They may not have an obvious meaning as individual data sets. They may just work.

In short, you have created a network that may or may not be appropriate, trained it with images you can only hope will be sufficiently representative, and you have no idea exactly why it works or when it might get a seriously wrong answer. Why on earth would a good engineer do this?

The reason researchers—and, increasingly, embedded-system developers are going to all this trouble is simple: CNNs do work. Training results generally predict real-world results. And the CNNs consistently out-score all other kinds of algorithms on standard batteries of object-recognition and classification problems. They are relatively insensitive to changes in object size, location, and lighting. And in work recently published by researchers at Google and Stanford University, clusters of different CNNs have been highly accurate not only in classifying the objects in images, but in writing captions describing what is going on in the scene. There is simply no other known algorithm today that can get there.

“This non-determinism can be unsettling, especially in safety-critical applications,” Compton agrees. “But in the real world, there are already lots of non-deterministic things in engineering. No system is ideally reliable. The goal is to achieve an acceptable failure rate. With CNNs, you do that by producing a large-enough test set to give you confidence in your error rate estimates.”

How Do I Do This?

Results or no results, by now this whole approach has to sound daunting. First, you select an architecture: how many convolutional and how many fully-connected layers, what de-linearization function, how many feature maps in each convolutional layer, and what size of kernels. But how?

“Researchers tend to re-use whatever network design they learned first,” Compton says. “There is a lot of hand-engineering here, exploring the solution space. There is only some recent work on automating the exploration. So people tend to stick with what has worked for them in the past.

“Today there is no way to predict the size of the network you will need—it could be three layers or 20. Nor can you predict the performance. You have to construct your CNN and train it.”

Implementation is another question. In academic research, execution time is not a big issue, so you can build your CNN as one big chunk of software. “Essentially, the convolution layers are FIR filters, and the fully-connected layers are matrix-matrix multiplies,” Compton explains. Training is another matter. It is usually done once, during the design process, and so the training time is not a factor in the latency of the CNN. It requires initial evaluation a big partial differential to get the slopes of the error function, and then some matrix multiplies on each iteration, repeated across all the training images. And it iterates until the network—you hope—converges to an acceptable error rate.

But in an embedded system you do care about latency, and you don’t generally have a data center’s worth of servers to throw at the computations. Still, you have to compute all those feature maps and neuron values every time the input changes—unless you are very clever about detecting unchanged regions in the input image and dropping unnecessary computations as those static regions propagate through the CNN. So meeting real-time deadlines may require hardware help for the customary embedded processor: DSP chips, pipelined accelerators in FPGAs, or dedicated synthetic neuron chips.

Platforms Emerge

This design process still doesn’t lie within the comfort zones of most design teams. And Compton points out that there is little in the way of tutorial material available that isn’t biased toward a particular research program or toward the today-dominant application, Web search. So are CNNs destined to remain merely thesis topics? No, Compton insists. They are moving into the real world in two stages.

The first stage involves using existing experts. “If you have a problem that requires fuzzy pattern-matching when the position of the pattern varies, you should consider CNNs,” Compton maintains. Such problems could be in machine vision, or something less obvious, like voiceprint analysis or Internet content filtering. Once you have identified the problem, the next step is to constrain it—essentially, controlling the outside world and the imaging system to limit the difficulty of the identification and classification tasks—until you can find an expert who has already worked successfully on similar problems. The expert will bring proven experience with CNN architectural choices and training techniques.

But the industry needs a more efficient solution, Compton believes. So the trend, including work at i-Abra, is toward building a design framework for CNNs: architectural exploration and design tools, implementation platforms, and supporting services. The goal is to allow a design team without neural-network background to describe their problem to the platform, and have the platform play the role of expert: selecting an architecture, evaluating the training images, predicting performance, and assisting in implementation.

“CNNs are already opening new applications to engineering solution,” Compton says. “As our knowledge and our hardware platforms mature, they will open even more.”