Applet: Nora Willett
Text: Marc Levoy
Technical assistance: Andrew Adams
In the preceeding applet we considered active and passive methods of autofocusing a camera, and we studied in detail the phase detection method commonly used in single-lens reflex (SLR) cameras. In this applet we consider contrast detection, another passive autofocus method. This method is commonly used in point-and-shoot cameras and some cell phone cameras - those with movable lenses. Contrast detection is also used in SLR cameras when they are in Live preview mode (called "Live View" on Canon cameras). In this mode the reflex mirror is flipped up, thereby disabling the autofocus module, so only the main sensor is available to help the camera focus, and contrast detection must be used.
The optical path for constrast detection is much simpler than for phase detection. There is no half-silvered mirror, no microlenses, and no chip crammed with one-dimensional CCDs. Instead, the image falling on the main sensor is captured and directly analyzed to determine whether the image falling on it is well focused or not. As in phase detection, this analysis can be performed at one or more distinct positions in the camera's field of view at once, and these results can be combined to help the camera decide where to focus. We could call these positions autofocus points by analogy to the phase detection method, except that in contrast detection there is no limit on the number and placement of these points. In the simulated viewfinder image at the right side of the applet, we've highlighted one position near the center of the field of view. Let's consider the formation of the image at this position.
If you follow the red bundle of rays in the applet, they start at an unseen object lying on the optical axis (black horizontal line) to the left of the applet, pass through the main lens, and reconverge on the optical axis. With the applet in its reset position, these rays come to a focus before the main sensor, then spread out again and strike the sensor. Although the simulated viewfinder shows a coin, let us assume for the purposes of the ray diagram that the unseen object is a single bright point on a black background. In this case the image captured by the sensor would be a broad spot that tapers to black at its boundaries. A 1D plot through this spot is shown to the right, where it looks like a low hump. Such a broad low hump is said to have low contrast.
Use the slider to move the lens left and right. As you do so the position where the red rays converge will also move. As their focus moves closer to the sensor, the breadth of the spot formed on the sensor decreases and its center becomes brighter, as shown in the 1D plot. When the rays' focus coincides with the sensor, the spot is tightest, and its peak is highest. Unfortunately, we can't use the height of this peak to decide when the system is well focused, since the object could any color - light or dark. Instead, we could examine the slope of the plot, i.e. the gradient within a small neighborhood of pixels, estimating this gradient by comparing the intensities in adjacent pixels. We would then declare the system well focused when this gradient exceeds a certain threshold. However, if the object naturally has slowly varying intensities, like skies or human skin, then its gradients will be modest even if it is in good focus.
The most successful method is to move the lens back and forth until the intensity or its gradient reaches a maximum, meaning that moving the lens forward or backward would decrease the intensity in a pixel or the gradient in a small neighborhood. This "local maximum" is the position of best focus. For images of natural scenes (like the coin), rather than bright spots on a dark background, even more complicated methods can be employed, and it's beyond the scope of this applet to describe them all. Moreover, the algorithm used in any commercial camera is proprietary and secret, and therefore unknown to us.
Regardless of the method employed, if you're trained in computer vision you'll recognize these methods as shape from focus algorithms. Compare this to phase detection, which because it uses only two views at each autofocus point is more like a shape from stereo algorithm.
This maximum-seeking method has one advantage and one big disadvantage. Its advantage is that it requires only local operations, meaning that it looks at only a few pixels in the vicinity of the desired evalution position. Thus, it requires relatively little computation (hence battery power). Its disadvantage is that it requires capturing multiple images. From a single image it can't tell you whether the camera is well focused. It also can't tell you how far to move the lens to get it into focus. It can't even tell you which direction to move the lens! (Compare this to phase detection, in which a single observation suffices in principle to focus the lens accurately.) Thus, contrast detection systems must capture an image, estimate the amount of misfocus, move the lens by a small amount, capture another image, estimate misfocus again, and decide if things are getting better or worse. If they're getting better, then the lens is moved again by a small amount. If they're getting worse, then the lens is moved the other way. If they've been getting better for a while, then suddenly they start get worse, we've overshot the in-focus position, so the lens should be moved the other way, but only a little.
By this iterative "hunting" process we eventually find the in-focus position we are seeking. However, hunting takes time. That's why we complain about how long it takes to autofocus a point-and-shoot camera. That's also why high-end SLRs that shoot movies in Live View mode either don't offer autofocusing or do it poorly - because they are compelled to use contrast detection. By the way, professional moviemakers don't use autofocus cameras; they have a person called a focus puller who stands next to the camera and manually moves the focus ring by a precise, pre-measured amount when called for in the script.
If you're trained in computer vision, you might ask at this point why cameras don't use shape from defocus, in which images are captured are only two lens positions. By comparing their amount of misfocus, one can in principle estimate how far to move the lens to bring the image into good focus, thereby avoiding hunting. However, this estimation requires assuming something about the frequency content of the scene (How sharp is the object being imaged?), and such assumptions are seldom trustworthy enough for everyday photography.
© 2011 Marc Levoy