Entropy-based Gaze Planning

Consider the problem of an active observer moving through a familiar environment with the task of identifying and localizing known objects. Due to limited resources, it cannot spend a lot of time in one place so the computational overhead must be low. Furthermore, it needs to minimize the effort expended in gathering data so it must be economical in its movement. These constraints typify many applications of active vision, particularly in the context of mobile robotics. At another level this problem raises the challenge of how relatively simple percepts can be integrated into solid inferences about the visual world. Our approach to this problem is based on two observations: i) that strong assertions can be made by accumulating evidence that might appear to be weak instantaneously, and ii) knowing how to explore an environment (i.e. where to look) can be learned from local interactions with the objects that populate it.

In this work, we introduce the notion of entropy maps, and show how they can be used to guide an active observer along an optimal trajectory, by which the identity and pose of objects in the world can be inferred with confidence, while minimizing the amount of data that must be gathered. Specifically, we consider the case of active object recognition where entropy maps are used to encode prior knowledge about the discriminability of objects as a function of viewing position. These maps are computed using optical flow signatures as a case study, and we show how a gaze-planning strategy can be formulated by using entropy minimization as a basis for choosing a next best view. Experimental results show the strategy's effectiveness for active object recognition using a single monochrome television camera.

Context: The context of the problem is a mobile observer consisting of a monochrome television camera mounted on the end effector of a gantry robot (Figure 1). The camera is free to move about the workspace of the gantry in which the different test objects are placed (i.e. stationary environment). As the camera moves relative to an object, an optical flow pattern is induced on the retina which results in a discrete image sequence. The task that the system must perform is to generate an optimal trajectory (the shortest sequence) that will result in the correct identification of the object.

Figure 1: Gantry robot autonomously moving around object, trying to identify it.

Recognition Based on Optical Flow:What makes this task particularly challenging is that the problem is fundamentally ill-posed, particularly in its non-uniqueness. Furthermore, recognition based on optical flow images poses its own set of problems (See Recognizing Objects from Curvilinear Motion for details.). For that reason, we have developed a Bayesian recognition strategy that uses as features signatures extracted from the optical flow images computed as the camera moves in front of the object. Rather than make an assertion based on a single measurement, we use Bayesian techniques to accumulate evidence for the different assertions over a sequence of measurements. Our hypothesis is that the correct assertion will become apparent over time.

Entropy Maps: Off-line, an appearance-based training strategy associates an image measure, x, with some particular object Oi in a database of known objects. We define the posterior probability distribution function over the set of n object hypotheses, {Oi}, {i=1..n}, given the measurement vector x, as P(Oi|x). This denotes the probability that the unknown object is an instance of each of the objects in the database given the observed data. We now wish to obtain a metric that predicts the likelihood of ambiguous recognition results as a function of viewing position. By ambiguous, we mean that more than one object in the database is a highly likely candidate. A suitable metric is defined in terms of the Shannon entropy [Cover:91],

H(P(Oi|{x})) = sum i (P(Oi|{x}) log 1/P(Oi|{x})),

which is a measure of the ambiguity of the posterior distribution produced by a recognition experiment. Higher entropies reflect greater ambiguity.

For the problem at hand, the entropy map is parameterized on a tessellated viewsphere with the object at origin. Off-line during training, image measures, {x}, are sampled at each coordinate of the viewsphere for each object in the database. A Bayesian learning strategy (see Recognizing Objects from Curvilinear Motion) is applied to the cumulative set of measures to derive a function for P(Oi|{x}). Then, using the entropy equation above, a map is constructed for each object in the database by evaluating H(P(Oi|{x})) at each coordinate of its respective viewsphere. Note that this is accomplished directly since the coordinates of each acquisition, x, are retained from the training procedure. Figure 2 depicts the steps involved in creating an entropy map.

Figure 2: One the left, we can see the tesselated viewsphere about the database object, created during training. At each segment, two motion sweeps lead to the computation of 2 optical flow images. Recognition is performed on each flow image, d, and the result is a discrete posterior distribution, P(O|d), depicting the confidence in each of the objects in the database: A, B, C. The entropy of this distribution, H(P(O|d)), is stored at each location. The resulting entropy map can be very informative in the context of planning gaze for object recognition. It provides a quantitative prediction of the level of difficulty of recognizing each object in an on-line experiment. Similar to aspect graphs, (e.g. [Eggert:92,Kriegman:89]), by linking location and discriminability using entropy maps, a set of such characteristic views can be automatically generated off-line. Figure 3 shows an example of an entropy map.

Figure 3: Here, we can see two views of the same entropy map built by gathering images about the object of interest - a toothpaste tube. Each tile represent a camera viewpoint of the object in the center. The tiles are colored from low entropy (blue) to high entropy (red). The system chose the viewpoint on the left as the most informative in terms of identifying the object, and the viewpoint on the right as the least informative. Note that this corresponds to our intuitive notion of what constitutes "good" and "bad" views of the object.

Using Entropy Maps to Plan Gaze: Two problems must be solved prior to planning gaze: 1) a particular map must be selected and 2) the pose of that map must be determined relative to the data acquired. As measurements are made on-line, the maximum a posteriori (MAP) solution corresponding to P(Oi|{x}) is used to determine the most likely object hypothesis for the measured data x. This estimate is subsequently used to select the entropy map to be used for planning the next best view. Of course the particular gaze planning strategy must be carefully structured to operate stably in these circumstances.

Pose can be estimated at minimal expense by retaining the location information along with the image measures acquired during training. For example, appearance-based methods can be used to index these measures using the data acquired on-line [Nayar:96]. In fact, the implementation described in Recognizing Objects from Curvilinear Motion already uses appearance-based techniques in the process of determining the likelihoods for the different object hypotheses. As such, the computational overhead of determining pose is minimal. Once the camera pose is established in the coordinates of the training viewsphere, it is straightforward to determine the relative transform taking the camera to the desired position within this frame (Figure 4, Steps~2--3). By applying this same transform to the current camera frame, the camera is positioned accordingly as shown in (Figure 4, Step~4).

The gaze planning strategy itself must be sufficiently robust to accommodate errors in pose determination and entropy map selection. Errors in the former are accommodated in part by smoothing the entropy map and a strategy that avoids placement in the vicinity of singularities and discontinuities. A partial solution to the selection problem is effected by choosing a next best view that minimizes the entropy on the most likely object hypothesis map, while simultaneously minimizing the entropy on any other likely candidates' maps. Over time the expectation is that confidence in an incorrectly chosen hypothesis will decrease as further evidence is uncovered.

Figure 4: (Step 1) Appearance-based methods are used to determine the likelihoods for the different object hypotheses using the data acquired on-line. Pose can be estimated at minimal expense by retaining the location information along with the image measures acquired during training. Once the camera pose is established in the coordinates of the training viewsphere, it is straightforward to determine the relative transform taking the camera to the desired position within this frame (Steps 2--3). By applying this same transform to the current camera frame, the camera is positioned accordingly as shown in (Step 4).

Experimental Results: Empirically, it was found that the proposed strategy performed better than a random navigation approach (move to random locations) if started from highly ambiguous viewpoints. Figure 5 illustrates a comparison of navigation results (using both strategies) starting at high entropy locations for 2 database objects. One can see that the proposed strategy converges quicker than the random strategy in both cases. Notice that, in Figure 5(a), the random strategy caused the sensor to move to a ``bad'' local minimum (low entropy, wrong model case) at iteration 2.


Figure 5: Here, entropy-based and random navigation strategies are compared, by plotting entropy vs. time for 2 objects. Notice that the proposed strategy converges quicker.

References: [Cover:91] T.M. Cover and J.A. Thomas, Elements of Information Theory Wiley and Sons, 1991.
[Eggert:92] D. Eggert, K. Bowyer, C. Dyer, H. Christensen, and D. Goldgof, "The scale space aspect graph" In Proceedings, Conference on Computer Vision and Pattern Recognition, pages 335--340, Champaign, Il., June 15-18 1992. IEEE.
[Kriegman:89] D. Kriegman and J. Ponce, "Computing exact aspect graphs of curved objects: Solids of revolution", In PROC. of IEEE Workshop on the Interpretation of 3-D Scenes, pages 116--122, Austin, Texas, November 27-29 1989. IEEE.
[Nayar:96] S.K. Nayar, H. Murase, and S.A. Nene, Parametric Appearance Representation in Early Visual Learning, chapter 6, Oxford University Press, February 1996.

A poster presentation on this topic was given at the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, Sept. 1999.

A presentation  related to this topic was given at the Second IEEE Workshop on Perception for Mobile Agents, in association with the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Fort Collins, Colorado, June 1999.