## ABSTRACT

Computing motion on the basis of the time-varying image intensity is a difficult problem for both artificial and biological vision systems. We will show how one well-known gradient-based computer algorithm for estimating visual motion can be implemented within the primate’s visual system. This relaxation algorithm computes the optical flow field by minimizing a variational functional of a form commonly encountered in early vision, and is performed in two steps. In the first stage, local motion is computed, while in the second stage spatial integration occurs. Neurons in the second stage represent the optical flow field *via* a population-coding scheme, such that the vector sum of all neurons at each location codes for the direction and magnitude of the velocity at that location. The resulting network maps onto the magnocellular pathway of the primate visual system, in particular onto cells in the primary visual cortex (VI) as well as onto cells in the middle temporal area (MT). Our algorithm mimics a number of psychophysical phenomena and illusions (perception of coherent plaids, motion capture, motion coherence) as well as electrophysiological recordings. Thus, a single unifying principle ‘the final optical flow should be as smooth as possible’ (except at isolated motion discontinuities) explains a large number of phenomena and links single-cell behavior with perception and computational theory.

## Introduction

One prominent school of thought holds that information-processing systems, whether biological or man-made, should follow essentially similar computational strategies when solving complex perceptual problems, in spite of their vastly different hardware (Marr, 1982). However, it is not apparent how algorithms developed for machine vision or robotics can be mapped in a plausible manner onto nervous structures, given their known anatomical and physiological constraints. In this chapter, we show how one well-known computer algorithm for estimating visual motion can be implemented within the early visual system of primates.

The measurement of movement can be divided into multiple stages and may be performed in different ways in different biological systems. In the primate visual system, motion appears to be measured on the basis of two different systems, termed *short-range* and *long-range* processes (Braddick, 1974, 1980). The shortrange process analyzes continuous motion, or motion presented discretely but with small spatial and temporal displacement from one moment to the next (apparent motion; in the human fovea both presentations must be within 15 min of arc and with 60–100ms of each other). The long-range system processes larger spatial displacements and temporal intervals. A second, conceptually more important, distinction is that the short-range process uses the image intensity, or some filtered version of image intensity (e.g. filtered *via* a Laplacian-of-Gaussian or a difference-of-Gaussian operator), to compute motion, while the long-range process uses more high-level ‘token-like’ motion primitives, such as lines, corners, triangles etc. (Ullman, 1981). Among short-range motion processes, the two most popular classes of algorithms are the gradient method on the one hand (Limb & Murphy, 1975; Fennema & Thompson, 1979; Marr & Ullman, 1981; Hildreth, 1984; Yuille & Grzywacz, 1988) and the correlation, second-order or spatiotemporal energy methods on the other hand (Hassenstein & Reichardt, 1956; Poggio & Reichardt, 1973; van Santen & Sperling, 1984; Adelson & Bergen, 1985; Watson & Ahumada, 1985). Gradient methods exploit the relationship between the spatial and the temporal intensity gradient at a given point to estimate local motion, while the second class of algorithms multiplies a filtered version of the image intensity with a slightly delayed version of the filtered intensity from a neighboring point (a mathematical operation similar to correlation; hence their name (for a review, see Hildreth & Koch, 1987).

## Computational theory

The problem in computing the optical flow field consists of labeling every point in a visual image with a vector, indicating at what speed and in what direction this point moves (for reviews on motion see Ullman, 1981; Nakayama, 1985; Hom, 1986; Hildreth & Koch, 1987). One limiting factor in any system’s ability to accomplish this is the fact that the optical flow, computed from the changing image brightness, can differ from the underlying two-dimensional velocity field. This vector field, a purely geometrical concept, is obtained by projecting the threedimensional velocity field associated with moving objects onto the two-dimensional image plane. A perfectly featureless rotating sphere will not induce any optical flow field, even though the underlying velocity field differs from zero almost everywhere. Conversely, if the sphere does not rotate but a light source, such as the sun, moves across the scene the computed optical flow will be different from zero even though the velocity field is not (Horn, 1986). In general, if the objects in the scene are strongly textured, the optical flow field should be a good approximation to the underlying velocity field (Verri & Poggio, 1987).

The basic tenet underlying Horn & Schunck’s (1981) analysis of the problem of computing the optical flow field from the time-varying image intensity *I*(*x*,*y*,*r*) falling onto a retina or a phototransistor array is that the total derivative of the image intensity between two image frames separated by the interval d*t* is zero: d*I*(*x*,*y*,*t*)/d*t* = 0. In other words, the image intensity seen from the point-of-view of an observer located in the image plane and moving with the image, does not change. This conservation law is strictly only satisfied for translation of a rigid Lambertian body in planes parallel to the image plan (for a detailed error analysis see Kearney *et al*. 1987). This law will be violated to some extent for other types of movements, such as motion in depth or rotation around an axis. The question is to what extent this rule will be violated and whether the system built using this hypothesis will suffer from a severe ‘visual illusion’.

Using the chain rule of differentiation, d*I*/d*t* = 0 can be reformulated as *Ix**ẋ**+Iy**xẏ**+It=**Δ**I**·* V+It = 0, where *ẋ**=**dx/dt* and xẏ = d*y*/d*t* are the x and y components of velocity V, and *I*_{x}*= ∂I/∂x, I*_{y}*= ∂1/∂y* and *I*_{t} = *∂1/∂t* are the spatial and temporal image gradients which can be measured from the image (vectors are printed in boldface). Formulating the problem in this manner leads to a single equation in two unknowns *ẋ**xẏ*. Measuring at *n* different locations does not help in general, since we are then faced with *n* linear equations in *2n* unknowns. This type of problem is termed ill-posed (Hadamard, 1923). One way to make these problems well-behaved in a precise, mathematical sense, is to impose additional constraints in order to be able to compute unambiguously the optical flow field. The fact that we are unable to measure both components of the velocity vector is also known as the ‘aperture’ problem. Any system with a finite viewing aperture and the rule d*I*/d*t* = 0 can only measure the component of motion — *I*_{t}*/*|∇Z| along the spatial gradient ∇*I* = *(I*_{x},*I*_{y}*)*. The motion component perpendicular to the local gradient remains invisible. In addition to the aperture problem, the initial motion data is usually noisy and may be sparse. That is, at those locations where the local visual contrast is weak or zero, no initial optical flow data exist (the featureless rotating sphere would be perceived as stationary), thereby complicating the task of recovering the optical flow field in a robust manner.

*L:*The term in the first square bracket is nothing but the expansion of d

*I*/d

*t*(see above) and thus represents local motion, measured along the intensity gradient. In an ideal world free of noise, d

*I*/d

*t*should be zero; we here impose the condition that it should be as small as possible to account for unavoidable noise in the motion measurement stage. The terms in the second bracket represent a measure of the smoothness of the flow field, the parameter λ controlling the compromise between the smoothness of the desired solution and its closeness to the data. The contribution of this term to

*L*will be zero for a spatially constant flow field - induced by rigid motion in the plane - since all spatial derivatives will be zero. The smoothness constraint also stabilizes the solution against the unavoidable noise in the intensity measurements.

Since *L* is quadratic in *ẋ* and xẏ and therefore has a unique minimum, the final solution minimizing *L* will represent a trade-off between faithfulness in the data and smoothness, depending on a parameter λ. The Horn & Schunck (1981) algorithm derives motion at every point in the image by taking into account motion in the surrounding area. It can be shown that it finds the qualitatively correct optical flow field for real images (for a mathematical analysis in terms of the theory of dynamic systems see Verri & Poggio, 1987). Such as area-based optical flow method is in marked contrast to the edge-based algorithm of Hildreth (1984); she proposes to solve the aperture problem by computing the optical flow along edges (in her case zero-crossings of the filtered image) using a variational functional very similar to that of equation 1.

The use of general constraints (as compared to very specific constraints of the type ‘a red blob at desk-top height is a telephone’, popular in early computer vision algorithms) is very common to solve the ill-posed problems of early vision (Poggio *et al*. 1985). Thus, continuity and uniqueness are exploited in the Marr & Poggio (1977) cooperative stereo algorithm, smoothness is used in surface interpolation (Grimson, 1981) and rigidity is used for reconstructing a threedimensional figure from motion (structure-from-motion; Ullman, 1979).

Before we continue, it is important to emphasize that the optical flow is computed in two, conceptually separate, stages. In the first stage, an initial estimate of the local motion, based on spatial and temporal image intensities, is computed. Horn & Schunck’s method of doing this (using d*I*/d*t =* 0) belongs to a broad class of motion algorithms, collectively known as gradient algorithms (Limb & Murphy, 1975; Fennema & Thompson, 1979; Marr & Ullman, 1981; Hildreth, 1984; Yuille & Grzywacz, 1988). A new variant of the gradient method, using dλ *I*/d*t* = 0 to compute local motion, leads to uniqueness of the optical flow, since this constraint is equivalent to two linear independent (in general) equations in two unknowns (Uras *et al*. 1988). Thus, in this formulation, computing optical flow is not an ill-posed but an ill-conditioned problem. Alternatively, a correlation or second-order model could be used at this stage for estimating local motion (Hassenstein & Reichardt, 1956; Poggio & Reichardt, 1973; van Santen & Sperling, 1984; Adelson & Bergen, 1985; Watson & Ahumada, 1985; Reichardt *et al*. 1988). However, for both principal (e.g. non-uniqueness of initial motion estimate) and practical (e.g. robustness to noise) reasons, all these methods require a second, independent stage where smoothing occurs.

However, while the optical flow generally varies smoothly from location to location, it can change quite abruptly across discontinuities. Thus, the flow field associated with a flying bird varies smoothly across the animal but drops to zero ‘outside’ the bird (since the background is stationary). In these cases of motion discontinuities - usually encountered when objects move across each other - smoothing should be prevented (see below and Hutchinson *et al*. 1988).

The cost functional used to compute motion (equation 1) is a quadratic variational functional of a type common in early vision (Poggio *et al*. 1985), and can be solved using simple electrical networks (Poggio & Koch, 1985). The key idea is that the power dissipated in a linear electrical network is quadratic in the currents or voltages; thus, if the values of the resistances are chosen appropriately, the functional *L* to be minimized corresponds to power dissipation and the steadystate voltage distribution in the network corresponds to the minimum of *L* in equation 1. Data are introduced by injecting currents into the nodes of the network. Once the network settles into its steady state - dictated by Kirchhoffs & Ohm’s laws - the solution can simply be read off by measuring the voltages at every node. Efforts are now under way (see, in particular, Luo *et al*. 1988) to build such resistive networks for various early vision algorithms in the form of miniaturized circuits using analog, subthreshold CMOS VLSI technology of the type pioneered by Mead (1989).

## Implementation in a neuronal network

We will now describe a possible neuronal implementation of this computer vision algorithm. Specifically, we will show that a reformulated variational functional equivalent to equation 1 can be evaluated within the known anatomical and physiological constraints of the primate visual system and that this formalism can explain a number of psychophysical and physiological phenomena.

Neurons in the visual cortex of mammals represent the direction of motion in a very different manner from resistive networks, using many neurons per location such that each neuron codes for motion in one particular direction (Fig. 1). In this representation, the velocity vector V(i,j) [where (*i*,*j*) are the image plane coordinates of the center of the cell’s receptive field] is not coded explicitly but is computed across a population of *n* such cells, each of which codes for motion in a different direction (given by the unit vector Θ _{k}), such that:

*V*(

*i*,

*j*,

*k*) have spatially overlapping receptive fields but with different preferred direction of motion

*k*. This population-coding scheme implies, of course, that all neurons corresponding to location

*i*,

*j*represent a single, unique value of velocity, an assumption which breaks down during the perception of two stimuli moving over each other (see the section on motion transparency). This distributed and coarse population-coding scheme is similar to the coding believed to be used in the system controlling eye movements in the mammahan superior colliculus (Lee

*et al*. 1988). Detecting the most active neuron at each location (winner-take-all scheme), as in Bülthoff

*et al*. (1989), is not required. To mimic neuronal responses more accurately, the output of ah our model neurons is halfwave rectified; in other words,

*f*(

*x*) =

*x*if

*x*>0 and 0 if

*x*<0. Thus, when the inhibitory inputs exceed the excitatory ones, the neuron is silent. We then require at least

*n =*4 neurons to represent all possible directions of movement. Note that in this representation the individual components

*V(i*,

*j*,

*k)*are

*not*the projections of the velocity field V(i,j) onto the direction Θ

_{k}(except for

*n*= 4).

*I(i*,

*j)*is projected onto the image plane and relayed to the first cortical processing stage

*via*two sets of cells: and where

*G*is the two-dimensional Gaussian filter (with σ

^{2}= 4 pixels; Marr & Hildreth, 1980; Marr & Ullman, 1981). The ∇

^{2}G filter is very similar to the difference-of-Gaussian or Mexican hat-shaped receptive fields of retinal ganglion cells (Enroth-Cugell & Robson, 1966). This stage then models the filtering performed by retinal ganglion cells. 5 and

*T*cells, however, only represent a first- order approximation of the visual transformations occurring in the retina and the lateral geniculate nucleus, because retinal ganglion cells always show some transient behavior - different from equation 3 - and do not respond instantaneously, as would be expected from equation 4. However, little would be gained at this early stage in our understanding of cortical processing by using much more sophisticated cellular models (for such a detailed dynamic description of cat retinal X cells see Victor, 1987).

*n*ON – OFF orientation- and direction-selective cells

*U(i*,

*j*,

*k)*, each with preferred direction indicated by the unit vector Θ

_{k}(here the

*V*neurons and the

*U*neurons have the same number of directions and the same preferred directions for the sake of simplicity, even though it is not necessary): where

*∈*is a constant and ∇

_{k}the spatial derivative along the direction Θ

_{k}. This derivative is approximated by projecting the convolved image

*S(i*,

*j)*onto a ‘simple’-type cortical receptive field, consisting of a 1 by 7 pixel positive (ON) subfield next to a 1 by 7 pixel negative (OFF) subfield. Because of the Gaussian convolution in the

*S*cells, the resulting receptive field has an ON subfield of 3 by 9pixels next to an OFF subfield of the same size (Fig. 8A shows such a subfield). Such receptive fields are common in the primary visual cortex of cats and primates (Hubel & Wiesel, 1962). We assume that at each location

*n*such receptive fields, each with preferred axis given by Θ

_{k}(

*k*∈ {l…n}) exist. The cell

*U(i*,

*j*,

*k)*responds optimally if a bar or grating oriented at right angles to Θ

_{k}moves in direction Θ

_{k}.

Our definition of *U* differs from the standard gradient model *U = — T/∇*_{k}*S*, by including a gain control term, ∈, such that *U* does not diverge if the visual contrast of the stimulus decreases to zero; thus, U→ *— T∇*_{k}*S* as |∇_{k}*S*|→ 0. Under these conditions of small stimulus contrast, our model can be considered a second-order model, similar to the correlation or spatio-temporal energy models (Hassenstein & Reichardt, 1956; Poggio & Reichardt, 1973; Adelson & Bergen, 1985) and the output of the *U* cell is proportional to the product of a transient cell *(T)* and a sustained simple cell with an odd-symmetric receptive field (∇_{k}S); thus, the response of *U* is proportional to the magnitude of velocity. For large values of stimulus contrast, i.e. |∇_{k}S| > ∈, *U→ – T/∇*_{k}*S*. Thus, our model of local motion detection appears to contain aspects of both gradient and second-order methods, depending on the exact experimental conditions (for a further discussion of this issue, see Koch *et al*. 1989).

*V*. The state of these neurons - coding for the final (global) optical flow field - is evaluated by minimizing a reformulated version of the functional in equation 1. The first term expresses the fact that the final velocity field should be compatible with the initial data, i.e. with the local velocity component measured along the spatial gradient (‘velocity constraint line’)- In other words, the velocity at location should be compatible with the local motion term

*U:*where cos(

*k*’–

*k*) represents the cosine of the angle between Θ

_{k}

**‘**and Θ

_{k}, and

*E(i*,

*j*,

*k)*is the output of an orientation-selective neuron raised to the mth power. This term ensures that the local motion components

*U(i*,

*j*,

*k)*only have an influence when there is an appropriately oriented local pattern; in other words,

*E*

^{m}prevents velocity terms incompatible with the measured data from contributing significantly to

*L*

_{0}. Thus, we require that the neurons

*E(i*,

*j*,

*k)*do not respond significantly to directions differing from Θ

_{k}. If they do,

*L*

_{0}will increasingly contain contributions from other, undesirable, data terms. A large exponent

*m*is advantageous on computational grounds, since it will lead to a better selection of the velocity constraint line. For our model neurons (with a half-width tuning of approximately 60°),

*m = 2*gave satisfactory responses. Equation 7 directly corresponds to the first term in the variational functional of Horn & Schunck (1981), equation 1.

*ẋ*and

*xẏ*by their components in terms of

*V(i*,

*j*,

*k)*[for instance, the

*x*component of the vector V(i,j) is given by Σ

_{k}V(

*i*,

*j*,

*k*)cosΘ

_{k}]. This leads to: We are now searching for the neuronal activity level

*V(i*,

*j*,

*k)*that minimizes the functional

*L*

_{0}

*+ λ L*

_{1}. Similar to the original Horn & Schunck’s functional, equation 1, the reformulated variational functional is quadratic in

*V(i*,

*j*,

*k)*, so we can find this state by evolving

*V(i*,

*j*,

*k)*on the basis of the steepest descent rule: The contribution from the L

_{0}term to the right-hand side of this equation has the form: while the contribution from the

*L*

_{1}term has the form: The terms in equations 10 and 11 are all linear in either

*U*or

*V*. This enables us to view them as the linear synaptic contributions of the

*U*and

*V*neurons towards the activity of neuron

*V(i*,

*j*,

*k)*. The left-hand term of equation 9 can be interpreted as a capacitative term, governing the dynamics of our model neurons. In other words, in evaluating the new activity state of neuron

*V(i*,

*j*,

*k)*, we evaluate equations 10 and 11 by summing all the contributions from

*V*and

*U*of the same location

*i*,

*j*as well as neighbouring

*V*neurons and subsequently using a simple numerical integration routine to compute the new state at time

*t*+Δ t. The appropriate network carrying out these operations is shown schematically in Fig. 1A.

This neuronal implementation converges to the solution of the Horn & Schunck algorithm as long as the correct constraint line is chosen in equation 7, that is as long as the *E*^{m} term is selective enough to suppress velocity terms incompatible with the measured data. In the next two sections, we will illustrate the behavior of this algorithm by replicating a number of perceptual and electrophysiological experiments.

## Correspondence to cortical anatomy and physiology

The neuronal network we propose to compute optical flow (Fig. 1) maps directly onto the primate visual system. Two major visual pathways, the parvo- and the magnocellular, originate in the retina and are perpetuated into higher visual cortical areas. Magnocellular cells appear to be the ones specialized to process motion information (for reviews, see Livingstone & Hubel, 1988;,De Yoe & van Essen, 1988), since they respond faster and more transiently and are more sensitive to low-contrast stimuli than parvocellular cells. Parvocellular neurons, in contrast, are selective for form and color.

We do *not* identify our S and *T* channels with either the parvo- or the magnopathway since this is not crucial to our model. Furthermore, reversibly blocking either the magno- or the parvocellular input to cells in the primary cortex leads to a degradation but not to the abolition of orientation- and direction-selectivity (Malpelli *et al*. 1981). Different from our model, cortical cells therefore appear to compute the local estimate of motion in either of the two pathways. Our current model does require that one set of cells signals edge information while a second population is sensitive to temporal changes in intensity (motion or flicker). We approximate the spatial receptive field of our retinal neurons using the Laplacian- of-Gaussian operator and the temporal properties of our transient pathway by the first derivative. Thus, the response of our *U* neurons increases linearly with increasing velocity of the stimulus. This is, of course, an oversimplification and more realistic filter functions should be used (see above).

Both the parvo- and the magnocellular pathways project into layer 4C of the primary visual cortex. Here the two pathways diverge, magnocellular neurons projecting to layer 4B (Lund *et al*. 1976). Cells in this layer are orientation- as well as direction-selective (Dow, 1974). Layer 4B cells project heavily to a small but well-defined visual area in the superior temporal sulcus called the middle temporal area (MT; Allman & Kass, 1971; Baker *et al*. 1981; Maunsell & van Essen, 1983a). All cells in MT are direction-selective and tuned for the speed of the stimulus; the majority of cells are also orientation-selective. Moreover, irreversible chemical lesions in MT cause striking elevations in psychophysically measured motion thresholds, but have no effect on contrast thresholds (Newsome & Pare, 1988). These findings all support the thesis that area MT is at least partially responsible for mediating motion perception. We assume that the orientation- and direction-selective *E* and *U* cells corresponding to the first stage of our motion algorithms are located in layers 4B or 4C in the primary visual cortex or possibly in the input layers of area MT, while the *V* cells are located in the deeper layers of area MT. Inspection of the tuning curve of a *V* model cell in response to a moving bar reveals its similarity with the superimposed experimentally measured tuning curve of a typical MT cell of the owl monkey (Fig. 2).

The structure of our network is indicated schematically in Fig. 1A. The strengths of synapses between the *U* and the *V* neurons and among the *V* neurons are directly given by the appropriate coefficients in equations 10 and 11. Equation 10 contains the contribution from *U* and *E* neurons in the primary visual cortex as well as from MT neurons *V* at the same location *i*,*j* but with differently oriented receptive fields *k’*. No spatial convergence or divergence occurs between our *U* and *V* modules, although this could be included. The first part of equation 10 gives the synaptic strength of the *U* to *V* projection *[cos(k– k’)E*^{m}*(i*,*j*,*k’)U(i*,*j*,*k’)]:*. if the preferred direction of motion of the presynaptic input *U(i*,*j*,*k’)* differs by no more than ±90° from the preferred direction of the postsynaptic neuron V(*i*,*j*,*k*), the *U→ V* projection will depolarize the postsynaptic membrane. Otherwise, it will act in a hyperpolarizing manner, since the *cos(k– k’)* term will be negative. Notice that our theory predicts neurons from all cortical orientation columns *k’* (which could be located in either VI or in the superficial layers of MT) projecting onto the *V*cells, a proposal which could be addressed using anatomical labeling techniques.

The synaptic interaction contains a multiplicative nonlinearity (*U*· E^{m}). This veto term can be implemented using a number of different biophysical mechanisms, for instance ‘silent’ or ‘shunting’ inhibition (Koch *et al*. 1982). The smoothness term *L*_{1} results in synaptic connections among the *V* neurons, both among cells with overlapping receptive fields (same value of *i*,*j)* and among cells with adjacent receptive fields (e.g. *i*– l,*j*). The synaptic strength of these connections acts in either a de- or a hyperpolarizing manner, depending on the sign of cos(k– *k’)* as well as on their relative locations (see equation 11).

We will next discuss an elegant psychophysical experiment, strongly supporting a two-stage model of motion computation (Adelson & Movshon, 1982; Welch, 1989). Moreover, since MT cells in primates, but not cells in VI, appear to mimic the behavioral response of humans to the psychophysical stimulus, such experiments can be used as probes to dissect the different stages in the processing of perceptual information.

If two identical sine or square gratings are moved at an angle past each other, human observers perceive the resulting pattern as a coherent plaid, moving in a direction different from the motion of the two individual gratings. The direction of the resultant plaid pattern (‘pattern velocity’) is given by the ‘velocity space combination rule’ and can be computed from knowledge of the local ‘component velocities’ of the two gratings (Adelson & Movshon, 1982; Hildreth, 1984). One such experiment is illustrated in Fig. 3. A vertical square grating is moved horizontally at right angles over a second horizontal square grating of the same contrast and moving at the same speed vertically. The resulting plaid pattern is seen to move coherently to the lower right-hand corner (Adelson & Movshon, 1982), as does the output of our algorithm. Note that the smoothest optical flow field compatible with the two local motion components (one from each grating) is identical to the solution of the velocity space combination rule. In fact, for rigid planar motion, as occurs in these experiments, this rule as well as the smoothness constraint lead to identical solutions, even when the velocities of the gratings differ (illustrated in Fig. 4A,B). Notice that the velocity of the coherent pattern is *not* simply the vector sum of the component velocity (which would predict motion towards the lower right-hand corner in the case illustrated in Fig. 4A,B).

If the contrast of both gratings is different, the component velocities are weighted according to their relative contrast. As long as the contrasts of the two gratings differ by no more than approximately one order of magnitude, observers still report coherent motion, but with the final pattern velocity biased towards the direction of motion of the grating with the higher contrast (Stone *et al*. 1988). Since our model incoporates such a contrast-dependent weighting factor (in the form of equation 5), it qualitatively agrees with the psychophysical data (Fig. 4C,D).

Movshon *et al*. (1985) repeated Adelson & Movshon’s plaid experiments while recording from neurons in the striate and extrastriate macaque cortex (see also Albright, 1984). All neurons in VI and about 60 % of cells in MT only responded to the motion of the two individual gratings (component selectivity; Movshon *et al*. 1985), similar to our *U(i*,*j*,*k)* cell population, while about 30 % of all recorded MT cells responded to the motion of the coherently moving plaid pattern (pattern selectivity), mimicking human perception. As illustrated in Fig. 3, our *V* cells behave in this manner and can be identified with this subpopulation.

An interesting distinction arises between direction-selective cells in V1 and those in MT. While the optimal orientation in VI cells is always perpendicular to their optimal direction, this is only true for about 60% of MT cells (type I cells;

Albright, 1984; Rodman & Albright, 1989). 30% of MT cells respond strongly to flashed bars oriented parallel to their preferred direction of motion (type II cells). These cells also respond best to the pattern motion in the Movshon *et al*. (1985) slaid experiments. Based on this identification, our model predicts that type II cells should respond to an extended bar (or grating) moving parallel to its edge. Even though, in this case, no motion information is available if only the classical receptive field of the MT cell is considered, motion information from the trailing and leading edges will propagate along the entire bar. Thus, neurons whose receptive fields are located away from the edges will eventually (i.e. after several tens of milliseconds) signal motion in the correct direction, even though the direction of motion is parallel to the local orientation. This neurophysiological prediction is illustrated in Fig. 8A,B.

Cells in area MT respond well not only to motion of a bar or grating but also to a moving random dot pattern (Albright, 1984; Allman *et al*. 1985), a stimulus containing no edges or intensity discontinuities. Our algorithm responds well to random-dot motion, as long as the spatial displacement between two consecutive frames is not too large (Fig. 5).

The ‘smooth’ optical flow algorithms we are discussing only derive the exact velocity field if a rigid, Lambertian object moves parallel to the image plane. If an object rotates or moves in depth, the derived optical flow only approximates the underlying velocity field (Verri & Poggio, 1987). Is this constraint reflected in VI and MT cells? No cells selective for true motion in depth have been reported in primate VI or MT. Cells in MT do encode information about position in depth, i.e. whether an object is near or far, but not about motion in depth, i.e. whether an object is approaching or receding (Maunsell & van Essen, 1983b). The absence of cells responding to motion in depth in the primate (but not in the cat; see Cynader & Regan, 1982) supports the thesis that area MT is involved in extracting optical flow using a smoothness constraint, an approach which breaks down for threedimensional motion. Cells selective for expanding or contracting patterns, caused by motion in depth, or to rotations of patterns within the frontoparallel plane, were first reported by Saito *et al*. (1986) in a cortical area surrounding MT, termed the medial superior temporal area (MST). We illustrate the response of our network to a looming stimuli in Fig. 6. As emphasized previously, our algorithm computes the qualitatively correct flow field even in this case when the principal constraint underlying our analysis, d*I*/d*t*= 0, is violated. Since MST receives heavy fiber projections from MT (Maunsell & van Essen, 1983*c*), it is likely that motion in depth is extracted on the basis of the two-dimensional optical flow computed in the previous stage.

## Psychophysics

We now consider the response of the model to a number of stimuli which generate strong psychophysical percepts. We have already discussed the plaid experiments (previous section), in which our smoothness constraint leads to the correct, perceived interpretation of coherent motion.

In ‘motion capture’ (Ramachandran & Anstis, 1983*a*), the motion of randomly moving dots can be influenced by the motion of a superimposed low-spatial-frequency grating such that the dots move coherently with the larger contour, that is they are ‘captured’. As the spatial frequency of the grating increases, the capture effect becomes weaker (Ramachandran & Inada, 1985). As first demonstrated by Bülthoff *et al*. (1989), algorithms that exploit local uniformity or smoothness of the optical flow can explain, at least qualitatively, this optical illusion, since the smoothness constraint tends to average out the motion of the random dots in favor of the motion of the neighboring contours (see also Yuille & Grzywacz, 1988). The response of our network - slightly modified to be able to perceive the low frequency grating - is illustrated in Fig. 7C,D. However, in order to explain the non-intuitive finding that the capture effect becomes weaker for high-frequency gratings, a version of our algorithm which works at multiple spatial scales is required.

Yuille & Grzywacz (1988) have shown how the related phenomenon of ‘motion coherence’ (in which a cloud of ‘randomly’ moving dots is perceived to move in the direction defined by the mean of the motion distribution; Williams & Sekuler, 1984) can be accounted for using a specific smoothness constraint. Our algorithm also reproduces this visual illusion quite well (Fig. 7A,B). In fact, it is surprising how often the Gestalt psychologists use the words ‘smooth’ and ‘simple’ when describing the perceptual organization of objects (for instance in the formulation of the key law of *Prägnanz* ; Kofka, 1935; Köhler, 1969). Thus, one could argue that these psychologists intuitively captured some of the constraints used in today’s computer vision algorithms.

Smoothing, that is that the flow field at one location influences motion at a different location, will not occur instantaneously. The differential equation implemented by our network (equations 9–11) can be considered to be a spatial discretized version of a parabolic differential equation, a family of partial differential equations whose members include the diffusion and the heat equation. We thus expect the time it takes to travel a certain distance to be proportional to the square of this distance. There exists some psychophysical support for this notion. Neighboring flashed dots can impair the speed discrimination of a pair of briefly flashed dots in an apparent motion experiment (Bowne & McKee, 1989). This ‘motion interference’ is time-selective, such that the optimal time of occurrence for the stimuli to interfere with the task increases with increasing distance between the two.

Our algorithm is able to mimic another illusion of the Gestalt psychologists: y motion (Lindemann, 1922; Kofka, 1931). A figure which is exposed for a short time appears with a motion of expansion and disappears with a motion of contraction, independent of the sign of contrast. Our algorithm responds in a similar manner to a flashed disk (Wang *et al*. 1989). A similar phenomenon has previously been reported for both fly and man (Bülthoff & Götz, 1979). This illusion arises from the initial velocity measurement stage and does not rely on the smoothness constraint.

Our model so far does not take into account temporal integration of velocity information over more than two frames [all simulations were always carried out with only two frames: *I(x*,*y*,*t)* and *I(x*,*y*,*t* + Δr)]. This is an obvious oversimplification. From careful psychophysical measurements we know that optimal velocity discrimination requires about 80–100 ms (McKee & Welch, 1985). Furthermore, a number of experiments argue for a ‘temporal recruitment’ (P. J. Snowden & O. J. Braddick, personal communication) or ‘motion inertia’ (Ramachandran & Anstis, 1983*b*) effect, such that the previously perceived velocity or direction of velocity influences the currently perceived velocity. Such a phenomenon could be reproduced by including into the variational functional of equation 1 a term which smooths over time, such as dV/d*t*.

## Motion transparency

An interesting visual phenomenon is ‘motion transparency’, in which two objects appear to move past or over each other; i.e. at least one object appears to be transparent. For instance, if the two gratings in the Adelson & Movshon (1982) experiment (Fig. 3) differ by an order of magnitude in visual contrast, i.e. one grating having a strong and the other a weak contrast, or if the two gratings differ significantly in spatial frequency, they tend not to be perceived as moving coherently. Perceptually, observers report seeing two gratings sliding past or over each other. The significant fact is that in these cases, more than one unique velocity is associated with a location in visual space.

Welch & Bourne (1989) propose that motion transparency could be decided at the level of the striate cortex by neurons that compare the local contrast and temporal frequency content of the moving stimuli. If either of these two quantities differ substantially - probably caused by two distinct objects - a decision not to cohere would be made. We could then assume within our framework that this decision - occurring somewhere prior to our smoothing stage - prevents smoothing from occurring by blocking the appropriate connections among the *V* cells with spatially distinct receptive fields. This could be accomplished by setting the synaptic connection strength to zero either *via* conventional synaptic inhibition or *via* the release of a neurotransmitter or neuropeptide acting over relatively large cortical areas. The notion that motion transparency prevents smoothing among the *V* cells presupposes that the perceptual apparatus now has access to the individual motion components *V(i*,*j*,*k)*, instead of to the vector sum V(i,j) of equation 2; only this assumption can explain the perception of two or more velocity vectors at any one location. Simple electrophysiological experiments could provide proof for or against our conjecture. For instance, it would be very intriguing to know how the pattern-selective cells of Movshon *et al*. (1985) in area MT respond to the two moving gratings of Adelson & Movshon (1982; see Figs 3 and 4). We know that if the gratings cohere, the cells respond to the motion of the plaid. How would these cells respond, however, if the two gratings do not cohere and motion transparency is perceived by the human observer?

## Motion discontinuities

The major drawback of this and all other motion algorithms is the degree of smoothness required, smearing out any discontinuities in the flow field, such as those arising along occluding objects or along a figure-ground boundary. A powerful idea to deal with this problem was proposed by Geman & Geman (1984; see also Blake & Zisserman, 1987), who introduced the concept of binary line processes which explicitly code for the presence of discontinuities. We adopted the same approach for discontinuities in the optical flow by introducing binary horizontal (*l*^{h}) and vertical (*l*^{v}) fine processes representing discontinuities in the optical flow (as first proposed in Koch *et al*. 1986). If the spatial gradient of the optical flow between two neighboring points is larger than some threshold, the flow field is ‘broken’ and the appropriate motion discontinuity at that location is switched on (*l* = 1), and no smoothing is carried out. If little spatial variation exists, the discontinuity is switched off (*l* = 0). This approach can be justified rigorously using Bayesian estimation and Markov random fields (Geman & Geman, 1984). In our deterministic approximation to their stochastic search technique, a modified version of the variational functional in equation 1 must be minimized (Hutchinson *et al*. 1988). This functional is, different from before, nonquadratic or non-convex, that is it can have many local minima. Domainindependent constraints about motion discontinuities, such as that they occur in general along extended contours and that they usually coincide with intensity discontinuities (edges), are incorporated into this approach (Geman & Geman, 1984; Poggio *et al*. 1988). As before, some of these constraints may be violated under laboratory conditions (such as when a homogeneous black figure moves over an equally homogeneous black background and the motion discontinuities between the figure and the ground do not coincide with the edges, since there are no edges) and the algorithm computes an optical flow field different from the underlying two-dimensional velocity field (in this case, the computed optical flow field is zero everywhere). However, for most natural scenes, these motion discontinuities lead to a dramatically improved performance of the motion algorithm (see Hutchinson *et al*. 1988).

We have not yet implemented motion discontinuities into the neuronal model. It is known, however, that the visual system uses motion to segment different parts of the scene. Several authors have studied the conditions under which discontinuities (in either speed or direction) in motion fields can be detected (Baker & Braddick, 1982; van Doom & Koenderink, 1983; Hildreth, 1984). Van Doom & Koenderink (1983) concluded that perception of motion boundaries requires that the magnitude of the velocity difference be larger than some critical value, a finding in agreement with the notion of processes that explicitly code for motion boundaries. Recently, Nakayama & Silverman (1988) studied the spatial interaction of motion among moving and stationary waveforms. A number of their results could be re-interpreted in terms of our motion discontinuities.

What about the possible cellular correlate of line processes? Allman *et al*. (1985) first described cells in area MT in the owl monkey whose ‘true’ receptive field extended well beyond the classical receptive field, as mapped with bar or spot stimuli (see Tanaka *et al*. 1986, for such cells in macaque MT). About 40–50 % of all MT cells have an antagonistic direction-selective surround, such that the response of the cell to motion of a random dot display or an edge within the center of the receptive field can be modified by moving a stimulus within the surrounding region that is 50–100 times the area of the center. The response depends on the difference in speed and direction of motion between the center and the surround, and is maximal if the surround moves at the same speed as the stimulus in the center but in the opposite direction. In brief, these cells become activated if a motion discontinuity exists within their receptive field. In cats, similar cells appear at the level of areas 17 and 18 (Orban & Gulyás, 1988). These authors have speculated as to the existence of two separate cortical systems, one for detecting and computing continuous variables, such as depth or motion, and one for detecting and handling boundaries. Thus, tantalizing hints exist as to the possible neuronal basis of motion discontinuities.

## Conclusion

The principal contribution of this article is to show how a well-known algorithm for computing optical flow, based on minimizing a quadratic functional *via* a Relaxation scheme, can be mapped onto the visual system of primates. The underlying neuronal network uses a population-coding scheme and is very robust in the face of hardware errors such as missing connections (Fig. 8). While the details of our algorithm are bound to be incorrect, it does explain qualitatively a number of perceptual phenomena and illusions, as well as electrophysiological experiments, on the basis of a single unifying principle: the final optical flow should be as smooth as possible. We are much less satisfied with our formulation of the initial, local stage of motion computation, because the detailed properties of direction-selective cortical cells in cat and primates do not agree with those of our *U* cells. The challenge here is to bring the biophysics of such motion-detecting cell into agreement with the well-explored phenomenological theories of psychophysics and computational vision (Grzywacz & Koch, 1988; Suarez & Koch, 1989).

The performance of our motion algorithm implemented *via* resistive grids (Hutchinson *et al*. 1988) is substantially improved following the introduction of processes which explicitly label for the existence of motion discontinuities, across which no smoothing should occur. It would be surprising if the nervous system has not made use of such an idea.

## References

*J. opt. Soc. Am*. A

*Nature, Lond*

*J. Neurophysiol*

*(Aotus trivirgatus)*

*Brain Res*

*Perception*

*Vision Res*

*(Aotus trivirgatus):*a quantitative comparison of medial, dorsomedial, dorsolateral and middle temporal areas

*J. Neurophysiol*

*Visual Reconstruction*

*J. opt. Soc. Am*. A

*Nature, Lond*

*Nature, Lond*.

*Vision Res*

*Phil. Trans. R. Soc. Ser*. B

*Vision Res*

*Trends Neurosci*

*J. Neurophysiol*

*J. Physiol., Lond*

*Comput. Graph. Image Proc*

*IEEE Trans. Pattern anal. Machine Intell*

*From Images to Surfaces*

*Synapse*

*Lectures on the Cauchy Problem in linear Partial Differential Equations*

*Chlorophanus*

*Z. Naturforsch*

*A. Rev. Neurosci*

*Artif. Intell*

*J. Physiol., Lond*

*IEEE Computer*

*IEEE Trans. Pattern anal. Machine Intell*

*Proc. natn. Acad. Sci. U.S.A*

*Phil. Trans. R. Soc. Ser*. B

*Proceedings of the IEEE Workshop on Visual Motion*

*Handbuch der Normalen und Pathologischen Physiologie*

*The Task of Gestalt Psychology*

*Nature, Lond*

*Comput. Graph. Image Proc*

*Psych. Forsch*

*Science*

*J. comp. Neurol*

*Proceedings of the IEEE Conference on Neural Information Processing Systems*

*J. opt. Soc. Am*. A

*J. Neurophysiol*

*Proc. R. Soc. Ser*. B

*Science*

*Proc. R. Soc*. B

*J. Neurophysiol*

*J. Neurophysiol*

*J. Neurosci*

*Expl Brain Res*., Suppl. II,

*Pattern Recognition Mechanisms*

*Vision Res*

*Vision Res*

*J. Neurosci*

*Neural Computers*

*Science*

*Proc. R. Soc*. B

*Kybernetik*

*Nature, Lond*

*Vision Res*

*Vision Res*

*Spatial Vision*

*Naturwissenschaften*

*Expl Brain Res*

*J. Neurosci*

*Neurosci. Abstr*

*Neural Computation*

*J. Neurosci*

*The Interpretation of Visual Motion*

*IEEE Computer*

*Biol. Cybernetics*

*Vision Res*

*J. opt. Soc. Am*. A

*J. Physiol., Lond*

*Artif. Intell. Lab. Memo No*

*Neural Computation*

*J. opt. Soc. Am*. A

*Nature, Lond*

*Ass. Res. Vision Ophthalmol*

*Vision Res*

*Nature, Lond*