# 3.1: Natural Photo-sensory Systems

Biological sensory systems perform energy-efficient and computationally elegant algorithms to accomplish tasks like those required of certain engineering applications. Animals and some engineered systems have the capacity for limited movement within the natural environment in response to sensory stimuli. For example, consider a frontend seeker on a missile designed to autonomously seek and hit a specified target. The missile needs to be guided to a target seen by a seeker with background sensory noise; this requirement is like that of a dragonfly searching and acquiring smaller flying insects. Tasks common to both systems include navigating and guiding the system within the natural environment, detecting, identifying, and tracking objects identified as targets, efficiently guiding the system to the targets, and then intercepting these targets.

This part is about photo-sensory systems, or vision, which involves the conversion of photonic energy into electronic signals. These signals are subsequently processed to extract pertinent information. The primary emphasis will be on vision computational models based on the primate vision system since much study has been made in this area. We begin with some vision principles common across many species within the animal kingdom. Then the structure and function of natural vision systems is investigated, with emphasis on information processing first within invertebrates (specifically arthropods) and then within vertebrates (specifically primates). Engineering application examples that leverage natural vision concepts follow.

## 3.1 Natural Photo-sensory Systems

Passive means the sensor observes natural stimuli that might be available within the environment, while active implies the sensor sends stimuli out and observes the response from the environment. Physical sensors in the animal kingdom include photo-sensory, such as passive vision systems processing photons, mechano-sensory, such as passive sonar (audition), active sonar (bats, dolphins, whales), passive compression (touch) and active compression (insect antennas), and chemo-sensory, such as gustation (taste) and olfaction (smell). This chapter will focus on passive photo-sensory vison systems.

### 3.1.1 Common principles among natural photo-sensory systems

A photon is the wave-particle unit of light with energy E = , where h is Plank’s constant and υ is the electromagnetic frequency. The energy per time (or space) is modeled as a wavelet since it satisfies the general definition of having a beginning and ending and unique frequency content. Information contained in the frequency and flux of photons is photonic information, which gets converted into electronic information coded in the graded (or analog) neural ionic voltage potentials or in the frequencies of action potentials.

Biological systems can be divided into vertebrates, such as mammals and reptiles, and invertebrates, such as insects. Animals collect and process information from the environment for the determination of subsequent action. The many varied species and associated sensory systems in existence reflect the wide range of environmental information available as well as the wide range of biological task objectives.

#### Commonality of Photo-reception and Chemo-reception

Photo-reception is made possible by the organic chemistry of photopigments, which initiate the visual process by capturing photons of light. Photopigments are composed of a form of Vitamin A called retinal and a large protein molecule called opsin. Opsins belong to a large family of proteins which include olfactory (sense of smell) receptor proteins. Odorant and tastant molecules attached themselves to a special membrane receptor, causing a sequence of molecular reactions eventually resulting in neuronal signaling. Photopigment molecules are like these chemo-sensory membrane receptors with retinal serving as the odorant or tastant already attached. The incoming photon of light gives the molecule enough energy to initiate a chain reaction like that in chemo-sensory reception when an odorant or tastant molecule come in contact with the receptor. As a result, the photo-reception process is really a simplified form of the chemo-reception process. A photo-sensory (or visual) system begins by converting the photonic stimulus into a chemical stimulus (photopigments) and the remaining information processing of the visual system is that of a chemo-sensory system.

#### Curvature and Reflection

The two primary eye designs are the vesicular (containing a cavity) eye found in vertebrates and certain mollusks and the compound eye found in arthropods. Figure 3.1-1 shows the concave nature of the vesicular eye and the convex nature of the compound eye. Images in biological systems are formed on a curved sheet of photoreceptors, called the retina. In a similar way, cameras form images on a sheet of photographic film, where the film is flat instead of curved. The ancient sea-going mollusk Nautilus has the concave retina structure with a pinhole aperture, which creates an inverted image with no magnification. Most concave retinas (vertebrates, etc.) depend on the refraction of light through a larger aperture. The lens serves this purpose. A larger aperture is needed to allow more photonic flux to enter the reception area to ensure sufficient energy is available to stimulate photoreceptors, and refraction through the lens and eyeball fluid (vitreous humor) serves to compensate for the otherwise blurred view of the environment as the aperture is increased.

Physical properties of reflection are also used in the eye designs of scallops and certain fish and mammals. Some of the purposes of designs based on reflection are not known (scallops), but other vision system designs exploit reflection in nocturnal (night-time low-light level) conditions. For example, night-hunting by certain mammals is augmented by the fact that a photon of light has twice as much a chance of being captured by the same photoreceptor as the light passes through a second time after being reflected. A special reflective tissue (tapetum lucidum) behind the retina gives this advantage in nocturnal conditions. This reflection can be observed when shining a light (flashlight, headlight) toward the animal and it is looking back.

The photoreceptors are typically long and cylindrical cells containing photopigments arranged in many flat disc-shaped layers. This design gives a small angular reception area, leading to sufficient spatial acuity, while providing many opportunities for the incoming photon to be captured by the photopigment.

#### Optical Imperfections

There are several imperfections that are dealt with in natural vision systems. Some of these include spherical aberration, chromatic aberration, and diffraction. Natural vision system parameters typically represent an optimal balance of the effects of these imperfections. Spherical aberration is caused by light coming into focus at a shorter distance when coming through the periphery of the lens than from the center. Chromatic aberration is caused by the dependency on wavelength of the index of refraction: The shorter the wavelength, the greater the amount of refraction. This means that if the blue part of the image is in focus, then the red part of the image is slightly out of focus. The optical properties of the available biological material do not allow for perfect compensation of these effects. For example, to correct for spherical aberration requires a constant decrease in the cornea index of refraction with distance from the center. Since the molecular structure of the cornea is constant, this is not possible. The general shape, however, of the primate eye is slightly aspherical, which minimizes the effects of spherical aberration. As the primate eye changes shape with age, these aberrations are corrected by external lenses (eyeglasses).

The third imperfection is caused by diffraction. Diffraction is a geometrical optics phenomenon resulting from the edge effects of the aperture. When combined with spherical and chromatic aberration, the result is a spatial frequency limit on the image that can be mapped onto the retina. This limit is typified by the angular distance that two separate point sources can be resolved, called angular acuity. Spatial acuity refers to the highest spatial frequency that can be processed by the vision system. The displacement between photoreceptors in highly evolved species is typically the distance represented by the angular acuity. Any further reduction in distance is not practical as there would be no advantage concerning image information content.

Another consideration is contrast sensitivity, which is how sensitive two separate photoreceptors are to varying levels of photon flux intensity. In biological systems, the information forwarded is frequently a difference in contrast between two adjacent photoreceptors. If the photoreceptors are very close, then the difference will never be great enough to show a relative contrast since edges in the image are already blurred due to the aforementioned imperfections. The photoreceptor spacing in the retina is on the order of the Nyquist spatial sampling interval for frequencies limited by these imperfections. In the adult human retina, this turns out to be about 120 million photoreceptors: about 100 million rods, which are very sensitive and used in nocturnal conditions, and about 20 million cones, which come in three types and provide color information in daylight conditions.

#### Visual Information Pathways

Receptive fields for the various sensory systems are mapped to specific surface regions of neuronal tissue (such as retina, brain, and other neuronal surfaces). Due to the connectivity, several pathways are usually observed. For example, one photoreceptor may be represented in several neurons that are transmitting photonic information to the brain. One neuron may represent the contrast between that particular photoreceptor and the most adjacent ones. This would be an example of a parvocellular pathway neuron (parvo means small). Another neuron may represent the contrast between an average of that photoreceptor and the most adjacent ones, and an average of a larger region centering on that photoreceptor. This would be an example of a magnocellular pathway neuron (magno means large). As it turns out, the names come from the relative physical size of these neurons, and they happen to also correspond to the size of the receptive field they represent. Parvocellular and Magnocellular pathways are common among many species, for example, both humans (and other mammals) and certain arthropods.

#### Connectivity and Acuity

There is a balance between temporal acuity, which is the ability to detect slight changes in photonic flux in time, and spatial acuity, which is the ability to detect slight changes between two adjacent objects whose images are spatially separated on the retina. As receptors are more highly interconnected, there is better temporal acuity due to the better photon-integrating ability of the aggregate. Receptors that are not highly interconnected exhibit better spatial acuity.

To illustrate this concept, consider a steady photonic flux represented by 1 photon per 10 photoreceptors per unit of time. On average, each photoreceptor would receive 1 photon every 10 units of time. If this incoming photon rate changed to 2 photons per 10 photoreceptors, then the output of a single photoreceptor would have to be monitored for a duration of 10’s of units of time to detect an average increase in photon flux. If an aggregate of 100 photoreceptor cells were integrated, and if the photonic flux were uniformly distributed, then the total output would jump from 10 photons to 20 photons, which might be noticeable at the very next unit of time. The result is that the animal will be able to detect slight changes in photonic flux much better if the cells are highly connected, while the ability to distinguish between two adjacent small objects would deteriorate. Thus, a higher connectivity results in sharp temporal acuity at the cost of spatial acuity.

#### Coarse Coding

Coarse coding is the transformation of raw data using a small number of broadly overlapping filters. These filters may exist in time, space, color, or other information domains. Biological sensory systems tend to use coarse coding to accomplish a high degree of acuity in sensory information domains. For example, in each of the visual system information domains (space, time, and color, or chromatic) we find filters that are typically few in number and relatively coarse (broad) in area covered (or bandwidth): There are essentially only four chromatic detector types, whose spectral absorption responses are shown in Figure 3.1-2, three temporal channels, and three spatial channels. Neurons in the retina receiving information from the photoreceptors are connected in such a way that we can observe these spatial, temporal, and chromatic information channels in the optic nerve.

Coarse coding can take on many different forms, and one coarsely coded feature space may be transformed into another. For example, within the color channels of the vision system we find a transformation from broad-band in each of the three colors at the sensory level to broad-band in color-opponent channels at the intermediate level. Other interesting examples of coarse coding include wind velocities and direction calculation by cricket tail sensors and object velocity calculations with bursting and resting discharge modes of neuronal aggregates in the cat superior colliculus.

The responses of vision system rods and cones must be broad in scope to cover their portion of the data space. For example, in daytime conditions only the three cone types have varying responses. As a minimum each type must provide some response over one-third of the visible spectrum. Each detector type responds to much more than one-third of the visible spectrum. Since a single response from a given detector can result from one of many combinations of color and intensity, the value by itself gives ambiguous local color and intensity information. If the response curve was very narrow band, then any response is the result of a particular frequency, and the value of the response would reflect its intensity. However, many of these detectors would be required to achieve the wide range (millions) of colors we can perceive. It is not practical to have each of many narrow-band detectors at each spatial location. The natural design is optimized to allow for many colors to be detected at each location while minimizing the neuronal hardware (or “wet-ware”) requirements.

### 3.1.2 Arthropod vision system concepts

Although there are millions of species within the animal kingdom, there are relatively few photo-receptor design concepts that have stood the test of time, such as the arthropod compound eye. There are some interesting similarities between the vision systems of the insect phyla and primates. For example, both map incoming light onto an array of photoreceptors located in a retina. Both exhibit distinct post-retina neuronal pathways for what appears to be spatial and temporal processing.

Of course, there are some key differences between insect and primate vision systems. Insects have non-movable fixed-focused optics. They are not able to infer distances by using focus or altering gaze for object convergence. The eyes are much closer together, so that parallax cannot be used to infer distances either. The size is much smaller, and the coverage is in almost every direction so that the overall spatial acuity is much worse than primates. As a result, navigation appears to be done more by relative image motion than by any form of object detection and recognition [Srini02].

#### Arthropod Compound Eye

The arthropod compound eye is a convex structure. The compound eye is a collection of individual ommatidia, which are complex light-detecting structure typically made up of a corneal lens, crystalline cone, and a group of photosensitive cells. Each ommatidium forms one piece of the input image so that the full image is formed by the integration of all ommatidia. There are three basic designs for integrating ommatidia into a composite image:

1. Apposition. Each ommatidia maps its signal onto a single photoreceptor.
2. Superposition: Several ommatidia contribute to the input signal for each photoreceptor
3. Neural superposition: Not only are the photoreceptor inputs a superposition of several ommatidia, but neurons further in the processing chain also receive their inputs from several photoreceptor outputs.

Apposition eyes form relatively precise images of the environment. This design is common among diurnal (daytime) insects. Superposition eyes are common among nocturnal (night-time) and crepuscular (twilight) insects. In conditions of low light levels, the superposition design allow for greater sensitivity since light from several ommatidia are focused onto a single photoreceptor. The greater sensitivity of the superposition eye comes at a cost of spatial acuity since image detail is shared by neighboring pixels. This is an example of “higher connectivity results in sharp temporal acuity at the cost of spatial acuity” explained earlier. The neural superposition eye is found in the dipteran (two-winged) fly. This design allows for further processing to compensate for the loss of spatial acuity, resulting in both good spatial acuity and sensitivity.

The superposition eye has greater sensitivity to changes in photonic flux because of the higher degree of connectivity of the ommatidia to a single photoreceptor. In a similar way, the primate rod system is highly interconnected, which results in a high degree of temporal sensitivity. The primate photoreceptors are divided into rods and cones, named for the shape of the outer photopigment-containing segment. Certain cone cells are also highly interconnected, bringing better sensitivity to temporal changes.

#### Scanning Eyes

A few mollusks and arthropods have developed a scanning mechanism for creating a visual image of the external environment. A narrow strip of photoreceptors is moved back and forth to generate the complete image. Certain sea snails have retinas that are 3 to 6 photoreceptors wide and 400 photoreceptors long. The eye scans 90°, taking about a second to scan up, and about a fourth of a second to return down [Smith00].

Mantis shrimp contain 6 rows of enlarged ommatidia in the central region of the compound eye. The larger ommatidia contain color visual pigments that can be used to further investigate an object of interest by scanning with these central photoreceptors. This allows the shrimp to use any color information in the decision process [Smith00].

Certain jumping spiders contain retinas 5 to 7 photoreceptors wide and 50 photoreceptors long. The spider normally scans from side but can rotate the eye to further investigate a particular object of interest. The lateral (additional) eyes on this spider contain highly interconnected photoreceptors for detecting slight rapid movements. Once detected, the attention of the primary eye can be directed to the newly detected object. This process is analogous to primate vision, where the more periphery cells are highly connected and the central area (the fovea to be discussed later) are more densely packed and not so interconnected. A sharp movement in the periphery causes a primate to rotate the eyes to fixate on the source of the movement. Once fixated, the higher spatial acuity of the central area can be used to discern the spatial detain of the new object of interest [Smith00].

### 3.1.3 Primate vision systems

Early vision can be defined as the processes that recover the properties of object surfaces from 2D intensity arrays. Complete vision would be the process of using early vision information to make some decision. The focus in this section is on vertebrate vision information pathways that begin in the retina and terminate in cortical processing stages. Cortical comes from cortex, which is used to describe the part of the brain where sensory system information is processed. Vision is processed in the primary visual cortex, hearing is processed in the auditory cortex, and touch is processed in the somatosensory cortex. Many of these concepts are also common in insect vision.

Figure 3.1-3 shows the relevant parts of the primate eye. Photonic energy is first refracted by the cornea and further by the lens and the vitreous humor, which fills the optics chamber. The retina covers most of the inner portion of the eye and serves as the first vision processing stage. Approximately 120 million photoreceptors are encoded into about 1 million axons that make up the optic nerve.

Figure 3.1-4 shows the other basic components of the primate vision system. A projection of the 3D environment is mapped onto the 2D sheet of neuronal tissue called the retina. The primate retina is composed of several layers of neurons, including photoreceptor, horizontal, bipolar, amacrine, and ganglion cell layers to be discussed in more detail later. The information is graded, which basically means analog to electrical engineers, until it reaches the axon (output) of the ganglion cell layer. The graded potential signaling is replaced by action potential signaling through the optic nerve. Upon reaching the optic chiasm, the right side of both retinas (representing the left side of the visual field) are mapped to the right side of the brain, and the left side of both retinas (right side of visual field) to the left side of the brain.

The retina, lateral geniculate nucleus (LGN) and the brain are all composed of layers of neurons. Figure 3.1-4 highlights the LGN whose outer 4 layers are the termination of Parvocellular Pathway (PP) optic neurons and inner 2 layers the termination of Magnocellular Pathway (MP) optic neurons. Both PP and MP signals are opponent signals, meaning the signal levels correspond to the contrast between a central receptive field (RF) and a larger surrounding RF which would include responses from neurons not represented by the central RF. Parvo (small) and magno (large) were names given by anatomists who based the names on the size of the cell bodies. Conveniently, it was later learned that the PP corresponds to smaller RFs (central RF could be one cell) and MP to larger RFs (central RF would be a larger aggregate of cells). In both cases the surrounding RF would be larger than the central RF. There is duality in the center-surround contrast signals in that some represent the central signal minus the surround (“ON” signals) while others represent the surround signal minus the central (“OFF” signals).

The PP contains color information as the cone response of a single central signal will have a different spectral response from the average response of the surrounding neurons. Some earlier researchers would use r, g, b for designating the three cone receptors. But since the spectral absorption curves broadly overlap much of the visible spectrum (as show in Figure 3.1-2) a better notation is l, m, s for long-, medium-, and short-wavelength cone types [DeV88]. We adopt that convention in this book.

#### Spatio-temporal Processing Planes

The retina can be considered a “part” of the brain, as suggested by the subtitle of John Dowling’s book The Retina: An Approachable Part of the Brain [Dowl87]. The retina is a multi-layered region of neuronal tissue lining the interior surface of the eye, as shown in Figure 3.1-3. In the early stages of primate central nervous system (CNS) embryonic development, a single neural tube develops two optic vesicles with optic cups that eventually develop into the retinas for each eye. The physiology (or functioning) of layers of neurons are similar, whether located peripherally in the retina (about 5 layers), in the LGN (about 6 layers), or in the visual cortex (about 10-12 layers). If we can better understand the spatial-temporal-chromatic signal processing that exists in the retinal it will better our understanding of what is also going on in the LGN and the higher processing centers of the visual cortex.

The vision processing mechanics can be best visualized as a series of parallel-processing planes, each representing one of the neuronal layers in the retinal or in the brain, as shown in Figure 3.1-5. Parallel incoming photons are received by the outer segments of the photoreceptors resulting in signals that propagate to the visual cortex in the brain. Each plane of neuronal processing acts upon the image in a serial fashion. However, the processing mechanism cannot be simply described as simple image filters acting on each separate plane. As the energy is propagated through the neuronal layers, the ionic charge spreads laterally across each processing plane. As a result, the output of each processing plane is a combination of the current and historic inputs of the cells in the path as well as the historical input of the adjacent cells.

To adequately model spatial and temporal effects of the neuronal interconnections, each cell in each neuronal processing plane must consider mediation effects of neighboring cells as well as temporal effects of signal degradation in time. One way to model both effects is to apply a 2D spatial filter to each image plane and follow the filter with a leaky integrator, that allows for temporal ionic equilibrium effects.

#### Information Encoding

Natural vision systems extract space (spatial), time (temporal) and color (chromatic) information to make some decision. Information is often encoded for transmission, for example, from the retina to the LGN. Figure 3.1-6a shows the basic information blocks in the vision system. Figure 3.1-6b illustrates the overall numerical processing elements in each of the various vision processing stages. There is an approximately 100:1 compression of the retina photoreceptors to the optic nerve signals, but an expansion of 1:1000 optic nerve signals to visual cortex neurons. This expansion is known as optic radiation. Combining the compression and expansion there is an overall expansion of about 1:10 retinal photoreceptors to visual cortex neurons. As typical in biology, the compression and expansion is quite non-uniform, as there are about 2 optic nerve neurons per photoreceptor in the retina’s fovea (very central part of vision), but only 1 optic nerve neuron for about every 400 photoreceptors in the peripheral part of the retina. This unbalance is a consequence of the importance of information in the center-of-gaze.

Natural vision filtering begins with photonic refraction through the cornea and lens (Figure 3.1-3). Figure 3.1-7 depicts the various cell layers within the retina and a gross approximation of the mathematical function performed by each layer on the incoming imagery. The incoming light then passes through the vitreous humor and retinal cell tissue and is focused onto a photoreceptor mosaic surface. The flux within a photoreceptor’s receptive region of the retina is averaged to a single output at the triad synapse (at the root of the photoreceptor). As a result, the information can be visualized as a mosaic, where each piece represents a single photoreceptor’s output.

Photonic energy is converted to electronic charge in the photopigment discs of the photoreceptors (rods and cones). It is believed that the rate of information transfer is proportional to the logarithm of the incoming intensity. The photoreceptors, with the help of a layer of horizontal cells, spread the charge in space and time within a local neighborhood of other receptors. Such charge-spreading can be modeled by spatio-temporal gaussian filters. Two separate variances (horizontal and vertical) are required for the spatial 2D filter and another for how the signal degrades in time.

The spread charge and original photoreceptor charge, both of which can be modeled as a gaussian-filtered version of the incoming imagery, are both available at the root of the photoreceptor, at the triad synapse. The bipolar cells connect to triad synapses and presumably activate signals proportional to the difference between the photoreceptor input and the horizontal cell input. Therefore, the bipolar cell output represents the difference-of-gaussian version of the original image.

Spatial edges are detected by two types of bipolar cells, on-bipolars and off-bipolars, which respond to light and darkness, respectively. The on-bipolar responds if the central receptive field exceeds the surrounding receptive field, while the off-bipolar cells respond if the surrounding receptive field exceed the central receptive field. Temporal edges (rapid changes in photonic flux levels) are detected by on-off and off-on bipolar cells, which respond to quick decrements or increments in photonic flux, respectively. Corresponding ganglion cells (on, off, on-off, and off-on) propagate amacrine-cell-mediated responses to these bipolar cells.

The difference signal propagated by the bipolar cells is a consequence of the lateral inhibition caused by the connectivity of photoreceptors and horizontal cells. The horizontal cells connect horizontally to numerous photoreceptors at the triad synapse. Horizontal cells only have dendrites, which for other neurons would typically serve as input channels. The dendrites (inputs) for these cells pass ions in both directions, depending how the ionic charge is distributed. The net effect is that adjacent photoreceptors have their information partially shared by this mediation activity of the horizontal cells.

Gap junctions between adjacent photoreceptors influence the photoreceptor charge. The response from a photoreceptor aggregate can be modeled as a spatial-temporal Gaussian with a small variance. The input from the neighboring aggregate of horizontal cells can be modeled with a similar Gaussian with a larger variance. The differencing function results in the difference-of-Gaussian (DOG) filter operation, resulting in a center-surround antagonistic receptive field profile. DOG functions and functions of the second derivative of Gaussian, called the Laplacian-of-Gaussian (LOG), have been used to model the bipolar cell output.

The analog charge information in the retina is funneled into information pathways as it is channeled from the mosaic plane to the optic nerve. These information channels originate in the retina and are maintained through the optic nerve and to portions of the brain. These include the rod channel, initiated by rod bipolars, the parvocellular pathway (PP) and the magnocellular pathway (MP), the latter two initiated by cone bipolars. Both the PP and the MP exhibit center-surround antagonistic receptive fields. PP cones are tightly connected, responding to small receptive fields, while the MP cones are more loosely connected (together with rod inputs), responding to large receptive fields.

The MP and PP perform separate spatial band-pass filtering, provide color and intensity information, and provide temporal response channels, as illustrated in Figure 3.1-8. A relatively high degree of acuity is achieved in each domain (space, time, and color, or chromatic) from these few filters. The MP is sensitive to low spatial frequencies and broad color intensities, which provide basic information of the objects in the image. The PP is known to be sensitive to higher spatial frequencies and chromatic differences, which add detail and resolution. In the color domain, the PP provides color opponency and thus spectral specificity, and the MP provides color non-opponency and thus overall intensity. In the time domain, the PP provides slowly varying dynamics, while the MP provides transient responses to image dynamics.

Retinal information is primarily in the form of graded potentials as it moves from the photoreceptor cell (PC) layer through the retina to the amacrine cell (AC) and ganglion cell (GC) layers. The GC output axons make up the optic nerve, transporting spikes to the LGN. The ganglion axonal signals begin the optic nerve transmission of color, time, and space information to the remaining neuronal organs in the vision pathway. It is typical that localized processing is graded, like an analog voltage level in an RLC circuit, but is pulsed via action potentials when travelling distances, such as from the retina to the LGN, and from there to the superior colliculus and to the visual cortex.

Figure 3.1-9 shows the signal and image processing functions at the various stages of the retina. Figure 3.1-10 shows greater detail of the lower left region of Figure 3.1-9. The spatio-temporal filtering characteristic is due to the connectivity of the first three layers of neurons: photoreceptors, horizontal cells, and bipolar cells.

#### Coarse-coding in the Signal Frequency Domain

We extend the use of coarse-coding to the signal frequency domain by considering Gaussian curves that simulate signal-processing filters. Gaussian-based filters were chosen due to the Gaussian nature of various stages of neuronal processing in vision systems as well as the ease of implementing Gaussian filters in electronic systems.

The Gaussian-based filters with different variances and their power spectra are shown in Figure 3.1-11. Gaussian curves G1 through G4 have increasing variances. Each curve is normalized so that the peak is at the same location. This way, the shape of the curve can be observed. In practical applications, the curves would be normalized for unity area so that filtering changes the signal without adding or taking away energy.

The spectrum of these Gaussian filters is Gaussian with decreasing variances. A curve with a small variance, such as G1, will pass low and medium frequency components and attenuate high ones, while one with a larger variance, such as G4, will only pass very low frequency components. Subtracting these filters gives us the Difference-of-Gaussian (DoG) filters shown. For the variances selected, DoG G1-G2 serves as a high-pass filter, while the others serve more as band-pass filters.

Keep in mind that frequency here implies signal frequency. The signal could contain variations in spatially distributed energy (spatial frequency), variations of intensity with time at a single location (temporal frequency, or variations in color with respect to either time or space (chromatic frequency).

Pairs of filters can be selected to decompose a signal into selected specified frequency components. For example, if it is desired to measure the strength of a signal at around 10% of the sampling frequency (horizontal axis in Figure 3.1-11), then the difference between gaussians G3 and G4 would be used to filter the signal. Due to linearity of the Fourier Transform, the spectral responses (middle plot in Figure 3.11) can be manipulated by addition or subtraction to get the desired spectral response of the filter (bottom plot). This simply translates to the same manipulation in the signal domain (top plot).

#### Photoreceptor Mosaic

These filtering concepts are readily extended to two dimensions for use with the planar processing behavior of vision system models. To fully appreciate the nature of the image filter, it is essential to understand that the pixels are not uniformly distributed in size or type. The input image comes from a photoreceptor mosaic composed of S, M, and L cones and Rods.

Figure 3.1-12 shows a gross simplification of the photoreceptor mosaic. The central region is called the fovea and represents a circular projection of about a 1o conical view of the environment. In this region are only two photoreceptor types: M and L cells. Two cone types allow for color discrimination in the fovea, and the lack of rod cells allows for a high degree of spatial acuity. The rapid decline of spatial acuity with eccentricity, or the amount of separation from the center, can be clearly demonstrated by looking at a book on a bookshelf. Keeping the eyes fixed, it becomes difficult to read titles that are still relatively close to the fixation point.

The lack of rod cells in the fovea accounts for the disappearance of a faint star when we look directly at it. Rod cells are far more sensitive, so they respond in nighttime dim lighting conditions. However, if cones are not stimulated, there is no color discrimination since a strong signal at a frequency with weak response is the same as a weak signal at a frequency with strong response.

Figure 3.1-13 shows a representative mapping of fovea L and M cells into the parvo- (PP) and magnocellular (MP) pathways. The PP cells are physically smaller, but also carry information pertaining to smaller receptive fields. In the figure, the L and M ratios in the MP are kept nearly constant (2:1) so that the only response would be increased or decreased intensity (luminance). The PP surround cells, however, are skewed toward the cell not in the center. In other words, overall, there is a 2:1 ratio of L:M cells. The surround field in the upper left connection is 1:1, which favors the M cell contribution when the L cell is the center. The other example (upper right), the surround is purely L, which favors L over the 2:1 ratio when M is in the center. The surround, therefore, is at a slightly different cellular concentration that helps to favor local contrast between the two spectrally different cone types, allowing for a stronger acuity in the chromatic domain.

### 3.1.4 Color Vision Processing Models

There are several ways to designate the three cone types shown by their spectral responses in Figure 3.1-2. Some researchers use B, G, and R to represent blue, green, and red peaks in the photon absorption curves, although the peaks are not at those precise colors. Others prefer to use S, M, and L to denote the short wavelength, medium wavelength, and long wavelength responses, respectively. This latter designation is more appropriate since the notation in Boynton’s model is changed to keep consistency between the three models presented in the next sections. All three describe separate luminance and chromatic channels of information within color vision processing.

#### Guth Color Model [Guth91]

A model proposed by Guth included luminance and chromatic channels, as shown in Figure 3.1-14. The response of the luminance channel can be summarized as L+M, while the response of the chromatic channel can be described as L - S. A variation of this model mixes chromatic and luminance channels with automatic gain control in an artificial neural network trained by psychophysical data. The localized gain control simulates the spatial-temporal characteristics of the photoreceptor-horizontal cell network. There are numerous research efforts that have used various methods of emulating lateral inhibition for the spatial-temporal feature extraction inherent in the photoreceptor-horizontal cell network.

The first stage of the Guth model is the summation of simulated receptor noise sent to each cone followed by a steady-state self-adapting nonlinear gain control. The second stage is linear combinations of signals divided into two sets of three channels each. The third stage is a nonlinear compression of the second stage channels. One set includes two opponent channels and one non-opponent channel compressed to provide visual discriminations and apparent brightness. The other set includes three channels compressed to provide the appearances of light in terms of whiteness, redness or greenness, and blueness or yellowness [Guth91, Guth96].

This model has been criticized as being a poor emulation of retinal structure since no provision is made for cone proportions, the nature of anatomical connections, and the receptive field structure of ganglion and geniculate (LGN) neurons. Also, it appears to be an artificial neural network, with no physiological basis, which is trained to fit psychophysical data [DeV96]. Nevertheless, the division of color processing into luminance and color channels is an integral part of the model, and the point here is that several of these models include similar arrangements of cone types for these vision channels.

#### Boynton’s Color Model [Boyn60]

A classic model by Boynton also divides the color vision pathways into luminance and chromatic channels. The luminance channel in his model is described as L+M. The chromatic channels are described as L-M and (L+M) - S. He points out the similarity in numerous others. The opponent chromatic channels are known from recordings at the horizontal cell layer. The horizontal cells connect to the photoreceptors and perform spatial and temporal photoreceptor signal mixing. The bipolar cells are thought to propagate difference signals in the opponent pathways [Boyn60].

#### DeValois’ Color Model [DeV88]

A later model proposed by DeValois (Figure 3.1-14) goes into more detail by considering the relative concentrations of cells into account. It is observed that the concentration of the various cone cells is a function of eccentricity, or the distance from the center. In the center, the foveola, there are only L and M cells in a respective ratio of about 2:1. S cones become more apparent in the parafovea and more peripheral regions of the retina. There is an overall presumed ratio of L:M:S cells of 10:5:1. The normalized response of a neighborhood with these concentrations gives:

DeV_LMS = 0.625L + 0.3125M + 0.0625S.

The variable DeV_LMS represents the response from a typical photoreceptor neighborhood with representative cell population densities. The DeValois color model consists of 4 center-antagonistic-surround channels, 3 representing PP channels and one representing an MP channel. Each of the 4 channels exists in two polarities for a total of 8 channels. The 6 chromatic channels model PP channel responses as

PPL = (+/-) (L - DeV_LMS)

PPM = (+/-) (M - DeV_LMS)

PPS = (+/-) (S - DeV_LMS)

while the luminance channels model the MP channel responses as

MP = (+/-) ((L + M) - DeV_LMS)

The general concept for the Guth and DeValois color vision model is illustrated in Figure 3.14.

#### Generic Color-Opponent Model

The Boynton and DeValois models along with models from Martinez-Uriegas [Mart94] and Chittka [Chittka96] are compared in Figure 3.1-15. All of these (as well as Guth) have some sort of L and M cell synergism for encoding luminance and cell antagonism for encoding color. (N and W in Martinez-Uriegas model are for narrow and wide receptive field areas. S in the other models are for small-wavelength cones). Based on these popular models a simple color model could include a center receptive field contrasted with its local neighborhood. The center receptive field is modeled as a single picture element, or pixel. Ratios of the center pixel with the local neighborhood represent the color-opponent response. The models presented use differences, but ratios are in this generic model. This is plausible since many neurons respond logarithmically with stimulus, and ratios become differences after a logarithmic transformation. The actual responses of bipolar cells are presumed subtractive, but they can be considered divisive since the subtraction follows the logarithmic response of the photoreceptors.

The photoreceptor responses are believed to be logarithmic, while the bipolar cell responses are believed to be subtractive. Due to the logarithmic nature of the photoreceptor response, the bipolar difference signal really reflects a contrast ratio of the photoreceptor with the horizontal-cell-mediated signal (which is a localized spatial-temporal average signal). This is because a logarithm transform of the ratio reduces a multiplication to an addition. For example, if an M detector responds with an output value of Mo and an L detector responds with an output value of Lo, then the logarithm of the ratio is the same as a subtraction of the individual logarithm-transformed cell responses. That is,

ln (Mo / Lo) = ln(Mo) – ln(Lo).

### 3.1.5 Extracting color from parvocellular color-opponent pathway

Figure 3.1-13 shows on and off parvocellular pathways as a difference between a single photoreceptor cell in the center and a local neighborhood of a few adjacent photoreceptors. A representative photon absorption curve for each receptor (S, M, L, and Rod) is shown in Figure 3.1-2 If the neighboring receptors are averaged together the average response will be different form the center cell’s response because on average the response of the center field is different from that of the neighborhood. To illustrate this concept, consider this example:

Example 3.1, Center-Surround Opponent Processing

Given photoreceptor spectral response curves in Figure 3.1-2 and a unity-intensity mono-chromatic stimulus determine the output of a center-surround antagonistic. Assume the surround input is made of a ratio of long-wavelength (L) to medium-wavelength (M) to short-wavelength (S) cones of L:M:S = 10:5:1. Assume the center field is only one cell (L, M, or S). Determine the output for a center cell of each cell type (S, M, and L) for a stimulus whose wavelength is

1. 450 nm

2. 500 nm

3. 550 nm

4. 600 nm

Solution:

Using Figure 3.1-2 we need to estimate the response of each stimulus that is expected from each of the three cell types. Looking at the normalized values at 450 nm the S-cone response is about 0.6, the M-cone about 0.3, and the L-cone about 0.1. The estimated measurements are shown in Figure 3.1-16. If the center cell is an S-cone cell the center value is 0.6. The surrounding neighborhood is calculated as a weighted average of the different responses. For L:M:S = 10:5:1 then the weighted average would be

surround_response = $$\frac{1}{16}(10(0.1)+5(0.3)+(0.6))=\frac{3.1}{16}=0.194$$

and the S-cell center-surround response would be

S cell: center_response – surround_response = 0.6 – 0.194 = 0.406

Similarly, at 450 nm,

M cell: center_response – surround_response = 0.3 – 0.194 = 0.106

L cell: center_response – surround_response = 0.1 – 0.194 = -0.094

Then the same can be done at 500, 550, and 600 nm. The following figure shows an estimated measured response for all three cell types at each of the 4 wavelengths:

Using the weighted average as before, the result for each of the three cell types for each of the four wavelengths are:

 Stimulus Wavelength Center-surround opponent response S-cell M-Cell L-cell 450 nm 0.41 0.11 -0.09 500 nm -0.53 0.22 -0.06 550 nm -0.89 0.07 0.06 600 nm -0.61 -0.31 0.21

Looking at the results of this example we see positive responses in the forward diagonal and negative responses away from it. This makes sense as the input wavelengths used for this example are incrementally increasing as are the peak response wavelengths going from S to M to L cell. When the input stimulus is near the peak response of the center cell then the weighted average of the local neighborhood is lower since it is influenced by cells not responding as strongly. Of course, this contrast is far more significant in the PP channel than the MP channel since the PP center field is typically a single cell instead of an aggregate of cells in a typical MP channel. The contrast caused by color is therefore much stronger in the PP channel than the MP channel, which is why color is attributed to the PP channel in Figure 3.1-8.

This example assumes an object emitting (or reflecting) energy at a single monochromatic frequency, but most natural objects emit a wide distribution of frequencies across the visible spectrum. Regardless of the chromatic frequency distribution the algorithm results in a single specific response for each input that the higher brain processing can use to perceive a specific color. The color difference of an object against its background is amplified by this contrast, which benefits a species dependent on color perception for survival.

### 3.1.6 Gaussian Filters

One of the original models for the outer plexiform layer (photoreceptor-horizontal-bipolar cell interconnection layer) is the Laplacian-of-Gaussian (LoG) filter. For a gaussian function, G, defined in terms of a radius from the center, r, so that r2 = x2 + y2 for cartesian coordinates x and y, then G is defined in terms of the variance, σ, as

$$G=e^{\frac{-\left(x^{2}+y^{2}\right)}{2 \pi \sigma^{2}}}=e^{\frac{-r^{2}}{2 \pi \sigma^{2}}}$$

Gaussian Filter

The LoG filter is defined as the second derivative of G:

$$\nabla^{2} G(r)=\frac{-1}{\pi \sigma^{2}}\left(1-\frac{r^{2}}{\pi \sigma^{2}}\right) e^{\frac{-r^{2}}{2 \pi \sigma^{2}}}$$

Laplacian-of-Gaussian (LoG) Filter

The Difference-of-Gaussian (DoG) for two gaussians with variances σ1 and σ2, is

$$G_{1}-G_{2}=e^{\frac{-r^{2}}{2 \pi \sigma_{1}{ }^{2}}}-e^{\frac{-r^{2}}{2 \pi \sigma_{2}^{2}}}$$

Difference-of-Gaussian (DoG) Filter

Under certain conditions, the DoG filter can very closely match the LoG filter [Marr82]. The DoG filter allows more flexibility as two variances can be modified, thus there are two degrees of freedom. The LoG filter only uses one variance, thus only one degree of freedom.

The spectrum of a gaussian is also a gaussian:

$$e^{-t^{2} / 2 \sigma^{2}} \Leftrightarrow \sigma \sqrt{2 \pi} e^{-\sigma^{2} \omega^{2} / 2}$$

Note that the variance, σ2, is in the denominator of the exponent in the time domain and in the numerator of the exponent in the frequency domain. This is shown graphically in Figure 3.1-11 as the broad (large variance) gaussians result in sharp spectral responses, passing only very low frequencies. The narrow (small variance) gaussians pass more of the lower and middle frequencies. The limits are a zero-variance gaussian, which, when normalized to unity area, becomes the impulse function, and an infinite-variance gaussian, which becomes a constant. An impulse function passes all frequencies, and a constant only passes the DC component of the signal, which, in frequency domain, is represented as an impulse at ω = 0 (repeated every 2π increment of ω due to the periodicity of the Fourier Transform:

$$\delta(t) \Leftrightarrow 1$$ Zero-variance gaussian limit

$$1 \Leftrightarrow 2 \pi \delta(t)$$ Infinite-variance gaussian limit

### 3.1.7 Wavelet Filter Banks and Vision Pathways

The two primary vision pathways are the magnocellular pathway (MP) and the parvocellular pathway (PP). Each neuronal response in the MP represents a local average over a large receptive field. Each neuronal response in the PP represents local detail in a smaller receptive field. Thus, the MP and PP decompose the natural input image into local average and local detail components, respectively.

Similarly, digital images can also be decomposed into a set of averages and another set of details using quadrature mirror filtering (QMF). This method of image analysis (breaking apart images into components) and synthesis (reconstructing images from the components) results in a series of averaging components and another series of detailing components [Strang96]. QMF is a special case of sub-band coding, where filtered components represent the lower and upper frequency halves of the original signal bandwidth. If the analyzing filter coefficients are symmetric, then the synthesizing components are mirrored with respect to the half-band value, thus the term quadrature mirror. The structure of such a wavelet analyzer and synthesizer is shown in Figure 3.1-17. The low pass filter (LPF) and high pass filter (HPF) are similar in functionality to the MP and PP in time, space, and color domains. A variety of applications have emerged from the QMF.

To illustrate QMF the following example and exercise decomposes a sequence into its averages (after LPF) and details (after HPF). The sequence is down-sampled after each pass through the LPF; all LPFs are the same and all HPFs are the same (technically, the reconstruction filters are adjoint filters, but are the same for real-valued coefficients).

To illustrate QMF the following example and exercise decomposes a sequence into its averages (after LPF) and details (after HPF). The sequence is down-sampled after each pass through the LPF; all LPFs are the same and all HPFs are the same (technically, the reconstruction filters are adjoint filters, but are the same for real-valued coefficients).

Example 3.2, 1D QMF Analysis and Synthesis

1. Using the discrete Harr wavelets [0.5 0.5] and [0.5 -0.5] for LPF and HPF respectively, show how to decompose the following sequence into one average value and a set of detailed values.

2. Reconstruct the original sequence from the calculated components to verify correct decomposition.

3. Compare the energy of the original sequence with the energy of the components.

x[n] = {12 16 8 10 10 18 13 17}

Solution:

Figure 3.1-18 shows the QMF symmetry of the PSD for the given LPF and HPF.

Part a:

We now filter the input sequence with the LPF and HPF (and stop once we have the same number of values, thus discarding the last value). Using the graphical method of convolution, flipping the LPF (which is symmetrical) and passing under x[n], taking dot product, and shifting results in

x[n]: 12 16 8 10 10 18 13 17

LPF[-n]: 0.5 0.5 = 6

0.5 0.5 = 14

0.5 0.5 = 12

0.5 0.5 = 9

0.5 0.5 = 10

0.5 0.5 = 14

0.5 0.5 = 15.5

0.5 0.5 = 15

First LPF result is {6 14 12 9 10 14 15.5 15}

Down-sampling LPF results gives {14 9 14 15}, which will be the input to the next LPF stage.

Similarly, using the graphical method of convolution, flipping the HPF and passing under x[n], taking dot product, and shifting results in

x[n]: 12 16 8 10 10 18 13 17

HPF[-n]: -0.5 0.5 = 6

-0.5 0.5 = 2

-0.5 0.5 = -4

-0.5 0.5 = 1

-0.5 0.5 = 0

-0.5 0.5 = 4

-0.5 0.5 = -2.5

-0.5 0.5 = 2

First HPF result is {6 2 -4 1 0 4 -2.5 2}

Down-sampling HPF results gives {2 1 4 2}, which will be saved as detailed components.

To determine the results of the second stage we repeat the LPF and HPF on the down-sampled LPF results of the first stage:

Down-sampled first-stage LPF results: 14 9 14 15

LPF[-n]: 0.5 0.5 = 7

0.5 0.5 = 11.5

0.5 0.5 = 11.5

0.5 0.5 =14.5

Second LPF result is {7 11.5 11.5 14.5}

Down-sampling gives {11.5 14.5}, which will be the input to the next LPF stage.

Down-sampled first-stage LPF results: 14 9 14 15

HPF[-n]: -0.5 0.5 = 7

-0.5 0.5 = -2.5

-0.5 0.5 = 2.5

-0.5 0.5 =0.5

Second HPF result is {7 -2.5 2.5 0.5}

Down-sampling gives {-2.5 0.5}, which will be saved as detailed components

To determine the results of the third stage we repeat the LPF and HPF on the down-sampled LPF results of the second stage. Subsequent down-sampling results in one value with will be saved:

Down-sampled second-stage LPF results: 11.5 14.5

LPF[-n]: 0.5 0.5 = 5.75

0.5 0.5 = 13

Third LPF result is {5.75 13}

Down-sampling gives the value 13. This value represents the sequence average.

Down-sampled second-stage LPF results: 11.5 14.5

HPF[-n]: -0.5 0.5 = 5.75

-0.5 0.5 = 1.5

Third HPF result is {5.75 1.5}

Down-sampling gives the value 1.5, and the analysis is complete.

A summary of the filter outputs is listed here, and the value after down-sampling is underlined:

First LPF result: {6 14 12 9 10 14 15.5 15}

First HPF result: {6 2 -4 1 0 4 -2.5 2}

Second LPF result: {7 11.5 11.5 14.5}

Second HPF result: {7 -2.5 2.5 0.5}

Third LPF result: {5.75 13}

Third HPF result: {5.75 1.5}

The QMF components in x[n] are the down-sampled HPF results and the final average, which is the sequence {2 1 4 2 -2.5 0.5 1.5 13}, where the last value is the sequence average.

Part b:

For the purposes of this text, which is to illustrate reconstruction from the components, we will simply subtract the detail from the average and then add the detail to the average to show the original sequence can be reconstructed. The final detail, 1.5 will be subtracted from the final average, 13, to give 11.5, and then the same two values will be added to give 14.5:

Reconstructing second stage: { (13-1.5) (13+1.5)}

= {11.5 14.5}

Then the second-stage down-sampled detail, the sequence {-2.5 0.5} will be used to subtract and add to the second stage average values just determined above:

Reconstructing first stage: { (11.5-(-2.5)) (11.5+(-2.5)) (14.5-0.5) (14.5+0.5)}

= {14 9 14 15}

and the original sequence determined from those value minus then plus the down-sampled first-stage details:

x[n] = {14-2 14+2 9-1 9+1 14-4 14+4 15-2 15+2}

= {12 16 8 10 10 18 13 17}

Part c:

One of the benefits of decomposition is the great reduction in signal energy. The total energy is the sum of the square of each of the components, which results in

Power in x[n]: 122 + 162 + 82 + 102 + 102 + 182 +132 + 172 = 1446

Energy in QMF components of x[n]: 22 + 12 + 42 + 22 + (-2.5) 2 + 0.52 + 1.52 + 132 = 202.8

As sequences become larger and signals become multidimensional (such as images or image sequences) the comparison can be far more dramatic (orders of magnitude).

Exercise 3.1, 1D QMF Analysis and Synthesis

Using the discrete Harr wavelets [0.5 0.5] and [0.5 -0.5] for LPF and HPF respectively, show how to decompose the following sequence into one average value and a set of detailed values.

x[n] = {2 22 4 12 0 16 0 4}

Answer: QMF Components of x[n]: {10 4 8 2 -2 -3 -2.5 7.5},
where the last value is the sequence average.

Vision pathways (MP and PP) and QMF filter banks both therefore break up the input image signal into high and low frequency components. The MP and PP are further augmented by the rod-system pathway. Rod cells are highly interconnected and although the rods themselves are basically saturated in daylight conditions; the rod bipolar cells are mediated by neighboring cone cells. The overall effect is a spatial low-pass filter of the mosaic image.

A model of the low frequency rod system filter can be combined with a model of the PP to create a pair of filters whose spectral response crosses at one-fourth the sampling frequency, or half the Nyquist-limited frequency. A carefully chosen pair can give a striking resemblance to typical filter pairs chosen for QMF applications. A model of the MP can be substituted for the low frequency filter, but the spectral response will diminish with very low frequencies.

### 3.1.8 Coarse Coding and the Efficient Use of Basis Functions

Natural vision systems process information in space, time, and color domains. In each of these domains we find filters that are typically few and relatively coarse in bandwidth. There are essentially only four chromatic detector types, three temporal channels, and three spatial channels. The responses of these elements must be broad in scope to cover their portion of the data space. For example, in daytime conditions only three detector types have varying responses. As a minimum each type must cover one-third of the visible spectrum.

Coarse coding resembles the more common wavelet applications typified by complementary coarse low pass and high pass filters. QMF signal reconstruction capability is a practical demonstration of extracting specific spectral detail from only two broadband filters. An interesting corollary to this line of research is that the behavior of such synthetic applications may lead to a deeper understanding of natural information processing phenomena.

### 3.1.9 Nonorthogonality and Noncompleteness in Vision Processing

Sets of wavelets can be subdivided into orthogonal or nonorthogonal and complete or noncomplete categories. A set of functions is orthogonal if the inner product of any two different functions is zero, and complete if no nonzero function in the space is orthogonal to every vector in the set. Orthogonality provides computational convenience for signal analysis and synthesis applications. Completeness ensures the existence of a series representation of each function within the given space. Orthogonality and completeness are desired properties for wavelet bases in compression applications.

However, biological systems are not concerned with information storage for perfect reconstruction. Any machine-vision application requiring some action to be taken based on an understanding of the image content will also fit this general description. In fact, many biological processes can be modeled by sets of functions that are nonorthogonal [Daug88]. The task is processing information to take some action, not processing information for later reconstruction. Using nonorthogonal filters leads to a redundancy of information to cover the span of information. The redundancy of vision filters is balanced by the need for efficiency, simplicity, and robustness. Information redundancy results in unnecessary hardware and interconnections, but often redundancy may be required to sufficiently span the information space inherent in the environment. The cost of supporting the redundancy may be less significant than the benefit of using simpler processing elements that degrade gracefully. Since there is a closeness between Gaussian-based filters and more mathematically elegant filters (such as Laplacian) there is good retention of pertinent information (though not perfect).