97 siggraph final version.mword - CiteSeerX

This paper describes algorithms for accelerating antialiasing in. 3D graphics through .... ing an antialiased image using a typical filter kernel of 8 sam- ples would ...
110KB taille 0 téléchargements 237 vues
Hardware Accelerated Rendering Of Antialiasing Using A Modified A-buffer Algorithm Stephanie Winner*, Mike Kelley† , Brent Pease **, Bill Rivard*, and Alex Yen† Apple Computer

ABSTRACT This paper describes algorithms for accelerating antialiasing in 3D graphics through low-cost custom hardware. The rendering architecture employs a multiple-pass algorithm to perform front-to-back hidden surface removal and shading. Coverage mask evaluation is used to composite objects in 3D. The key advantage of this approach is that antialiasing requires no additional memory and decreases rendering performance by only 30-40% for typical images. The system is image partition based and is scalable to satisfy a wide range of performance and cost constraints. CR Categories and Subject Descriptors: I . 3 . 1 [Computer Graphics]: Hardware Architecture - raster display devices; I . 3 . 3 [Computer Graphics]: Picture/Image Generation display algorithms; I . 3 . 7 [Computer Graphics]: ThreeDimensional Graphics and Realism - visible surface algorithms Additional Key Words and Phrases: scanline, antialiasing, transparency, texture mapping, plane equation evaluation, image partitioning

1 INTRODUCTION This paper describes a low-cost hardware accelerator for rendering 3D graphics with antialiasing. It is based on a previous architecture described by Kelley [10]. The hardware implements an innovative algorithm based on the A-buffer [3] that combines high performance front-to-back compositing of 3D objects with coverage mask evaluation. The hardware also performs triangle setup, depth sorting, texture mapping, transparency, shadows, and Constructive Solid Geometry (CSG) operations. Rasterization speed without antialiasing is 100M pixels/second, providing throughput of 2M texturemapped triangles/second1 . The degradation in speed when antialiasing is enabled for a complex scene is 30%, resulting in 70M pixels/second. Several hardware algorithms have been developed which maintain either high quality or performance while reducing or eliminating the large memory requirement of supersampling [11,8]. An accumulation buffer requires only a fraction of the memory of supersampling, but requires several passes of the object data

* 3Dfx Interactive, San Jose, CA USA, [email protected],

[email protected] † Silicon Graphics Computer Systems, Mountain View, CA USA,

[email protected], [email protected] ** Bungie West, San Jose, CA USA, [email protected] 1 50 pixel triangles, with tri-linearly interpolated mip-mapped textures.

(one pass per subpixel sample) through the hardware rendering pipeline. The resulting image is very high quality, but the performance degrades in proportion to the number of subpixel samples used by the filter function. An A-buffer implementation does not require several passes of the object data, but does require sorting objects by depth before compositing them. The amount of memory required to store the sorted layers is limited to the number of subpixel samples, but it is significant since the color, opacity and mask data are needed for each layer. The compositing operation uses a blending function which is based on three possible subpixel coverage components and is more computationally intensive than the accumulation buffer blending function. The difficulty of implementing the A-buffer algorithm in hardware is described by Molnar [12]. The A-buffer hardware implementation described in this paper maintains the high performance of the A-buffer using a limited amount of memory. Multiple passes of the object data are sometimes required to composite the data from front-to-back even when antialiasing is disabled. The number of passes required to rasterize a partition increases when antialiasing is used. However, only in the worst case is the number of passes equal to the number of subpixel samples (9, in our system). It is possible to enhance the algorithm as described in [2, 3] to correctly render intersecting objects. The current implementation does not include that enhancement. Furthermore, the algorithm correctly renders images of moderate complexity which have overlapping transparent objects without imposing any constraints on the order in which transparent objects are submitted.

2 SYSTEM OVERVIEW The hardware accelerator is a single ASIC which performs the 3D rendering and triangle setup. It provides a low-cost solution for high performance 3D acceleration in a personal computer. A second ASIC is used to interface to the system bus or PCI/AGP. The rasterizer uses a screen partitioning algorithm with a partition size of 16x32 pixels. Screen partitioning reduces the memory required for depth sorting and image compositing to a size which can be accommodated inexpensively on-chip. No off-chip memory is needed for the z buffer and dedicated image buffer. The high bandwidth, low latency path between the rasterizer and the on-chip buffers improves performance. The system's design was guided by three principles. We strove to: 1. Balance the computation between the processor and hardware 3D accelerator; 2. Minimize processor interrupts and system bus bandwidth; and 3. Provide good performance with as little as 2 MB of dedicated memory, but to have performance scale up in higher memory configurations. The principles inspired the following features:







The hardware accelerator implements triangle setup to reduce required system bandwidth and balance the computational load between the accelerator and the host processor(s). Multiple rendering ASICs can operate in parallel to match CPU performance.

2.1 Front-to-Back Antialiasing The antialiasing algorithm is distributed among three of the major functional blocks of the ASIC (see Figure 1): the Plane Equation Setup, Hidden Surface Removal, and Composite Blocks.

The hardware accelerator only interrupts the processor when it has finished processing a frame. This leaves the CPU free to perform geometry, clipping, and shading operations for the next frame while the ASIC is rasterizing the current frame. The partition size is 16x32 pixels so that a double-buffered z buffer and image buffer can be stored on-chip. This reduces cost and required memory bandwidth while improving performance. External memory is required for texture map storage, so texture map rendering performance scales with that memory's speed and bandwidth.

Triangles Display List Traversal system SDRAM

I/O controller

In addition to these three design principles, another goal was to provide hardware support for antialiased rendering. Two types of antialiasing quality were desired: a fast mode for interactive rendering, and a slower, high quality mode for producing final images. For high quality antialiasing, the ASIC uses a traditional accumulation buffer method to antialias each partition by rendering the partition at every subpixel offset and accumulating the results in a off-chip buffer. Because this algorithm is well known [8], this high-quality antialiasing mode is not discussed in this paper. The more challenging goal was to also provide high quality antialiasing for interactive rendering in less than double the time needed to render a non-antialiased image. We assumed that this type of antialiasing would only be used for playback or previewing, so it could only consume a small portion of the die area. Therefore the challenge in implementing antialiasing was how to properly antialias without maintaining the per pixel coverage and opacity data for each of the layers individually. Our solution to this problem involves having the ASIC perform Z-ordered shading using a multiple pass algorithm (see the Appendix for psuedo-code of the rendering algorithm). This permits an unlimited number of layers to be rendered for each pixel as in the architecture presented by Mammen [11]. However, because Mammen's architecture performs antialiasing by integrating area samples in multiple passes to successively antialias the image, the number of passes is equal to the number of subpixel positions in the filter kernel. For example, rendering an antialiased image using a typical filter kernel of 8 samples would require 8 times as long as rendering it without antialiasing. Obviously this is too high a performance penalty for use in interactive rendering. With our modified A-buffer algorithm, the number of passes required to antialias an image is a function of image complexity (opacity and subpixel coverage) in each partition, not the number of subpixel samples. The worst case arises when there are at least 8 layers which have 8 different coverage masks which each cover only one subpixel. This rarely, if ever, occurs in practice. In fact, we have found that an average of only 1.4 passes is required when rendering with a 16x32 partition and an 8 bit mask. A discussion of the details of the system architecture follows the discussion of the antialiasing algorithm implementation.

Resubmit

Plane Equation Setup

Scan Conversion

Pixels 32x16 pixel RAM

Hidden Surface Removal CSG and Shadow

Texture Cache

Shading and Texture Mapping

32x16 pixel RAM

Composite

Scanout I/O

image buffer Figure 1. Rasterization ASIC pipeline The Plane Equation Setup calculates plane equation parameters for each triangle and stores them for later evaluation in the relevant processing blocks. The Scan Conversion generates the subpixel coverage masks for each pixel fragment and outputs them to the rendering pipeline. During the Hidden Surface Removal, fragments of tessellated objects are flagged for specific blend operations during shading. The Composite Block shades pixels by merging the coverage masks and alpha values.

2.2 Coverage Mask Generation We use a staggered subpixel mask, as shown in Figure 2. Each pixel is divided into 16 subpixels, but only half of the samples are used. The mask is stored as an 8 bit value using the bit assignments shown in Figure 2. 0

1 2

4

3 5

6

7

Figure 2. Staggered sub-pixel mask This staggered mask is similar to the mask used in the triangle processor [5]. It uses only half the memory a grid-aligned 4x4 requires but offers nearly the same quality of antialiasing. Better antialiasing quality can be achieved by increasing the subpixel samples to 64 and using a 32 bit mask. To support that the on-chip image buffer would require nearly 60% more capacity. The mask generation is performed by treating each scanline as 4 subscanlines and computing 4 coverage segments by using the scan conversion parameters. The triangle edge intersection with the scanline is calculated first. The edge intersection solves the following linear equations: Xbegin = Slopebegin* (CurrentY - Y0) + X0. Xend = Slopeend * (CurrentY - Y1 ) + X1 . where Y0 , Y1 ,X0 and X1 are the end points of the edges. Then the begin and end values for the 4 subscanlines are calculated and each pixel is clipped against those segments. The associated coverage mask bit is asserted if the subpixel is not clipped by the segment. Figure 3 shows an example where each color represents the 16 subpixel samples. The column on the left represents the [begin, end] values of each segment. A subpixel's coverage mask bit is asserted when it is greater than or equal to the begin value and less than the end value. The coverage mask for each pixel is shown in the bottom of the figure. [2,10] [1,10] [1,10] [0,10]

0xE8 0xFF 0x55 Figure 3. Coverage Mask Generation. A single set of linear equation evaluators can achieve one pixel per clock output for even small size triangles such as 2X2. In hardware solving a linear equation is more efficient in terms of speed and area than using lookup tables as was done by Schilling [16]. As in Schilling's design [14] and the Reality Engine[2] we exploited the fact that the mask generation is closely related to scan conversion and reused much of the circuitry between those functions.

2.3 Fragment Merging The Hidden Surface Removal block includes an on-chip buffer for two layers of pixel depth data (depth value and shadow and CSG state information). When objects are rendered in a single pass, only one layer is used. When multiple layers are needed,

one layer contains the depth of the data composited during previous passes and the second layer contains the front-most depth of the data which has yet to be composited. Pixels which are completely covered by opaque objects are resolved in a single pass. When a pixel contains portions of two or more triangles, it is desirable to merge the pixel fragments so that the pixel can be fully composited in one pass. We considered and explored several methods, but did not find a satisfactory solution which permits processing of pixel fragments in a single pass. We considered using object tags that could be compared during sorting[3], but rejected that approach because of its limitations and the burden it places on software. Object tags require extra memory in the rasterizer as they must be stored along with the depth data for each pixel. The number of unique tags is thus limited by hardware memory. Software must assign a unique tag to each object and must determine how to best reuse tags when the number of objects exceeds the number of tags the hardware supports. Another method which has been used for combining pixel fragments is to identify ones with similar depths and combine them if their colors are similar [17]. In our architecture the colors are not available during hidden surface removal, so the method of combining pixel fragments can only use the depth data. It is difficult to determine when two depths are similar and should be considered equal. Some software renderers use the minimum of the depth gradient to determine a tolerance within which objects are considered to have equal depth. Since the depth gradients in x and y are readily available, namely the a and b plane equation parameters (see Section 3.2), this seems to be a perfect option. Unfortunately, in practice it is possible to create scenes in which small triangles, approaching a pixel in size, have large gradients which can not be properly sorted. The gradients output by the Plane Equation Setup are only accurate if the pixel is completely covered, so they are not representative of the actual gradient of a pixel fragment. It is more difficult to compute the true gradient of a pixel fragment, so we tried using a fixed tolerance. However, even when a fixed tolerance is used pixel data is incorrectly discarded during the multiple pass front-to-back depth sorting operation. We decided that rather than implement a solution which causes serious artifacts we would prefer to have a robust solution and compromise on performance. The solution we used is to only consider two depths to be equal when they match precisely.

2.4 Shading For Antialiasing Shading is implemented in the Composite Block, which, like the Hidden Surface Removal block, has a two-layer buffer for storing pixel colors and masks. Since the buffers occupy onchip memory it is necessary to minimize the state information stored in them. Consequently, the pixel color, alpha, and mask for the previously composited data is stored as a single 40-bit value. In addition, a single bit controls the blending functions for color, alpha, and mask. The details of how the composite block functions as part of the pipeline are described later in this paper. Consider the example of 3 layers of data as shown in Figure 4. Object A's data completely covers the pixel, having a mask of 0xff and an alpha of 0.5. At the end of the first pass the first layer contains A's color, alpha, and mask.

each layer's mask and alpha are saved to properly shade subsequent layers. Unfortunately, the number of layers is unbounded, as is the memory required to store them, so this is not an option. Using the opaqueMask flag to control the blending of the colors and masks allowed us to conserve memory and produce high quality images with some transparency.

increasing depth

B C A mask = 0xFF mask = 0xE8 mask = 0xE8 alpha = 1.0 alpha = 1.0 alpha = 0.5 Figure 4. 3 layer composite example. B's data is opaque, but does not completely cover the pixel, having a mask of 0xE8. First, B's color and alpha is scaled by its mask coverage, 0.5. The result is blended with the data from the previous pass using an AoverB operation [14] where: I = IFront + (1 -

Figure 6 shows a more common type of scene rendered with antialiasing enabled. The artifacts which appear where the blue cone overlaps the red cone are a result of the loss of per-layer mask and alpha data when the mask coverage is combined with the alpha as each layer is composited from front-to-back. As mentioned in [2], it is not possible to correctly antialias the intersection of the cones using an alpha antialiasing algorithm.

Front ) •IBack

IBack is the color component intensity of the back object (B), IFront is the color component intensity of the front object (A) pre-multiplied by Front (A's alpha), and (1 - Front ) is the transmission coefficient of the front object. Blending the mask is a more complex operation. In the Abuffer algorithm the new mask is the bitwise OR of the two masks. In our algorithm, the coverage and transmission coefficients of the two objects are compared and the masks are either bitwise ORed or one mask is selected depending on the results of the comparison. When the front-most layer's mask completely covers the new mask and the new data is more opaque than the front-most layer's, the new mask replaces the previous mask. The opaqueMask flag is asserted for that pixel and is stored in the RAM. Object C is composited during the third pass. It is opaque and has a mask of 0xE8. When the opaqueMask flag is asserted C's mask is clipped by the previously composited data's mask, resulting in a clipped mask of 0x0. Then C's color and alpha are scaled by the clipped mask coverage and blended behind the frontmost layer using the AoverB function.

Figure 6. Antialiasing artifacts Figure 7 shows a more complex scene rendered with and without antialiasing.

Figure 5 shows an image generated with the opaqueMask flag. The scene contains a mostly-transparent layer covering an opaque black triangle which is closer than an identical opaque white triangle. The background is an opaque red layer. Figure 5 was generated with the opaqueMask feature disabled. Notice the fringing artifacts that result when the mask of the white triangle is not clipped by the black triangle.

a.

b.

Figure 5. overlapping triangles (a) using the opaqueMask flag and (b) without using the opaqueMask flag. Implementing the A-buffer algorithm in hardware requires saving the mask for each layer. Instead of combining the masks,

a.

opaque objects. This is particularly important since a texture mapped object's alpha values cannot be determined until they are retrieved from memory. Unlike the architecture described by Kelley, objects in a partition are not depth sorted before shading. Kelley’s architecture stored four layers of depth and required multiple passes to sort additional layers. Consequently, the number of layers that can occupy the same depth is limited to 3 or 4. This architecture can render an unlimited number of layers at any depth.

First Pass Sorting Scenes that contain only opaque data and no shadows or CSG can be resolved in a single pass through the pipeline. Otherwise multiple passes are required to resolve the final color for each pixel in the scene. The operations that occur in the second and subsequent passes through the rendering pipeline differ from those that occur in the first. b. Figure 7. Not antialiased (a) and antialiased (b).

2.5 Antialiased Intersections It is possible to antialias the intersections by calculating or reconstructing subpixel depth values for each layer. Methods for doing this are described by [2] and [3]. The method described in the A-buffer algorithm works for the intersection of 2 objects, but breaks down when more than two objects intersect unless the Zmin and Zmax values are saved for each layer. The method for antialiasing intersections used in the Reality Engine uses the x and y slope values and a single depth sample per layer to reconstruct subpixel depth values. This is more accurate than the A-buffer method, but is computationally intensive and requires storing subpixel depth values in the zbuffer. Neither of these solutions was feasible to implement in an architecture with limited memory, particularly since antialiasing CSG and shadows requires maintaining subpixel CSG and shadow data in the depth buffer. A single CSG and shadow sample requires 12 bits of data, so 8 sub-samples would add 84 bits of data for each of the two layers for a total of 168 bits for each pixel. In order to accommodate that much data on-chip we would have to reduce the partition size which would decrease the performance of non-antialiased scenes.

If Perfection Is Required There are two methods which can be used to render a high quality antialiased image. Supersampling can be achieved by rendering each partition at higher resolution and filtering it as a post-processing operation (using a separate image processing ASIC). An alternative is to use an accumulation buffer method to antialias the scene by rendering it several times using different subpixel offsets. The result of each rendering pass can be accumulated in a off-chip buffer for each partition.

During the first pass each input pixel depth is compared with the depth of the frontmost object received so far for the pass (see the Appendix for psuedo-code). The depth is a 25 bit floating point value with a 19 bit mantissa and a 6 bit exponent. Objects are not sorted before compositing begins. Instead, any object which passes the depth sort test is passed down the pipeline immediately. When the compositing engine receives the coverage mask, opacity, and color data for a pixel, it stores the data in the image buffer. Any data already in the buffer it is overwritten (discarded) since it would fall behind the new data. Before being discarded, though, it is examined to determine if it would have contributed to the final pixel color. If so, a flag is set for that pixel which and is used at the end of the pass to initiate another rendering pass of the same partition. After the last object enters the pipeline, the Display List Traversal sends a synchronization token before moving to the next partition. The Composite block interrupts the Display List Traversal when it receives that synchronization token (labeled Resubmit in Figure 1) and the Display List Traversal determines whether another pass is required before moving to the next partition. The latency incurred by waiting for the interrupt can be minimized with a predictive algorithm that begins re-fetching the data for another pass or pre-fetching data for the next partition.

Unlimited Equal Depth Layers It is possible to composite an unlimited number of layers which have equal depth in a single pass. Equal layers are identified during the depth sort and are composited using additive blending after the new layer is clipped by the coverage mask of the existing layer. In the current implementation, equality occurs when the depth values match precisely. Performance would be improved if a robust method for combining objects which have nearly equal depths could be used (refer to the previous section on fragment merging).

2.6 Hidden surface removal

Subsequent Pass Sorting

With this architecture, unlimited visible layers can be rasterized using multiple passes as in the algorithms described by [11] and [10]. Mammen's algorithm requires that all of the opaque objects be rasterized before transparent objects are rendered. As with the architecture described by Kelley, this does not require transparent objects to be rasterized separately from

The sort and composite operations are modified when multiple passes are required. The data in the depth and image buffers is retained at the end of each pass. Second depth and image buffers are used for storing the input pixel data. During the depth sort, each object is compared with the final depth of the previous pass and the front-most depth of the cur-

rent pass. If the object's depth falls between the two, or if it matches the front-most depth of the existing pass, it is passed to the composite block. The composite block blends the colors of any objects which are equal in depth using an AoverB blend. The masks are combined and the results are stored in the second buffer. As in the case of the first pass, writing the second buffer sometimes causes data which was composited earlier during the same pass to be overwritten. It is necessary to determine if that discarded data would have contributed to the final pixel color. Again, a flag is set for that pixel which is used at the end of the pass to initiate another rendering pass of the same partition. When the input object's depth is equal to that of the previously composited data, its coverage mask is clipped by the previous pass's coverage mask and blended (AoverB) with the data in the composite buffer. This architecture requires all data for a partition to be submitted during each pass. In the case of equal depths it is also important that the data arrive in the same order since the first object at a particular depth is considered to be in front of any objects at the same depth which are received later (first-come-first-rendered).

In order to reduce the size of the display list, triangles can share vertices. Through the use of the QuickDraw ™ 3D and QuickDraw™ 3D RAVE TriMesh data structures vertex sharing is easily achieved even after an object has been clipped and projected onto the screen. In the best case, vertex sharing permits each new vertex to define two new triangles. In that case the number of triangles is double the number of vertices.

3 SYSTEM ARCHITECTURE The rendering tasks are divided between the host CPU and 3D accelerator to balance the overall system. The host CPU performs the transformation, clipping and shading functions. The algorithms which perform these functions are described in detail in [1,4,9]. It also generates the display list using a linked list structure to link the object lists for each partition. The hardware accelerator performs the rasterization by following the linked list. It DMAs the object data from system memory and only interrupts the CPU at the end of the list/frame. The ASIC rasterizer is a pipelined design shown in figure 1. It reads triangle data and outputs texture mapped, shaded, pixels. The depth and image buffers are on-chip to minimize latency and maximize bandwidth. The ASIC is clocked at 100MHz.

2.7 Image Partitioning

3.1 Display List Traversal

Several screen partition based rasterizers have been proposed or built [13,6,15,18]. A motivating factor in using image partitioning is that the depth and image buffers can be stored onchip. This improves performance and reduces the pin count of the ASIC (assuming the buffers would have dedicated ports for performance reasons). As in the case of other partition based renderers [10,17], performance is also improved since multiple passes are used to resolve the portions of the final image which contain the greatest depth complexity.

This module reads the triangle vertex data by following a linked list which was constructed by software during the partition sort.

The primary disadvantage of partition based renderers is the inherent latency resulting for the need to construct a bucket sorted display list. Another disadvantage of partition based rasterizers is that data which appears in multiple partitions must be transferred from system memory multiple times or cached locally. Some partition based designs [5,10] used a one dimensional partition. In the best case an object had to be transferred once for every scanline it touched. It is important to exploit the inherent 2 dimensional image coherence to reduce the system memory bandwidth required to transfer the object data.

Two Dimensional Bucket Sorting After the triangles are projected into screen space they are partitioned into each 16x32 partition that intersects with the triangle. Since determining which partition a triangle belongs in can be computationally expensive 2 different algorithms are used depending on the size of the triangle. Triangles that are approximately the size of a single partition or smaller are unlikely to span more than 1 or 2 partitions and are hence easier to sort. Small triangles are included in the bucket for each partition which is overlapped by the triangle's bounding box. Triangles that are much larger than a partition are more difficult since their orientation will effect which partitions they intersect. It is important to avoid adding triangles to partitions that do not intersect the triangle, this can waste memory, bandwidth and performance. The edge slopes of large triangles are used to compute which partitions they cover.

3.2 Plane Equation Setup The scan conversion of the triangles is performed using plane equation evaluation as in the PixelPlanes design [7]. A plane equation is used to describe the relationship between a plane in screen space and any three points inside the plane. Algebraically, a plane can be described using this equation: z = a * x + b * y + c. The equation can be evaluated for any point (x,y,z) in screen space. The linear relationship between the three points (x0 , y 0 , z 0 ), (x1 ,y 1 ,z 1 ), (x 2 ,y 2 ,z 2 ) and the above plane equation is z

 z01  z   2

x

=

 x 01 x  2

y0 1 a y1 1 b  y2 1 c 

The plane equation setup module takes the vertex data and generates the coefficients, a, b, and c, for each parameter's plane equation. The parameters include the color, alpha, depth, and texture map coordinates. The plane equation setup eliminates the standard two passes of linear interpolation (lirp) in both x and y directions and is more accurate. The coefficient calculation is implemented in a systolic array, so the internal bandwidth is greater than a two pass lirp implementation [10]. Once the coefficients are calculated, they are passed down the pipeline and stored in the module where they will be used to evaluate the associated parameters; for example, the depth coefficients are stored in the Hidden Surface Removal module. The coefficients can be passed down a separate pipeline at a rate of one coefficient per clock cycle and double-buffered in the evaluation module, thus minimizing the overhead and bandwidth

associated with plane equation parameter passing. The use of plane equation evaluation for a given shading parameter at a specific pipeline stage is more efficient than passing down the evaluated parameters for each pixel during every clock cycle.

3.3 Hidden Surface Removal Current 4

This module depth sorts a pixel per clock cycle. There is enough storage for two layers of depth, shadow, and CSG data. During the first pass of a partition only one layer is used.

Most recent 4

3.4 Texture Map Lookup The texture mapping module implements traditional mipmapped bilinear and trilinear texturing [19]. A target square or non-square texture map can be up to 2048 texels on a side and either 16 or 32 bits per texel. Each texture mapped pixel produced by this system results from applying a filter function to either four (bilinear) or eight (trilinear) texels. The filter function is a linear interpolation between texel samples, weighted by the fractional portion of the horizontal (u) and vertical (v) lookup indices. Our goal was to perform trilinear texture mapping at an average run rate of one pixel per clock cycle. The difficulty in achieving this performance goal arises from the fact that the interpolation operation needs an input feed rate of four or eight 32-bit texel colors per clock cycle. Building a texture memory subsystem that could sustain this bandwidth (1600 MBytes per second sustained random accesses for bilinear mapping, 3200 MBps for trilinear) is not feasible in a personal computer today. Instead, we chose to build an on-chip texel cache to provide most of the necessary bandwidth. The texel cache harnesses two different types of pixel to pixel temporal locality from the bilinear texel access patterns. We refer to these types as "most recent four" and "line to line". Both patterns of temporal locality arise from our use of mipmapping, which limits the pixel to pixel sampling stride through texture space. In our texel cache system, we use two different but coupled cache modules. One cache module captures most recent four reuse, and the other captures line to line reuse. Two such cache pairs provide the eight necessary texel samples for trilinear texture mapping. The most recent four access pattern is illustrated in Figure 8. In this figure, a pixel associated with a particular draw object required the center 2x2 block of texels (shown in blue) for bilinear interpolation; the next pixel associated with that object will need a similar 2x2 block to be sampled (shown as a bolded square outline). But, because the sampling stride is constrained, we are sure to re-use at least one of the four texels from the previous pixel. On average we can expect to hit about two texels per pixel when caching the most recent four texels. The structure needed to implement this cache is a 4-entry, 4-port fully associative cache with an always-replace write policy for all four entries.

Figure 8. Most recent four texel reuse. Line to line temporal locality is illustrated in Figure 9. The figure shows a triangle's bilinear sampling trace correspondence between screen and texture space. Each consecutive pixel on a scan line has an increasing X value (i.e., line A starts at X=1 and ends at X=8). There are six lines composing an example triangle in this figure (A through F). Bilinear sample points occur at the intersection of the A through F arrows and the X=1 through X=8 lines. The texels sampled for line A are shaded in the texture space portion of the figure. Those texels sampled by line A at x=4 and again by line B at x=4 are shaded in red. Note that there is no simple spatial locality for a texture cache to utilize short of the cache itself embodying a texture lookup mechanism. However, given the fact that our rasterization is horizontally bounded (by our partition size), there is temporal locality which can be utilized by a relatively small associativetype cache. A 4-port fully associative cache with LRU (least recently used) replacement policy could be used to capture both line to line and most recent four texel reuse, however such a structure would be unnecessarily large. The line to line texel reuse will always be correlated through X. As an example, in Figure 9 two of the texels sampled by line A (the upper two shown in red) were sampled at X=4; the remaining re-used texel was sampled at X=5. By limiting our associative search aperture to sampled texels adjacent to and including X (X+1, X, and X-1) we simplify the lookup and compare hardware. Only a three read port, one (or two) write port cache tag memory is needed, with twelve (maximum) comparators. A fully associative cache would require an N read port, four write port cache tag memory with N*4 comparators. screen space

texture space A B C D

Y A B C D E F 0 1 2 3 4 5 6 7 8 X

x=1 x=2

E F x=6 x=7 x=8 Figure 9.

x=3 x=4 x=5

By selectively saving one (or two) of the four texel samples for each pixel (indexed to X), it will be available for the next line of a particular draw object. The choice of which texel to save is easily determined by examining the sign bits of the object's u and v gradients. For example, if the next line in pixel space is going to sample down and to the right of the current line in texel space, then we save the bottom right of the four current texels (or bottom two texels with a two write port cache tag memory). To determine if the line to line cache has a hit, the pixel space indices X+1, X, and X-1 are used to look up the previously stored addresses (stored as texel cache tags) for the one selected texel associated with that X value. A match indicates a hit. At the end of that cycle, a new cache tag (the selected texel address) is written to the cache tag store. In our implementation, we chose to allow the line cache to report exactly one hit from the possible three texels examined at X+1, X and X-1. To minimize the overall cache performance degradation due to redundant hits between the most recent four and line caches, an inhibit signal is passed from the four cache module to the line cache module that suppresses reporting of redundant texel hits. In this way, the line cache will always report a unique hit if one is available. With the four cache hitting an average of two texels per pixel, and the line cache frequently hitting a unique texel per pixel, an average of nearly three out of the needed four texels can be achieved in the high resolution mip-map. The low resolution map has a texel stride half that of the high resolution map and therefore achieves even better average hits. The remaining one to two average texels needed per trilinear pixel can be comfortably read from our pipelined SDRAM memory system. Due to the 10 to 12 cycle read latency of the SDRAM memory system, it was necessary to split the cache tags and cache data in the pipeline. The cache tag control module shown in Figure 10 contains all of the cache tag state for data that will arrive and be cached some time later in the cache data controller. The synchronizer module aligns the arrival of texel color data from the SDRAM memory system with the arrival of cache tag data from the Texel Cache Tag Control module. Texture lookup ALUs Texel Cache Tag Control Pipelined SDRAM Memory System Tag FIFO

Data FIFO Sync

Texel Cache Data Control Color Blend ALUs Figure 10. Split cache tag-data architecture. The Texture Lookup CPUs in Figure 10 calculate texel sample addresses and pass these addresses to the Texel Cache

Controller. Four addresses per cycle for each bilinear pixel are generated. Eight addresses per cycle are generated for each trilinear pixel. At the bottom of the texel cache pipeline a corresponding number of colors (four or eight) are presented to the Color Blend ALUs where color highlights and color modulation are applied to the raw bilinear or trilinear pixel color. The final rendered pixel color is passed further down the pipeline for compositing with other pixels.

3.5 Front-to-back Compositing The final pixel processing is performed by compositing the incoming pixel layers in front-to-back order, after which the resulting ARGB values are output to the frame buffer. There is enough storage for 2 layers of image data. During the first pass only one layer is used. The second layer contains the final image data for the previous partition and can be scanned out of the ASIC while the next partition is being composited (at least during the first pass). This is a standard double-buffering technique.

3.6 Scalable Performance A low cost, single card implementation of the 3D accelerator is shown in figure 11. Texture map data is stored in the SDRAM connected to the 3D accelerator. The SDRAM connected to the I/O interface is optional; it can be used to store additional texture map data or vertex data. Storing the texture map data locally reducing the PCI bandwidth when texture mapping is used. Storing vertex data locally reduces the PCI bandwidth when triangles cross partition boundaries since they will only be loaded onto the card once. It also reduces PCI bandwidth if resubmission is required to resolve the final pixel values for a partition. PCI

I/O interface

2 MB SDRAM

3D accelerator

2 MB SDRAM

Figure 11. Low cost card implementation. To improve performance it is possible to use multiple rasterizer ASICs in parallel as described by Fuchs in [6]. The I/O interface ASIC can drive up to 4 rasterizer ASICs as shown in Figure 12. The image buffer outputs of each rasterizer are merged by a Frame Buffer Interface (not designed as part of this project) which transfers each partition to the frame buffer (VRAM). The frame buffer must be local to the rasterizer ASIC since PCI could not sustain the bandwidth required to support a 1280x1152 display at 30fps (177 MBytes/sec). A typical input bandwidth is 100 MBytes/sec which can be sustained by 32 bit 33MHz PCI.

ACKNOWLEDGMENTS

PCI

I/O Interface

64 MB SDRAM

The authors wish to thank Paul Baker and Jack McHenry for supporting this project in Apple's Interactive 3D Graphics group. Thanks to Bill Garrett and Sun-Inn Shih for their reviews.

APPENDIX: PSEUDO-CODE 3D+ 64MB SDRAM

3D+ 64MB SDRAM

3D+ 64MB SDRAM

3D+ 64MB SDRAM

Frame Buffer Interface

VRAM

to display

The following pseudo-code summarizes the rendering algorithm: RenderFrame() { /* object loop: transform, shade, sort */ foreach (object) { Transform (object); Shade (object); PartitionSort (object); } /* partition loop: rasterize */ foreach (partition) { InitPartition (partition); Rasterize (partition); }

Figure 12. High performance card implementation

4 CONCLUSIONS These are the main design goals met by the system:

High Performance Antialiasing

}

The performance when the modified A-buffer antialiasing is used is only 40% slower than when antialiasing is used. This is much better than performance degradation required for using an accumulation buffer or supersample method of antialiasing.

The object loop is executed by the host CPU and the partition loop funtionality is embodied in the ASIC.

Low Memory Bandwidth and Capacity The off-chip memory requirement for the depth buffer is eliminated. There is an on-chip image buffer so that the output is a write-only path to the frame buffer memory. Implementing these buffers on chip also reduces memory bandwidth. The on chip texture cache reduces the memory bandwidth needed for texture mapping. Dedicated memory can be used for texture and vertex data to further reduce the system memory bandwidth and improve rasterization performance.

Balance Between CPU and Rendering ASIC The ASIC only interrupts the CPU at the end of each frame. This is required to process the 3D geometry as quickly as possible. Using dedicated memory with the 3D accelerator will also improve the system performance since it reduces the bandwidth load on the system memory.

5 FUTURE WORK As mentioned it is necessary to develop a robust method for merging pixel fragments. This will reduce the number of passes required to perform antialiasing, improving image quality and performance. The quality of antialiasing can be further improved by increasing the number subpixel samples from 16 to 64. A method of antialiasing interpenetrating objects must also be incorporated. Finally, additional data should be included in the z and image buffers to properly antialias shadows and CSG.

The following pseudo-code is a simplified version of the multipass rasterization loop: /* rasterize loop: first layer */ foreach (object) { foreach (pixel) { if (depth_pixel[x][y] firstDepth[x][y]) { if (depth_pixel[x][y] !> depth_buf[x][y]) { depth_buf[x][y] = depth_pixel[x][y];

composite_buf[x][y] = CompositeFirst+pixel(pixel); } } } } }

REFERENCES [1]

Kurt Akeley and T. Jermoluk. High-Performance Polygon Rendering. Computer Graphics (SIGGRAPH 88 Conference Proceedings), volume 22, number 4, pages 239-246. August 1988.

Conference Proceedings), volume 26, number 2, pages 231-240. July 1992. [13] F. Park. Simulation and Expected Performance Analysis of Multiple Processor Z-Buffer Systems. Computer Graphics, (SIGGRAPH 80 Conference Proceedings), pages 48-56. 1980. [14] Thomas Porter and Tom Duff. Compositing Digital Images. Computer Graphics, (SIGGRAPH 84 Conference Proceedings), volume 18, number 3, pages 253-259. July 1984. ISBN 0-89791-138-5. [15] PowerVR, NEC/VideoLogic 1996.

[2]

Kurt Akeley. RealityEngine Graphics. SIGGRAPH 93 Conference Proceedings, pages 109-116. August 1993. ISBN 0-89791-601-8.

[16] Andreas Schilling. A New Simple and Efficient Antialiasing with Subpixel Masks. Computer Graphics, (SIGGRAPH 91 Conference Proceedings), volume 25, number 4, pages 133-141. July 1991.

[3]

Loren Carpenter. The A-buffer, an Antialiased Hidden Surface Method. Computer Graphics, (SIGGRAPH 84 Conference Proceedings), volume 18, number 3, pages 103-108. July 1984. ISBN 0-89791-138-5.

[17] Jay Torborg and James Kajiya. Talisman: Commodity Realtime 3D Graphics for the PC. SIGGRAPH 96 Conference Proceedings, pages 353-363. 1996.

[4]

Michael Deering and S. Nelson, Leo: A System for Cost Effective 3D Shaded Graphics. SIGGRAPH 93 Conference Proceedings, pages 101-108. August 1993.

[5]

Michael Deering, S. Winner, B. Schediwy, C. Duffy and N. Hunt. The Triangle Processor and Normal Vector Shader: A VLSI System for High Performance Graphics. Computer Graphics, (SIGGRAPH 88 Conference Proceedings), volume 22, number 4, pages 21-30. August 1988

[6]

Henry Fuchs. Distributing a Visible Surface Algorithm over Multiple Processors. Preceeding of the 6th ACMIEEE Symposium on Computer Architecture, pages 5867. April, 1979.

[7]

Henry Fuchs et al. Fast Spheres, Shadows, Textures, Transparencies, and Image Enhancements in PixelPlanes. Computer Graphics, (SIGGRAPH 85 Conference Proceedings), volume 19, number 3, pages 111-120. July 1985.

[8]

Paul Haeberli and Kurt Akeley. The Accumulation Buffer: Hardware Support for High-Quality Rendering. Computer Graphics, (SIGGRAPH 90 Conference Proceedings), volume 24, number 4, pages 309-318. August 1990. ISBN 0-89791-344-2.

[9]

Chandlee Harrell and F. Fouladi. Graphics Rendering Architecture for a High Performance Desktop Workstation. SIGGRAPH 93 Conference Proceedings, pages 93-100. August 1993.

[10] Michael Kelley, K. Gould, B. Pease, S. Winner, and A. Yen. Hardware Accelerated Rendering of CSG and Transparency. SIGGRAPH 94 Conference Proceedings, pages 177-184. 1994. [11] Abraham Mammen. Transparency and Antialiasing Algorithms Implemented with the Virtual Pixel Maps Technique. IEEE Computer Graphics and Applications, 9(4), pages 43-55. July 1989. ISBN 0272-17-16. [12] Steven Molnar, John Eyles, and John Poulton. PixelFlow: High-Speed Rendering Using Image Composition. Computer Graphics, (SIGGRAPH 92

[18] G. Watkins. A Real-Time Visible Surface Algorithm. Computer Science Department, University of Utah, UTECH-CSC-70-101. June 1970. [19] Lance Williams. Pyramidal Parametrics. SIGGRAPH 83 Conference Proceedings, pages 1-11. July 1983.