Vector graphics on GPU

Despite vector graphics being used in every computer with a screen connected to it, rendering of vector shapes and text is still mostly a task for the CPU. This is wrong and needs to be changed. Here I describe a general approach to rasterization and how we can ask for help from GPU when rendering vector paths.

Basic rasterization operation.

First, let's overview a general principle of rasterization of vector shapes. Here is an image with a clockwise circle in it and I'll demonstrate a very rudimentary algorithm for drawing one row of pixels of that image. For simplicity reasons, anti-aliasing will not be included at this time.

For each pixel in this row we only need to know if its center is inside of this shape or outside. Outside pixels are left untouched while inside pixels are filled with color of that shape.

We start with a winding number set to zero. Winding number is an integer which indicates whether the pixel center is inside or outside of shape.

Next imagine a horizontal ray being cast through centers of all pixels in this pixel row. Ray goes from infinity on the left side to infinity on the right side. Every time an intersection with an upward segment is found, the winding number is increased by 1. When an intersection is found with a downward segment, the winding number is decreased by 1. As we go from one pixel to another, left to right, if the current winding number at the center of each pixel is zero, the pixel is not visible. If it is something else, the pixel is filled according to the fill rule. For example, for the non-zero fill rule, all pixels with a winding number other than zero are filled with color.

In this picture the circle is drawn clockwise, but if it was drawn counterclockwise, the intersection on the left would result in the winding number decreasing by 1 instead of increasing it.

Now, pixel centers between both intersections have winding numbers of -1, but according to the non-zero winding rule, they are filled the same way as for a clockwise circle.

Note that intersections are found while ray travels from left to right so they are automatically sorted by X position of each intersection.

Repeat the same operation for each row of pixels of the output image and in the end the image will be complete.

Of course, state of the art rasterizers have a lot of optimisations and actual algorithms may look very different from this, but the basic idea is the same and it works for Bézier paths of any complexity.

Now do this on GPU.

Unfortunately, GPUs work a bit differently from CPUs. CPU is like one person reading a book. You start at page 1 and read all the pages one by one, in order. But if you read books like me, it may take up to 6 years to finish one book. So you come up with a faster way. You hire 14.000 people to read that book, each reading a single line from it in parallel. They finish in about 8 seconds, but there are new challenges. You need to find a way to quickly assign which line to read to which person and when they finish, they somehow need to tell you the whole story in order. They cannot all speak at the same time randomly. This is the kind of challenges you face when programming for GPUs.

GPUs are extremely parallel. They can run thousands of threads at the same time. For example, the original Apple M1 GPU can execute 24.576 threads in parallel. Ideally, data should be prepared for the GPU in such a way all these threads can work with small amounts of information and produce even smaller output of fixed size. This is not a requirement, especially for more modern GPUs, but I like to keep it simple because of my deep distrust in graphics drivers.

Rasterize one pixel at a time.

An intuitive way to do rasterization on GPU is one pixel at a time. Screens contain a lot of pixels and GPUs can run a lot of threads. So it looks like each pixel can be processed in parallel.

To do this, we need to modify the algorithm to work on each pixel separately instead of processing each shape and each segment of it in order in a single thread.

Now, for each pixel, go over all segments of a shape and if a segment intersects with this imaginary horizontal line going through the center of that pixel, the winding number is modified. Only intersections on the left side of the pixel center need to be considered. Intersections on the right side of the pixel center can be completely ignored. The order in which segments are processed does not matter.

In this example there is only one intersection to the left of the highlighted pixel center. So if we were rendering it, this intersection would be contributing to the final winding number of this pixel by increasing it by one.

After processing all intersections on the left side of the pixel center, the final winding number is used to determine if a pixel needs to be filled according to the fill rule of that shape.

Add anti-aliasing.

Up to this point anti-aliasing was not introduced to rasterizer for simplicity reasons. But having sharp edges of shapes has a 1995 vibe to it. People demand smooth pixels in 2022. Fortunately there are methods that can be used to achieve smooth edges without heavily modifying the entire rasterizer.


One of the simplest methods for anti-aliasing is Supersampling. It does not require modifying the rasterizer at all. The idea is to draw the image at a bigger resolution first, then scale the result back to desired size. While scaling down, an average of a few pixels will produce a single output pixel, making it look smooth. Supersampling, however, has a few problems. The biggest problem is that it is extremely expensive. If we wanted to achieve somewhat decent quality, we would have to render at size 4 times the original size in both directions. This means rendering 16 times larger image and taking an average of 16 pixels to produce one pixel in the final image. This means 16 times more memory and at least 16 times longer to render image. And in the end, 16 levels is not that good. A lot of people will quickly notice the low quality of anti-aliasing at the edges of shapes with only 16 samples per pixel.

Analytic anti-aliasing.

Much better approach for vector graphics is analytic anti-aliasing. And it turns out, it is not just almost as fast as a rasterizer with no anti-aliasing, but also has much better quality than what can be practically achieved with supersampling. It requires small changes to the rasterization algorithm, but nothing fundamental.

I introduced the concept of winding number for deciding if a pixel is inside the shape or not. This concept can be extended to also include information about how much pixel is covered by shape, not just binary covered or not covered.

The value indicating how shape is contributing to the coverage of a pixel will be called alpha from now on. Just like winding number, alpha is not the same as pixel opacity. It will need to be converted to opacity according to the fill rule.

From here I will assume floating point arithmetic will be used and pixel size is 1.0 × 1.0. 100% of something is equal to 1.0, 50% is equal to 0.5.

To calculate alpha for any given pixel, there are two values that need to be calculated for each segment of a shape. First value is cover and another is called area. If a segment is on the right edge of a pixel, none of these values need calculation. Obviously, if the segment is below or above the pixel, computation is not needed either. The remaining situation is when the segment is in the pixel, on the left of the pixel or both. Area value is calculated for a part of the segment which intersects with the pixel being rendered. Cover value is calculated for a part of the segment which is to the left of the pixel. Once both of these values are calculated for a segment, they are added to the final alpha value. Once all segments in a shape are considered, the final alpha value is converted to pixel opacity according to the fill rule of that shape.

Calculating area and cover values.

Here is an example of how to calculate area and cover values for the highlighted pixel. The shape being drawn is the same clockwise circle, but now we are drawing a pixel which is located somewhere at the top left side of this shape.

First step is to find the part of the ellipse which will be contributing to the alpha value of this pixel. We are only interested in the segment which is one the left from the right edge of our pixel.

Now calculate the area. Find a segment which is within our pixel only. Cut off any parts of the segment that are outside of the pixel. And calculate the area that is to the right of the segment. For segments that go down instead of going up, area value is negated.

To calculate cover, isolate the part of the segment which is to the left of our pixel, but is still in between the top and bottom edges of the pixel row. Then multiply a value is calculated by subtracting the minimum Y value of the isolated segment from maximum Y value of the isolated segment. Like with area, if a segment goes down instead of up, cover value is negated. In this particular case, a is around 0.45 and pixel size is 1.0. So the cover value is 0.45.

Note that cover needs to be calculated for all segments which are to the left of the pixel and between top and bottom edges of the pixel, even if they do not intersect with the pixel at all.

Once both area and cover values are calculated, they are added to alpha value. Here is an illustration of both area and cover values combined for this case.

Then repeat the process for all segments of a shape, continuously accumulating their area and cover values. In the end we will have a final alpha value. Now alpha value must be converted to opacity according to the fill rule of a shape. If the fill rule is non-zero, alpha is calculated by using formula min(abs(alpha), 1.0). If the fill rule is even-odd, the formula is abs(alpha - 2.0 × round(0.5 × alpha)). Resulting opacity value is in the range from 0.0 to 1.0.

I first discovered this algorithm in FreeType which cites LibArt as inspiration. Some form of the same algorithm with various modifications is used pretty much on every computer now.


At this point we can rasterize shapes on GPU with very good quality anti-aliasing.

The following algorithm is run in parallel for every pixel on screen -

For small amounts of small shapes this approach will be extremely fast. However, it will become slower when more shapes are added due to the increasing amount of segments that need to be processed at each pixel. To make this renderer usable in the real world, some fundamental changes need to be made. We have to reduce the amount of segments that require processing for each pixel. The simplest way is to divide the screen into smaller lists of segments and process only this reduced list of segments for each pixel.

Divide input into smaller chunks.

Let's split the entire destination area to virtual fixed size blocks. The size of these blocks do not depend on threadgroup size or SIMD group size. Since the renderer will be splitting segments at block boundaries, larger blocks will require less CPU processing power to cut segments, but will require more processing power on GPU to rasterize segments within a block. It is always best to choose some nice round number. I chose 32 × 32 pixels.

The origin of the first block is always at 0, 0 relative to the destination image. If the destination image is not multiple of 32, the last block is still 32 × 32 pixels size and extends a little beyond the image.

What is in the block.

Each block contains two main things. First, there is a list of shapes that intersect with a block. Each shape comes with a list of segments that make that shape. Second, there is a cover table. This is all that is needed to draw a pixel.

Cover table.

Previously I mentioned cover value and how it is calculated when implementing anti-aliasing. This is the same. We only have a list of segments which intersect with a block our pixel is in. Cover table carries cover information from previous blocks on the left side of a block our random pixel is in.

Cover table is needed because we do not know anything about shapes or segments outside of any given block, but segments outside of the block can still contribute to coverage of pixels within the block.

Final algorithm.

At this point, most of the things we need are there. Let's overview the full process.

Prepare blocks on CPU.

For each shape in composition do the following -

Lastly, calculate cover table values for each shape in each block by examining contents of blocks to the left.

Now the work on the CPU is done. Data can be uploaded to the GPU now.

Dispatch one thread for each pixel to run on GPU.

The following algorithm is run in parallel for every pixel on screen -

This is it.

Now I'll write some thoughts about certain parts of the pipeline and what I discovered so far.

What is a segment?

In this article I talk about shapes being made of segments. But what exactly is this segment? The answer is that it does not matter much. The simplest to handle are simple line segments. Hardest are cubic curves. And there can be everything in-between like quadratic curves or monotonic cubic curves.

In my implementation I use simple lines encoded as 8.8 fixed point numbers. Since block size is 32 × 32 pixels, 8.8 fixed point numbers are enough to describe lines with 1/256th of a pixel precision. Using this encoding, one line segment can be described using only 64 bits. It also makes shader which then processes these lines simpler. It can operate on 16 bit integers quite far in the algorithm until it needs higher precision.

Monotonic quadratic curves are also tempting thing to try. You would need to find up to 2 roots of a quadratic curve to get a line which then can be used to calculate cover and area values. For input which is mostly curves, not lines, having segments encoded as curves may save some space and iteration over the list of segments in the shader will be shorter. However, if input is mostly lines, it will add cost to processing and memory requirements. So far, I'm sticking with simple line segments.

SIMD group configuration is important.

Once you ask the GPU to dispatch a list of threads, it divides them into thread groups. Each thread group has a list of threads to run. And threads within thread groups are further divided into SIMD groups. The last one is significant because every time the code path is diverging within the SIMD group, the shader will run longer because it will have to execute both sides of a branch.

Given that, a very good performance optimization is not to try to reject segments on the X axis in the shader. Rejecting segments which are below or above current pixel boundaries is fine. Rejecting segments which are to the right of the current pixel will most likely increase shader execution time.

Another point is to choose thread group size which has width equal to SIMD group on that hardware to make sure that within the thread group, SIMD group is only 1 row of pixels.

Calculate vertical bounds of shapes.

It is good to skip entire shapes in the shader and avoid processing of segments within. Again, since block size is 32 × 32 pixels, the bounding box can be only two 8 bit values. Only vertical minimum and maximum values are enough. As I said before, I do not attempt to reject shapes by X pixel position to avoid diverging branches within SIMD groups, but rejecting them by Y position gives a good performance boost.

Why not prepare blocks on the GPU too?

This certainly can be achieved as demonstrated by brilliant piet-gpu. However, it requires more advanced features of GPU compute than I'd like. In the simple approach where blocks are prepared on CPU, there is only one GPU kernel dispatch. The simplest kernel which renders shapes with solid color is about 80 lines of code. Easy to fix and extend. I managed to optimize block preparation on CPU quite well and this preparation step is predictable while allocating memory. I think it is quite well balanced.

Other work.

Primary inspiration for me to try rendering vector graphics on GPU was piet-metal. The next generation of concept is piet-gpu. It seems like it went to the opposite direction than my implementation and now it attempts to do everything on GPU, including flattening of shapes to individual segments. The method presented in Efficient GPU Path Rendering Using Scanline Rasterization is very interesting. Although a bit too complex for my taste since it relies on prefix sum calculation and sorting, both of which are not particularly trivial to do on GPU. Pathfinder is quite interesting as well and I suspect it is very similar to what I described here, but I have not attempted to figure out how it works exactly. Then there is Slug which is popular library for rendering scalable text for games. Algorithm seems to be patented.


Overall I really like this approach of giving the GPU to perform some of the steps of rasterization. The most expensive elements of vector graphics rendering are blending, finding which pixels to fill and calculating fill colors. Calculating fill colors can become really expensive if shapes are filled with images using a bilinear filter. Radial gradients can become not so cheap as well for the CPU to process. So the most expensive parts are run on GPU already.

Trying to come up with an exact number of milliseconds spent in an attempt to compare approaches is a more or less pointless task in my opinion. However, compared to rendering exactly the same input with heavily optimized software renderer, GPU-assisted version is 10-15 times faster. Image fills is the place with the biggest performance boost. I think this is a good start.

Blog of Aurimas Gasiulis
Twitter Mastodon GitHub E-Mail