Parallel vector graphics rasterization on CPU.

During the last couple of weeks I got some free time and as a fun programming exercise I decided to write a CPU-based, multithreaded vector graphics rasterizer. The goal was to outperform rasterizers of all publicly available vector graphics libraries. And today I’m sharing what I came up with.

I called the project Blaze, its source is available here. There is a web-based demo here. The demo runs on WebAssembly and uses WebGL2 to display the final image. GPU is not involved in generating pixels in any way. Be sure to use a decent web browser for best performance.

Introduction.

Traditional scanline rasterizers follow path outline, update some sort of edge list for each scanline, then convert this edge list to arrays of spans for final composition on the destination surface. While there are methods to optimize this process quite well, it suffers from being bound to a single CPU core. The biggest contributor to the problem is the fact that any segment in the path can affect how it covers any particular pixel. You cannot examine a few segments of an entire path and decide if pixel X is covered or not, like it is with triangle meshes, for example. For this reason it is a challenge to find a way to distribute workload equally across multiple threads.

In the next few paragraphs I’ll provide a brief explanation of how my attempt to solve this problem works.

Input.

Input for the rasterizer is a list of classic Bézier paths. Each path is encoded as an array of path commands indicating the types of segments and an array of points describing these segments. Only color fill and source over blending mode is supported at this time which is enough to demonstrate the concept.

Creating segment lists.

Screen, or destination image, is first subdivided into fixed height intervals. Rasterizer then creates a segment list for every interval each path intersects. Every input path gets as many segment lists as they intersect intervals. Next, path segments are transformed by transformation matrix, clipped to destination image bounds and flattened to lines. Lines are further subdivided so they do not cross interval boundaries. This happens for each path in parallel. Finally, line segments are inserted into segment lists they belong to.

Below is an illustration which demonstrates what is the result of segment list preparation for an image containing two shapes. Black borders indicate line segments.

Line segment compression.

Line segments are compressed when inserting to segment lists. In all cases, Y values are encoded as 8.8 fixed point numbers. X values are encoded in two different ways. If path width is less than 256 pixels, X values have the same, 8.8 fixed point encoding. Otherwise they are encoded as larger, 24.8 numbers. Narrow encoding uses 8 bytes per line segment, wide encoding uses 12. In terms of speed, this only gives modest improvement on systems with ridiculously fast memory bandwidth, like Apple M1 Max, but can be up to 10% overall improvement on slower systems, depending on input.

Construction of row lists.

Each screen space interval needs to get a list of paths it intersects with. This is what the second stage is responsible for. Additionally, since the rasterizer has a better knowledge of input geometry, it can skip paths which did not contribute to particular rows.

Output of this stage is an array of paths for each screen space interval with non-empty segment lists.

The image below shows row lists as rectangles with gray background and which bits of which shapes they contain for each interval.

Rasterization and blending.

This is where pixels are generated. For each screen space interval, a thread is spawned to complete rasterization for it. Rasterizer iterates over a list of paths created for the current interval in back to front order and rasterizes segments each path has assigned to it.

The rasterization of segments is a fairly simple scanline converter. But since rasterization happens only one interval at a time with its height known when compiling, there are opportunities for a few optimizations.

Dense cover and area tables.

This rasterizer calculates exact area values for each pixel segment intersects. This type of rasterizer uses two integers per pixel. The first is area, indicating how much pixel was covered by a segment intersecting that pixel. And the second is cover, indicating how much the segment affects all pixels to the right of the pixel intersected by the segment. It is a well known technique, but with many different implementations. Some rasterizers maintain a linked list of cells, adding a new cell once pixel-segment intersection is found. This can become quite slow and not great in terms of memory layout.

Since this rasterizer operates on a very limited area of image, allocating memory for area and cover tables for an entire interval is not a deal breaker. This memory is then reused for the next interval once the rasterizer is done with the current one. To know which pixels were crossed by segments, a bit vector is used. In the end it is a simple left to right bit scan to calculate what pixels to fill. Only bit vectors are zero-filled before rasterizing a path. Cover and area tables are left with old content, further avoiding expensive memory writes.

Avoiding floating point numbers.

Except at the initial segment transform and curve clipping, no floating point numbers are used. Even curve flattening is done entirely with fixed point numbers also known as integers. It would not be a complete surprise for me if using floats for some operations would be faster overall, but I like how simple and predictable integers are.

Results.

Performance numbers are always fun. I decided to compare the rasterizer with CoreGraphics, Skia and Blend2D. Blend2D has a multi-threaded context which does parallel rasterization internally. It certainly sounds like a good candidate for comparison. Meanwhile CoreGraphics and Skia are two of the most popular renderers.

To measure the final number of milliseconds, each image is rendered 500 times in a row. Then results are sorted. 5 best and 5 worst results are removed and the average time for the rest of 490 runs is calculated.

Clearing the destination pixel buffer before rendering is not included in time calculation, in all cases renderers get an already zero-filled destination. Path objects are created up front as well and not included in timing. All strokes in all test images are pre-converted to fills. All libraries are fed the same exact data containing only color fills. Only simple color fills and source-over blending is used since the goal is to test rasterization performance, not other features.

All images are rendered at 200% scale.

The numbers below are measured on machine with Apple M1 Max CPU with 10 cores. The times are in milliseconds. Lower is better.

Tiger, 960×1014 pixels, 301 layer.
Blaze	0.6ms
Blend2D	0.8ms
Skia	5.9ms
CoreGraphics	11ms

Periodic Table, 1630×964 pixels, 5942 layers.
Blaze	1.1ms
Blend2D	2.7ms
Skia	10.5ms
CoreGraphics	22.8ms

Boston, 1412×1264 pixels, 1695 layers.
Blaze	1.2ms
Blend2D	2.8ms
Skia	16ms
CoreGraphics	32ms

Paris-30k, 4222×3474 pixels, 50681 layers.
Blaze	18ms
Blend2D	47ms
Skia	190ms
CoreGraphics	390ms

Paragraphs, 1650×26158 pixels, 11253 layers.
Blaze	6ms
Blend2D	12ms
Skia	80ms
CoreGraphics	132ms

This rasterizer renders Paris-30k at almost constant 60 FPS at any scale at full screen on MacBook Pro with M1 Max. Or massive amounts of text at 120 FPS. Not bad for software rasterizer.

A thing that did not work.

At first, my idea was to use a different method for the final stage. Segments were divided not into horizontal intervals, but small, 8×16 tiles. And scan was performed from top to bottom instead of from left to right.

At the rasterization stage, process would go as follows -

Start with the top most tile in the vertical interval.
Clear cover/area table. For 8×16 tiles it would be 8×16×2 amount of 32 bit integers.
Rasterize all segments for that tile.
Run cover value accumulator from top to bottom for that tile while calculating alpha values for all pixels.
Proceed to the next tile below it.
If the tile has no segments, fill it according to the last accumulation of cover values since the last tile.

Why top to bottom scan? Because it is SIMD-friendly. You can load four or eight cover values at once, add with previous row, add to area and apply fill rule without reverting to scalar instructions. And blending of such tiles is SIMD-friendly as well.

However, the performance numbers were not that great. 20-30% slower than the current implementation. In most cases, even tiny tiles like that are rarely dense. Usually, even at lower zoom levels, they do not contain enough segments to intersect the majority of pixels. So typically, the rasterizer operates on blocks containing mostly zeroes. And clearing of tiles before rasterization did show up in the profiler as well.

A few final words.

This rasterizer performs best when rendering larger images. Anything smaller than about 50×50 pixels and less than 30-50 layers is not likely to be rendered faster using this rasterizer compared to traditional rasterizers.

Despite being really fast for simple solid color fills, CPU-based rasterizer would become slow very quickly with more diverse input. Radial gradients, fancy blending modes, effects that involve blur and complex masking are only a few examples of things that have potential to drag performance down to “not interactive anymore” level. Some things are better suited for GPUs.