GPU Project 10 – Rasterizer and Shader

So, it’s been a long time since this project was started. Life has a strange was of getting in the way of hobbies, and sometimes we just can get enough time to finish things off. Thankfully, I had a chance a couple of weeks ago to finish off the last pieces I wanted to do and finally we have a real 3D rendered object on the screen, with real faces and even a simple texture. Last time this is how the architecture of the GPU looked like:

The rasterizer and shader were still not written or tested, and there was a camera space transform block missing in the pipeline as well. To speed up things a bit, I decided to drop the camera space transform block entirely. To be fair, that can be done in software and bundled into the word space transform + rotation block. That leaves two more blocks: the rasterizer and the shader.

For the rasterization process I tried several things, most of them ended up being bloated and/or not appropriate for implementation in an FGPA. I settled on finding out simply whether we were on the right side of the equations of three lines (the sides of the triangles). One of the rules I had when I started this project was that I would not look up other architectures or publicly available papers that would skew my design. It turns out that what I chose is pretty much what Juan Pineda proposed in 1988 in his paper “A Parallel Algorithm for Polygon Rasterization”.

The awesome thing of this algorithm is that (once optimized and after some solving some setup equations, the complexity boils down to tree sums per pixel, plus another thee sums per line:

// Scan through bounding rectangle
for (ap_uint<10> y = ymin; y < ymax; y++) {
	int cx1 = cy1;
	int cx2 = cy2;
	int cx3 = cy3;
	bool done = false;
	for (ap_uint<10> x = xmin; x < xmax; x++) { 
                #pragma HLS PIPELINE II = 2 color_map_t a, b, c; 
		barycentric(x, y, p1.x, p2.x, p3.x, p1.y, p2.y, p3.y, &a, &b, &c); if (cx1 > 0 && cx2 > 0 && cx3 > 0){
		output_pixel.depth = 1;
		output_pixel.u = a*u_a + b*u_b + c*u_c;
		output_pixel.v = a*v_a + b*v_b + c*v_c;
		output_pixel.x = x;
		output_pixel.y = y;
		raw_pixel_out << output_pixel;
	}
	cx1 -= dy12;
	cx2 -= dy23;
	cx3 -= dy31;
}
cy1 += dx12;
cy2 += dx23;
cy3 += dx31;
}

The code shown above is the HLS code of the heart of the rasterizer. As pixels are produced, their barycentric coordinates are also generated. This is used in the shader module. The shader itself has very simple code. It essentially just looks up the right address of the texture in DDR memory (using nearest neighbor interpolation) and uses that as the final color.

// Pump in
raw_pixel_in >> in;

// Basics
out.depth 	= in.depth;
out.x 		= in.x;
out.y 		= in.y;

// Generate address
uint32_t tex_x = in.u*255 & 0xFF;
uint32_t tex_y = in.v*255 & 0xFF;
uint32_t address = (tex_y)*256 + (tex_x);

//Get value
texture_pixel.col = texture_buffer[address];

// Color
local_color.chan.r = texture_pixel.chan.r;
local_color.chan.g = texture_pixel.chan.g;
local_color.chan.b = texture_pixel.chan.b;
local_color.chan.a = 255;
out.color 	= local_color.col;

// Pump out
cooked_pixel_out << out;

Nothing fancy, but looking up data in DDR involves an AXI4-Full master interface, so that ends up consuming a fair amount of logic anyway.

I wrote a little code to convert a bitmap to a C array and loaded in on the ARM processor. The processor initializes the array in DDR during boot up so things *should* just work as long as the texture coordinates are input correctly. This block diagram now looks like this:

Next up some render tests!

GPU Project 09 – Rotations

Sorry for the long delay between posts. Being a hobby project usually this takes low priority when other things are urgent. And actually, there have been many advances lately but I haven’t had the chance to post them. Lets go through these in order.

Although we have a fair point cloud rendering in place it really is not that useful unless we can move and rotate objects. For that purpose we have the object transform block highlighted below in red:

The idea is that we can have a list of object specific parameters in a BRAM that the block will use to alter the vertex stream:

  • number of vertices in the object
  • position
  • rotation

The block will take in vertices from the vertex pump, rotate them around the object’s axis and then add the object’s position as an offset to each vertex. The position part is actually quite simple since we only have to do an addition on each axis. The rotation part, well… no so much.

Continue reading

GPU Project 08 – Point cloud rendering in FPGA

After successfully running the simulation, its time to see how the rendering works on the real HW. And as always, SW needs to be written to get the HW to know what to do. In this case, I will be loading the vertex data of a cube into memory to see it transformed, and then experiment with the same teapot data that I used in the C# version the app.

Continue reading

Minecraft Flandre Scarlet

So, this is something I did a long time ago. I’m probably looking at this with nostalgia googles, so bear with me. I had seen lots of people do Touhou characters in Minecraft, but always in 2D!! Though awesome in itself, the whole point of Minecraft is to do things in 3d, so I got to work on voxelizing a 3d model of Flandre Scarlet that I had lying around. After importing the Voxels into Minecraft I then replaced the blocks (one by one, inside the game) with the right colored blocks. Took forever, but definitely worth it. Here it is:

Continue reading

GPU Project 07 – Simulating the design

In order to see if the design is working before committing to a full build in the FPGA I wanted to simulate it to see if it could render just a few pixels and return sensible pixel locations. There are of course lots of different complicated ways of doing this, some quite elaborated, but I just wanted to functionally verify the design in the shortest amount of time (this is a hobby project after all). So, I opted o use Xilinx BFMs. These are IP cores that can generate different kinds of traffic on AXI buses. Here’s the testbench that I created:

Continue reading

GPU Project 06 – HLS IPs

Hi!

Well, due to being very busy at work I hand’t had a chance to actually post progress on the project, but we most definitely have progress! If you have been following these posts, you can see that last time we sketched out the overall architecture of the video card. 

In order to render point clouds we only need the three blocks that are highlighted. Basically, a mechanism for pulling in raw vertexes from memory, a block that can transform the 3d points to a 2d screen space, and a block that can take those points and draw them on a frame buffer. So, here they are!

Continue reading

GPU Project 05 – Sketching the RTL

So, the time has finally arrived. Time to tackle the GPU in HW! So, a quick disclaimer: since this is a hobby project I will use HLS to quickly iterate designs and reach a functional RTL. All of the blocks will be designed considering that they are meant for RTL and (given enough time) could be replaced by hand coded VHDL/Verilog without too  much hassle. This is the architecture that I am envisioning:

Continue reading

GPU Project 04 – Simple 3D object parser

So, last week we got a basic projection algorithm in place. We “rendered” the vertices of the cube into a bitmap, but we barely know got it see it working. We definitely need something more complicated to see it operating. One option is to just try to hard code a list of larger vertices that describe a more complex object, but doing that by hand is definitely cumbersome, inexact and prone to errors. Instead, I decided to rely on the vast world wide web and find several 3d objects that I could use. It turns out that there are millions of such objects in many, many websites, but all of these are in different formats. After some hunting, I setted on using a .obj format. These are the reasons:

  • Plain ASCII format: Can’t beat this when it comes to ease of parsing
  • No compression
  • Simple 3d object structure.
  • Vertexes and faces are separate.

 

Continue reading

GPU Project 03 – World and space

So, now that we have a way of showing images, let start digging into how we will actually generate the images that we will show. The basic idea is that a 3D scene is formed by a digital representation of the objects that we want to show, and that a video card transforms this information into a 2d image that we can see on a screen. The first side is usually the task of really 3d designers, game creators, artists and whismy programmers that create collections of points, lights and textures that represent a 3d object. Since we are starting from scratch, we will start by trying to render a collection of points in space. These points will simply be a set of (x,y,z) points in the world. I always like to use coordinates the way PC game designers use them:

Continue reading

GPU Project 02 – Basic frame buffer and DMA code

The idea in the previous post was to create a suitable framebuffer display circuit that could be used as a generic part of the video card that sends the content of the framebuffer to a monitor or tv. After I moved to a different computer I realized how inconvenient it is to have a project slaved to a particular set of board files. These are not copied with the project so I decided to remove the dependency with the board files and synthesize again. (The updated Vivado project and bitstream are attached).

So, with the FPGA fabric in place we needed to create a basic application that runs on the Zynq’s ARM to write to the framebuffer and configure the VDMA block. This will work to verify that the output VDMA and VGA circuit is working correctly.

So, first thing first. We need to get a bare metal application working on the board. I exported the .HDF file from the Vivado project and fired up Xilinx SDK.

Continue reading