Simplified Nanite In MiniEngine
Overview
This is a simplified Nanite implementation (SimNanite) based on Unreal’s Nanite virtual geometry. We have implemented most of Unreal Nanite’s features.
In offline, we partition the triangles into clusters with the Metis graph partition library. Then, SimNanite partitions the clusters into cluster groups, builds the DAG (Directed Acyclic Graph) and BVH tree based on the cluster groups. In order to avoid the LOD crack, SimNanite simplify the mesh globally rather than simplifying the triangles per cluster.
At runtime, we implement a three-level GPU culling pipeline: instance culling, BVH node culling and cluster culling. In the BVH node culling pass, we use the MPMC ( Multiple Producers, Single Consumer ) model to transverse the BVH structure and integrate the cluster culling into the BVH culling shader for work balance.
We generate an indirect draw or indirect dispatch command for those clusters that pass the culling. If the cluster is small enough, we employ the software rasterization with compute shader.
During the rasterization pass, we write the cluster index and triangle index to the visibility buffer. In the next base pass or GBuffer rendering, we fetch these indices from the visibility buffer, calculate the pixel attributes (UV and normal) by barycentric coordinates and render the scene with these attributes.
Nanite Builder
Overview
SimNanite (Simplified Nanite) building is an offline mesh processing during mesh import. It splits the mesh into clusters to provide a fine-grained mesh culling with a graph partition algorithm. Beyond the cluster level, SimNanite partitions the cluster into groups to accelerate mesh culling. Cluster groups are the leaf nodes of the BVH structure. In addtion, SimNanite simplifies the merged clusters at the current level rather than separate clusters in order to avoid the Lod crack artifact without boundnary edge locking.
Triangle Partition
SimNanite partitions the mesh into clusters in order to perform fine-grained GPU culling. Meshes can be viewed as graphs, whose nodes are vertex points and edges are mesh topology. With the graph representation, we can partition it with the Metis graph library.
Given a mesh without an index buffer, SimNanite processes the mesh triangles sequentially. For each triangle, SimNanite hash the vertex position to find the vertex index of the adjacent list array. Each vertex in the triangle has 6 edges. We add these edge points into the coresponding vertex adjacent unordered_set.
The Metis library’s graph partition function input has two parts: vertex adjacent list array and vertex adjacent offsets. Vertex adjacent offsets record the offsets of each vertex in the vertex adjacent list.
With the adjacent vertex list for each vertex, we can pack them into an array and record their offsets.
After the triangle partition, Metis outputs an array containing the partition index of each vertex. The next step is to batch the triangles into clusters with the partition result and find out the linked clusters for each cluster. If the vertices of one triangle belong to different partitions, this triangle can be viewed as an “edge-cluster”. We add these clusters to the linked clusters array in order to partition the clusters in the next step.
1 |
|
Cluster Partition
With clusters and linked clusters, we can partition them into cluster groups as we do in the triangle partition pass. Cluster group is the leaf node in BVH acceleration structure. Usually, it consists of four to eight clusters.
Below is the cluster visualization at LOD level 0, 1 and 2 in SimNanite:
Mesh Simplification
We simplify the mesh until the cluster number is less than 24 or fails to simplify the mesh. The mesh simplification library is Meshoptimizer. It employs the QEM method to simplify the mesh, which is similar to what the Unreal Engine does. Another important point is that we simplify the global mesh rather than the cluster, as the latter method causes the LOD crack whitout boundnary edge locking.
Build DAG
In the DAG (Directed Acyclic Graph) building pass, we organize the data and translate it into a GPU-friendly structure to acclerate to GPU-culling performed later.
SimNanite merges the resources of all lod level into a global resource array. That is to say, a simnanite mesh resource only has one vertex buffer, one index buffer, one cluster group array and one cluster array. UE’s Nanite performs an additional compression process after mesh building. We ignore compression for simplicity.
A nanite mesh resource contains several lod resources. Each LOD resource stores cluster group indices in the current level. The max cluster group number per LOD is 8. It stores the vertex and index location in the mesh vertex buffer. Nanite mesh building consume a lot of time. To accelerate programming efficiency, we serialize the DAG structure on the disk and load it without building at the next launch.
1 |
|
Build BVH
UE’s nanite use BVH to acclerate the GPU cluster group culling and LOD selection. In the offline, UE builds the BVH by SAH (Surface Area Heuristic) method. In SimNanite, we find the maximum dimension of the bound box extents and sort the cluster goups based on the position distribution in the maximum dimension. After that, we split the cluster groups into 4 nodes and build the whole BVH tree bottom-up. Each LOD has a root BVH node. For example, the mesh that has four LOD contains four root nodes.
1 |
|
Culling
Instance Culling
This step performs instance-level GPU-Culling, which is easy to implement and not necessary to detail it. The input of this step is the scene instance data and the output is the instance culled by the camera.
scene instance data:
instances viewed by camera:
instances culled by camera:
Persistent Culling
I tried to implement a DAG transversal at first. But I found it was too complicated to traverse to DAG in the compute shader. So I finally use the BVH structure to traverse the cluster group on GPU, which is same as the unreal does.
UE’s Nanite use the MPMC ( Multiple Producers, Single Consumer ) model to transverse the BVH structure. What’s more, it integrates the cluster culling into BVH culling shader for work balance. In SimNanite, we have implemented the two features (MPMC and integrate cluster culling) mentioned above.
SimNanite processes the node culling tasks at first. After the previous node culling task has completed, the first thread in the group fetches the node tasks from the node MPMC culling task queue.
1 |
|
Then, the compute group counts the node tasks ready to process.
1 |
|
We start the node culling task when at least one node is ready in the group. SimNanite generates cluster culling tasks and pushes these tasks to the cluster culling queue if a node is a leaf node and its cluster group error is small enough. Otherwise, SimNanite generates BVH node culling tasks and push these to the BVH node culling queue.
Cluster Culling
Cluster culling is a two-pass process. The first pass is performed at the persistent culling stage. It will process the cluster culling task if there are no node culling tasks for the compute group to process. SimNanite dispatches an additional cluster culling pass after the persistent culling stage to process tasks that were not handled in the previous stage.
1 |
|
Generate Indirect Draw Command
An indirect draw command is generated during cluster culling. SimNanite uses software rasterization for those clusters that are small enough, which is the same as UE’s Nanite solution.
SimNanite uses instance ID to index the cluster buffer for hardware rasterization. For software rasterization, SimNanite uses group ID to index the cluster buffer. Each compute group processes one cluster triangle.
1 |
|
Hardware Rasterization
SimNanite uses indirect draw instances for hardware rasterization clusters. Vertex shaders index clusters by instance ID. Clusters store the information about the vertex buffer range. It should be noticed that all vertex buffers in the scene are merged into a global single buffer. Otherwise, we can rasterize the scene only with an indirect draw call if we don’t merge buffers together. UE’s Nanite also implement a complicated steaming solution for the global scene vertex buffer mannagement.
We load the vertex position data indexed by scene index buffer and culled from cluster buffer. After MVP transform, we store the cluster index and triangle index into the visibility buffer. In additional, we store the material index to the material ID buffer, which will be used as depth test buffer in the latter rendering pass.
1 |
|
Bellow is the visibility buffer visualization:
Software Rasterization
In cases of triangles that are small enough, we use software rasterization by the compute shader. Each thread in the compute group processes one triangle. First, load the vertex position buffer based on the triangle index and the cluster index.
1 |
|
Then, calculate the screen position based on the cluster instance world matrix and view projection matrix.
1 |
|
Execute the back face culling process.
1 |
|
Rasterize the screen pixels covered by triangles. This is the simplest implementation and can be optimized in the future.
1 |
|
Finally, we calculate the barycentric coordinates and screen depth of this pixel. The compute shader is not able to perform a hardware depth test, so we convert the depth value type from float to int and use InterlockedMax to perform a software depth test.
1 |
|
Below is a visualization of the software rasterization process. The green area is rasterized by the software compute shader.
Hardware Rasterization Part:
Software Rasterization Part:
Visualize Visibility Buffer:
BasePass or GBuffer Pass
With the visibility buffer, we can finally render the scene. First, SimNanite transfers the material index to the depth buffer, which will be used in the next pass’s depth test operation.
1 |
|
Then we draw a full screen quad for each material. The depth of this quad equals to material index.
1 |
|
Unreal’s Nanite dipatches an additional material classify pass to split the screen into tiles. Then Unreal uses indirect draw to draw the tiles generated in the previous pass, which reduces quad overdraw. In SimNanite, we remove the classify pass and draw the screen quad directly. For each material quad, we only render pixels that pass the material index depth test. Below are two material quad depth test figures. The first one is bunny material and the second one is teapot material.
For each pixel, SimNanite fetches the cluster index and triangle index from the visibility buffer. Then, we calculate the barycentric coordinates based on the vertex position buffer. Finally, SimNanite calculates the vertex attributes, such as pixel UV and pixel normal by barycentric coordnates. The render the material with these attributes.
1 |
|
Bellow is the SimNanite final result: