3D Smart Gaussian Splatting
The Road to 3D Expertise
Don’t wait, start building a complete 3D Reconstruction System Today
Guaranteed results within 2 weeks,
20 hours for the main quest,
50 hours for all the side quests
From Images to Semantic 3D Gaussian Splatting with Python
numpy array manipulation, basic linear algebra
(projections), and the concept of 3D Gaussian Splatting is assumed. We will heavily
reference Depth Anything V3 for geometry.
1. The Semantic Gap in 3D Reconstruction
Traditional 3D Gaussian Splatting (3DGS) excels at photorealism, optimizing
SH coefficients and opacity to match pixel colors. However, it is structurally
“blind”. A splat representing a car is mathematically indistinguishable from a splat representing the road—they
are just Gaussians in space. To build a true Semantic Scanner, we must bridge the gap between
appearance (RGB) and meaning (Semantics).
In this lesson, we hijack the standard 3DGS pipeline. Instead of just optimizing for color, we will inject a Semantic Channel into our Gaussian model. By leveraging the state-of-the-art Depth Anything V3 model, we can lift 2D interactions (clicking a pixel) into 3D space with millimetric precision, assigning labels to millions of points in milliseconds.
Standard RGB Splat
Optimizes: position, covariance, alpha, SH_coeffs.
Result: Beautiful visual, no understanding.
Semantic Splat
Adds: class_id, instance_id, probability.
Result: Queryable 3D Database.
But before we can label anything, we need to understand our geometry engine. How do we get 3D from a single image? Enter Depth Anything V3. Does it live up to the hype?
2. Depth Anything V3: The Geometry Engine
Depth Anything V3 represents a paradigm shift in Monocular Depth Estimation
(MDE). Unlike earlier models that struggled with thin structures or transparent surfaces, V3
utilizes a massive training set of 1.5M labeled images and 62M unlabeled ones. For our
scanner, we treat this model as a black box function f(I) -> D that maps an RGB image
I to a metric depth map D.
We will use the vitb (Vision Transformer Base) encoder for a balance between speed
(~30ms inference) and accuracy. The output is a relative depth map, which we must inverse-normalize
to get metric consistency if we lack ground truth scale. The precision of these depth maps is critical; a noisy
depth map leads to “flying floaters” in our Gaussian Cloud.
The pixel-wise depth Z allows us to “lift” every pixel into 3D space. But correct lifting requires
understanding the camera geometry. How do we mathematically perform this unprojection?
3. From Pixels to Point Clouds (Unprojection)
To convert a 2D pixel (u, v) with depth Z into a 3D point (X, Y, Z), we
invert the standard Pinhole Camera Model. We need the Camera Intrinsics Matrix K,
generally a 3x3 matrix containing the focal lengths fx, fy and principal point
cx, cy.
The unprojection formula is computationally cheap but must be vectorized for performance. In Python, using
numpy broadcasting is essential to process 1920x1080 (approx 2 million) points
instantly. We avoid `for` loops like the plague.
Pinhole Unprojection
We transform pixel coordinates to normalized sensor coordinates, then scale by depth.
P_{3D} = Z \cdot K^{-1} \cdot [u, v, 1]^T
u, v coordinates once. During the
loop, only the multiplication with Z (depth) changes. This reduces the operation to a simple
element-wise multiplication.
Now that we have a cloud of points, we need to convert them into our target representation: Semantic Gaussians. How do we structure this data class?
4. Initializing Semantic Gaussians
A standard Gaussian Splat is defined by its mean (position), covariance (scale +
rotation), opacity, and SH (color). For our semantic variant, we append a
label integer and potentially a confidence float. We optimize storage by using
np.int8 for labels if our class count is small (under 127).
We perform a “cold start” initialization: every projected point from Depth Anything V3 becomes the center of a
spherical Gaussian. We set the initial scale based on the distance to the nearest neighbor (or a simple
heuristic based on depth Z) to ensure coverage without excessive overlap.
With our data structure ready, we need an interface to interact with it. How do we visualize and label this data in real-time?
5. The OpenCV Labelling GUI
While web apps are great, OpenCV provides a bare-metal, zero-latency GUI window accessible directly
from Python script. We use cv2.setMouseCallback to verify pixel interactions. When the user clicks
on the 2D image, we capture the (u, v) coordinates.
We overlay the current segmentation mask on the video feed using cv2.addWeighted for transparency.
This feedback loop is essential: the user clicks, the system segments, the display updates. All in under
50ms.
GUI Logic Flow
The Event Listener captures mouse clicks. cv2.EVENT_LBUTTONDOWN
triggers the segmentation logic at (x, y). We store these seed points in a list.
The Mask Overlay blends the binary mask (colored red) with the original image.
alpha=0.5 gives a clear view of boundaries.
We have the 2D interaction. Now comes the core challenge: How do we translate a single 2D pixel click into a volumetric 3D selection?
6. Ray-Splat Intersection Strategy
A simple unprojection of the clicked pixel gives us a single 3D point. However, objects in 3DGS are composed of thousands of overlapping splats. We need to select all Gaussians relevant to the object, not just one.
We employ a Ray-Casting strategy. We cast a ray from the camera center through the pixel
(u, v). We then define a cylinder or cone around this ray and find all Gaussian centers that fall
within it. To filter occluded Gaussians (those behind the visible surface), we use the depth value
D from Depth Anything V3 as a hard cutoff threshold (D +/- epsilon).
Identifying the splats is step one. Step two is propagating this label to the neighbors to fill the volume of the object.
7. Propagating Labels in 3D (KNN)
We use a K-Nearest Neighbors (KNN) approach, often utilizing a highly optimized
KDTree (from scipy.spatial or pynanoflann). When a set of “seed” splats
is labeled via the ray intersection, we query the tree for their neighbors within a radius R.
To prevent “bleeding” into unconnected objects (e.g., labeling the floor when selecting a shoe), we enforce color consistency and normal consistency checks. If a neighbor is close spatialy but has a vastly different color, the label propagation stops.
KDTree only when the geometry changes.
Since our geometry is static (only labels change), build it once at startup.
Now that we have modified the internal state of our Gaussians, we need to render the result. How do we visualize semantic masks in a 3D viewer?
8. Real-Time Projection Algorithm
We are not using the differentiable Gaussian Rasterizer here (unless you are integrating into a heavy pipeline).
For a lightweight Python viewer, we can use Point-Based Rendering with Open3D or a
custom OpenGL shader.
We map each unique label_id to a distinct color. When rendering, we simply swap the
cloud.rgb buffer with a color_map[cloud.labels] buffer. This allows us to toggle
between “RGB Mode” and “Semantic Mode” instantly. The overhead is negligible—just a memory copy.
Render Modes
Comparison of rendering buffers sent to the GPU.
| Mode | Data Source | Visualization |
|---|---|---|
| RGB | cloud.rgb |
Photorealistic |
| Semantic | palette[cloud.labels] |
Flat coded colors |
| Confidence | heatmap(cloud.conf) |
Gradient (Red=Low, Green=High) |
The system works. We can click, label, propagate, and render. But scanning a whole room image by image is tedious. How do we scale this?
9. Scaling to Video Sequences
To process a video, we assume temporal consistancy. If a pixel (u, v) is labeled
“Chair” in Frame 1, it should remain “Chair” in Frame 2, provided the camera movement is small.
We project the labeled 3D Gaussians back onto the 2D camera plane of Frame 2. This generates a “pseudo-ground-truth” mask for Frame 2. We then run Depth Anything V3 on Frame 2, unproject new points, and if they align with existing labeled points, they inherit the label. This Label Inheritance loop allows us to label an object once and have the label “stick” as we move around it.
Forward Projection
Project 3D labels to 2D mask for the next frame.
Fusion
Merge new unprojected points with existing octree/grid.
Finally, we need to save our work. A proprietary format is useless. How do we export this for use in other tools?
10. Exporting for Downstream Tasks
The .ply (Polygon File Format) is the gold standard for point clouds. It is flexible enough to
handle custom headers. We define a custom property scalar_class in the header to store our labels.
Standard viewers like CloudCompare will read this as a scalar field, allowing you to colorize the cloud by “scalar_class” immediately upon loading. No conversion needed.
So, we have built a pipeline that goes from raw pixels to semantic meaning.
Dataset Recap & Future
| Component | Role | Key Libs |
|---|---|---|
| Geometry | Depth Extraction | Depth Anything V3 |
| Representation | Storage & Rendering | Gaussian Splatting |
| Interaction | Selection & Masking | OpenCV, KDTree |
Further Reading
- [Paper] Depth Anything V3: Robust Monocular Depth
- [Repo] 3D Gaussian Splatting for Real-Time Rendering
- [Florent’s Code] Semantic Splatting Toolkit v0.1
Practical Exercises
Exercise 1: The Integrity Check
Modify the Gaussian class to filter points with low confidence from Depth Anything V3. Visualize the result.
Exercise 2: The Auto-Segmenter
Integrate Segment Anything (SAM). Instead of manual clicks, use SAM masks to label clusters of Gaussians automatically.
