3D Smart Gaussian Splatting

From Images to Semantic 3D Gaussian Splatting with Python

Build an interactive 3D semantic scanner using Python and Depth Anything V3. Transform 2D images/videos into labeled Gaussian Splats in milliseconds.
📈 Advanced ⏱ 60 min ⚡ Python / Open3D / PyTorch
💡
Prerequisites: Familiarity with numpy array manipulation, basic linear algebra (projections), and the concept of 3D Gaussian Splatting is assumed. We will heavily reference Depth Anything V3 for geometry.

1. The Semantic Gap in 3D Reconstruction

Traditional 3D Gaussian Splatting (3DGS) excels at photorealism, optimizing SH coefficients and opacity to match pixel colors. However, it is structurally “blind”. A splat representing a car is mathematically indistinguishable from a splat representing the road—they are just Gaussians in space. To build a true Semantic Scanner, we must bridge the gap between appearance (RGB) and meaning (Semantics).

In this lesson, we hijack the standard 3DGS pipeline. Instead of just optimizing for color, we will inject a Semantic Channel into our Gaussian model. By leveraging the state-of-the-art Depth Anything V3 model, we can lift 2D interactions (clicking a pixel) into 3D space with millimetric precision, assigning labels to millions of points in milliseconds.

Standard RGB Splat

Optimizes: position, covariance, alpha, SH_coeffs.

Result: Beautiful visual, no understanding.

Semantic Splat

Adds: class_id, instance_id, probability.

Result: Queryable 3D Database.

💡
Note: We are not training a network here. We are performing test-time optimization and direct projection, which allows for real-time interactivity without hours of GPU training.

But before we can label anything, we need to understand our geometry engine. How do we get 3D from a single image? Enter Depth Anything V3. Does it live up to the hype?

2. Depth Anything V3: The Geometry Engine

Depth Anything V3 represents a paradigm shift in Monocular Depth Estimation (MDE). Unlike earlier models that struggled with thin structures or transparent surfaces, V3 utilizes a massive training set of 1.5M labeled images and 62M unlabeled ones. For our scanner, we treat this model as a black box function f(I) -> D that maps an RGB image I to a metric depth map D.

We will use the vitb (Vision Transformer Base) encoder for a balance between speed (~30ms inference) and accuracy. The output is a relative depth map, which we must inverse-normalize to get metric consistency if we lack ground truth scale. The precision of these depth maps is critical; a noisy depth map leads to “flying floaters” in our Gaussian Cloud.

RGB Image DPT V3 Encoder (ViT-B) Depth Map

The pixel-wise depth Z allows us to “lift” every pixel into 3D space. But correct lifting requires understanding the camera geometry. How do we mathematically perform this unprojection?

3. From Pixels to Point Clouds (Unprojection)

To convert a 2D pixel (u, v) with depth Z into a 3D point (X, Y, Z), we invert the standard Pinhole Camera Model. We need the Camera Intrinsics Matrix K, generally a 3x3 matrix containing the focal lengths fx, fy and principal point cx, cy.

The unprojection formula is computationally cheap but must be vectorized for performance. In Python, using numpy broadcasting is essential to process 1920x1080 (approx 2 million) points instantly. We avoid `for` loops like the plague.

Pinhole Unprojection

We transform pixel coordinates to normalized sensor coordinates, then scale by depth.

Vector Form:
P_{3D} = Z \cdot K^{-1} \cdot [u, v, 1]^T
🚀
Optimization: Pre-calculate the meshgrid of u, v coordinates once. During the loop, only the multiplication with Z (depth) changes. This reduces the operation to a simple element-wise multiplication.

Now that we have a cloud of points, we need to convert them into our target representation: Semantic Gaussians. How do we structure this data class?

4. Initializing Semantic Gaussians

A standard Gaussian Splat is defined by its mean (position), covariance (scale + rotation), opacity, and SH (color). For our semantic variant, we append a label integer and potentially a confidence float. We optimize storage by using np.int8 for labels if our class count is small (under 127).

We perform a “cold start” initialization: every projected point from Depth Anything V3 becomes the center of a spherical Gaussian. We set the initial scale based on the distance to the nearest neighbor (or a simple heuristic based on depth Z) to ensure coverage without excessive overlap.

gaussian_model.py
import numpy as np class SemanticGaussianModel: def __init__(self, points, colors): # Standard Attributes self.xyz = points # (N, 3) float32 self.rgb = colors # (N, 3) uint8 self.scaling = np.full(points.shape[0], 0.01) # Semantic Attributes self.labels = np.zeros(points.shape[0], dtype=np.int8) # 0 = Unlabeled self.confidence = np.zeros(points.shape[0], dtype=np.float16)

With our data structure ready, we need an interface to interact with it. How do we visualize and label this data in real-time?

5. The OpenCV Labelling GUI

While web apps are great, OpenCV provides a bare-metal, zero-latency GUI window accessible directly from Python script. We use cv2.setMouseCallback to verify pixel interactions. When the user clicks on the 2D image, we capture the (u, v) coordinates.

We overlay the current segmentation mask on the video feed using cv2.addWeighted for transparency. This feedback loop is essential: the user clicks, the system segments, the display updates. All in under 50ms.

GUI Logic Flow

The Event Listener captures mouse clicks. cv2.EVENT_LBUTTONDOWN triggers the segmentation logic at (x, y). We store these seed points in a list.

if event == cv2.EVENT_LBUTTONDOWN: seeds.append((x, y)) update_mask(seeds)

The Mask Overlay blends the binary mask (colored red) with the original image. alpha=0.5 gives a clear view of boundaries.

overlay = image.copy() overlay[mask == 1] = (0, 0, 255) disp = cv2.addWeighted(overlay, 0.5, image, 0.5, 0)
Warning: OpenCV’s highgui is blocking. Ensure your heavy processing runs in a separate thread or is extremely optimized to prevent the window from freezing.

We have the 2D interaction. Now comes the core challenge: How do we translate a single 2D pixel click into a volumetric 3D selection?

6. Ray-Splat Intersection Strategy

A simple unprojection of the clicked pixel gives us a single 3D point. However, objects in 3DGS are composed of thousands of overlapping splats. We need to select all Gaussians relevant to the object, not just one.

We employ a Ray-Casting strategy. We cast a ray from the camera center through the pixel (u, v). We then define a cylinder or cone around this ray and find all Gaussian centers that fall within it. To filter occluded Gaussians (those behind the visible surface), we use the depth value D from Depth Anything V3 as a hard cutoff threshold (D +/- epsilon).

Ray(u,v) Selected Features

Identifying the splats is step one. Step two is propagating this label to the neighbors to fill the volume of the object.

7. Propagating Labels in 3D (KNN)

We use a K-Nearest Neighbors (KNN) approach, often utilizing a highly optimized KDTree (from scipy.spatial or pynanoflann). When a set of “seed” splats is labeled via the ray intersection, we query the tree for their neighbors within a radius R.

To prevent “bleeding” into unconnected objects (e.g., labeling the floor when selecting a shoe), we enforce color consistency and normal consistency checks. If a neighbor is close spatialy but has a vastly different color, the label propagation stops.

propagation.py
from scipy.spatial import cKDTree def propagate_labels(cloud, seed_indices, radius=0.05): tree = cKDTree(cloud.xyz) # Find neighbors for all seeds indices = tree.query_ball_point(cloud.xyz[seed_indices], r=radius) # Flatten and unique all_indices = np.unique([item for sublist in indices for item in sublist]) # Assign label cloud.labels[all_indices] = CURRENT_LABEL return cloud
💡
Tip: For large scenes, rebuild the KDTree only when the geometry changes. Since our geometry is static (only labels change), build it once at startup.

Now that we have modified the internal state of our Gaussians, we need to render the result. How do we visualize semantic masks in a 3D viewer?

8. Real-Time Projection Algorithm

We are not using the differentiable Gaussian Rasterizer here (unless you are integrating into a heavy pipeline). For a lightweight Python viewer, we can use Point-Based Rendering with Open3D or a custom OpenGL shader.

We map each unique label_id to a distinct color. When rendering, we simply swap the cloud.rgb buffer with a color_map[cloud.labels] buffer. This allows us to toggle between “RGB Mode” and “Semantic Mode” instantly. The overhead is negligible—just a memory copy.

👁

Render Modes

Comparison of rendering buffers sent to the GPU.

Mode Data Source Visualization
RGB cloud.rgb Photorealistic
Semantic palette[cloud.labels] Flat coded colors
Confidence heatmap(cloud.conf) Gradient (Red=Low, Green=High)

The system works. We can click, label, propagate, and render. But scanning a whole room image by image is tedious. How do we scale this?

9. Scaling to Video Sequences

To process a video, we assume temporal consistancy. If a pixel (u, v) is labeled “Chair” in Frame 1, it should remain “Chair” in Frame 2, provided the camera movement is small.

We project the labeled 3D Gaussians back onto the 2D camera plane of Frame 2. This generates a “pseudo-ground-truth” mask for Frame 2. We then run Depth Anything V3 on Frame 2, unproject new points, and if they align with existing labeled points, they inherit the label. This Label Inheritance loop allows us to label an object once and have the label “stick” as we move around it.

Forward Projection

Project 3D labels to 2D mask for the next frame.

Fusion

Merge new unprojected points with existing octree/grid.

Drift Warning: Without loop closure or a SLAM backend, position errors accumulate. Your labels might start “floating” away from the object after a few hundred frames.

Finally, we need to save our work. A proprietary format is useless. How do we export this for use in other tools?

10. Exporting for Downstream Tasks

The .ply (Polygon File Format) is the gold standard for point clouds. It is flexible enough to handle custom headers. We define a custom property scalar_class in the header to store our labels.

Standard viewers like CloudCompare will read this as a scalar field, allowing you to colorize the cloud by “scalar_class” immediately upon loading. No conversion needed.

ply_exporter.py
# Construct PLY Header header = “””ply format binary_little_endian 1.0 element vertex {} property float x property float y property float z property uchar red property uchar green property uchar blue property uchar label end_header “””.format(len(cloud.xyz))

So, we have built a pipeline that goes from raw pixels to semantic meaning.

Dataset Recap & Future

Component Role Key Libs
Geometry Depth Extraction Depth Anything V3
Representation Storage & Rendering Gaussian Splatting
Interaction Selection & Masking OpenCV, KDTree

Further Reading

🚀
Florent’s Vision: By 2026, manual semantic labeling will be obsolete. Foundation models like SAM 2 combined with real-time Splatting will allow for “Text-to-Label” interactions where you simply type “Show me the chairs” and the 3D scene segments itself instantly.

Practical Exercises

Exercise 1: The Integrity Check

Modify the Gaussian class to filter points with low confidence from Depth Anything V3. Visualize the result.

Exercise 2: The Auto-Segmenter

Integrate Segment Anything (SAM). Instead of manual clicks, use SAM masks to label clusters of Gaussians automatically.