AI Architect of 3D Systems in Production
Preface
This book brings together in a single volume all the technical guides, tutorials and articles produced as part of the AI Architect of 3D Systems in Production training delivered by the 3D Geodata Academy. I designed it to work two ways at once: as a daily reference for the practitioner, and as a structured learning path for the ambitious beginner.
The field of 3D Data Science and Spatial Artificial Intelligence is accelerating faster than anything I've seen before. In 2026, LiDAR point clouds, BIM models, neural rendering, and AI agents are all converging on the same goal: building digital representations of the physical world that are not just accurate, but intelligent and queryable. This book hands you the keys to every link in that chain.
The audience is wide on purpose: geomatics engineers who want to fold AI into their workflows, data scientists curious about the spatial side, architects and BIM managers looking to automate their processes, and 3D computer vision researchers. Each chapter stands on its own. Read them in order, though, and you'll see the full arc: we start with Python 3D fundamentals and end with spatial AI agents deployed in real companies.
Introduction
Spatial AI sits on one of the most active frontiers of modern computing. It lives at the intersection of computer vision, computational geometry, deep learning, and natural language processing, and it aims at one thing: giving machines a deep understanding of the physical world in three dimensions.
The field of 3D Data Science has organized itself around a handful of technical pillars. Processing LiDAR and photogrammetric point clouds. Building Information Modeling (BIM). Neural rendering with NeRF and 3D Gaussian Splatting. And more recently, AI agents that can reason about 3D scenes in plain language. Every one of these pillars gets its own chapter in this book.
This compilation isn't a random pile of articles. It follows the order of the training program, from the basics of geometric processing all the way up to AI agent architectures running in production. Every chapter has runnable code, in-depth theory, and practical advice taken from real projects. The goal is simple: a complete, coherent path you can apply directly to your own work.
Part I. 3D Python Fundamentals
Modules 1-2: environment, geometric processing, formats, segmentation, and LiDAR big data
Before getting into 3D data processing, you need solid foundations. This first part walks you from installing Anaconda to handling massive point clouds, with stops along the way for file formats and the core segmentation algorithms.
Chapter 1. Setting Up a Python Environment for 3D
1. What Is a Python Environment?
A Python environment is an isolated workspace built from three core pieces: a Python interpreter (the engine that runs your code), a set of packages (the libraries installed), and a dependency tree (the version relationships between those libraries). Getting this concept right is the first step toward working calmly in 3D Data Science.
Picture yourself working on two projects at the same time. The first one uses open3d 0.17 with numpy 1.24. The second needs open3d 0.18 with numpy 1.26. If both share the same system Python, installing what one needs will break the other. That's the classic "dependency hell", and it's everywhere in the Python scientific stack.
The clearest analogy is a toolbox. Every project has its own box, with its own tools in the right versions. You don't mix metric and imperial screwdrivers. Same idea here: don't mix the dependencies of two different 3D projects. Isolation is what gets you reproducibility. If your pipeline works today, it will still work in six months, on a different machine.
Python gives you three isolation levels. The system Python is the one your OS shipped with. Never install scientific packages there. Virtual environments (via venv) create a lightweight copy with its own pip, but they share the system interpreter. Then there are Conda environments, which give you the strongest isolation: they bundle their own Python interpreter, their own compiled C/C++ libraries, and they handle binary dependencies (like CUDA or OpenGL) that pip can't manage.
For 3D Data Science, I recommend Conda without a second thought. Libraries like Open3D, PyTorch, and PDAL rely on compiled binaries that need specific versions of CUDA, OpenGL, and libstdc++. Conda handles that mess for you. With venv + pip, you're on your own when the linker errors start.
2. Installing Anaconda and Creating the Environment
Installing Anaconda is the first concrete step on your path. Anaconda is a Python distribution built for scientific computing. It includes conda (the environment manager), a set of pre-installed packages, and the Anaconda Navigator graphical interface.
Head to anaconda.com/download and grab the version that matches your operating system (Windows, macOS, or Linux). Run the installer and accept the defaults. On Windows, only check "Add Anaconda to PATH" if you don't already have another Python distribution. Once it's done, open a terminal (or Anaconda Prompt on Windows) and check with conda --version. You should see something like conda 24.x.x.
Next step: create a dedicated environment. I pick Python 3.11 because in 2026 it offers the best balance between performance and compatibility. Python 3.12 and 3.13 are faster, sure, but some 3D libraries (Open3D's C++ bindings and PyTorch's CUDA extensions in particular) still aren't fully certified on those versions. Python 3.11 is the sweet spot: stable, fast, supported everywhere.
# Step 1: Verify the conda installation
conda --version
# Step 2: Create the environment dedicated to 3D Data Science
conda create -n spatial-ai python=3.11 -y
# Step 3: Activate the environment
conda activate spatial-ai
# Step 4: Verify the right Python is active
python --version # Should print Python 3.11.x
which python # Should point to ~/anaconda3/envs/spatial-ai/bin/python
After activation, your prompt shows the (spatial-ai) prefix. That's your visual confirmation that you're working inside the isolated environment. Any pip install or python command you run from now on only touches this environment. Your system Python stays untouched.
Miniconda over Anaconda. Miniconda installs only conda and Python, without Anaconda's 250+ pre-installed packages. You save 3 GB of disk space and install exactly what you need. On a workstation, Anaconda works fine thanks to Anaconda Navigator and the pre-configured scientific packages.
On Windows (Anaconda Prompt), just run
conda activate spatial-ai. On macOS/Linux with recent shells (bash 4+, zsh), the same command works once you've run conda init once. On older systems or in some CI scripts, you'll need source activate spatial-ai instead. If conda activate fails with "CommandNotFoundError", run conda init bash (or conda init zsh), then restart your terminal.
3. Installing 3D Packages with pip
pip is Python's standard package manager. It downloads and installs libraries from PyPI (the Python Package Index), the largest Python package repository in the world. A package is a reusable code library. For example, open3d gives you 3D processing tools, laspy lets you read LiDAR files, and numpy handles high-performance numerical computing.
We're going to install two layers of packages. The core stack covers the basics: reading data, geometric processing, and visualization. The extended stack adds Deep Learning, BIM/IFC formats, and raster processing. Always start with the core stack, then add the extensions when you need them.
One key detail: we install laspy[lazrs], not just laspy. The bracket syntax enables an "extra". Here, the lazrs backend, a LAZ decompressor written in Rust. Without it, reading .LAZ files (the standard compressed LiDAR format) is either impossible, or 10x slower through a pure Python fallback. That's the kind of detail that turns a 30-second pipeline into a 5-minute one.
# Core stack: 3D processing and visualization
pip install open3d laspy[lazrs] numpy scipy matplotlib
# Extended stack: Deep Learning, BIM, raster
pip install torch torchvision pyvista trimesh ifcopenshell pdal rasterio
# Quick check of installed versions
python -c "
import open3d; print(f'Open3D: {open3d.__version__}')
import laspy; print(f'Laspy: {laspy.__version__}')
import numpy; print(f'NumPy: {numpy.__version__}')
import torch; print(f'PyTorch:{torch.__version__}')
print(f'CUDA: {torch.cuda.is_available()}')
"
| Package | Role | Category |
|---|---|---|
open3d | Processing and visualization of point clouds, meshes, 3D reconstruction | 3D core |
laspy[lazrs] | Reading/writing LAS/LAZ (LiDAR) files with fast Rust decompression | LiDAR I/O |
numpy | Numerical computing on N-dimensional arrays, the base of the whole stack | Compute |
scipy | Scientific algorithms: interpolation, optimization, KDTree, linear algebra | Compute |
matplotlib | 2D plots: histograms, scatter plots, learning curves | Visualization |
torch | Deep Learning framework with GPU/CUDA support for PointNet++, Point Transformers | AI |
pyvista | Interactive 3D visualization built on VTK, great for meshes and volumes | Visualization |
trimesh | Triangle mesh manipulation, volume computation, STL/OBJ export | Meshes |
ifcopenshell | Reading/writing IFC files (Building Information Modeling) | BIM |
pdal | Point data processing pipelines (filtering, reprojection, tiling) | Pipeline |
rasterio | Reading/writing geospatial rasters (GeoTIFF, DEM, orthoimages) | Raster |
conda install and pip install for the same package. If you install numpy via conda and then another version via pip, you'll get silent conflicts. The rule: use conda for the environment and Python, then pip for every package. If a package is only available through conda-forge (like pdal), install it first, before any pip install.
Now that the packages are installed, how do we make sure this setup is reproducible on another machine, or six months from now when the versions have moved on?
4. Managing Dependencies with requirements.txt
Reproducibility is non-negotiable in Data Science. A pipeline that runs on your machine but fails on a colleague's is a useless pipeline. The requirements.txt file is the Python ecosystem's standard answer: it lists every package and its exact version, so you can rebuild an identical environment with a single command.
The pip freeze command captures the full state of your environment: every installed package with its exact version number. The output is already in requirements.txt format. The catch: pip freeze also pulls in all transitive dependencies (the dependencies of your dependencies), which produces a long file that's painful to maintain.
The good practice is to keep a "human-readable" requirements.txt with only your direct dependencies and loose version constraints. The open3d>=0.18,<0.19 format means: "install Open3D 0.18.x, any patch is fine". You get bug fixes without the risk of a major update breaking your code.
# ===== Core 3D Data Science Stack =====
open3d>=0.18,<0.19
laspy[lazrs]>=2.5,<3.0
numpy>=1.26,<2.0
scipy>=1.12,<2.0
matplotlib>=3.8,<4.0
# ===== Deep Learning =====
torch>=2.3,<3.0
torchvision>=0.18,<1.0
# ===== Visualization & Meshes =====
pyvista>=0.43,<1.0
trimesh>=4.0,<5.0
# ===== Geospatial Formats =====
rasterio>=1.3,<2.0
# ===== Development Tools =====
jupyterlab>=4.0,<5.0
ipykernel>=6.0,<7.0
# Capture the exact state of the environment (all dependencies)
pip freeze > requirements-lock.txt
# Reproduce the environment on another machine
conda create -n spatial-ai python=3.11 -y
conda activate spatial-ai
pip install -r requirements.txt
For advanced Conda users, the environment.yml file is an alternative that captures both conda and pip packages. You generate it with conda env export > environment.yml and rebuild it with conda env create -f environment.yml. The upside is that it also includes binary dependencies (CUDA, MKL, etc.) that requirements.txt doesn't cover.
requirements.txt with your direct dependencies and loose constraints, and a requirements-lock.txt generated by pip freeze for exact reproducibility. Commit both to Git. The first one is for understanding the project, the second one for reproducing it bit-for-bit.
For maximum reproducibility, wrap your 3D environment in a
Docker container. Start from a base image like nvidia/cuda:12.4-runtime-ubuntu22.04, add Miniconda, and your requirements.txt. Every team member, every CI server, and every GPU compute machine will run the exact same environment. Same CUDA version, same libstdc++, same Python. Pre-built images like pytorch/pytorch or jupyter/scipy-notebook speed things up even more. Pair Docker with docker-compose to orchestrate a full pipeline: one container for LiDAR processing, one for GPU training, one for the 3D web server.
5. Automation and IDE Integration
A well-installed environment is useless if it isn't wired into your development tools. This section covers integration with VS Code, JupyterLab, and writing automation scripts so anyone on your team can get going with a single command.
In VS Code, open the command palette (Ctrl+Shift+P) and type "Python: Select Interpreter". Pick the interpreter that matches your spatial-ai environment. VS Code will then activate that environment in every integrated terminal automatically. Check the bottom bar shows spatial-ai (Python 3.11.x). This step matters: without it, autocomplete, linting, and debugging will all use the wrong Python.
For JupyterLab, you need to register your environment as an available "kernel". Two commands do it: install ipykernel in the environment, then register it with a clear name. Once that's done, you'll see "spatial-ai" in the kernel list when you create a new notebook.
The final step is full automation. The script below creates the environment, installs the packages, configures Jupyter, and verifies everything. Drop it in your Git repo so any new team member can get started by running one command.
#!/bin/bash
# ==============================================
# Automatic install script: spatial-ai
# Usage: bash setup_spatial_ai.sh
# ==============================================
ENV_NAME="spatial-ai"
PYTHON_VERSION="3.11"
echo "[1/5] Creating the conda environment..."
conda create -n $ENV_NAME python=$PYTHON_VERSION -y
echo "[2/5] Activating..."
conda activate $ENV_NAME
echo "[3/5] Installing packages..."
pip install open3d laspy[lazrs] numpy scipy matplotlib
pip install torch torchvision pyvista trimesh
pip install jupyterlab ipykernel
echo "[4/5] Registering the Jupyter kernel..."
python -m ipykernel install --user --name $ENV_NAME --display-name "Python 3.11 (spatial-ai)"
echo "[5/5] Verifying..."
python -c "
import open3d; print(f' Open3D: {open3d.__version__}')
import laspy; print(f' Laspy: {laspy.__version__}')
import numpy; print(f' NumPy: {numpy.__version__}')
import torch; print(f' PyTorch: {torch.__version__}')
print(' Installation completed successfully!')
"
When you work with an AI assistant like Claude or GitHub Copilot, always start by telling it your environment. Say: "I'm working with Python 3.11, Open3D 0.18, PyTorch 2.3 on CUDA 12.4". The assistant will then generate code compatible with your specific versions instead of guessing or suggesting code for outdated ones.
python -c "import open3d; print(open3d.__version__)". If that command fails, your environment isn't configured properly, and nothing else will work. Bake these checks into your CI/CD scripts.
Install
k3d (pip install k3d) to visualize point clouds and meshes directly inside your Jupyter notebooks, with rotation, zoom, and interactive colors. For more advanced rendering, pyvista ships a Jupyter backend via pyvista.set_jupyter_backend('trame') that gives you full VTK rendering in the browser. These extensions turn JupyterLab into a real 3D visualization studio, so you don't have to keep flipping between the IDE and an external viewer like CloudCompare.
Your environment is ready and automated. But what do you do when things don't work as planned? The next section covers the most common problems and how to fix them.
6. Diagnosing and Fixing Common Issues
Even with a careful install, you'll hit errors. The trick isn't avoiding them (you can't), it's diagnosing them fast. This section is your survival guide: the most frequent errors, what causes them, and the commands that will save you.
The single most common error in scientific Python is ModuleNotFoundError. 90% of the time, it doesn't mean the package isn't installed. It means you haven't activated the right environment. Your terminal is using the system Python instead of spatial-ai. The first thing to check is always which python (or where python on Windows). If the path doesn't contain envs/spatial-ai, you're in the wrong environment.
The second classic trap is PyTorch and CUDA. You install PyTorch, then torch.cuda.is_available() returns False. The cause: you installed the CPU build of PyTorch, or your CUDA version doesn't match. PyTorch needs a specific CUDA toolkit version installed on your system. Check the driver version with nvidia-smi, then install PyTorch from pytorch.org with the matching index.
The third frequent issue is Open3D on a headless server. Open3D visualization requires an OpenGL context. On a server with no screen (SSH, Docker, cloud), any call to o3d.visualization.draw_geometries() will fail. The fix: use offscreen mode with o3d.visualization.rendering.OffscreenRenderer, or export your results to files (.ply, .png) instead of displaying them.
# Full environment diagnostic
conda info --envs # List all environments
which python # Which Python is active?
python --version # Which version?
pip list # All installed packages
pip list --outdated # Packages with updates available
# Verify CUDA for PyTorch
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')"
# Nuclear option: drop and rebuild
conda deactivate
conda remove -n spatial-ai --all -y
conda create -n spatial-ai python=3.11 -y
conda activate spatial-ai
pip install -r requirements.txt
| Error | Likely Cause | Fix |
|---|---|---|
ModuleNotFoundError: No module named 'open3d' | Wrong environment active, or package not installed | conda activate spatial-ai && pip install open3d |
torch.cuda.is_available() = False | CPU build of PyTorch installed, or CUDA version mismatch | Reinstall from pytorch.org with the right CUDA index |
GLFWError: No OpenGL context | Headless server, no graphics GPU | Use OffscreenRenderer or export_image() |
ERROR: Could not build wheels for lazrs | Rust compiler missing (required by lazrs) | Install Rust via rustup.rs, or use pip install laspy[laszip] |
EnvironmentError: Conda prefix already exists | Environment partially created from a previous attempt | conda remove -n spatial-ai --all -y, then recreate |
pip install very slow or timing out | Slow network, proxy, or saturated PyPI index | Use pip install --index-url https://pypi.tuna.tsinghua.edu.cn/simple/ (mirror) |
sudo pip install. That command installs packages into the system Python with root privileges, which can break critical system tools (like apt on Ubuntu). Always use a Conda environment or venv, which need no admin rights.
If nothing works after several attempts, go nuclear: drop the environment completely with conda remove -n spatial-ai --all -y, then rebuild it from your requirements.txt. That's the fundamental benefit of isolation. Tearing down and rebuilding takes a few minutes, and it never risks damaging your system.
Chapter 2. Master Guide: Point Cloud Processing with Python
1. Orchestrating the Production Environment
Deploying a 3D Data Science pipeline starts with strict dependency isolation. We use Anaconda to create a dedicated Python environment, and pip to install the specialized 3D processing packages.
# Create an environment tuned for 3D computing
conda create -n spatial-ai python=3.11 -y
conda activate spatial-ai
pip install open3d laspy[lazrs] numpy torture-test-3d
Pay close attention to the lazrs extension. Without this Rust backend, decompressing .LAZ files becomes a major bottleneck and caps your IOPS (Input/Output Operations Per Second).
python >= 3.10 to get the type-hinting optimizations and the GIL (Global Interpreter Lock) performance improvements when you call into C++-compiled libraries.
2. I/O Optimization: The Offset Syndrome
Reading massive LiDAR files (several GB) calls for careful memory management. The LAS/LAZ format often stores coordinates with offsets and scales to save bytes by encoding them as int32 instead of float64.
import laspy
import numpy as np
def stream_safe_read(path):
las = laspy.read(path)
# Direct access to scaled coordinates (float64)
points = np.vstack((las.x, las.y, las.z)).transpose()
return points, las.header
Reading the header matters a lot. Ignoring the offsets during geometric operations can introduce catastrophic float-precision errors if the project origin sits millions of meters away (the WGS84 system).
.LAS file larger than 10GB into RAM in one shot. Use tiling or a streaming iterator.
laspy.read(path, laz_backend=laspy.LazBackend.Laszip) and iterate by chunks with laspy.open() in streaming mode.
3. Voxelization: The Survival Grid
Voxel Downsampling isn't random point removal. It's spatial discretization where each voxel (3D cube) gets replaced by the centroid of the points inside it. That gives you a uniform spatial distribution, which is vital for Deep Learning algorithms like PointNet++.
import open3d as o3d
def optimize_pcd(pcd, voxel_size=0.05):
# Smart 5cm downsampling
pcd_down = pcd.voxel_down_sample(voxel_size=voxel_size)
return pcd_down
For an urban Digital Twin, a voxel_size of 0.10 (10 cm) is often the optimal balance between visual fidelity and compute performance (a 90% reduction in point count).
voxel_size of 0.01 m (1 cm) is ideal for indoor scans, 0.05 m for building facades, and 0.5 m for aerial surveys. Rule of thumb: the voxel should be at least 3x larger than the sensor's measurement noise.
| Voxel Size (m) | Points (Millions) | Memory (MB) | Latency (ms) |
|---|---|---|---|
| Raw | 50.0 | 1200 | N/A |
| 0.01 | 22.5 | 540 | 1250 |
| 0.05 | 4.8 | 115 | 320 |
| 0.10 | 1.2 | 28 | 85 |
4. Normal Estimation: The Direction of the Surface
Normals are unit vectors perpendicular to the local tangent plane. Without them, no PBR rendering, no curvature-based partitioning. A KDTree is mandatory here to find nearest neighbors in logarithmic time O(log n).
# Hybrid search: 10cm radius, max 30 neighbors
search_param = o3d.geometry.KDTreeSearchParamHybrid(radius=0.1, max_nn=30)
pcd.estimate_normals(search_param=search_param)
# Consistent upward orientation
pcd.orient_normals_to_align_with_direction(direction=[0, 0, 1])
Computing the normal relies on Principal Component Analysis (PCA) of the local covariance matrix. The eigenvector tied to the smallest eigenvalue gives the normal direction.
pcd.estimate_normals on GPU via the o3d.t.geometry module for a 15x speedup on CUDA.
5. RANSAC: The Art of Robust Consensus
The RANSAC algorithm (Random Sample Consensus) is the bulldozer of 3D processing. It iterates to find the mathematical model (here a plane) that maximizes the number of inliers (matching points) despite environmental noise.
plane_model, inliers = pcd.segment_plane(
distance_threshold=0.02,
ransac_n=3,
num_iterations=1000
)
# Geometric extraction
ground = pcd.select_by_index(inliers)
objects = pcd.select_by_index(inliers, invert=True)
A distance_threshold of 0.02 means any point within 2 cm of the mathematical plane counts as part of the ground.
6. DBSCAN Clustering: Density as Identity
Unlike K-Means, DBSCAN doesn't need to know the number of clusters up front. It groups points by density. It's the perfect tool to isolate cars, pedestrians, or street furniture in a street LiDAR scan.
with o3d.utility.VerbosityContextManager(o3d.utility.VerbosityLevel.Debug) as cm:
labels = np.array(objects.cluster_dbscan(eps=0.25, min_points=10))
max_label = labels.max()
print(f"Point cloud has {max_label + 1} clusters")
voxel_down_sample first.
7. Bounding Boxes: Metric Containment
You extract dimensions with two box types: AABB (Axis-Aligned) and OBB (Oriented). The OBB wins because it aligns with the object's real orientation, giving you exact length and width.
obb = objects.get_oriented_bounding_box()
obb.color = (1, 0, 0) # Red for visualization
print(f"Dimensions: {obb.extent} m")
| Box Type | Computation | Metric Accuracy | Usage |
|---|---|---|---|
| AABB | Simple Min-Max | Low | Rough collision |
| OBB | PCA-based | High | Scan-to-BIM |
8. From Open3D to PyTorch: The Tensor Bridge
In 2026, inference rarely runs on raw CPU. We have to convert our Open3D objects into Tensors to feed them into a Point Transformers network.
import torch
def pcd_to_tensor(pcd):
points = np.asarray(pcd.points)
tensor = torch.from_numpy(points).float()
return tensor.cuda() # Direct send to GPU
9. Summary and Compute Architectures
| Stage | Complexity | Recommended Hardware |
|---|---|---|
| I/O (Lazrs) | O(n) | NVMe SSD + multicore |
| Voxel Down | O(n) | RAM (high frequency) |
| Normals/KDTree | O(n log n) | CPU multithreading |
| RANSAC/DBSCAN | O(n^2) brutal | GPU (o3d.t) |
Master these tools and you hold the keys to tomorrow's cities.
Chapter 3. 3D Data Formats: From Point Cloud to Web
1. Panorama of 3D Formats
Why are there so many 3D formats? The answer fits in three words: history, domains, trade-offs. Each community (geospatial, film, BIM, video games, web) built its own standards, tuned for its specific constraints around precision, performance, and metadata.
You can sort formats into broad families. Point clouds favour raw geometric fidelity (LAS/LAZ, PLY, E57). Meshes encode triangulated surfaces (OBJ, STL, FBX). BIM formats carry rich semantics (IFC). And web formats optimise transmission and real-time rendering (glTF/GLB, 3D Tiles).
The classic beginner trap is believing a single format can do it all. In reality, a professional pipeline chains several formats depending on the stage: acquisition in LAS, processing in PLY or NumPy, delivery in glTF. Understanding these transitions is essential if you want to avoid silent data loss.
| Format | Type | Binary/Text | Compression | Metadata | Typical use case |
|---|---|---|---|---|---|
| LAS 1.4 | Point cloud | Binary | No (LAZ = yes) | CRS, classification | Aerial/terrestrial LiDAR surveys |
| PLY | Cloud / Mesh | Both | No | Custom properties | Research, prototyping |
| E57 | Point cloud | Binary (XML header) | Yes (internal) | Images, scanner poses | Static scanning, BIM scan-to-model |
| OBJ | Mesh | Text | No | Materials (.mtl) | Simple mesh exchange |
| STL | Mesh | Both | No | None | 3D printing |
| FBX | Full scene | Binary | No | Animations, skeletons | Film, video games |
| IFC | BIM | Text (STEP) | No | Full semantics | Construction, architecture |
| glTF/GLB | 3D scene | JSON+Bin / Binary | Draco (optional) | PBR materials, animations | Web 3D, visualisation |
| 3D Tiles | Hierarchical tiling | Binary | Draco, gzip | Bounding volumes | Cesium, urban Digital Twins |
2. LAS/LAZ: The LiDAR Standard
The LAS format (LASer) is the de facto standard for LiDAR data. Developed by the ASPRS (American Society for Photogrammetry and Remote Sensing), it encodes each point with its coordinates, intensity, classification, and laser-return metadata.
A LAS file breaks down into three blocks. The header holds the global information: coordinate system (CRS via VLRs), offsets, scales, point count, and bounding box. Variable Length Records (VLR) store the extended metadata. And Point Data Records encode each point according to a numbered format (PDRF 0 to 10).
The LAZ compression, invented by Martin Isenburg (LASzip), applies lossless compression that shrinks size by 5 to 10 times. The principle: encode deltas between consecutive points rather than absolute values. It's become a de facto standard, natively supported by PDAL, CloudCompare, and laspy.
The LAS 1.4 version brings major improvements over 1.2: 64-bit point counters (beyond 4.3 billion points), extended point formats (PDRF 6-10) with 8-bit classification (256 classes instead of 32), and native support for full-waveform data.
import laspy
import numpy as np
def inspect_las_header(filepath):
"""Inspect the header and point format of a LAS/LAZ file."""
las = laspy.read(filepath)
h = las.header
print(f"Version : LAS {h.version}")
print(f"Point Format : PDRF {las.point_format.id}")
print(f"Point Count : {h.point_count:,}")
print(f"Scales : {h.scales}")
print(f"Offsets : {h.offsets}")
print(f"Bounding Box : {h.mins} -> {h.maxs}")
# List the dimensions available in this PDRF
print(f"\nAvailable dimensions:")
for dim in las.point_format.dimensions:
print(f" - {dim.name} ({dim.dtype})")
return las
LAS files store coordinates as 32-bit integers with
scales and offsets that preserve millimetre precision even for 6-digit geographic coordinates (e.g. X=350000.001 m). When you convert to PLY or NumPy in float32, you lose that precision: a float32 only offers ~7 significant digits, which translates to ~5 cm of error at these magnitudes. Always use float64 (double precision), or subtract the offset before conversion: points = np.vstack((las.x - las.header.offsets[0], ...)). This step is critical for BIM and high-precision topography.
3. PLY: Universal Simplicity
The PLY format (Polygon File Format, also called Stanford Triangle Format) is one of the oldest and simplest formats in the 3D ecosystem. Its plain-text header describes the data structure, followed by the payload in ASCII or binary.
Why does PLY stay popular in research? Three reasons: its simplicity (a human-readable header), its flexibility (you can add arbitrary properties: curvature, confidence, semantic label), and its speed in binary mode (no text parsing, direct sequential reads). Open3D, trimesh, CloudCompare, and nearly every academic tool support it natively.
The flip side: PLY has no notion of CRS (coordinate system), no standardised compression, and no hierarchical structure. It's perfect for prototyping and research datasets, but unsuitable for large-scale geospatial data delivery.
import laspy
import open3d as o3d
import numpy as np
def las_to_ply_with_colors(las_path, ply_path):
"""Convert a LAS to PLY while preserving RGB colors."""
las = laspy.read(las_path)
# Extract coordinates
points = np.vstack((las.x, las.y, las.z)).T
# Build the Open3D cloud
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(points)
# Add colors (16-bit normalization -> [0,1])
if hasattr(las, 'red'):
colors = np.vstack((las.red, las.green, las.blue)).T
colors = colors / 65535.0 # LAS stores as uint16
pcd.colors = o3d.utility.Vector3dVector(colors)
o3d.io.write_point_cloud(ply_path, pcd, write_ascii=False)
print(f"Exported {len(points):,} points to {ply_path}")
return pcd
By default, Open3D exports in binary (
write_ascii=False). Never change that default without a reason. A 10-million-point PLY file weighs ~230 MB in ASCII versus ~115 MB in binary, and binary reads are 10 to 15 times faster because they skip line-by-line text parsing. Reserve ASCII mode for debugging (inspecting values by eye) or interop with legacy tools. In production, binary is always the right call.
4. glTF/GLB: The Web 3D Standard
The glTF format (GL Transmission Format), developed by the Khronos Group, is often called the "JPEG of 3D". Its goal: deliver a compact, efficient, universal format for transmitting and rendering 3D scenes on the web and in real-time applications.
A glTF file is made of a JSON describing the scene (nodes, meshes, PBR materials, textures, animations) and binary buffers holding the geometry data. The GLB variant packs everything into a single binary file, eliminating multiple HTTP requests and speeding up loading.
The hierarchical structure of glTF organises into scenes containing nodes, which in turn carry meshes with materials. That architecture maps naturally to the scene graph used by Three.js, Babylon.js, and Cesium, which is why it's been adopted so widely across the web 3D ecosystem.
A notable fact for Gaussian Splatting: .splat and .ply exports from 3DGS reconstructions can be converted to custom glTF extensions, allowing direct in-browser visualisation through specialised viewers. That opens the door to mainstream distribution of photorealistic reconstructions.
import trimesh
def mesh_to_glb(input_path, output_path):
"""Export a mesh to GLB (binary glTF)."""
mesh = trimesh.load(input_path)
# trimesh automatically handles material conversion
# to the glTF PBR (Physically Based Rendering) model
mesh.export(output_path, file_type='glb')
print(f"GLB exported: {output_path}")
print(f" Vertices : {len(mesh.vertices):,}")
print(f" Faces : {len(mesh.faces):,}")
# Example: OBJ to GLB
mesh_to_glb("model.obj", "model.glb")
Draco compression (Google) can cut a glTF's size by 80 to 95% by compressing the geometry buffers. Enable it via trimesh.exchange.gltf.export_glb(mesh, include_normals=True) or use the CLI tool gltf-pipeline -i input.glb -o output.glb -d (the -d flag turns on Draco).
Draco compression has become a must for web deployment of 3D models. On a typical urban mesh (500k triangles, normals + UVs + colors), Draco cuts size by 90%+: a 45 MB GLB drops to 3-4 MB. Client-side decoding is near-instant thanks to the WebAssembly decoders shipped by Google. For massive point clouds, pair Draco with 3D Tiles and Cesium's hierarchical tiling: the viewer only loads the tiles visible at screen resolution, letting you navigate billions of points with sub-second latency. Also look at meshoptimizer (the EXT_meshopt_compression extension) as an alternative to Draco, sometimes faster at decoding.
5. Conversions and Interoperability
In a production pipeline, data inevitably moves through several formats. Conversion is never innocuous: each step can cause information loss (CRS, attributes, precision). Mastering the tools and anticipating the traps is a key skill for any 3D practitioner.
The most common conversion pipelines follow a logic of progressive specialisation. For web delivery: LAS → PLY → glTF. For BIM: IFC → OBJ → glTF. For deep learning: LAS → PLY or NumPy. Each transition has its rules and its preferred tools.
Python offers a rich ecosystem for automating these conversions. open3d excels for point clouds and simple meshes, trimesh for textured meshes and glTF exports, and pyvista for volumetric meshes and scientific visualisation. For heavy geospatial pipelines, PDAL remains the reference.
import open3d as o3d
import trimesh
from pathlib import Path
def convert_3d(input_path, output_path):
"""Universal converter: detect the type and route to the right engine."""
src = Path(input_path).suffix.lower()
dst = Path(output_path).suffix.lower()
# Point cloud: Open3D
if src in ('.ply', '.pcd', '.xyz'):
pcd = o3d.io.read_point_cloud(input_path)
o3d.io.write_point_cloud(output_path, pcd)
# Mesh to web: trimesh
elif dst in ('.glb', '.gltf'):
mesh = trimesh.load(input_path)
mesh.export(output_path)
# Generic mesh: Open3D
else:
mesh = o3d.io.read_triangle_mesh(input_path)
o3d.io.write_triangle_mesh(output_path, mesh)
print(f"Conversion done: {src} -> {dst}")
CRS (coordinate reference system). PLY, OBJ, and glTF have no field to store the projection. Always export the CRS separately (a .prj or .json file) and reapply it after conversion with pyproj or PDAL.
| Conversion | Recommended tool | Preserves CRS | Preserves colors | Notes |
|---|---|---|---|---|
| LAS → PLY | laspy + Open3D | No | Yes | Normalise 16-bit RGB → float |
| LAS → NumPy | laspy | Via header | Yes | Ideal for ML pipelines |
| PLY → glTF | trimesh | No | Yes (vertex colors) | Needs a mesh or splat |
| OBJ → GLB | trimesh | No | Yes (via .mtl) | Materials converted to PBR |
| IFC → OBJ | IfcOpenShell | Partial | No | BIM semantics lost |
| LAS → 3D Tiles | PDAL + py3dtiles | Yes | Yes | Full geospatial pipeline |
Chapter 4. Advanced Segmentation: The RANSAC vs DBSCAN Duel
1. The Philosophy of 3D Segmentation
Segmentation is the cornerstone of scene understanding. It splits into three families: Semantic Segmentation (label per class), Instance Segmentation (label per object), and Primitive Segmentation (geometry).
2. RANSAC: Statistical Resilience
RANSAC starts from a simple premise: your data contains clean geometric structures polluted by noise. By randomly picking a minimal subset of points (3 for a plane), it builds a candidate model and tests it against the whole dataset.
import open3d as o3d
def extract_dominant_plane(pcd, threshold=0.02):
model, inliers = pcd.segment_plane(
distance_threshold=threshold,
ransac_n=3,
num_iterations=1000
)
[a, b, c, d] = model
# Equation: ax + by + cz + d = 0
print(f"Plane Equation: {a:.2f}x + {b:.2f}y + ... = 0")
return pcd.select_by_index(inliers), pcd.select_by_index(inliers, invert=True)
distance_threshold to your sensor noise. For a Leica RTC360 TLS (~1.9 mm noise at 10 m), use 0.005 m. For a DJI L2 drone LiDAR (~3 cm noise), use 0.05 m. Too low and you fragment detected planes, too high and you merge adjacent ones.
3. DBSCAN: The Density Archipelago
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is my go-to for non-parametric clustering. You don't need to define the number of clusters K up front.
import matplotlib.pyplot as plt
# Cluster isolation
labels = np.array(objects.cluster_dbscan(
eps=0.30, # Neighborhood radius (30cm)
min_points=15, # Minimum density
print_progress=True
))
max_label = labels.max()
print(f"Found {max_label + 1} individual objects")
K-Distance Graph to find the optimal EPS. Look for the "knee point" in the curve of distances to the k-th neighbor.
4. Head-to-Head: The Match
Picking the right algorithm means knowing each one's strengths and weaknesses. This table sums up the core differences between RANSAC and DBSCAN.
| Criterion | RANSAC | DBSCAN |
|---|---|---|
| Nature | Geometric Model | Spatial Density |
| Shapes | Planes, Spheres, Cylinders | Arbitrary Shapes |
| Parameters | Threshold, Iterations | Epsilon, Min Points |
| Complexity | O(k * N) | O(N log N) with index |
| Noise Resilience | Excellent (outliers ignored) | Good (labels noise as -1) |
5. Applied: Automatic Cleaning Pipeline
Chaining both algorithms in a sequential pipeline gives you a full cleaning flow: noise removal, ground extraction, object clustering.
def auto_clean(pcd):
# 1. Statistical Outlier Removal
cl, ind = pcd.remove_statistical_outlier(nb_neighbors=20, std_ratio=2.0)
clean_pcd = pcd.select_by_index(ind)
# 2. Ground Removal
_, inliers = clean_pcd.segment_plane(distance_threshold=0.05, ransac_n=3, num_iterations=1000)
objects = clean_pcd.select_by_index(inliers, invert=True)
# 3. Object Clustering
labels = objects.cluster_dbscan(eps=0.25, min_points=10)
return objects, labels
This 12-line pipeline handles 80% of the preprocessing work for most modern machine perception tasks.
Chapter 5. Big Data LiDAR: Processing at Terabit Scale
1. The Memory Bottleneck
A standard LiDAR point (X, Y, Z, intensity, R, G, B, GPS time) weighs in around 32 to 64 bytes uncompressed. A billion points adds up to roughly 50 GB of raw RAM.
[[x,y,z], ...] eats up to 5x more memory than a contiguous NumPy array.
2. Strategy 1: Chunking with Laspy
Chunking means reading the file in fixed-size batches, never loading the whole thing into memory. Laspy ships a native iterator that makes this trivial to wire up.
import laspy
import numpy as np
def process_massive_file(filepath, chunk_size=1_000_000):
with laspy.open(filepath) as f:
# Iterate over points in batches of 1 million
for chunk in f.chunk_iterator(chunk_size):
points = np.column_stack((chunk.x, chunk.y, chunk.z))
# Local processing (e.g. height filter)
mask = points[:, 2] > 50.0
save_result(points[mask])
3. Strategy 2: Spatial Tiling
PDAL (Point Data Abstraction Library) is the king of tile generation, via its CLI or Python binding.
[
"input.laz",
{
"type": "filters.splitter",
"length": 100.0,
"buffer": 2.0
},
{
"type": "writers.las",
"filename": "tile_#.laz"
}
]
4. Strategy 3: Multiprocessing (Parallelism)
Once data is cut into independent tiles, parallel processing puts every available CPU core to work and slashes compute time by a nearly linear factor.
from multiprocessing import Pool, cpu_count
def worker_task(tile_path):
# Isolated process: load, process, save
pcd = load(tile_path)
result = heavy_algorithm(pcd)
return result.status
if __name__ == "__main__":
files = glob.glob("tiles/*.laz")
# Use every available core
with Pool(processes=cpu_count()) as pool:
results = pool.map(worker_task, files)
psutil.virtual_memory() and cap max_workers accordingly.
5. Memory Mapping (Memmap)
Memory mapping lets the OS treat a file on disk as if it were in RAM, without explicitly loading it. It's the sweet spot between access speed and capacity.
| Method | Access Speed | Max Capacity | Complexity |
|---|---|---|---|
| RAM (Standard) | Extreme (~50 GB/s) | Capped by physical RAM | Low |
| HDD (Seek/Read) | Slow (~150 MB/s) | Unlimited | Medium |
| Memmap (NVMe) | High (~3-7 GB/s) | Capped by disk | Medium |
Summary: The Ideal Architecture
- Split: PDAL to cut into 100m2 tiles with buffer.
- Process: Python multiprocessing script firing one worker per tile.
- Merge: Merge lightweight results (vectors, metadata) into a spatial database (PostGIS).
You've got the fundamentals down now. The Python environment is set up, you know how to load, manipulate, and segment point clouds, you understand the 3D formats, and you can process massive datasets. What's next? Giving your data meaning. Part II dives into 3D semantics: segmentation, classification, and object detection, the three pillars of automated scene understanding.
Part II. 3D Semantics
Module 2: Segmentation, classification, and object detection in point clouds
With the fundamentals in hand, it's time to give your data meaning. This part covers the three core operations of 3D semantic analysis: cutting space into coherent regions (segmentation), assigning an identity to each element (classification), and precisely locating objects of interest (detection). These three skills form the bedrock of any automated 3D scene understanding pipeline.
Chapter 6. 3D Segmentation: Extracting Meaning from Point Clouds
1. What Is 3D Segmentation?
3D segmentation partitions a point cloud into homogeneous, meaningful groups. Each point gets assigned to a segment, turning a shapeless mass of coordinates into a semantic map that downstream algorithms can act on.
There are three main families. Geometric segmentation groups points by shape similarity (planes, cylinders, spatial clusters). Semantic segmentation assigns a class (ground, building, vegetation, vehicle) to every individual point. Instance segmentation takes things further by distinguishing each object: not just "car", but "car_1", "car_2", "car_3".
Segmentation is the foundation for every downstream task: classification, detection, BIM reconstruction, digital twin. Without prior partitioning, no 3D processing pipeline runs efficiently. It's the base layer that gives raw data meaning.
Knowing each approach's strengths and limits lets you pick the right strategy for the context. An airborne LiDAR survey doesn't segment like an indoor scan, and real-time constraints force different trade-offs than post-processing does.
2. Geometric Segmentation: Regions and Surfaces
Geometric segmentation taps into the local spatial properties of points (normals, curvatures, distances) to group regions that share shape coherence. It's the oldest and most reliable approach, and it needs no prior training.
The Region Growing algorithm starts from seed points picked by minimum curvature. At each iteration, neighbors whose normal lines up closely enough with the current region's get absorbed. The process runs until no new point passes the angle criterion.
Euclidean clustering (a DBSCAN variant applied to 3D) groups points by pure spatial proximity. Two points belong to the same cluster if they're connected by a chain of neighbors closer than eps meters. This approach works especially well after a RANSAC ground extraction, to isolate the remaining individual objects.
import open3d as o3d
import numpy as np
# Load and prepare the cloud
pcd = o3d.io.read_point_cloud("scene.ply")
pcd.estimate_normals(search_param=o3d.geometry.KDTreeSearchParamHybrid(
radius=0.1, max_nn=30))
# Region growing segmentation based on normals
# Criterion: angle between normals < 25 degrees
normals = np.asarray(pcd.normals)
points = np.asarray(pcd.points)
tree = o3d.geometry.KDTreeFlann(pcd)
labels = -np.ones(len(points), dtype=int)
current_label = 0
angle_threshold = np.cos(np.radians(25))
for seed_idx in range(len(points)):
if labels[seed_idx] != -1:
continue
queue = [seed_idx]
labels[seed_idx] = current_label
while queue:
idx = queue.pop(0)
[_, nn_idx, _] = tree.search_radius_vector_3d(points[idx], 0.15)
for ni in nn_idx:
if labels[ni] == -1:
dot = np.abs(np.dot(normals[idx], normals[ni]))
if dot > angle_threshold:
labels[ni] = current_label
queue.append(ni)
current_label += 1
print(f"Regions detected: {current_label}")
radius = 3 * voxel_size as a starting point.
For Region Growing, the angle threshold between normals is the most critical parameter. Use
10-15 degrees for planar surfaces (walls, floors, roofs) to get clean segments. Bump it to 30 degrees or more for natural terrain or irregular surfaces (vegetation, rocks). When in doubt, start at 15 degrees and raise it until you hit acceptable over-segmentation rather than under-segmentation.
3. Semantic Segmentation: One Class per Point
The goal of semantic segmentation is to attach a class label to each point: ground, building, vegetation, vehicle, pedestrian. Unlike geometric segmentation (which groups by shape), semantic segmentation grasps the nature of each element in the scene.
The classical approach sits on a two-step pipeline: (1) compute local geometric features for each point (curvature, relative height, local density, covariance matrix eigenvalues), then (2) train a supervised classifier (Random Forest, SVM) on those features. This method is still very competitive for airborne LiDAR surveys, where geometric features are highly discriminative.
Deep learning has reshaped the field since PointNet (2017). Modern architectures like KPConv, RandLA-Net, and Point Transformers learn straight from raw coordinates, with no hand-engineered features. The core difference from 2D vision: there's no regular grid, density varies enormously, and spatial relations are in 3D.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
def compute_geometric_features(points, normals, k=20):
"""Compute geometric features for each point."""
from scipy.spatial import cKDTree
tree = cKDTree(points)
features = []
for i in range(len(points)):
_, idx = tree.query(points[i], k=k)
neighbors = points[idx]
cov = np.cov(neighbors.T)
eigenvalues = np.sort(np.linalg.eigvalsh(cov))[::-1]
e1, e2, e3 = eigenvalues
# Geometric features derived from eigenvalues
linearity = (e1 - e2) / (e1 + 1e-8)
planarity = (e2 - e3) / (e1 + 1e-8)
sphericity = e3 / (e1 + 1e-8)
height_rel = points[i, 2] - neighbors[:, 2].min()
features.append([linearity, planarity, sphericity, height_rel])
return np.array(features)
# Classification pipeline
features = compute_geometric_features(points, normals)
clf = RandomForestClassifier(n_estimators=200, max_depth=15, n_jobs=-1)
scores = cross_val_score(clf, features, labels, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
| Method | Typical mIoU | Speed | Data needed |
|---|---|---|---|
| Random Forest + features | 65-75% | Fast | A handful of annotated scans |
| PointNet++ | 70-80% | Moderate | Thousands of scenes |
| KPConv | 75-85% | Moderate | Thousands of scenes |
| RandLA-Net | 75-82% | Fast (GPU) | Thousands of scenes |
| Point Transformer v3 | 80-88% | Slow | Tens of thousands |
In urban LiDAR scenes, the "ground" class often makes up 60% or more of the points, swamping rare classes like "vehicle" or "pedestrian". If you train without correction, the model will predict "ground" everywhere and still hit high accuracy. Use a
weighted loss (weighted cross-entropy) or oversampling of minority classes. Compute weights inversely proportional to each class's frequency.
4. Instance Segmentation: Identifying Each Object
Semantic segmentation answers "what is this?". Instance segmentation answers "how many are there and where?". Every individual object gets a unique ID, which is a must for counting, measuring, and temporal tracking.
The most pragmatic approach pairs semantic segmentation with spatial clustering. First you segment points by semantic class, then you run DBSCAN on each class to split out individual instances. This top-down method is simple, reliable, and often plenty for industrial applications.
End-to-end approaches like 3D-BoNet and Mask3D predict instance masks directly, no intermediate step. They're more elegant but need huge volumes of annotated data and significant training time.
import numpy as np
from sklearn.cluster import DBSCAN
def instance_segmentation(points, semantic_labels, class_names):
"""Instance segmentation by post-semantic clustering."""
instance_labels = -np.ones(len(points), dtype=int)
current_instance = 0
for cls_id, cls_name in enumerate(class_names):
mask = semantic_labels == cls_id
if mask.sum() < 10:
continue
cls_points = points[mask]
# eps tuned to the class: larger for buildings
eps = 0.5 if cls_name == "building" else 0.25
clustering = DBSCAN(eps=eps, min_samples=10).fit(cls_points)
for label in set(clustering.labels_) - {-1}:
inst_mask = clustering.labels_ == label
indices = np.where(mask)[0][inst_mask]
instance_labels[indices] = current_instance
current_instance += 1
print(f"Instances detected: {current_instance}")
return instance_labels
# Usage
classes = ["ground", "building", "vegetation", "vehicle"]
instances = instance_segmentation(points, sem_labels, classes)
Panoptic segmentation is the next step: it unifies semantic and instance segmentation in a single framework. Architectures like
Panoptic-PolarNet and EfficientLPS produce class labels and instance IDs at the same time. The big win is consistency, no more reconciling two independent predictions. It's the dominant research direction in 3D scene understanding since 2023.
Panoptic segmentation combines semantics and instances in a unified framework. Each point gets both a semantic class and an instance ID. "Stuff" classes (ground, sky) have no instances, while "things" classes (car, pedestrian) do. It's the state of the art in 3D scene understanding.
5. Evaluating Segmentation
Measuring segmentation quality matters as much as the segmentation itself. Standard metrics let you compare methods objectively and catch pipeline weaknesses before production.
Per-class IoU (Intersection over Union) measures the overlap between prediction and ground truth for each category. mIoU (mean IoU) averages across all classes and is the reference metric in the community. Watch out though: a high mIoU can mask disastrous performance on rare classes if they get drowned out by the average.
Visual inspection in 3D is still a must. An 80% mIoU can mean object boundaries are systematically misclassified, which is critical for Scan-to-BIM applications where tolerance is on the order of a centimeter.
import numpy as np
def compute_iou(pred, gt, num_classes):
"""Compute per-class IoU and mIoU."""
ious = []
for cls in range(num_classes):
intersection = np.sum((pred == cls) & (gt == cls))
union = np.sum((pred == cls) | (gt == cls))
if union == 0:
ious.append(float('nan'))
else:
ious.append(intersection / union)
valid = [x for x in ious if not np.isnan(x)]
miou = np.mean(valid)
return ious, miou
# Usage example
class_names = ["ground", "building", "vegetation", "vehicle"]
ious, miou = compute_iou(predictions, ground_truth, len(class_names))
for name, iou in zip(class_names, ious):
print(f" {name:12} IoU = {iou:.3f}")
print(f" mIoU = {miou:.3f}")
Chapter 7. 3D Classification: Recognizing Objects in Space
1. Point Cloud Classification
3D classification assigns a single label to a point cloud or to a pre-segmented object. Unlike semantic segmentation (which works point by point), classification judges the object as a whole: is this cluster of points a chair, a table, a tree, or a vehicle?
Use cases are wide open: object recognition on benchmarks like ModelNet40 (40 CAD object categories), scene understanding for robotics, industrial quality control, and automatic sorting of elements in urban LiDAR surveys. In every scenario, the goal is the same: turn raw geometry into usable categorical information.
The split from segmentation matters. Segmentation assigns a class to every point of a full scene. Classification assigns a class to a whole object that's already been isolated. In practice, the two are often chained: you segment the scene into individual objects first, then classify each one.
2. Geometric Features for Classification
Before the deep learning era, classification relied on hand-crafted geometric descriptors. These features are still highly relevant in 2026, especially when annotated data is scarce or model interpretability is required.
Local features derive from the eigenvalues of the local covariance matrix. From a neighborhood of k points, you get three eigenvalues e1 >= e2 >= e3 that encode the spatial distribution of points. From those, you derive linearity (wire-like structure), planarity (surface structure), sphericity (volume structure), and anisotropy (degree of elongation).
Global features capture overall shape: bounding box dimensions, volume, estimated surface, point count, height/width ratio. Multi-scale features combine local descriptors at different radii (0.5m, 1m, 2m, 5m) to capture context at multiple levels of granularity.
import numpy as np
import open3d as o3d
from scipy.spatial import cKDTree
def extract_features(points, radii=[0.5, 1.0, 2.0]):
"""Full multi-scale feature extraction pipeline."""
tree = cKDTree(points)
all_features = []
for radius in radii:
features_r = []
for i in range(len(points)):
idx = tree.query_ball_point(points[i], radius)
if len(idx) < 3:
features_r.append([0] * 5)
continue
neighbors = points[idx]
cov = np.cov(neighbors.T)
evals = np.sort(np.linalg.eigvalsh(cov))[::-1]
e1, e2, e3 = evals + 1e-10
linearity = (e1 - e2) / e1
planarity = (e2 - e3) / e1
sphericity = e3 / e1
anisotropy = (e1 - e3) / e1
density = len(idx) / (4/3 * np.pi * radius**3)
features_r.append([linearity, planarity, sphericity,
anisotropy, density])
all_features.append(np.array(features_r))
# Global features
bbox_dims = points.max(axis=0) - points.min(axis=0)
volume = np.prod(bbox_dims)
height = bbox_dims[2]
aspect_ratio = bbox_dims.max() / (bbox_dims.min() + 1e-8)
n_points = len(points)
global_feats = np.tile([volume, height, aspect_ratio, n_points],
(len(points), 1))
return np.hstack(all_features + [global_feats])
| Feature | Formula | What it captures |
|---|---|---|
| Linearity | (e1 - e2) / e1 | Wire-like structure (cables, poles) |
| Planarity | (e2 - e3) / e1 | Surface structure (walls, roofs) |
| Sphericity | e3 / e1 | Volume structure (bushes, rocks) |
| Anisotropy | (e1 - e3) / e1 | Overall elongation |
| Local density | k / V_sphere | Neighborhood compactness |
| Relative height | z - z_min_local | Vertical position in the scene |
Compute your geometric features at at least 3 different radii (e.g. 0.5m, 1.5m, and 5m). The same object shows very different signatures depending on the scale of observation. This multi-scale approach typically lifts accuracy by 5 to 15 points versus a single scale, for a linear extra compute cost.
3. Random Forest and Gradient Boosting for 3D
Tree-based methods are still among the best tools for point cloud classification, especially when features are well designed. Fast to train, interpretable, and tolerant of heterogeneous features, they're often the first choice in production.
Random Forest builds hundreds of independent decision trees and votes for the final class. XGBoost and LightGBM push the idea further with gradient boosting: each tree corrects the errors of the previous one, allowing finer convergence. On well-computed geometric features, these models match neural networks in performance, with training and inference times orders of magnitude smaller.
The big win is interpretability. Feature importance analysis reveals which geometric descriptors are most discriminative for each class. That info is gold for refining the feature pipeline and understanding the physics of the problem.
import xgboost as xgb
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np
# Feature extraction (see previous section)
X = extract_features(points, radii=[0.5, 1.0, 2.0, 5.0])
y = labels # Classes: 0=ground, 1=building, 2=vegetation, 3=vehicle
# XGBoost config tuned for 3D
clf = xgb.XGBClassifier(
n_estimators=500,
max_depth=8,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softprob',
eval_metric='mlogloss',
n_jobs=-1,
tree_method='hist' # Fast for large volumes
)
# Stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(clf, X, y, cv=cv, scoring='f1_weighted')
print(f"F1 weighted: {scores.mean():.3f} (+/- {scores.std():.3f})")
# Train the final model and analyze features
clf.fit(X, y)
importances = clf.feature_importances_
feature_names = [f"{f}_r{r}" for r in [0.5,1.0,2.0,5.0]
for f in ["lin","plan","sph","aniso","dens"]]
feature_names += ["volume", "height", "aspect", "n_pts"]
top_k = np.argsort(importances)[::-1][:5]
for i in top_k:
print(f" {feature_names[i]:15} importance = {importances[i]:.4f}")
height dominates every other feature, your model is probably over-dependent on altitude and will generalize poorly to scenes with different elevations. Diversify your features for a solid model.
Never split your train/test data randomly point by point. Neighboring points are strongly correlated spatially: if a training point sits 10 cm from a test point, the model "cheats" by memorizing local context. Use a
spatial split: reserve entire geographic zones for test (a whole street, an entire building). Otherwise your metrics get artificially inflated by 10 to 20 points.
After a first training run, use the feature importance ranking to drop features under 1% of the total. Fewer features means faster inference, lower overfitting risk, and often better generalization. On a 20-feature set, it's not rare to see the top 8 carry 95% of the information.
4. From Classification to Detection
Classification answers "what is it?". But in a full scene, the key question is "what is it, AND where is it?". This move from classification to detection is the bridge between identification and location.
The simplest approach walks the scene with a 3D sliding window (a fixed-size cube) and classifies the contents of each window. That brute method is far too costly in 3D: a 100m x 100m x 30m scene at 1m steps generates 300,000 evaluations. Smart sampling strategies (region proposals, 3D anchors) cut that cost drastically.
Modern detection combines classification and bounding box regression in a single pass. The next chapter lays out these architectures, from VoxelNet to 3DETR, the state of the art in 3D object detection.
semantic segmentation -> clustering -> classification. End-to-end methods (direct detection) score better on benchmarks, but the modular approach gives you finer error diagnosis and independent tuning at each stage.
Self-supervised pre-training techniques like
Point-MAE and Point-BERT learn rich 3D representations without manual annotation by reconstructing masked portions of the point cloud. After fine-tuning on a few hundred annotated examples, these models rival classifiers trained on thousands of samples. It's an especially promising avenue when annotated data is scarce or expensive to produce.
Chapter 8. 3D Object Detection: Locate and Identify
1. The 3D Detection Challenge
3D object detection goes beyond classification and segmentation. You have to pin down each object with an oriented 3D bounding box (OBB) defined by seven parameters: position (x, y, z), dimensions (length, width, height), and rotation angle theta.
It's a lot harder than 2D detection. In 2D a box has 4 parameters (x, y, w, h). In 3D you jump to at least 7, with rotations possible around three axes. The search space blows up combinatorially, and exhaustive approaches stop being practical.
The applications matter. Self-driving cars need to spot vehicles, pedestrians and cyclists in real time with centimeter accuracy. City planning relies on detection to inventory street furniture, trees and buildings from airborne LiDAR. Automated BIM uses detection to pick out doors, windows and mechanical equipment in indoor scans.
2. Voxelization-Based Approaches
Voxelization turns an irregular point cloud into a regular 3D grid, which lets you apply 3D convolutions the same way image networks apply 2D convolutions. Regularizing the format is the bridge between the discrete world of points and the structured world of tensors.
The VoxelNet architecture is a good example. The cloud gets discretized into voxels, each voxel is encoded by a small network (VFE, for Voxel Feature Encoding), and then sparse 3D convolutions extract hierarchical features. A detection head finally produces bounding box proposals.
Picking the voxel resolution is the make-or-break decision. Fine resolution (0.05m) captures detail but blows up the data volume. Coarse resolution (0.5m) runs fast but loses small objects. In practice, modern architectures use sparse 3D convolutions (through spconv or MinkowskiEngine) that only process non-empty voxels, which slashes memory cost.
import open3d as o3d
import numpy as np
def voxelize_scene(pcd, voxel_size=0.2):
"""Voxelize a cloud and extract per-voxel features."""
# Build the voxel grid
voxel_grid = o3d.geometry.VoxelGrid.create_from_point_cloud(
pcd, voxel_size=voxel_size)
points = np.asarray(pcd.points)
# Compute voxel indices for each point
min_bound = points.min(axis=0)
voxel_indices = ((points - min_bound) / voxel_size).astype(int)
# Aggregate features per voxel
unique_voxels = {}
for i, vi in enumerate(voxel_indices):
key = tuple(vi)
if key not in unique_voxels:
unique_voxels[key] = []
unique_voxels[key].append(points[i])
# Per-voxel features: centroid, point count, extent
voxel_features = {}
for key, pts in unique_voxels.items():
pts = np.array(pts)
centroid = pts.mean(axis=0)
n_pts = len(pts)
extent = pts.max(axis=0) - pts.min(axis=0)
voxel_features[key] = np.concatenate([centroid, [n_pts], extent])
print(f"Active voxels: {len(voxel_features)} / theoretical grid")
return voxel_features, voxel_grid
features, grid = voxelize_scene(pcd, voxel_size=0.2)
2000 x 2000 x 500 = 2 billion theoretical voxels. Even with sparse convolutions, the hash grid can saturate your RAM. Always run a spatial crop before voxelization.
Voxel size is the single most important parameter in voxel-based 3D detection. Rules of thumb:
0.05-0.1m for indoor scans (furniture, office objects), 0.2-0.5m for ground-level urban scenes (cars, pedestrians), 1-2m for airborne LiDAR surveys (buildings, vegetation). Too small and you drown the signal in noise. Too big and you merge distinct objects together.
3. Point-Based Approaches: VoteNet and 3DETR
Point-based approaches skip the information loss baked into voxelization by working directly on raw coordinates. Two standout architectures show off this philosophy: VoteNet and 3DETR.
VoteNet borrows from the generalized Hough voting system. Each point "votes" for the position of the center of its object. The votes then get clustered, and each vote cluster produces a bounding box proposal. The intuition is straightforward: a point on a car's hood votes toward the geometric center of the car, while a point on the roof votes the opposite direction. Where votes converge, you've found an object center.
3DETR adapts the DETR Transformer architecture (originally built for 2D detection) to 3D. A set of learnable "queries" probes the features extracted from the point cloud through cross-attention. Each query specializes on a potential object and directly predicts its class and bounding box. The big win: no Non-Maximum Suppression (NMS) in post-processing, because the queries naturally divide the objects among themselves.
3DETR, Group-Free 3D, CAGroup3D and Point Transformer v3 dominate the ScanNet and SUN RGB-D benchmarks. Why? Attention naturally captures long-range relations between the different parts of an object, which local convolutions struggle to do.
VoteNet assumes foreground points (objects) have already been separated from background (ground, walls). On a raw, unsegmented point cloud, ground points vote in random directions and pollute the vote clusters. In practice, always run a RANSAC ground extraction before VoteNet, or use a variant like
ImVoteNet that folds in 2D features to disambiguate.
4. Detection in Urban LiDAR Scenes
Urban detection comes with its own constraints: scenes that stretch hundreds of meters, density that varies with distance from the sensor, frequent occlusions, and objects at very different scales (a pedestrian is 1.7m, a building is 30m).
The practical pipeline starts with rigorous preprocessing: RANSAC ground extraction to normalize heights, outlier removal, and cutting the scene into manageable tiles. Post-processing matters just as much: 3D NMS to kill redundant detections, confidence-score filtering, and optionally temporal tracking if the data is sequential.
For plenty of urban use cases, a simple detector based on geometric rules (height, density, shape) is surprisingly effective. The example below shows a bare-bones pipeline that spots trees and buildings using height and shape criteria after ground extraction.
import numpy as np
import open3d as o3d
from sklearn.cluster import DBSCAN
def detect_urban_objects(pcd, ground_threshold=0.3):
"""Simple detector based on geometric rules."""
points = np.asarray(pcd.points)
# 1. RANSAC ground extraction
plane, inliers = pcd.segment_plane(
distance_threshold=0.15, ransac_n=3, num_iterations=2000)
non_ground = pcd.select_by_index(inliers, invert=True)
ng_pts = np.asarray(non_ground.points)
# 2. Normalize height against the ground
z_ground = np.mean(points[inliers, 2])
heights = ng_pts[:, 2] - z_ground
# 3. Cluster the objects
clustering = DBSCAN(eps=0.8, min_samples=20).fit(ng_pts)
detections = []
for label in set(clustering.labels_) - {-1}:
mask = clustering.labels_ == label
cluster_pts = ng_pts[mask]
cluster_h = heights[mask]
# Compute dimensions
bbox = cluster_pts.max(axis=0) - cluster_pts.min(axis=0)
h_max = cluster_h.max()
aspect = bbox[2] / (max(bbox[0], bbox[1]) + 1e-8)
# Rule-based classification
if h_max > 8 and bbox[0] > 3 and bbox[1] > 3:
cls = "building"
conf = min(0.95, 0.6 + h_max / 50)
elif 2 < h_max < 15 and aspect > 1.5:
cls = "tree"
conf = min(0.90, 0.5 + aspect / 5)
elif 1 < h_max < 3 and bbox[0] < 6:
cls = "vehicle"
conf = 0.70
else:
cls = "unknown"
conf = 0.30
# Oriented bounding box
o3d_cluster = o3d.geometry.PointCloud()
o3d_cluster.points = o3d.utility.Vector3dVector(cluster_pts)
obb = o3d_cluster.get_oriented_bounding_box()
detections.append({
"class": cls, "confidence": conf,
"center": obb.center, "extent": obb.extent,
"n_points": mask.sum()
})
print(f"Detected objects: {len(detections)}")
return detections
| Method | Accuracy (mAP) | Speed | Ease of use |
|---|---|---|---|
| Geometric rules | 50-65% | Very fast | Instant, no training |
| VoxelNet / SECOND | 75-82% | Real-time (GPU) | Needs training + GPU |
| VoteNet | 70-78% | Moderate | Moderate training |
| 3DETR / CAGroup3D | 80-88% | Slow | Heavy GPU + large dataset |
| PointPillars | 72-80% | Very fast (GPU) | Good production trade-off |
The top detectors in 2026 fuse LiDAR and camera.
BEVFusion projects LiDAR and camera features into a shared BEV (Bird's Eye View) space, then runs detection in that unified space. You typically pick up 5 to 8 mAP points over LiDAR alone, especially for distant objects where LiDAR goes sparse. TransFusion uses cross-attention between the two modalities for a tighter alignment.
5. Toward Complete Scene Understanding
Segmentation, classification and detection are just the building blocks. Complete scene understanding combines these three pillars to build a structured, intelligible representation of the 3D environment.
The scene graph concept pushes that integration further. Each detected object becomes a graph node, and spatial relations between objects become edges: "the car is parked in front of the building", "the tree is adjacent to the sidewalk", "the window is in the facade". This relational representation is the substrate on which spatial AI agents reason.
This unified view, where every point in the cloud is simultaneously segmented, classified and placed in a relational graph, is what modern digital twins and automated BIM systems are built on. We'll dig into those in the next parts of the book.
3D Foundation Models like OpenScene and SAM3D are starting to unify segmentation, classification and detection under a single framework. By combining 2D embeddings (CLIP, SAM) with 3D features, they enable zero-shot scene understanding: describe what you're looking for in plain language, and the model pins it down in the point cloud. That's the frontier between 3D perception and general AI.
3D IoU is a lot stricter than its 2D counterpart. Two boxes that look well aligned from above can have a poor 3D IoU if their heights differ. A threshold of
3D IoU > 0.25 is standard on indoor benchmarks (ScanNet), while 3D IoU > 0.5 or 0.7 is used for self-driving (KITTI, nuScenes). Never compare detectors evaluated at different IoU thresholds.
Segmentation, classification, detection: you've got the full toolbox for pulling meaning out of point clouds. But what's the point of detecting walls, floors and openings if you can't turn them into usable models? Part III shows you how to automate the Scan-to-BIM pipeline and build urban digital twins.
Part III. Automation and BIM
Modules 2-3: Scan-to-BIM, IFC and Urban Digital Twins
Now that you know how to segment, classify and detect objects in point clouds, we're moving on to the industrial application that gets asked for the most: turning 3D scans into structured BIM models. This part gives you what you need to automate the Scan-to-BIM pipeline, work with the IFC format in Python, and fold your models into urban digital twins.
Chapter 9. Automatic Scan-to-BIM: The Digital Transformation
1. The "Slice" Strategy
Detecting walls directly in 3D is hard and expensive. The engineer's trick is to project the problem into 2D. By pulling a slice out of the building at a strategic height (say 1.5m), we turn a 3D problem into a 2D binary image.
import numpy as np
def create_floor_slice(pcd, height_min=1.2, height_max=1.8):
# Band-pass filter on Z
points = np.asarray(pcd.points)
mask = (points[:, 2] > height_min) & (points[:, 2] < height_max)
slice_pcd = pcd.select_by_index(np.where(mask)[0])
# 2D projection (flatten Z)
grid, transform = points_to_raster(slice_pcd, pixel_size=0.05)
return grid # Binary image (0 or 255)
2. Hough Transform: Detecting Lines
From the binary slice image, the Hough transform picks out the straight segments that correspond to wall surfaces. OpenCV gives you an optimized implementation through HoughLinesP.
import cv2
def find_walls(image_grid):
lines = cv2.HoughLinesP(
image_grid,
rho=1, # Distance resolution (1 pixel)
theta=np.pi/180, # Angular resolution (1 degree)
threshold=50, # Minimum votes
minLineLength=30, # Min length (e.g. 1.5m)
maxLineGap=10 # Allowed gap for merging
)
return lines
3. Topology Reconstruction: From Lines to Walls
Shapely library to buffer the lines and run boolean operations (Union) to produce clean wall polygons.
4. IFC Generation with IfcOpenShell
The final step of the Scan-to-BIM pipeline turns the detected geometry into structured IFC entities. IfcOpenShell lets you build a full BIM model programmatically from scratch.
import ifcopenshell
from ifcopenshell.api import run
# Create empty project
model = ifcopenshell.file()
project = run("root.create_entity", model, ifc_class="IfcProject", name="AutoBIM")
# Create a wall from coords p1, p2
wall = run("root.create_entity", model, ifc_class="IfcWall")
run("geometry.edit_object_placement", model, product=wall)
CGAL library, through its Python bindings (scikit-geometry), gives you solid geometric primitives for reconstructing complex shapes.
Wrap-up: The Limits of Automation
| Step | Difficulty | Auto Success Rate |
|---|---|---|
| Straight walls (Manhattan World) | Low | 95% |
| Curved walls | Medium | 60% |
| Openings (Doors/Windows) | High | 40% |
| MEP Networks | Extreme | 10% |
Chapter 10. IfcOpenShell: Hack BIM with Python
1. The IFC Spatial Hierarchy
An IFC file is organized as a tree: Project > Site > Building > Storey > Spaces > Elements. Walking that hierarchy is the first skill to master if you want to pull out targeted information.
import ifcopenshell
model = ifcopenshell.open("model.ifc")
project = model.by_type("IfcProject")[0]
def print_hierarchy(object, indent=0):
print(" " * indent + f"-> {object.Name} ({object.is_a()})")
if hasattr(object, "IsDecomposedBy"):
for rel in object.IsDecomposedBy:
for related_obj in rel.RelatedObjects:
print_hierarchy(related_obj, indent+1)
print_hierarchy(project)
GlobalId identifiers are UUIDs compressed in base64. To generate compliant GlobalIds, use ifcopenshell.guid.new() rather than uuid.uuid4(). An invalid GlobalId can silently corrupt the IFC file and cause crashes in BIM viewers like BIMcollab or Solibri.
2. Data Mining: Extracting Quantities
IFC Property Sets and Quantity Sets hold the measured information for every element. Export them into a Pandas DataFrame and you turn a BIM file into a queryable database.
import pandas as pd
import ifcopenshell.util.element
data = []
walls = model.by_type("IfcWall")
for w in walls:
psets = ifcopenshell.util.element.get_psets(w)
qto = psets.get("Qto_WallBaseQuantities", {})
data.append({
"GlobalId": w.GlobalId,
"Name": w.Name,
"Volume": qto.get("NetVolume", 0.0),
"Area": qto.get("NetSideArea", 0.0)
})
df = pd.DataFrame(data)
print(df.groupby("Name")["Volume"].sum())
In 10 lines of code, you've replaced a paid Revit plugin.
3. Geometry: From Implicit to Mesh
IFC geometries are stored implicitly (extruded profiles, CSG). IfcOpenShell's geometric engine tessellates them into triangular meshes you can actually use for visualization and analysis.
import ifcopenshell.geom
settings = ifcopenshell.geom.settings()
settings.set(settings.USE_WORLD_COORDS, True)
for wall in walls:
shape = ifcopenshell.geom.create_shape(settings, wall)
verts = shape.geometry.verts # Flat list X,Y,Z...
faces = shape.geometry.faces # Triangle indices
4. Modify and Write
IfcOpenShell isn't read-only. You can add properties, modify attributes and save the updated model. That's the key to BIM automation.
for wall in walls:
ifcopenshell.api.run("pset.edit_pset", model,
product=wall,
pset_name="Pset_QualityControl",
properties={"Status": "Checked", "Date": "2026-01-10"})
model.write("model_checked.ifc")
Chapter 11. Urban Digital Twin: The CityGML Standard
1. The LOD (Level of Detail) Concept
- LOD 0: 2.5D terrain (DTM). Use case: hydrology.
- LOD 1: Extruded blocks ("Shoebox"). Use case: macro-scale urban planning.
- LOD 2: Differentiated roofs. Use case: solar potential.
- LOD 3: Detailed facades (windows, doors). Use case: architecture.
- LOD 4: Building interiors (BIM). Use case: crisis management.
2. Parsing CityGML with Python
CityGML is an XML format based on GML. The lxml library lets you move through its dense namespaces to pull out the semantic attributes of each building.
from lxml import etree
ns = {'core': "http://www.opengis.net/citygml/2.0",
'bldg': "http://www.opengis.net/citygml/building/2.0",
'gml': "http://www.opengis.net/gml"}
def parse_buildings(file_path):
tree = etree.parse(file_path)
root = tree.getroot()
buildings = root.findall('.//bldg:Building', namespaces=ns)
print(f"Found {len(buildings)} buildings.")
for b in buildings:
measured_height = b.find('.//bldg:measuredHeight', namespaces=ns)
height = float(measured_height.text) if measured_height is not None else 0.0
print(f"Building Height: {height}m")
etree.parse (which loads everything into RAM). Go with etree.iterparse for streaming reads.
3. From CityGML to the Web: 3D Tiles
To view an urban digital twin in a browser, you have to convert the CityGML into 3D Tiles, Cesium's spatial streaming format. The py3dtiles tool handles the conversion.
# Conversion via Py3DTiles CLI
py3dtiles convert input.gml --out output_tileset/
4. Visualization with CesiumJS
CesiumJS is the reference web engine for 3D geospatial rendering. In a few lines of JavaScript, it loads a 3D Tiles tileset and applies dynamic styling based on semantic attributes.
const viewer = new Cesium.Viewer('cesiumContainer');
const tileset = viewer.scene.primitives.add(
new Cesium.Cesium3DTileset({
url: './output_tileset/tileset.json'
})
);
tileset.readyPromise.then(function(tileset) {
viewer.zoomTo(tileset);
tileset.style = new Cesium.Cesium3DTileStyle({
color: {
conditions: [
['${Height} >= 50', 'color("purple")'],
['${Height} >= 20', 'color("red")'],
['true', 'color("white")']
]
}
});
});
citygml-tools or FME. The 3D Tiles format uses a hierarchical LOD with bounding volumes, which lets you stream millions of buildings without saturating browser memory.
Part IV. Machine Learning & Deep Learning 3D
Module 4: From classical features to neural architectures for point clouds
Automated BIM runs on rules and heuristics. To go further, to generalize, learn, adapt, you have to step into Machine Learning and then Deep Learning. This part starts with the fundamentals of ML applied to 3D (geometric features, Random Forest, SVM) before digging into the neural architectures (PointNet, PointNet++) that reshaped the field.
Chapter 12. Machine Learning for 3D: Fundamentals Demystified
1. AI, Machine Learning, Deep Learning: What's What?
These terms get tossed around interchangeably in the media, but they describe nested concepts with increasing levels of specificity. Grasping this hierarchy matters if you want to avoid the buzzword trap and pick the right approach for each problem.
Artificial Intelligence is the broadest field: any system that mimics intelligent behavior. In 3D, a plain RANSAC algorithm that detects a plane already counts as AI in the wide sense. Machine Learning is a subset of AI: algorithms that learn patterns from data without being explicitly programmed for each case. In 3D, a Random Forest trained on geometric features to classify ground/building/vegetation is classical ML.
Deep Learning is itself a subset of ML. It uses multi-layer neural networks to automatically learn hierarchical representations. PointNet, which takes raw XYZ coordinates and figures out on its own which features matter, is DL. Beyond that, Foundation Models (SAM, CLIP) are DL models pre-trained on massive data, adaptable to specific tasks through fine-tuning or zero-shot.
In 2026, the reality on the ground is more nuanced. LLMs and "GenAI" dominate the media conversation, but in production 3D, classical ML is still massively deployed. Random Forests with geometric features run point cloud classification pipelines at national scale (IGN, Kadaster, Ordnance Survey). Deep Learning is catching up fast, but it needs more data, more GPU, and more expertise to be reliable.
2. Classical Machine Learning on Point Clouds
Classical ML on point clouds rests on one core principle: feature engineering. You manually extract geometric descriptors for each point (or group of points), then train a supervised classifier on those feature vectors.
The most discriminative geometric features come from the local covariance matrix (PCA over k-nearest neighbors). The three eigenvalues lambda_1 >= lambda_2 >= lambda_3 let you derive: planarity (lambda_2 - lambda_3) / lambda_1, linearity (lambda_1 - lambda_2) / lambda_1, sphericity lambda_3 / lambda_1, and curvature lambda_3 / (lambda_1 + lambda_2 + lambda_3). A flat ground will have strong planarity, a power cable strong linearity, a tree strong sphericity.
The Random Forest is the go-to algorithm for this task. Why? It natively handles features at different scales, it's robust to noise, it needs no normalization, and it gives you a feature importance score that helps make sense of the model. With 15 multi-scale geometric features, a Random Forest hits 85-90% accuracy on urban benchmarks.
import open3d as o3d
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
def compute_geometric_features(pcd, k=30):
"""Compute geometric features via local PCA."""
tree = o3d.geometry.KDTreeFlann(pcd)
points = np.asarray(pcd.points)
features = []
for i in range(len(points)):
[_, idx, _] = tree.search_knn_vector_3d(points[i], k)
neighbors = points[idx]
cov = np.cov(neighbors.T)
eigenvalues = np.sort(np.linalg.eigvalsh(cov))[::-1]
l1, l2, l3 = eigenvalues + 1e-10 # Avoid division by zero
linearity = (l1 - l2) / l1
planarity = (l2 - l3) / l1
sphericity = l3 / l1
curvature = l3 / (l1 + l2 + l3)
anisotropy = (l1 - l3) / l1
features.append([linearity, planarity, sphericity, curvature, anisotropy])
return np.array(features)
# Full pipeline
pcd = o3d.io.read_point_cloud("scene.ply")
X = compute_geometric_features(pcd, k=30)
y = np.load("labels.npy") # 0=ground, 1=building, 2=vegetation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = RandomForestClassifier(n_estimators=100, max_depth=15, n_jobs=-1)
clf.fit(X_train, y_train)
print(f"Accuracy: {clf.score(X_test, y_test):.3f}")
Normalize your features with
StandardScaler before using an SVM or k-NN (those algorithms are scale-sensitive). On the flip side, never normalize before a Random Forest: decision trees are invariant under monotonic transformations, and normalization only hurts the interpretability of split thresholds without improving accuracy.
3. From Feature Engineering to Feature Learning
Classical ML has an Achilles heel: features are human-designed. No matter how clever, they only capture what the designer imagined. Complex spatial patterns (multi-scale relationships between objects, scene contexts) slip past manual geometric descriptors.
That's the fundamental shift Deep Learning brings: instead of designing features, you let the network learn them directly from raw data. The network jointly optimizes feature extraction AND classification, finding representations a human would never have dreamed up.
The bridge between both worlds is PointNet (Qi et al., 2017). It takes raw XYZ coordinates as input, applies learned transformations (shared MLPs), and produces a global representation through symmetric max-pooling. No voxelization, no 2D projection, just direct processing of the point cloud. That's what sets up the next chapters on Deep Learning for 3D.
4. When to Use What: The Practitioner's Decision Tree
When you face a 3D classification or segmentation problem, the choice of method should never follow tech fashion. It depends on concrete criteria: dataset size, available hardware, need for interpretability, and latency constraints.
The golden rule: start simple. If you have fewer than 10,000 annotated points, a Random Forest with geometric features will be more reliable than an under-trained PointNet. If you have a GPU and hundreds of thousands of points, PointNet or PointNet++ become competitive. For massive datasets (millions of points, dozens of classes), pre-trained Foundation Models like SAM-3D or OpenScene offer the best accuracy/annotation effort ratio.
Interpretability is an often-ignored factor. A Random Forest tells you which features matter (planarity dominates? curvature?). A neural network is a black box. If your client or your team needs to understand why a point is classified as "building", classical ML is still unbeatable.
| Method | Min. data | Hardware | Typical accuracy | Training | Interpretability |
|---|---|---|---|---|---|
| Random Forest + features | 1k-10k pts | CPU | 85-90% (urban) | Seconds | High |
| SVM + features | 1k-50k pts | CPU | 83-88% | Minutes | Medium |
| PointNet | 50k-500k pts | GPU (4 GB+) | 87-91% | Hours | Low |
| PointNet++ | 100k-1M pts | GPU (8 GB+) | 90-94% | Hours | Low |
| Foundation Models | Zero-shot possible | GPU (16 GB+) | 91-96% | Minutes (fine-tune) | Very low |
The boundary between classical ML and Deep Learning isn't watertight. The top-performing approaches in 2026 often combine both: geometric features concatenated with features learned by a neural network. For example, extract eigenvalue features at multiple scales, then inject them as additional features into a PointNet++. This hybrid strategy consistently gains 2-4 mIoU points over DL alone, especially on small datasets.
5. Evaluation Metrics for 3D
Properly evaluating a 3D classification or segmentation model is subtler than it looks. Global accuracy, the most intuitive metric, is often misleading because of the massive class imbalance in real point clouds.
In a typical urban point cloud, the ground accounts for 60-70% of points, buildings 15-20%, vegetation 10-15%, and rare objects (cars, pedestrians, street furniture) under 5%. A model that predicts "ground" for every point hits 65% accuracy, a flattering but totally useless number. That's why per-class metrics are essential.
The reference metric in 3D segmentation is mIoU (mean Intersection over Union). For each class, IoU measures the ratio between the intersection and the union of predictions and ground truth. mIoU is the average of these per-class IoUs, giving equal weight to each category regardless of frequency. An mIoU of 70% is considered good on standard benchmarks (S3DIS, SemanticKITTI).
import numpy as np
def compute_miou(predictions, labels, num_classes):
"""Compute mIoU (mean Intersection over Union)."""
ious = []
for c in range(num_classes):
pred_c = predictions == c
label_c = labels == c
intersection = np.logical_and(pred_c, label_c).sum()
union = np.logical_or(pred_c, label_c).sum()
if union == 0:
ious.append(float('nan')) # Class absent
else:
ious.append(intersection / union)
miou = np.nanmean(ious)
# Per-class output
for c, iou in enumerate(ious):
print(f" Class {c}: IoU = {iou:.3f}")
print(f" Global mIoU: {miou:.3f}")
return miou, ious
# Usage example
preds = np.array([0, 0, 1, 2, 2, 1, 0, 2])
gt = np.array([0, 1, 1, 2, 0, 1, 0, 2])
miou, per_class = compute_miou(preds, gt, num_classes=3)
Spatial data leakage is a sneaky trap in 3D. Points close in space are strongly correlated (spatial autocorrelation). If your train/test split randomly partitions points from the same object, the model "memorizes" the local context instead of learning generalizable patterns. Solution: always split by spatial zones (per building, per tile, per scan), never by individual points.
mIoU alone can hide catastrophic performance on rare classes. In the example above, the 72% mIoU looks fine, but the Vehicle IoU at 58% is unacceptable for an urban mapping application. Always report mIoU and per-class IoU in your reports. In production, set a minimum threshold per critical class (e.g., Vehicle IoU > 70% for autonomous driving).
Self-supervised pre-training approaches (
Point-MAE, Point-BERT) let you learn 3D representations from unlabeled data, then fine-tune on your specific task with very few labels. This is the dominant trend in 2026: pre-train on millions of unlabeled points (Waymo, nuScenes), then adapt with a few hundred annotated points from your domain. The typical gain is 5-10 mIoU points over training from scratch.
Chapter 13. PointNet: The Architecture That Changed Everything
1. The Ordering Problem
A point cloud is a set of N points. If you permute the order of the points, the object stays the same. But a classical neural network (MLP, CNN) is sensitive to input order. PointNet solves this by using a symmetric function (Max Pooling) that makes the network permutation-invariant.
2. T-Net: Spatial Alignment
The T-Net is a mini-network that learns a transformation matrix to spatially align the input cloud. It makes the model robust to arbitrary rotations and translations.
class TNet(nn.Module):
def __init__(self, k=3):
super().__init__()
self.conv1 = nn.Conv1d(k, 64, 1)
self.fc1 = nn.Linear(1024, 512)
self.fc3 = nn.Linear(256, k*k)
def forward(self, x):
# Learns a k x k transformation matrix
matrix = self.fc3(x).view(-1, k, k)
identity = torch.eye(k).cuda()
return matrix + identity
3. Backbone Implementation
The PointNet backbone chains the T-Net, shared MLP layers, and a global Max Pooling to produce a 1024-dimensional feature vector that's permutation-invariant.
class PointNetCls(nn.Module):
def __init__(self, k=10, feature_transform=False):
super().__init__()
self.feat = PointNetFeat(global_feat=True)
self.fc1 = nn.Linear(1024, 512)
self.fc2 = nn.Linear(512, 256)
self.fc3 = nn.Linear(256, k) # k classes
self.dropout = nn.Dropout(p=0.3)
def forward(self, x):
x, trans, trans_feat = self.feat(x)
x = F.relu(self.fc1(x))
x = F.log_softmax(self.fc3(x), dim=1)
return x, trans_feat
L2 regularization on the transformation matrix: reg_loss = 0.001 * torch.norm(I - torch.bmm(trans, trans.transpose(2,1))). Without this regularization, the matrix diverges and the network won't converge.
4. Classification vs Segmentation
PointNet supports two core tasks. Classification assigns a global label to an object (chair, table, airplane), while semantic segmentation assigns a class to each point individually. The architectural difference is tiny, but the practical implications are massive.
| Task | Output | Architecture |
|---|---|---|
| Classification | 1 class per cloud | Global Max Pooling -> FC Layers |
| Segmentation | 1 class per point | Concatenate Global Feat (1024) to each Point Feat (64) |
For segmentation, the clever idea is to reinject global information (the whole-shape context) back at the local level of each point.
Chapter 14. PointNet++: The Power of Local Context
1. The Hierarchical Architecture
PointNet++ splits the cloud into increasingly larger zones. The base unit is the Set Abstraction Level (SA), made of three steps:
- Sampling: Pick representative centroids (FPS).
- Grouping: Find the neighbors around these centroids (Ball Query).
- PointNet: Encode these neighborhoods into feature vectors.
2. Furthest Point Sampling (FPS)
FPS iteratively selects the point farthest from the already-chosen set, guaranteeing uniform spatial coverage of the cloud. This is the sampling step of the Set Abstraction Layer.
def farthest_point_sample(xyz, npoint):
# xyz: [B, N, 3]
centroids = torch.zeros(B, npoint, dtype=torch.long).cuda()
distance = torch.ones(B, N).cuda() * 1e10
farthest = torch.randint(0, N, (B,), dtype=torch.long).cuda()
for i in range(npoint):
centroids[:, i] = farthest
centroid = xyz[batch_indices, farthest, :].view(B, 1, 3)
dist = torch.sum((xyz - centroid) ** 2, -1)
mask = dist < distance
distance[mask] = dist[mask]
farthest = torch.max(distance, -1)[1]
return centroids
3. Multi-Scale Grouping (MSG)
The PointNet++ solution: look at multiple scales (e.g., radius 0.1m and 0.4m) at once and concatenate the results.
4. Feature Propagation (Upsampling)
For segmentation, you need to propagate the features learned at higher levels back to each original point. Inverse-distance weighted interpolation performs this upsampling.
def feature_interpolate(points, centroids, feats):
# Find the 3 nearest centroids for each original point
dists, idx = k_nearest_neighbors(points, centroids, k=3)
inv_dists = 1.0 / (dists + 1e-8)
norm = torch.sum(inv_dists, dim=2, keepdim=True)
weights = inv_dists / norm
# Weighted sum of features
interpolated_feats = torch.sum(group_features * weights.view(B, N, 3, 1), dim=2)
return interpolated_feats
State of the Art
| Model | mIoU (S3DIS) | Inference Speed |
|---|---|---|
| PointNet | 47.6% | 150 fps (Ultra Fast) |
| PointNet++ | 54.5% | 35 fps (Fast) |
| RandLA-Net | 62.4% | 60 fps (Very Fast) |
| Point Transformer | 70.1% | 5 fps (Heavy) |
Part V. Neural Rendering and 3D Reconstruction
Module 5: Gaussian Splatting, NeRF, and Photogrammetry
Neural networks aren't just for classification. They can also generate whole 3D scenes from plain photographs. This part covers the three 3D reconstruction and rendering approaches: 3D Gaussian Splatting, NeRFs, and classical photogrammetry revisited through Python. You'll learn how to build photorealistic 3D models from images.
Chapter 15. 3D Gaussian Splatting: The Rendering Revolution
1. The Theory: Why Gaussians?
Unlike voxels (discrete) or meshes (rigid), a 3D Gaussian is a soft primitive defined by: a mean position, a covariance matrix (stretch/rotation), an opacity, and a view-dependent color (Spherical Harmonics).
2. Initialization: The Role of COLMAP
3DGS starts from a sparse point cloud generated by Structure-from-Motion. COLMAP is the reference tool for this initial reconstruction, which provides camera poses and the starting 3D points.
# Feature extraction and matching
colmap feature_extractor --database_path db.db --image_path images/
colmap exhaustive_matcher --database_path db.db
# Sparse reconstruction
colmap mapper --database_path db.db --image_path images/ --output_path sparse/
3. The Training Loop (Optimization Loop)
Training iteratively optimizes each gaussian's parameters (position, covariance, opacity, color) by minimizing the difference between the rendered image and the real photo through backpropagation.
def training_step(iteration):
# 1. Differentiable Rasterization
rendered_image = rasterizer(viewpoint_cam, pc)
# 2. Loss computation
gt_image = viewpoint_cam.original_image
loss = (1.0 - lambda_dssim) * l1_loss(rendered_image, gt_image) + \
lambda_dssim * (1.0 - ssim(rendered_image, gt_image))
# 3. Backprop & Optimizer Step
loss.backward()
optimizer.step()
optimizer.zero_grad()
# 4. Densification (Adaptive Control)
if iteration % densification_interval == 0:
pc.densify_and_prune(grad_threshold, min_opacity, extent)
4. Compression and Export
| Attribute | Raw Type | Compressed Type | Gain |
|---|---|---|---|
| Position | float32 (12B) | float16 (6B) | 50% |
| Rotation | float32 (16B) | uint8 (4B) | 75% |
| SH (Color) | float32 (45 * 4B) | Codebook Index (1B) | 99% |
.splat format (16 bits per attribute instead of 32) cuts size by 60% with no perceptible visual loss. The antimatter15/splat implementation on GitHub is the reference for WebGL rendering.
5. The Python Ecosystem
Nerfstudio to get started. Their pipeline wires up Colmap, training, and a web socket viewer (Viser) automatically.
ns-train splatfacto --data data/my_scene
Chapter 16. NeRF vs 3D Gaussian Splatting: The Definitive 2026 Comparison
1. Fundamentals: Implicit vs Explicit
The core difference between NeRF and 3DGS lies in how they represent the scene: an implicit neural network versus explicit geometric primitives. This architectural choice dictates every downstream property.
| Criterion | NeRF (Implicit) | 3DGS (Explicit) |
|---|---|---|
| Representation | MLP / Hash grid | Anisotropic 3D Gaussians |
| Rendering | Ray marching (slow) | Differentiable rasterization (fast) |
| Editability | Low (network weights) | High (direct manipulation) |
| Model size | ~50 MB | ~200-800 MB |
2. NeRF Pipeline with Nerfstudio
Nerfstudio unifies the NeRF workflow end-to-end: data prep via COLMAP, model training, and interactive visualization. Two commands are enough to get a result.
# Data preparation with COLMAP
ns-process-data images --data ./photos/ --output-dir ./processed/
# Nerfacto training
ns-train nerfacto --data ./processed/ \
--max-num-iterations 30000
3. 3D Gaussian Splatting Pipeline
For finer control than Nerfstudio, the gsplat library exposes the differentiable rasterizer directly in Python. Here's how to run a custom render from raw gaussians.
import torch
from gsplat import rasterization
def render_gaussians(means3d, scales, rotations, opacities, sh_coeffs, camera):
"""Differentiable rasterization rendering with gsplat."""
rendered_image, rendered_alpha, info = rasterization(
means=means3d, quats=rotations,
scales=scales, opacities=opacities,
colors=sh_coeffs,
viewmats=camera.world_to_cam[None],
Ks=camera.intrinsics[None],
width=camera.width, height=camera.height,
sh_degree=3,
)
return rendered_image, rendered_alpha
4. Comparative Benchmark
The numbers speak for themselves. This benchmark on the Mip-NeRF 360 dataset compares the four main methods across visual quality, training time, and rendering performance.
| Method | PSNR (dB) | SSIM | Train (min) | FPS 1080p | Size (MB) |
|---|---|---|---|---|---|
| Nerfacto | 28.4 | 0.874 | 35 | 0.8 | 48 |
| Instant-NGP | 29.1 | 0.891 | 8 | 4.2 | 55 |
| 3DGS | 30.2 | 0.918 | 22 | 148 | 410 |
| 3DGS-MCMC | 30.5 | 0.922 | 28 | 135 | 320 |
pip install lpips and evaluate with lpips.LPIPS(net='vgg').
5. Use Cases: When to Pick What
| Use case | Recommendation | Reason |
|---|---|---|
| Real estate virtual tour | 3DGS | Real-time rendering required (>60 FPS) |
| Infrastructure inspection | NeRF (Nerfacto) | Better handling of specular reflections |
| Urban digital twin | 3DGS | Editability and scalability |
| Heritage documentation | 3DGS + NeRF hybrid | Max quality + compact archiving |
| AI training (synthetic data) | NeRF | Fine control over lighting |
6. Production Optimization: Compression and Streaming
A raw 3DGS model can weigh hundreds of MB. Pruning low-opacity gaussians and quantizing spherical harmonics drastically shrink the size for web and mobile deployment.
def compress_gaussians(model_path, output_path, target_ratio=0.2):
"""Compress a 3DGS model via pruning + quantization."""
data = torch.load(model_path)
means = data["means"]
opacities = data["opacities"]
# Prune by opacity
opacity_threshold = torch.quantile(opacities, 1.0 - target_ratio)
keep_mask = opacities > opacity_threshold
# Prune by size
scales = data["scales"]
size_mask = scales.max(dim=1).values < 0.5
final_mask = keep_mask & size_mask
# Quantize SH (float32 -> float16)
compressed = {}
for key, val in data.items():
if "sh" in key:
compressed[key] = val[final_mask].half()
else:
compressed[key] = val[final_mask]
torch.save(compressed, output_path)
7. 2026 Trends: Toward Convergence
- 4D Gaussian Splatting: Dynamic scenes with temporal deformation of gaussians.
- LangSplat: Semantic embeddings directly inside gaussians.
- Neural compression: Learned codecs that shrink 3DGS models by 10x.
- Relighting: PBR material decomposition for realistic re-lighting.
Chapter 17. Python Photogrammetry: The Art of 3D Reconstruction
1. The SfM Pipeline (Structure-from-Motion)
SfM is the beating heart of the pipeline. It recovers camera positions (Motion) and scene geometry (Structure) at the same time.
2. Automating with COLMAP and Python
COLMAP runs from the command line, which makes it easy to script in Python through subprocess. This pattern lets you slot SfM reconstruction into an automated batch processing pipeline.
import subprocess
def run_colmap(project_path):
database = project_path + "/database.db"
images = project_path + "/images"
# 1. Feature Extraction
subprocess.run(["colmap", "feature_extractor",
"--database_path", database,
"--image_path", images])
# 2. Exhaustive Matching
subprocess.run(["colmap", "exhaustive_matcher",
"--database_path", database])
# 3. Mapper (Reconstruction)
subprocess.run(["colmap", "mapper",
"--database_path", database,
"--image_path", images,
"--output_path", project_path + "/sparse"])
exiftool -FocalLength *.jpg. If EXIF data is missing (video frames, screenshots), set the camera model manually with --ImageReader.camera_model PINHOLE.
3. Multi-View Stereo (MVS): Going Dense
4. Post-Processing and Meshing
The dense cloud coming out of MVS needs to be turned into a triangle mesh before you can export it to CAD or visualization software. Poisson reconstruction builds a smooth, watertight mesh from the estimated normals.
import open3d as o3d
def poisson_mesh(pcd_path, depth=9):
pcd = o3d.io.read_point_cloud(pcd_path)
pcd.estimate_normals()
mesh, densities = o3d.geometry.TriangleMesh.create_from_point_cloud_poisson(pcd, depth=depth)
vertices_to_remove = densities < np.quantile(densities, 0.1)
mesh.remove_vertices_by_mask(vertices_to_remove)
return mesh
Summary: When to Use Photogrammetry
| Technology | Hardware Cost | Accuracy | Use Case |
|---|---|---|---|
| Terrestrial LiDAR | $$$$ (20k+) | Millimeter | Industry, Metrology |
| Photogrammetry | $ (Smartphone) | Centimeter | Heritage, Virtual Tours |
| iPhone LiDAR | $$ (1k) | Decimeter | Fast scanning, Indoor AR |
Part VI. Advanced Spatial AI
Modules 5-6: Foundation Models, AI Agents and LangChain 3D
We're stepping into the most cutting-edge territory of 3D Data Science. Foundation Models enable zero-shot segmentation of 3D scenes. Spatial AI agents reason in natural language over geometric data. LangChain wires LLMs into your 3D pipelines. This part pushes you to the frontier of what's possible in 2026.
Chapter 18. Foundation Models for 3D: Zero-Shot Segmentation in 2026
1. The Era of Spatial Foundation Models
Foundation Models reshaped text and image processing. In 2026, that shift is hitting 3D. Architectures like SAM 3D, OpenScene and LERF now segment point clouds with zero training on the target domain.
2. CLIP and the Vision-Language Bridge to 3D
CLIP (Contrastive Language-Image Pre-training) lines up text and images in a shared embedding space. Project those features onto a point cloud and every point picks up an open-vocabulary semantic "understanding".
import torch
import clip
from PIL import Image
model, preprocess = clip.load("ViT-L/14", device="cuda")
def extract_clip_features(images_list):
features = []
for img_path in images_list:
image = preprocess(Image.open(img_path)).unsqueeze(0).to("cuda")
with torch.no_grad():
feat = model.encode_image(image)
features.append(feat.cpu().numpy())
return np.vstack(features)
def text_query_embedding(text_prompt):
tokens = clip.tokenize([text_prompt]).to("cuda")
with torch.no_grad():
text_feat = model.encode_text(tokens)
return text_feat.cpu().numpy()
3. OpenScene: Multi-View Fusion for 3D Understanding
OpenScene fuses CLIP features pulled from multiple 2D views to enrich every 3D point with a semantic embedding. The payoff: a natural-language text query is enough to segment the cloud.
from openscene import OpenSceneModel
model = OpenSceneModel(backbone="ViT-L/14", feature_dim=768, fusion_strategy="weighted_average")
point_features = model.compute_3d_features(
points=points, images_dir="./renders/",
poses_file="./poses.json", intrinsics_file="./intrinsics.json"
)
query = "fire extinguisher"
scores = model.query(point_features, query)
mask = scores > 0.75
result = pcd.select_by_index(np.where(mask)[0])
| Method | Type | Vocabulary | ScanNet mIoU |
|---|---|---|---|
| PointNet++ (supervised) | Closed-set | 20 classes | 53.5% |
| OpenScene | Open-vocabulary | Unlimited | 47.2% |
| SAM 3D | Zero-shot | Unlimited | 51.1% |
4. LERF: Language Embedded Radiance Fields
LERF bakes CLIP embeddings straight into a NeRF radiance field. That lets you generate 3D relevancy maps from a text query without any explicit segmentation step up front.
from nerfstudio.pipelines.base_pipeline import Pipeline
pipeline = Pipeline.load("outputs/scene/lerf/2026-02-15/")
model = pipeline.model
query_text = "emergency exit door"
relevancy_map = model.get_relevancy(query=query_text, resolution=512, scale_range=(0.1, 1.5))
spatial_mask = relevancy_map > 0.6
5. SAM 3D: Segment Anything in Space
SAM 3D extends Meta's Segment Anything model into the third dimension by merging 2D masks produced across multiple views into a 3D instance segmentation without any task-specific training.
from segment_anything import SamAutomaticMaskGenerator, sam_model_registry
from sam3d import MultiViewFusion, MaskMerger
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
mask_gen = SamAutomaticMaskGenerator(model=sam, points_per_side=32, pred_iou_thresh=0.86)
fusion = MultiViewFusion(pcd_path="scene.ply", images_dir="./views/", depth_dir="./depth/", poses_file="./camera_poses.json")
all_masks = fusion.generate_2d_masks(mask_gen)
merger = MaskMerger(iou_threshold=0.5, min_views=3)
segments_3d = merger.merge(all_masks, fusion.point_cloud)
print(f"3D segments detected: {len(segments_3d)}")
6. Production Pipeline: Multi-Model Orchestration
In production you orchestrate Foundation Models as a pipeline: SAM 3D segments instances, CLIP classifies them, and everything feeds a scene graph. This class wraps the whole workflow.
class ZeroShotPipeline:
"""Complete zero-shot pipeline for 3D understanding."""
def __init__(self, sam_ckpt, clip_model="ViT-L/14"):
self.sam = self._load_sam(sam_ckpt)
self.clip = self._load_clip(clip_model)
def process_scene(self, pcd_path, images_dir, categories=None):
# Step 1: Geometric segmentation (SAM 3D)
segments = self.segment_3d(pcd_path, images_dir)
# Step 2: Open-vocabulary classification (CLIP)
if categories:
segments = self.classify_segments(segments, categories)
# Step 3: Scene graph construction
scene_graph = self.build_scene_graph(segments)
return scene_graph
pipeline = ZeroShotPipeline(sam_ckpt="sam_vit_h.pth")
graph = pipeline.process_scene(
pcd_path="building_scan.ply", images_dir="./captures/",
categories=["wall", "floor", "ceiling", "door", "window", "furniture"]
)
Chapter 19. Deploying Spatial AI Agents in the Enterprise: A Practical Guide
1. What Is a Spatial AI Agent?
A Spatial AI Agent is an autonomous system that pairs an LLM with 3D perception and reasoning tools. It can query point clouds, walk scene graphs and run geometric operations in response to natural-language requests.
2. 3D Scene Graph: The Spatial Knowledge Structure
The Scene Graph is the agent's core data structure: a graph where every node is a 3D object and every edge encodes a spatial relation (proximity, containment, adjacency).
@dataclass
class SpatialNode:
id: str
label: str
centroid: np.ndarray
bbox_extent: np.ndarray
volume: float
class SceneGraph:
def compute_spatial_relations(self):
nodes = list(self.nodes.values())
for i, a in enumerate(nodes):
for b in nodes[i+1:]:
dist = np.linalg.norm(a.centroid - b.centroid)
if dist < 2.0:
rel = self._classify_relation(a, b)
self.edges.append(SpatialEdge(a.id, b.id, rel, dist))
3. Building the Agent with LangGraph
LangGraph gives you the framework to build a ReAct agent that picks and chains geometric tools automatically based on the user's natural-language request.
from langchain_openai import ChatOpenAI
from langchain.tools import tool
from langgraph.prebuilt import create_react_agent
@tool
def find_objects(label: str, max_results: int = 10) -> str:
"""Find 3D objects matching a label in the scene."""
results = scene_graph.query_by_label(label, limit=max_results)
return json.dumps([{"id": r.id, "label": r.label} for r in results])
@tool
def measure_distance(object_id_a: str, object_id_b: str) -> str:
"""Compute the Euclidean distance between two 3D objects."""
a = scene_graph.nodes[object_id_a]
b = scene_graph.nodes[object_id_b]
dist = float(np.linalg.norm(a.centroid - b.centroid))
return f"Distance: {dist:.2f} meters"
llm = ChatOpenAI(model="gpt-4o", temperature=0)
agent = create_react_agent(model=llm, tools=[find_objects, measure_distance])
4. REST API with FastAPI
To expose the agent in production, FastAPI gives you a fast, self-documenting REST API. Each user query routes to the agent, which orchestrates the tools and returns a structured response.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
app = FastAPI(title="Spatial AI Agent API", version="1.0.0")
class QueryRequest(BaseModel):
question: str = Field(..., example="Which fire extinguishers are more than 15m from an exit?")
scene_id: str = Field(..., example="building_floor_3")
@app.post("/api/v1/query")
async def query_scene(request: QueryRequest):
result = agent.invoke({"messages": [{"role": "user", "content": request.question}]})
return {"answer": result["messages"][-1].content}
asyncio.wait_for() with a 30-second timeout per step, and cap the total tool calls at 10 per request.
5. Production Infrastructure
| Component | Technology | Role |
|---|---|---|
| API Gateway | Traefik / Kong | Rate limiting, JWT auth |
| Agent Runtime | FastAPI + LangGraph | Business logic |
| Scene Graph DB | Neo4j | Spatial relationship storage |
| Cache | Redis | Frequent queries |
| Monitoring | Prometheus + Grafana | Metrics and alerts |
6. Error Handling and Guardrails
A production AI agent has to validate its own answers. Guardrails independently recompute the numerical values the LLM reports, so geometric hallucinations get caught.
class SpatialGuardrails:
def validate_distance(self, claimed_distance, object_ids):
actual = self.compute_actual_distance(*object_ids)
error = abs(claimed_distance - actual) / actual
if error > 0.1:
return ValidationResult(is_valid=False,
reason=f"Incorrect distance: {claimed_distance:.2f}m vs {actual:.2f}m")
return ValidationResult(is_valid=True)
Chapter 20. Spatial Agent with LangChain: AI That Reasons in 3D
1. The Concept: LLM as a Controller
A model like GPT-4 doesn't "see" 3D. But it's great at logical planning and tool calling. The goal is to build a text interface where the LLM acts as the site foreman, telling specialized Python scripts to run the geometric analysis.
2. Building the Geometric Tools
Each LangChain tool is a Python function decorated with @tool, and its docstring is what the LLM reads as a description. The quality of that docstring directly decides whether the agent picks the right tool.
from langchain.tools import tool
@tool
def count_objects_in_scene(category: str) -> str:
"""Count the number of objects of a given category in the loaded scene."""
count = db.query(f"SELECT COUNT(*) FROM objects WHERE label='{category}'")
return f"There are {count} {category}s in the scene."
@tool
def get_object_dimensions(object_id: int) -> str:
"""Return the dimensions (L x W x H) of a specific object."""
bbox = get_bbox(object_id)
return f"Dimensions: {bbox.extent} m"
3. The Agent Architecture
We use an OpenAI Functions Agent. The agent takes the user's question, decides which tool to call, reads back the tool output (Observation), and puts together a final answer.
4. Example Conversation Flow
Query: "Is there enough room for a 3m wide truck between buildings A and B?"
- Thought: I need to find the distance between building A and B.
- Action 1:
find_object_id(name="Building A")-> ID: 412 - Action 2:
find_object_id(name="Building B")-> ID: 890 - Action 3:
measure_min_distance(id1=412, id2=890)-> 2.85m - Final Answer: No, the minimum distance is 2.85m, not enough for a 3m truck.
from langchain.agents import initialize_agent, AgentType
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature=0, model="gpt-4-turbo")
tools = [count_objects_in_scene, get_object_dimensions, measure_distance]
agent = initialize_agent(tools, llm, agent=AgentType.OPENAI_FUNCTIONS, verbose=True)
agent.run("Analyze the scan.laz file and list the potential conflicts.")
len(points) < MAX_POINTS in every tool before execution. An unconstrained agent can bring a server to its knees in a single request.
5. Limitations and Security
Since we run Python code (sometimes generated), sandboxing matters a lot. Never give your agent os.system access. Start by limiting the tools to read-only functions.
Part VII. Field Applications
Module 6: Drone acquisition and production pipeline
Theory is in hand. Now it's end-to-end practice. This final part covers two field applications: the full processing of a drone LiDAR survey (from flight to digital twin) and using AI assistants to speed up your 3D pipeline development. You'll learn to orchestrate everything you've picked up into real production workflows.
Chapter 21. Drone LiDAR Processing with Python: From Flight to Digital Twin
1. Flight Planning and Acquisition Parameters
The quality of a drone LiDAR survey is decided before takeoff. Getting the physics of acquisition right is how you size a flight properly and hit the point density your spec demands. Four parameters interact to set ground density: the Pulse Repetition Rate (PRR), the flight altitude (AGL, Above Ground Level), the drone's ground speed, and the scanner's Field of View (FOV).
The principle is geometric. At every instant, the LiDAR scanner fires a pulsed laser beam that sweeps perpendicular to the drone's trajectory. The resulting swath width scales with altitude and aperture angle. Lower altitude means a narrower swath, but higher ground density, because the same pulses concentrate on a smaller surface. Climb higher and you cover more ground per flight line, at the cost of density. Ground speed works the same way: slow the drone down and you get more pulses per linear meter on the ground. In practice, the density equation is simple: density = PRR / (speed x swath). With a 240 kHz sensor at 60 m AGL and 4 m/s, you typically land 500 to 800 pts/m2, plenty for most topographic and BIM applications.
Beyond pure physics, operational constraints shape planning heavily. Weather is a deal-breaker: wind above 25 km/h compromises drone stability and therefore INS/GNSS positioning accuracy. Rain is a no-go, since droplets create parasitic returns (noise). Time of day also matters: for vegetation surveys, flying during peak foliage cuts signal penetration to the ground. GNSS constellation visibility (GPS, GLONASS, Galileo, BeiDou) has to be checked ahead of time with PDOP prediction tools (Position Dilution of Precision). A PDOP above 3 degrades post-processed accuracy. Finally, battery life is still the main limiting factor in 2026: count on 20 to 35 minutes of effective flight depending on drone and sensor. A 10-hectare site typically needs 2 to 3 batteries, and you'll need to plan recharge time.
from dataclasses import dataclass
@dataclass
class FlightPlan:
altitude_m: float = 80.0
speed_ms: float = 5.0
overlap_pct: float = 70.0
scan_rate_hz: int = 240000
fov_deg: float = 70.0
@property
def swath_width(self):
import math
return 2 * self.altitude_m * math.tan(math.radians(self.fov_deg / 2))
@property
def point_density(self):
return self.scan_rate_hz / (self.speed_ms * self.swath_width)
plan = FlightPlan(altitude_m=60, speed_ms=4)
print(f"Density: {plan.point_density:.0f} pts/m2")
2. LAZ File Ingestion and Quality Control
A typical drone LiDAR flight produces between 10 and 50 GB of raw data, depending on duration, point density and number of returns. The files ship in LAZ format (LASzip), the compressed version of the LAS format standardized by ASPRS (American Society for Photogrammetry and Remote Sensing). Getting the difference between the two formats matters: a LAS file stores coordinates, intensities, classifications and returns as uncompressed binary. The same dataset in LAZ takes 5 to 10 times less disk space thanks to Martin Isenburg's lossless compression algorithm. In production, always work in LAZ for storage and transfer, and only decompress to LAS when a tool demands it.
Before any processing, checking the Coordinate Reference System (CRS) is a critical step. Onboard LiDAR sensors record raw coordinates in the navigation frame (usually geocentric WGS84), and the post-processing software (DJI Terra, POSPac, TerraSolid) projects them into the target system. In France, the standard system is RGF93 / Lambert-93 (EPSG:2154) with elevations in IGN69 (NGF). Always check the LAZ file header to confirm the CRS before merging flight strips.
RAF20 conversion grid to switch between them. In Python, the pyproj library handles these transformations via Transformer.from_crs() with the right geoid grid.
def ingest_drone_laz(laz_dir, output_path):
laz_files = sorted(Path(laz_dir).glob("*.laz"))
all_points = []
for laz_file in laz_files:
las = laspy.read(str(laz_file))
if las.header.point_count == 0:
continue
coords = np.vstack((las.x, las.y, las.z)).T
valid = (coords[:, 2] > 0) & (coords[:, 2] < 500)
all_points.append(coords[valid])
merged = np.vstack(all_points)
return merged
3. Automatic Classification with PDAL
Point classification is where the raw cloud turns into semantically rich data. The main goal is to separate ground (ASPRS class 2) from everything else: vegetation, buildings, power lines. That split is the prerequisite for any topographic analysis. No correctly classified ground means no reliable DTM.
The SMRF algorithm (Simple Morphological Filter), implemented in PDAL via filters.smrf, is one of the strongest performers for ground classification on drone data. It works on mathematical morphology: it applies a series of progressive morphological openings on a rasterized grid of the point cloud. With each iteration the window grows, which gradually erases objects above the ground. The three key parameters are: slope (max tolerated slope between two neighbor cells, typically 0.15 to 0.30 for rolling terrain), window (max window size in meters, set based on the largest structure to remove, like a 18 m wide building), and threshold (vertical tolerance in meters, 0.5 m is a good starting point for drone data). Slope too low and you'll classify embankments as non-ground; window too small and you'll leave building roofs in the ground class.
A popular alternative is CSF (Cloth Simulation Filter), which simulates a virtual cloth falling onto the inverted point cloud. The cloth "drapes" over the terrain and points close to the cloth get classified as ground. CSF is often more intuitive to tune and handles very rugged terrain better. PDAL ships it via filters.csf. In practice, the best approach is to run both algorithms on a sample and visually compare the results. For flat urban terrain, SMRF shines; for natural terrain with steep slopes, CSF often wins.
After ground classification, the filters.hag_nn filter computes HAG (Height Above Ground) for every point in the cloud. HAG is the elevation difference between a point and the ground interpolated beneath it (by nearest neighbor). That extra dimension is key for vegetation analysis: it lets you tell apart low vegetation (HAG < 2 m), medium vegetation (2-5 m) and high vegetation (> 5 m) without knowing the absolute terrain elevation. HAG is also the basis for computing the CHM (Canopy Height Model), a foundational derived product in forestry.
import pdal, json
def classify_ground(input_laz, output_laz):
pipeline_json = {
"pipeline": [
input_laz,
{"type": "filters.smrf", "slope": 0.15, "window": 18, "threshold": 0.5},
{"type": "filters.hag_nn"},
{"type": "writers.las", "filename": output_laz, "compression": "laszip"}
]
}
pipeline = pdal.Pipeline(json.dumps(pipeline_json))
count = pipeline.execute()
return pipeline.arrays[0]
| ASPRS Code | Class | Description |
|---|---|---|
| 2 | Ground | Natural terrain |
| 3 | Low vegetation | 0.3m - 2.0m |
| 5 | High vegetation | 2.0m - 30.0m |
| 6 | Building | Structures |
filters.elm (Extended Local Minimum) to refine classification on steep slopes. Flat roofs are the classic trap: SMRF often labels them as ground if window is too small. Push window to at least 25 m in urban contexts.
4. DTM/DSM Generation with Rasterio
The DTM (Digital Terrain Model) represents the bare ground surface, stripped of all vegetation and construction. The DSM (Digital Surface Model) represents the highest visible surface, including buildings, vegetation and any object above the ground. The difference between DSM and DTM gives the nDSM (normalized DSM), the raster equivalent of HAG, which reveals the height of objects above the terrain.
Going from classified point cloud to raster requires interpolation. You've got three main methods. Linear interpolation (based on Delaunay triangulation) is fast and respects the exact values at known points. It's the recommended default. IDW (Inverse Distance Weighting) is smoother but can produce "bull's-eye" artifacts around isolated points. The TIN (Triangulated Irregular Network) approach preserves the exact cloud topology but shows visible triangle facets at low resolution. Picking the raster resolution is a trade-off: at 0.1 m you catch micro-relief but the GeoTIFF file is huge; at 1.0 m the file is compact but fine details disappear. For typical drone surveys (densities above 100 pts/m2), resolutions of 0.25 to 0.50 m are usually the sweet spot.
gdal_grid and gdal_translate give you solid rasterization pipelines. PDAL can also write rasters directly via writers.gdal, which skips the intermediate Python export step. That said, Rasterio is still the best pick for fine-grained control in Python, especially for CRS handling, NoData values and compression profiles.
import rasterio
from scipy.interpolate import griddata
def generate_raster(points, values, resolution, output_path, crs="EPSG:2154"):
x_min, y_min = points[:, :2].min(axis=0)
x_max, y_max = points[:, :2].max(axis=0)
cols = int((x_max - x_min) / resolution)
rows = int((y_max - y_min) / resolution)
xi = np.linspace(x_min, x_max, cols)
yi = np.linspace(y_min, y_max, rows)
grid_x, grid_y = np.meshgrid(xi, yi)
grid_z = griddata(points[:, :2], values, (grid_x, grid_y), method="linear")
transform = from_bounds(x_min, y_min, x_max, y_max, cols, rows)
with rasterio.open(output_path, "w", driver="GTiff",
height=rows, width=cols, count=1, dtype="float32",
crs=crs, transform=transform, compress="lzw") as dst:
dst.write(grid_z.astype(np.float32), 1)
5. Computing Derived Metrics
A DTM on its own is raw data. Its real value comes out when you derive morphological metrics that feed thematic analyses. The two core derivatives are slope and aspect, computed from the spatial gradient of the elevation raster.
Slope measures terrain steepness in degrees (or percent). It matters for hydrology (runoff and erosion modeling), geotechnics (embankment stability, landslide risk), and urban planning (buildability of parcels). Aspect (or orientation) tells you the direction a slope faces (0-360 degrees, 0 = North). It directly shapes the solar irradiation of a parcel, which is vital information for photovoltaic siting studies, precision agriculture and building energy modeling. Beyond these two classic derivatives, the CHM (Canopy Height Model) is a simple subtraction: CHM = DSM - DTM. The CHM is the reference product in forestry for estimating tree heights, biomass volume and forest cover density.
def compute_slope_aspect(dtm_path, output_slope, output_aspect):
with rasterio.open(dtm_path) as src:
dem = src.read(1)
res = src.res[0]
dy, dx = np.gradient(dem, res)
slope = np.degrees(np.arctan(np.sqrt(dx**2 + dy**2)))
aspect = np.degrees(np.arctan2(-dy, dx))
aspect = (aspect + 360) % 360
return slope, aspect
6. BIM Integration: From Point Cloud to IFC Model
Bringing a drone LiDAR survey into a BIM environment is the step that gives field data its operational value for architecture, engineering and construction teams. The IFC format (Industry Foundation Classes) is the open standard for exchanging digital models. But representing irregular terrain in a format designed mainly for straight-edged building elements brings its own challenges.
In IFC4, terrain is typically represented by an IfcSite containing an IfcGeographicElement with a geometry of type IfcTriangulatedFaceSet, essentially a TIN mesh. The IFC4.3 standard (published in 2024) significantly improves support for linear infrastructure and terrain with new types like IfcGeoScienceModel and better handling of geographic coordinate systems. In practice, the Scan-to-BIM terrain workflow follows three steps: (1) generate a DTM from ground-classified points, (2) create a TIN (Delaunay) mesh at a resolution suited to the BIM context (typically 0.5-2.0 m), and (3) export that mesh to IFC through IfcOpenShell. Subsampling is often necessary because BIM models get heavy above 50,000 terrain triangles. To dig deeper into this methodology, Chapter 4 of this book walks through the full Scan-to-BIM pipeline.
def create_terrain_ifc(dtm_points, output_path):
ifc = ifcopenshell.file(schema="IFC4")
project = api.run("root.create_entity", ifc, ifc_class="IfcProject", name="Drone Survey")
site = api.run("root.create_entity", ifc, ifc_class="IfcSite", name="Site")
step = max(1, len(dtm_points) // 50000)
sampled = dtm_points[::step]
from scipy.spatial import Delaunay
tri = Delaunay(sampled[:, :2])
ifc.write(output_path)
7. Pipeline Validation and Quality Control
A LiDAR processing pipeline only has value if its results are quantitatively valid. Validation comes from comparing the produced DTM against GCPs (Ground Control Points), points surveyed on the ground with higher accuracy than the drone data (typically RTK GNSS, around 1-2 cm).
GCP placement strategy decides whether validation is reliable. The core principle is representative coverage of the whole study area: GCPs at the edges, in the middle, on flat ground and on slopes. ASPRS recommends a minimum of 20 independent control points (not used for georeferencing) for statistically meaningful validation. You want to separate two kinds of accuracy. Absolute accuracy measures the gap between model coordinates and ground truth (through the GCPs). That's the RMSE reported to the client. Relative accuracy measures internal consistency (for instance, the elevation difference between two adjacent flight strips in the overlap zone). A model can have good relative accuracy (the strips line up) but poor absolute accuracy (the whole model is shifted by 20 cm).
For the client report, the ASPRS Positional Accuracy Standards (2023 edition) defines clear accuracy classes. Always present vertical RMSE, horizontal RMSE (if measured), bias (mean error), standard deviation, and the number of GCPs used. An error distribution plot (histogram) and a spatial residuals map are welcome deliverables that help catch directional biases. In 2026, a vertical RMSE under 5 cm is reachable with an RTK drone and careful PPK post-processing.
def validate_dtm(dtm_path, gcp_file):
gcps = np.loadtxt(gcp_file, delimiter=",", skiprows=1)
with rasterio.open(dtm_path) as src:
errors = []
for x, y, z_gnss in gcps:
row, col = src.index(x, y)
z_dtm = src.read(1)[row, col]
if not np.isnan(z_dtm):
errors.append(z_dtm - z_gnss)
errors = np.array(errors)
rmse = np.sqrt(np.mean(errors**2))
print(f"RMSE: {rmse:.3f} m")
return {"rmse": rmse}
| Metric | Acceptable Threshold | Excellent Threshold | Unit |
|---|---|---|---|
| Vertical RMSE | < 0.10 | < 0.05 | meters |
| MAE | < 0.08 | < 0.03 | meters |
| Bias | < 0.05 | < 0.02 | meters |
Chapter 22. AI in the Service of 3D Code: Coding Faster and Smarter
1. The AI Pair-Programming Workflow
The rise of generative language models, Claude (Anthropic), GPT-4 (OpenAI), Copilot (GitHub), has reshaped the way developers tackle 3D code. The usual approach is no longer the lone developer wading through dense documentation. It's AI pair-programming: an iterative dialogue between human intent and the machine's generative capacity.
The typical workflow breaks down into four phases. Phase 1, Instruction: the developer describes in natural language what they want (for instance, "load a LAZ file, filter ground points, generate a DTM raster at 50 cm resolution"). Phase 2, Generation: the AI produces a first draft of code, often 80-90% functional. Phase 3, Review: the developer examines the code, checks the consistency of spatial parameters (CRS, units, axes), tests on a sample. Phase 4, Refinement: errors or adjustments get fed back to the AI, which fixes things iteratively.
The areas where generative AI shines in 3D are well known: writing boilerplate code (format reading/writing, pipeline configuration), exploring sparsely documented APIs (IfcOpenShell, pyvista, laspy), converting between formats (LAZ to PLY, OBJ to GLTF), and debugging spatial algorithms (projection errors, rotation issues). On the flip side, AI hits its ceiling on novel geometric algorithms (new mesh simplification methods), domain-specific optimizations (memory handling for 10-billion-point clouds) and hardware tuning (custom CUDA kernels for voxel hashing).
2. Effective Prompting for 3D
The quality of AI-generated code scales directly with prompt precision. In 3D development, a generic prompt like "process my point cloud" produces generic, unusable code. An effective prompt has to specify five key elements: the input format, the coordinate system, the expected operation, the performance constraints, and the target library.
Here's the gap between a bad prompt and a good one. Bad prompt: "Downsample my point cloud with Open3D." Good prompt: "I have a 50-million-point LAZ file in Lambert-93 (EPSG:2154). I want to decimate this cloud to 1 point per 10 cm voxel with Open3D 0.18. The file is 2 GB, so memory handling matters. Generate Python 3.11 with type hints." The second prompt gives enough context for the AI to produce fit-for-purpose code on the first try.
Another key point: specify the library version. 3D APIs move fast. Open3D 0.17 and 0.18 have different tensor interfaces, laspy 2.x completely changed the API compared to 1.x. Stating the version in the prompt saves hours of debugging over deprecated or renamed methods.
# Prompt sent to the AI:
# "Generate a Python function (Open3D 0.18) that loads a PLY file,
# applies voxel downsampling at 5cm, and saves the result.
# Log the point count before/after."
# Code generated by the AI:
import open3d as o3d
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def decimate_point_cloud(input_path: str, output_path: str, voxel_size: float = 0.05) -> None:
"""Voxel downsampling with Open3D 0.18."""
pcd = o3d.io.read_point_cloud(input_path)
n_before = len(pcd.points)
logger.info(f"Points before downsampling: {n_before}")
pcd_down = pcd.voxel_down_sample(voxel_size=voxel_size)
n_after = len(pcd_down.points)
logger.info(f"Points after downsampling: {n_after} (ratio: {n_after/n_before:.2%})")
o3d.io.write_point_cloud(output_path, pcd_down)
logger.info(f"Saved: {output_path}")
3. Spatial Debugging with AI
Debugging spatial code is especially tricky because errors show up visually rather than as Python exceptions. A point cloud that looks "mirrored", an unexpected 90-degree rotation, a BIM model offset by 500 km. These symptoms are hard to diagnose without experience with 3D conventions. That's exactly where generative AI shines.
The spatial debugging pattern with AI is this: describe the visual symptom, not the code. For example: "My LiDAR point cloud loaded from a LAZ file appears upside-down in Open3D (buildings point down)." The AI will immediately spot an axis convention issue: the file uses Y-up (common in CG/gaming) while Open3D expects Z-up. Another classic: "After reprojecting EPSG:4326 to EPSG:2154, all my points clump in the same spot". The AI will diagnose a latitude/longitude swap or a unit issue (degrees vs meters).
Coordinate transformation (CRS) errors account for most spatial bugs in production. The AI knows the classic traps: axis order in EPSG:4326 (lat, lon vs lon, lat depending on the library), datum differences (WGS84 vs ETRS89, negligible in France but not worldwide), and UTM zone mistakes.
# AI-assisted debugging session
# Symptom: "My points are shifted ~200m east after reprojection"
# AI diagnosis: datum shift not applied (WGS84 to RGF93 via NTv2 grid)
from pyproj import Transformer
import numpy as np
# INCORRECT: direct transformation without correction grid
# transformer = Transformer.from_crs("EPSG:4326", "EPSG:2154")
# CORRECT: force the pipeline with the NTv2 grid
transformer = Transformer.from_crs(
"EPSG:4326",
"EPSG:2154",
always_xy=True # Force (longitude, latitude) order
)
# Check against a known point (Eiffel Tower)
lon, lat = 2.2945, 48.8584
x, y = transformer.transform(lon, lat)
print(f"Lambert-93: X={x:.2f}, Y={y:.2f}")
# Expected: X~648240, Y~6862260
# If offset > 1m: check the proj version and grids
4. Generating 3D Visualizations
Visualization is one of the areas where generative AI brings the fastest productivity gain. Describe the target rendering in natural language ("show my point cloud colored by classification with a color bar and black background") and get functional pyvista code in 30 seconds. The friction of wading through complex APIs just disappears.
Python 3D visualization libraries have large and sometimes unintuitive APIs. pyvista wraps VTK with a Pythonic interface, plotly delivers interactive WebGL rendering, matplotlib gives you basic 3D plots, and three.js (via Python-to-JS) enables advanced web visualizations. The AI knows the idioms of each and generates code suited to the presentation context: scientific publication (matplotlib), interactive exploration (plotly), web application (three.js) or technical analysis (pyvista).
The big win is fast customization. Instead of digging through documentation to find how to change the camera angle, color palette, or lighting, you just ask in natural language. The AI iterates on parameters until the rendering hits what you want, a process that would typically eat 10 to 15 minutes of documentation reading per adjustment.
# Code generated by AI: classified point cloud visualization with pyvista
import pyvista as pv
import numpy as np
import laspy
def visualize_classified(laz_path: str) -> None:
"""Display a LiDAR cloud colored by ASPRS class."""
las = laspy.read(laz_path)
points = np.vstack((las.x, las.y, las.z)).T
classes = las.classification
# Standard ASPRS palette
cmap = {2: "#8B4513", 3: "#228B22", 4: "#32CD32",
5: "#006400", 6: "#DC143C", 9: "#1E90FF"}
cloud = pv.PolyData(points)
cloud["classification"] = classes
plotter = pv.Plotter(window_size=[1920, 1080])
plotter.set_background("black")
plotter.add_mesh(cloud, scalars="classification",
cmap="tab10", point_size=2,
render_points_as_spheres=True)
plotter.add_scalar_bar(title="ASPRS Class", n_labels=6)
plotter.camera_position = "iso"
plotter.show()
5. Automating Pipelines with AI
One of the most powerful use cases of generative AI is the rapid creation of complete CLI pipelines. From a natural-language description ("build a command-line tool that takes a folder of LAZ files, applies SOR filtering, classifies the ground with CSF, generates a DTM raster and exports to GeoTIFF"), the AI can produce a functional script including argparse, logging, error handling and parallel processing.
The productivity gain is huge for scaffolding: setting up program structure (CLI arguments, configuration, logging, path handling). This "infrastructure" code often makes up 40-50% of the total code in a 3D processing pipeline but delivers zero business value. The AI spits it out in seconds, freeing the developer to focus on the real spatial logic.
Integration with CI/CD systems (GitHub Actions, GitLab CI) is another area where AI speeds up development. Generating a YAML workflow that automatically runs the 3D processing pipeline on every push, with automatic validation against reference GCPs, takes one well-structured prompt. Field benchmarks show a 3 to 5x acceleration factor on standard pipelines when you use AI as a development assistant.
#!/usr/bin/env python3
# CLI pipeline generated with AI assistance
import argparse
import logging
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor, as_completed
import open3d as o3d
import numpy as np
import laspy
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("pipeline3d")
def process_tile(laz_path: Path, voxel_size: float, output_dir: Path) -> dict:
"""Process a LAZ tile: load, filter, downsample, export."""
logger.info(f"Processing: {laz_path.name}")
try:
las = laspy.read(str(laz_path))
pts = np.vstack((las.x, las.y, las.z)).T
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(pts)
# SOR filtering (Statistical Outlier Removal)
pcd_clean, mask = pcd.remove_statistical_outlier(
nb_neighbors=20, std_ratio=2.0
)
# Voxel downsampling
pcd_down = pcd_clean.voxel_down_sample(voxel_size=voxel_size)
out_path = output_dir / laz_path.with_suffix(".ply").name
o3d.io.write_point_cloud(str(out_path), pcd_down)
return {"file": laz_path.name, "status": "ok",
"n_in": len(pts), "n_out": len(pcd_down.points)}
except Exception as e:
logger.error(f"Failed {laz_path.name}: {e}")
return {"file": laz_path.name, "status": "error", "msg": str(e)}
def main():
parser = argparse.ArgumentParser(description="Pipeline 3D batch processing")
parser.add_argument("input_dir", type=Path, help="Source LAZ folder")
parser.add_argument("output_dir", type=Path, help="Output folder")
parser.add_argument("--voxel", type=float, default=0.05, help="Voxel size (m)")
parser.add_argument("--workers", type=int, default=4, help="Number of workers")
args = parser.parse_args()
args.output_dir.mkdir(parents=True, exist_ok=True)
tiles = list(args.input_dir.glob("*.laz"))
logger.info(f"{len(tiles)} tiles found")
results = []
with ProcessPoolExecutor(max_workers=args.workers) as pool:
futures = {pool.submit(process_tile, t, args.voxel, args.output_dir): t for t in tiles}
for fut in as_completed(futures):
results.append(fut.result())
ok = sum(1 for r in results if r["status"] == "ok")
logger.info(f"Done: {ok}/{len(tiles)} tiles processed")
if __name__ == "__main__":
main()
x5 acceleration factor. The AI-generated code needed about 20% of manual edits, mostly for sensor-specific parameters and business validation thresholds.
6. Toward the 3D Developer Agent
The next frontier is the jump from assistant (which answers one-off requests) to autonomous agent (which plans, runs and validates a full end-to-end pipeline). In 2026, the first prototypes of 3D developer agents are emerging: they receive a spec ("process this LiDAR dataset, classify the ground, generate a DTM and a quality report") and orchestrate the steps on their own, handle errors and produce the deliverables.
Integration with CI/CD pipelines (Continuous Integration / Continuous Deployment) matters a lot here. An AI agent can automatically generate unit tests for each 3D function (output dimension checks, CRS consistency, NaN absence), run those tests on reference datasets, and deploy the pipeline to production if everything passes. Reference datasets like S3DIS (Stanford 3D Indoor Scenes), ScanNet and DALES (Dayton Annotated LiDAR Earth Scan) serve as benchmarks to automatically validate that code changes don't degrade performance.
Ethical considerations and quality control sit at the heart of this shift. An autonomous agent that generates and deploys code without human supervision can spread systematic errors, a ground classification bias, for example, that degrades every DTM it produces. The 3D architect's role shifts toward governance: defining the guardrails, quality thresholds, and mandatory human validation points. Module 5 of the training (Spatial AI Agents) goes deeper into this supervised agent architecture and secure production deployment patterns.
7. Best Practices for AI-Assisted 3D Code
The experience built up since AI code assistants arrived lets us pull out clear rules for the 3D domain:
Always validate geometrically, not just syntactically. 3D code can run without errors and still produce a geometrically absurd result. After every AI generation, visualize the output, check the bounding boxes, confirm the CRS is right, and compare against a known ground truth. Syntactic validation (the code runs) is just the first step. Geometric validation (the result is spatially correct) is the one that matters.
Use unit tests with reference datasets. Build a library of test datasets with known expected results. For instance: a synthetic point cloud of a perfect cube (for testing plane detection), an S3DIS dataset excerpt (for testing semantic segmentation), a reference DTM with GCPs (for testing interpolation). Every AI-generated function has to pass these tests before going to production.
Keep expertise at the center. AI speeds things up, but the expert decides. Knowing why you pick a voxel size of 0.05 m over 0.10 m, why SMRF beats CSF on urban terrain, why linear interpolation suits a dense DTM better than IDW, these decisions demand deep domain understanding the AI doesn't have. The 3D engineer of 2026 is a decision architect assisted by an executing AI, not the other way around.
Conclusion
We've come a long way, from reading a plain .LAZ file to deploying AI agents that can reason in natural language about complex 3D scenes. This progression mirrors the day-to-day reality of the 3D AI Architect role in 2026: a professional who has to command geometric fundamentals, deep learning architectures, neural rendering pipelines, and intelligent-agent frameworks all at once.
The skills covered in this book form a coherent toolkit you can put to work right away:
- Fundamentals: Open3D, Laspy, and NumPy for geometric processing of point clouds.
- Segmentation: RANSAC, DBSCAN, and automated cleaning pipelines.
- Big Data: Chunking, tiling, and multiprocessing for datasets spanning hundreds of gigabytes.
- BIM: Automated Scan-to-BIM, IfcOpenShell, and urban digital twins with CityGML.
- 3D Deep Learning: PointNet and PointNet++ for classification and segmentation.
- Neural Rendering: 3D Gaussian Splatting and NeRF with head-to-head benchmarks.
- Foundation Models: SAM 3D, CLIP, OpenScene, and LERF for zero-shot segmentation.
- AI Agents: LangChain, LangGraph, and FastAPI for production-grade systems.
- Terrain: End-to-end drone LiDAR pipeline with DTM/DSM and BIM integration.
- Generative AI: Pair-programming with Claude, GPT-4, and Copilot to speed up 3D development by 3 to 5x.
To take things further, the 3D Geodata Academy offers personalized coaching with hands-on projects built on your own data, one-on-one mentoring, and access to a community of professionals in the field. Every chapter of this book maps to a training module, with practical exercises, real datasets, and progressive assessments.
The future of spatial intelligence is already underway. You now hold the keys to help shape it.
About the Author
Dr. Florent Poux holds a PhD in geomatic sciences (University of Liège, EuroSDR 2019 Award) and served as a postdoctoral researcher at RWTH Aachen University. He specializes in spatial artificial intelligence, 3D Data Science, and point cloud processing, with more than 10 years of experience across research and industry. He's published 15 articles in international peer-reviewed journals (h-index 21, 1,563+ citations), 19 conference papers, and trained more than 12,000 professionals across 50+ countries.
Founder of the 3D Geodata Academy, he's devoted to passing on cutting-edge skills at the intersection of 3D geodata, deep learning, and generative artificial intelligence. His "Theory-to-Code" teaching approach pairs scientific rigor with industrial pragmatism: every theoretical concept ships alongside an executable Python implementation.
His areas of expertise: massive point cloud processing, 3D semantic segmentation, neural rendering (NeRF, 3D Gaussian Splatting), Foundation Models for 3D, spatial AI agents, Scan-to-BIM automation, and digital twins.
Ready to go deeper
Check out 3D Data Science with Python (O'Reilly Media, 2025), 687 pages, 18 chapters, and 200+ code examples. An O'Reilly Python Best Seller, adopted by 100+ universities.
Get the book on O'Reilly →