Ref: 3DGA-EBOOK-ARCH3D | Ver: 1.0
Date: 03/31/2026
Status: Approved Document

AI Architect of 3D Systems in Production


From point cloud processing to the deployment of spatial AI agents: the complete guide
Dr. Florent Poux
3D Geodata Academy
March 2026

Preface

This book brings together in a single volume all the technical guides, tutorials and articles produced as part of the AI Architect of 3D Systems in Production training delivered by the 3D Geodata Academy. I designed it to work two ways at once: as a daily reference for the practitioner, and as a structured learning path for the ambitious beginner.

The field of 3D Data Science and Spatial Artificial Intelligence is accelerating faster than anything I've seen before. In 2026, LiDAR point clouds, BIM models, neural rendering, and AI agents are all converging on the same goal: building digital representations of the physical world that are not just accurate, but intelligent and queryable. This book hands you the keys to every link in that chain.

The audience is wide on purpose: geomatics engineers who want to fold AI into their workflows, data scientists curious about the spatial side, architects and BIM managers looking to automate their processes, and 3D computer vision researchers. Each chapter stands on its own. Read them in order, though, and you'll see the full arc: we start with Python 3D fundamentals and end with spatial AI agents deployed in real companies.

Introduction

Spatial AI sits on one of the most active frontiers of modern computing. It lives at the intersection of computer vision, computational geometry, deep learning, and natural language processing, and it aims at one thing: giving machines a deep understanding of the physical world in three dimensions.

The field of 3D Data Science has organized itself around a handful of technical pillars. Processing LiDAR and photogrammetric point clouds. Building Information Modeling (BIM). Neural rendering with NeRF and 3D Gaussian Splatting. And more recently, AI agents that can reason about 3D scenes in plain language. Every one of these pillars gets its own chapter in this book.

This compilation isn't a random pile of articles. It follows the order of the training program, from the basics of geometric processing all the way up to AI agent architectures running in production. Every chapter has runnable code, in-depth theory, and practical advice taken from real projects. The goal is simple: a complete, coherent path you can apply directly to your own work.

X Z Y 3D Point Cloud

Part I. 3D Python Fundamentals

Modules 1-2: environment, geometric processing, formats, segmentation, and LiDAR big data

Before getting into 3D data processing, you need solid foundations. This first part walks you from installing Anaconda to handling massive point clouds, with stops along the way for file formats and the core segmentation algorithms.


Chapter 1. Setting Up a Python Environment for 3D

Before you write a single line of code, you have to build the foundations: a Python environment that's isolated, reproducible, and tuned for 3D Data Science.
"A well-configured environment is an invisible investment that saves you hundreds of hours. A badly configured one is technical debt that will cost you thousands."
System Python (conflict) numpy 1.24 numpy 1.26 (conflict!) open3d 0.17 torch 2.0 (CUDA 11) torch 2.3 (CUDA 12) Incompatible versions spatial-ai Python 3.11 open3d 0.18 numpy 1.26 laspy 2.5 torch 2.3 web-app Python 3.12 flask 3.0 numpy 1.24 sqlalchemy deep-learn Python 3.10 torch 2.0 numpy 1.24 CUDA 11.8 Isolated environments = zero conflicts Each project has its own "toolbox"

1. What Is a Python Environment?

A Python environment is an isolated workspace built from three core pieces: a Python interpreter (the engine that runs your code), a set of packages (the libraries installed), and a dependency tree (the version relationships between those libraries). Getting this concept right is the first step toward working calmly in 3D Data Science.

Picture yourself working on two projects at the same time. The first one uses open3d 0.17 with numpy 1.24. The second needs open3d 0.18 with numpy 1.26. If both share the same system Python, installing what one needs will break the other. That's the classic "dependency hell", and it's everywhere in the Python scientific stack.

The clearest analogy is a toolbox. Every project has its own box, with its own tools in the right versions. You don't mix metric and imperial screwdrivers. Same idea here: don't mix the dependencies of two different 3D projects. Isolation is what gets you reproducibility. If your pipeline works today, it will still work in six months, on a different machine.

Python gives you three isolation levels. The system Python is the one your OS shipped with. Never install scientific packages there. Virtual environments (via venv) create a lightweight copy with its own pip, but they share the system interpreter. Then there are Conda environments, which give you the strongest isolation: they bundle their own Python interpreter, their own compiled C/C++ libraries, and they handle binary dependencies (like CUDA or OpenGL) that pip can't manage.

For 3D Data Science, I recommend Conda without a second thought. Libraries like Open3D, PyTorch, and PDAL rely on compiled binaries that need specific versions of CUDA, OpenGL, and libstdc++. Conda handles that mess for you. With venv + pip, you're on your own when the linker errors start.

2. Installing Anaconda and Creating the Environment

Installing Anaconda is the first concrete step on your path. Anaconda is a Python distribution built for scientific computing. It includes conda (the environment manager), a set of pre-installed packages, and the Anaconda Navigator graphical interface.

Head to anaconda.com/download and grab the version that matches your operating system (Windows, macOS, or Linux). Run the installer and accept the defaults. On Windows, only check "Add Anaconda to PATH" if you don't already have another Python distribution. Once it's done, open a terminal (or Anaconda Prompt on Windows) and check with conda --version. You should see something like conda 24.x.x.

Next step: create a dedicated environment. I pick Python 3.11 because in 2026 it offers the best balance between performance and compatibility. Python 3.12 and 3.13 are faster, sure, but some 3D libraries (Open3D's C++ bindings and PyTorch's CUDA extensions in particular) still aren't fully certified on those versions. Python 3.11 is the sweet spot: stable, fast, supported everywhere.

setup_conda.sh
# Step 1: Verify the conda installation
conda --version

# Step 2: Create the environment dedicated to 3D Data Science
conda create -n spatial-ai python=3.11 -y

# Step 3: Activate the environment
conda activate spatial-ai

# Step 4: Verify the right Python is active
python --version   # Should print Python 3.11.x
which python        # Should point to ~/anaconda3/envs/spatial-ai/bin/python

After activation, your prompt shows the (spatial-ai) prefix. That's your visual confirmation that you're working inside the isolated environment. Any pip install or python command you run from now on only touches this environment. Your system Python stays untouched.

Tip: On a server or compute machine, pick Miniconda over Anaconda. Miniconda installs only conda and Python, without Anaconda's 250+ pre-installed packages. You save 3 GB of disk space and install exactly what you need. On a workstation, Anaconda works fine thanks to Anaconda Navigator and the pre-configured scientific packages.
💡 Multi-OS activation
On Windows (Anaconda Prompt), just run conda activate spatial-ai. On macOS/Linux with recent shells (bash 4+, zsh), the same command works once you've run conda init once. On older systems or in some CI scripts, you'll need source activate spatial-ai instead. If conda activate fails with "CommandNotFoundError", run conda init bash (or conda init zsh), then restart your terminal.
Step 1 conda create -n spatial-ai python=3.11 Step 2 conda activate spatial-ai (spatial-ai) $ Step 3 pip install open3d laspy numpy torch ... Conda Workflow: Create → Activate → Install Each step isolates your working environment further

3. Installing 3D Packages with pip

pip is Python's standard package manager. It downloads and installs libraries from PyPI (the Python Package Index), the largest Python package repository in the world. A package is a reusable code library. For example, open3d gives you 3D processing tools, laspy lets you read LiDAR files, and numpy handles high-performance numerical computing.

We're going to install two layers of packages. The core stack covers the basics: reading data, geometric processing, and visualization. The extended stack adds Deep Learning, BIM/IFC formats, and raster processing. Always start with the core stack, then add the extensions when you need them.

One key detail: we install laspy[lazrs], not just laspy. The bracket syntax enables an "extra". Here, the lazrs backend, a LAZ decompressor written in Rust. Without it, reading .LAZ files (the standard compressed LiDAR format) is either impossible, or 10x slower through a pure Python fallback. That's the kind of detail that turns a 30-second pipeline into a 5-minute one.

install_packages.sh
# Core stack: 3D processing and visualization
pip install open3d laspy[lazrs] numpy scipy matplotlib

# Extended stack: Deep Learning, BIM, raster
pip install torch torchvision pyvista trimesh ifcopenshell pdal rasterio

# Quick check of installed versions
python -c "
import open3d; print(f'Open3D: {open3d.__version__}')
import laspy;  print(f'Laspy:  {laspy.__version__}')
import numpy;  print(f'NumPy:  {numpy.__version__}')
import torch;  print(f'PyTorch:{torch.__version__}')
print(f'CUDA:   {torch.cuda.is_available()}')
"
PackageRoleCategory
open3dProcessing and visualization of point clouds, meshes, 3D reconstruction3D core
laspy[lazrs]Reading/writing LAS/LAZ (LiDAR) files with fast Rust decompressionLiDAR I/O
numpyNumerical computing on N-dimensional arrays, the base of the whole stackCompute
scipyScientific algorithms: interpolation, optimization, KDTree, linear algebraCompute
matplotlib2D plots: histograms, scatter plots, learning curvesVisualization
torchDeep Learning framework with GPU/CUDA support for PointNet++, Point TransformersAI
pyvistaInteractive 3D visualization built on VTK, great for meshes and volumesVisualization
trimeshTriangle mesh manipulation, volume computation, STL/OBJ exportMeshes
ifcopenshellReading/writing IFC files (Building Information Modeling)BIM
pdalPoint data processing pipelines (filtering, reprojection, tiling)Pipeline
rasterioReading/writing geospatial rasters (GeoTIFF, DEM, orthoimages)Raster
Warning: Never mix conda install and pip install for the same package. If you install numpy via conda and then another version via pip, you'll get silent conflicts. The rule: use conda for the environment and Python, then pip for every package. If a package is only available through conda-forge (like pdal), install it first, before any pip install.
3D Python Stack: Layered Architecture Deep Learning / AI torch torchvision PointNet ▼ depends on ▼ 3D Processing / I-O open3d laspy trimesh pyvista ▼ depends on ▼ Numerical Foundations numpy scipy matplotlib

Now that the packages are installed, how do we make sure this setup is reproducible on another machine, or six months from now when the versions have moved on?

4. Managing Dependencies with requirements.txt

Reproducibility is non-negotiable in Data Science. A pipeline that runs on your machine but fails on a colleague's is a useless pipeline. The requirements.txt file is the Python ecosystem's standard answer: it lists every package and its exact version, so you can rebuild an identical environment with a single command.

The pip freeze command captures the full state of your environment: every installed package with its exact version number. The output is already in requirements.txt format. The catch: pip freeze also pulls in all transitive dependencies (the dependencies of your dependencies), which produces a long file that's painful to maintain.

The good practice is to keep a "human-readable" requirements.txt with only your direct dependencies and loose version constraints. The open3d>=0.18,<0.19 format means: "install Open3D 0.18.x, any patch is fine". You get bug fixes without the risk of a major update breaking your code.

requirements.txt
# ===== Core 3D Data Science Stack =====
open3d>=0.18,<0.19
laspy[lazrs]>=2.5,<3.0
numpy>=1.26,<2.0
scipy>=1.12,<2.0
matplotlib>=3.8,<4.0

# ===== Deep Learning =====
torch>=2.3,<3.0
torchvision>=0.18,<1.0

# ===== Visualization & Meshes =====
pyvista>=0.43,<1.0
trimesh>=4.0,<5.0

# ===== Geospatial Formats =====
rasterio>=1.3,<2.0

# ===== Development Tools =====
jupyterlab>=4.0,<5.0
ipykernel>=6.0,<7.0
requirements_commands.sh
# Capture the exact state of the environment (all dependencies)
pip freeze > requirements-lock.txt

# Reproduce the environment on another machine
conda create -n spatial-ai python=3.11 -y
conda activate spatial-ai
pip install -r requirements.txt

For advanced Conda users, the environment.yml file is an alternative that captures both conda and pip packages. You generate it with conda env export > environment.yml and rebuild it with conda env create -f environment.yml. The upside is that it also includes binary dependencies (CUDA, MKL, etc.) that requirements.txt doesn't cover.

Good practice: Keep two files. A readable requirements.txt with your direct dependencies and loose constraints, and a requirements-lock.txt generated by pip freeze for exact reproducibility. Commit both to Git. The first one is for understanding the project, the second one for reproducing it bit-for-bit.
🚀 Going further
For maximum reproducibility, wrap your 3D environment in a Docker container. Start from a base image like nvidia/cuda:12.4-runtime-ubuntu22.04, add Miniconda, and your requirements.txt. Every team member, every CI server, and every GPU compute machine will run the exact same environment. Same CUDA version, same libstdc++, same Python. Pre-built images like pytorch/pytorch or jupyter/scipy-notebook speed things up even more. Pair Docker with docker-compose to orchestrate a full pipeline: one container for LiDAR processing, one for GPU training, one for the 3D web server.

5. Automation and IDE Integration

A well-installed environment is useless if it isn't wired into your development tools. This section covers integration with VS Code, JupyterLab, and writing automation scripts so anyone on your team can get going with a single command.

In VS Code, open the command palette (Ctrl+Shift+P) and type "Python: Select Interpreter". Pick the interpreter that matches your spatial-ai environment. VS Code will then activate that environment in every integrated terminal automatically. Check the bottom bar shows spatial-ai (Python 3.11.x). This step matters: without it, autocomplete, linting, and debugging will all use the wrong Python.

For JupyterLab, you need to register your environment as an available "kernel". Two commands do it: install ipykernel in the environment, then register it with a clear name. Once that's done, you'll see "spatial-ai" in the kernel list when you create a new notebook.

The final step is full automation. The script below creates the environment, installs the packages, configures Jupyter, and verifies everything. Drop it in your Git repo so any new team member can get started by running one command.

setup_spatial_ai.sh
#!/bin/bash
# ==============================================
# Automatic install script: spatial-ai
# Usage: bash setup_spatial_ai.sh
# ==============================================

ENV_NAME="spatial-ai"
PYTHON_VERSION="3.11"

echo "[1/5] Creating the conda environment..."
conda create -n $ENV_NAME python=$PYTHON_VERSION -y

echo "[2/5] Activating..."
conda activate $ENV_NAME

echo "[3/5] Installing packages..."
pip install open3d laspy[lazrs] numpy scipy matplotlib
pip install torch torchvision pyvista trimesh
pip install jupyterlab ipykernel

echo "[4/5] Registering the Jupyter kernel..."
python -m ipykernel install --user --name $ENV_NAME --display-name "Python 3.11 (spatial-ai)"

echo "[5/5] Verifying..."
python -c "
import open3d; print(f'  Open3D:  {open3d.__version__}')
import laspy;  print(f'  Laspy:   {laspy.__version__}')
import numpy;  print(f'  NumPy:   {numpy.__version__}')
import torch;  print(f'  PyTorch: {torch.__version__}')
print('  Installation completed successfully!')
"

When you work with an AI assistant like Claude or GitHub Copilot, always start by telling it your environment. Say: "I'm working with Python 3.11, Open3D 0.18, PyTorch 2.3 on CUDA 12.4". The assistant will then generate code compatible with your specific versions instead of guessing or suggesting code for outdated ones.

Tip: Always verify your installs with Python one-liners. For example: python -c "import open3d; print(open3d.__version__)". If that command fails, your environment isn't configured properly, and nothing else will work. Bake these checks into your CI/CD scripts.
VS Code (spatial-ai) EXPLORER src/ pipeline.py utils.py data/ requirements.txt File tree pipeline.py import open3d as o3d import numpy as np # Load the cloud pcd = o3d.io.read... (spatial-ai) $ python pipeline.py Open3D 0.18.0 | 1,234,567 points Integrated terminal Python 3.11.9 ('spatial-ai') UTF-8 | LF | Python Python interpreter selector
💡 Jupyter extensions for 3D
Install k3d (pip install k3d) to visualize point clouds and meshes directly inside your Jupyter notebooks, with rotation, zoom, and interactive colors. For more advanced rendering, pyvista ships a Jupyter backend via pyvista.set_jupyter_backend('trame') that gives you full VTK rendering in the browser. These extensions turn JupyterLab into a real 3D visualization studio, so you don't have to keep flipping between the IDE and an external viewer like CloudCompare.

Your environment is ready and automated. But what do you do when things don't work as planned? The next section covers the most common problems and how to fix them.

6. Diagnosing and Fixing Common Issues

Even with a careful install, you'll hit errors. The trick isn't avoiding them (you can't), it's diagnosing them fast. This section is your survival guide: the most frequent errors, what causes them, and the commands that will save you.

The single most common error in scientific Python is ModuleNotFoundError. 90% of the time, it doesn't mean the package isn't installed. It means you haven't activated the right environment. Your terminal is using the system Python instead of spatial-ai. The first thing to check is always which python (or where python on Windows). If the path doesn't contain envs/spatial-ai, you're in the wrong environment.

The second classic trap is PyTorch and CUDA. You install PyTorch, then torch.cuda.is_available() returns False. The cause: you installed the CPU build of PyTorch, or your CUDA version doesn't match. PyTorch needs a specific CUDA toolkit version installed on your system. Check the driver version with nvidia-smi, then install PyTorch from pytorch.org with the matching index.

The third frequent issue is Open3D on a headless server. Open3D visualization requires an OpenGL context. On a server with no screen (SSH, Docker, cloud), any call to o3d.visualization.draw_geometries() will fail. The fix: use offscreen mode with o3d.visualization.rendering.OffscreenRenderer, or export your results to files (.ply, .png) instead of displaying them.

diagnostic.sh
# Full environment diagnostic
conda info --envs           # List all environments
which python                # Which Python is active?
python --version            # Which version?
pip list                    # All installed packages
pip list --outdated         # Packages with updates available

# Verify CUDA for PyTorch
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')"

# Nuclear option: drop and rebuild
conda deactivate
conda remove -n spatial-ai --all -y
conda create -n spatial-ai python=3.11 -y
conda activate spatial-ai
pip install -r requirements.txt
Python error? ModuleNotFoundError which python + conda activate pip install <package> CUDA not detected nvidia-smi + torch version Reinstall from pytorch.org OpenGL / display error Headless server? OffscreenRenderer / export Decision tree: diagnosing the most common errors
ErrorLikely CauseFix
ModuleNotFoundError: No module named 'open3d'Wrong environment active, or package not installedconda activate spatial-ai && pip install open3d
torch.cuda.is_available() = FalseCPU build of PyTorch installed, or CUDA version mismatchReinstall from pytorch.org with the right CUDA index
GLFWError: No OpenGL contextHeadless server, no graphics GPUUse OffscreenRenderer or export_image()
ERROR: Could not build wheels for lazrsRust compiler missing (required by lazrs)Install Rust via rustup.rs, or use pip install laspy[laszip]
EnvironmentError: Conda prefix already existsEnvironment partially created from a previous attemptconda remove -n spatial-ai --all -y, then recreate
pip install very slow or timing outSlow network, proxy, or saturated PyPI indexUse pip install --index-url https://pypi.tuna.tsinghua.edu.cn/simple/ (mirror)
Hard rule: On Linux, never use sudo pip install. That command installs packages into the system Python with root privileges, which can break critical system tools (like apt on Ubuntu). Always use a Conda environment or venv, which need no admin rights.

If nothing works after several attempts, go nuclear: drop the environment completely with conda remove -n spatial-ai --all -y, then rebuild it from your requirements.txt. That's the fundamental benefit of isolation. Tearing down and rebuilding takes a few minutes, and it never risks damaging your system.

Your Python environment is now installed, configured, and diagnosable. You've got a complete, isolated toolbox for 3D Data Science. In the next chapter, we finally write some code: load a first LiDAR point cloud and explore its structure.

Chapter 2. Master Guide: Point Cloud Processing with Python

Precision engineering for Spatial AI: from raw LiDAR to semantic structure.
"The point cloud is the atom of spatial perception. In 2026, the engineer doesn't just 'see' points anymore. They orchestrate massive data flows where every micron matters."

1. Orchestrating the Production Environment

Deploying a 3D Data Science pipeline starts with strict dependency isolation. We use Anaconda to create a dedicated Python environment, and pip to install the specialized 3D processing packages.

setup_env.sh
# Create an environment tuned for 3D computing
conda create -n spatial-ai python=3.11 -y
conda activate spatial-ai
pip install open3d laspy[lazrs] numpy torture-test-3d

Pay close attention to the lazrs extension. Without this Rust backend, decompressing .LAZ files becomes a major bottleneck and caps your IOPS (Input/Output Operations Per Second).

Florent's Tip: Always use python >= 3.10 to get the type-hinting optimizations and the GIL (Global Interpreter Lock) performance improvements when you call into C++-compiled libraries.

2. I/O Optimization: The Offset Syndrome

Reading massive LiDAR files (several GB) calls for careful memory management. The LAS/LAZ format often stores coordinates with offsets and scales to save bytes by encoding them as int32 instead of float64.

io_manager.py
import laspy
import numpy as np

def stream_safe_read(path):
    las = laspy.read(path)
    # Direct access to scaled coordinates (float64)
    points = np.vstack((las.x, las.y, las.z)).transpose()
    return points, las.header

Reading the header matters a lot. Ignoring the offsets during geometric operations can introduce catastrophic float-precision errors if the project origin sits millions of meters away (the WGS84 system).

Pitfall: Never try to load a .LAS file larger than 10GB into RAM in one shot. Use tiling or a streaming iterator.
Warning: Watch out for memory with LAS files over 2 GB. Laspy loads everything into RAM by default. For massive files, use laspy.read(path, laz_backend=laspy.LazBackend.Laszip) and iterate by chunks with laspy.open() in streaming mode.

3. Voxelization: The Survival Grid

Voxel Downsampling isn't random point removal. It's spatial discretization where each voxel (3D cube) gets replaced by the centroid of the points inside it. That gives you a uniform spatial distribution, which is vital for Deep Learning algorithms like PointNet++.

voxel_filter.py
import open3d as o3d

def optimize_pcd(pcd, voxel_size=0.05):
    # Smart 5cm downsampling
    pcd_down = pcd.voxel_down_sample(voxel_size=voxel_size)
    return pcd_down

For an urban Digital Twin, a voxel_size of 0.10 (10 cm) is often the optimal balance between visual fidelity and compute performance (a 90% reduction in point count).

Florent's Tip: A voxel_size of 0.01 m (1 cm) is ideal for indoor scans, 0.05 m for building facades, and 0.5 m for aerial surveys. Rule of thumb: the voxel should be at least 3x larger than the sensor's measurement noise.
Voxel Size (m)Points (Millions)Memory (MB)Latency (ms)
Raw50.01200N/A
0.0122.55401250
0.054.8115320
0.101.22885

4. Normal Estimation: The Direction of the Surface

Normals are unit vectors perpendicular to the local tangent plane. Without them, no PBR rendering, no curvature-based partitioning. A KDTree is mandatory here to find nearest neighbors in logarithmic time O(log n).

normals_engine.py
# Hybrid search: 10cm radius, max 30 neighbors
search_param = o3d.geometry.KDTreeSearchParamHybrid(radius=0.1, max_nn=30)
pcd.estimate_normals(search_param=search_param)
# Consistent upward orientation
pcd.orient_normals_to_align_with_direction(direction=[0, 0, 1])

Computing the normal relies on Principal Component Analysis (PCA) of the local covariance matrix. The eigenvector tied to the smallest eigenvalue gives the normal direction.

k-neighbors Normals estimated via local PCA (k-NN)
Advanced: For dense scenes, run pcd.estimate_normals on GPU via the o3d.t.geometry module for a 15x speedup on CUDA.

5. RANSAC: The Art of Robust Consensus

The RANSAC algorithm (Random Sample Consensus) is the bulldozer of 3D processing. It iterates to find the mathematical model (here a plane) that maximizes the number of inliers (matching points) despite environmental noise.

plane_segmenter.py
plane_model, inliers = pcd.segment_plane(
    distance_threshold=0.02,
    ransac_n=3,
    num_iterations=1000
)
# Geometric extraction
ground = pcd.select_by_index(inliers)
objects = pcd.select_by_index(inliers, invert=True)

A distance_threshold of 0.02 means any point within 2 cm of the mathematical plane counts as part of the ground.

6. DBSCAN Clustering: Density as Identity

Unlike K-Means, DBSCAN doesn't need to know the number of clusters up front. It groups points by density. It's the perfect tool to isolate cars, pedestrians, or street furniture in a street LiDAR scan.

cluster_engine.py
with o3d.utility.VerbosityContextManager(o3d.utility.VerbosityLevel.Debug) as cm:
    labels = np.array(objects.cluster_dbscan(eps=0.25, min_points=10))

max_label = labels.max()
print(f"Point cloud has {max_label + 1} clusters")
Warning: DBSCAN is CPU-hungry. For clouds > 1M points, always run a voxel_down_sample first.

7. Bounding Boxes: Metric Containment

You extract dimensions with two box types: AABB (Axis-Aligned) and OBB (Oriented). The OBB wins because it aligns with the object's real orientation, giving you exact length and width.

bbox_factory.py
obb = objects.get_oriented_bounding_box()
obb.color = (1, 0, 0) # Red for visualization
print(f"Dimensions: {obb.extent} m")
Box TypeComputationMetric AccuracyUsage
AABBSimple Min-MaxLowRough collision
OBBPCA-basedHighScan-to-BIM

8. From Open3D to PyTorch: The Tensor Bridge

In 2026, inference rarely runs on raw CPU. We have to convert our Open3D objects into Tensors to feed them into a Point Transformers network.

tensor_bridge.py
import torch

def pcd_to_tensor(pcd):
    points = np.asarray(pcd.points)
    tensor = torch.from_numpy(points).float()
    return tensor.cuda() # Direct send to GPU

9. Summary and Compute Architectures

StageComplexityRecommended Hardware
I/O (Lazrs)O(n)NVMe SSD + multicore
Voxel DownO(n)RAM (high frequency)
Normals/KDTreeO(n log n)CPU multithreading
RANSAC/DBSCANO(n^2) brutalGPU (o3d.t)

Master these tools and you hold the keys to tomorrow's cities.

Now that we can read and manipulate point clouds, how do we extract geometric shapes automatically from these millions of points?

Chapter 3. 3D Data Formats: From Point Cloud to Web

Understanding, comparing, and converting the formats that structure the modern 3D ecosystem.
"A data format isn't just a container: it's a contract between the producer and the consumer. Picking the wrong format guarantees hours lost to conversion and information loss."

1. Panorama of 3D Formats

Why are there so many 3D formats? The answer fits in three words: history, domains, trade-offs. Each community (geospatial, film, BIM, video games, web) built its own standards, tuned for its specific constraints around precision, performance, and metadata.

You can sort formats into broad families. Point clouds favour raw geometric fidelity (LAS/LAZ, PLY, E57). Meshes encode triangulated surfaces (OBJ, STL, FBX). BIM formats carry rich semantics (IFC). And web formats optimise transmission and real-time rendering (glTF/GLB, 3D Tiles).

The classic beginner trap is believing a single format can do it all. In reality, a professional pipeline chains several formats depending on the stage: acquisition in LAS, processing in PLY or NumPy, delivery in glTF. Understanding these transitions is essential if you want to avoid silent data loss.

FormatTypeBinary/TextCompressionMetadataTypical use case
LAS 1.4Point cloudBinaryNo (LAZ = yes)CRS, classificationAerial/terrestrial LiDAR surveys
PLYCloud / MeshBothNoCustom propertiesResearch, prototyping
E57Point cloudBinary (XML header)Yes (internal)Images, scanner posesStatic scanning, BIM scan-to-model
OBJMeshTextNoMaterials (.mtl)Simple mesh exchange
STLMeshBothNoNone3D printing
FBXFull sceneBinaryNoAnimations, skeletonsFilm, video games
IFCBIMText (STEP)NoFull semanticsConstruction, architecture
glTF/GLB3D sceneJSON+Bin / BinaryDraco (optional)PBR materials, animationsWeb 3D, visualisation
3D TilesHierarchical tilingBinaryDraco, gzipBounding volumesCesium, urban Digital Twins
Taxonomy of 3D Formats 3D Formats Point Clouds LAS LAZ PLY E57 Meshes OBJ STL FBX BIM / Semantic IFC Web / Real-time glTF GLB 3D Tiles Every domain spawned its own standards · the key is mastering the bridges

2. LAS/LAZ: The LiDAR Standard

The LAS format (LASer) is the de facto standard for LiDAR data. Developed by the ASPRS (American Society for Photogrammetry and Remote Sensing), it encodes each point with its coordinates, intensity, classification, and laser-return metadata.

A LAS file breaks down into three blocks. The header holds the global information: coordinate system (CRS via VLRs), offsets, scales, point count, and bounding box. Variable Length Records (VLR) store the extended metadata. And Point Data Records encode each point according to a numbered format (PDRF 0 to 10).

The LAZ compression, invented by Martin Isenburg (LASzip), applies lossless compression that shrinks size by 5 to 10 times. The principle: encode deltas between consecutive points rather than absolute values. It's become a de facto standard, natively supported by PDAL, CloudCompare, and laspy.

The LAS 1.4 version brings major improvements over 1.2: 64-bit point counters (beyond 4.3 billion points), extended point formats (PDRF 6-10) with 8-bit classification (256 classes instead of 32), and native support for full-waveform data.

inspect_las.py
import laspy
import numpy as np

def inspect_las_header(filepath):
    """Inspect the header and point format of a LAS/LAZ file."""
    las = laspy.read(filepath)
    h = las.header

    print(f"Version      : LAS {h.version}")
    print(f"Point Format : PDRF {las.point_format.id}")
    print(f"Point Count  : {h.point_count:,}")
    print(f"Scales       : {h.scales}")
    print(f"Offsets      : {h.offsets}")
    print(f"Bounding Box : {h.mins} -> {h.maxs}")

    # List the dimensions available in this PDRF
    print(f"\nAvailable dimensions:")
    for dim in las.point_format.dimensions:
        print(f"  - {dim.name} ({dim.dtype})")

    return las
Florent's Tip: The PDRF (Point Data Record Format) decides which fields are available. PDRF 0 = XYZ + intensity. PDRF 2 = adds RGB. PDRF 6 (LAS 1.4) = extended classification + GPS time + scanner channel. Always check the PDRF before you code your pipeline. Accessing a missing field raises a silent exception in some libraries.
⚠️ float32 precision loss
LAS files store coordinates as 32-bit integers with scales and offsets that preserve millimetre precision even for 6-digit geographic coordinates (e.g. X=350000.001 m). When you convert to PLY or NumPy in float32, you lose that precision: a float32 only offers ~7 significant digits, which translates to ~5 cm of error at these magnitudes. Always use float64 (double precision), or subtract the offset before conversion: points = np.vstack((las.x - las.header.offsets[0], ...)). This step is critical for BIM and high-precision topography.
Structure of a LAS File Header (375 bytes) Version, PDRF, scales, offsets, CRS, bbox, point count Variable Length Records (VLR) CRS WKT, classification lookup, custom metadata Point Data Records X, Y, Z, intensity, classification, RGB, GPS time... ~95% of the file size Extended VLRs (LAS 1.4 only) Fixed Variable Data

3. PLY: Universal Simplicity

The PLY format (Polygon File Format, also called Stanford Triangle Format) is one of the oldest and simplest formats in the 3D ecosystem. Its plain-text header describes the data structure, followed by the payload in ASCII or binary.

Why does PLY stay popular in research? Three reasons: its simplicity (a human-readable header), its flexibility (you can add arbitrary properties: curvature, confidence, semantic label), and its speed in binary mode (no text parsing, direct sequential reads). Open3D, trimesh, CloudCompare, and nearly every academic tool support it natively.

The flip side: PLY has no notion of CRS (coordinate system), no standardised compression, and no hierarchical structure. It's perfect for prototyping and research datasets, but unsuitable for large-scale geospatial data delivery.

las_to_ply.py
import laspy
import open3d as o3d
import numpy as np

def las_to_ply_with_colors(las_path, ply_path):
    """Convert a LAS to PLY while preserving RGB colors."""
    las = laspy.read(las_path)

    # Extract coordinates
    points = np.vstack((las.x, las.y, las.z)).T

    # Build the Open3D cloud
    pcd = o3d.geometry.PointCloud()
    pcd.points = o3d.utility.Vector3dVector(points)

    # Add colors (16-bit normalization -> [0,1])
    if hasattr(las, 'red'):
        colors = np.vstack((las.red, las.green, las.blue)).T
        colors = colors / 65535.0  # LAS stores as uint16
        pcd.colors = o3d.utility.Vector3dVector(colors)

    o3d.io.write_point_cloud(ply_path, pcd, write_ascii=False)
    print(f"Exported {len(points):,} points to {ply_path}")
    return pcd
💡 Binary PLY vs ASCII: 10x faster
By default, Open3D exports in binary (write_ascii=False). Never change that default without a reason. A 10-million-point PLY file weighs ~230 MB in ASCII versus ~115 MB in binary, and binary reads are 10 to 15 times faster because they skip line-by-line text parsing. Reserve ASCII mode for debugging (inspecting values by eye) or interop with legacy tools. In production, binary is always the right call.
PLY: ASCII vs Binary ASCII Format ply format ascii 1.0 element vertex 1000000 property float x property float y property float z end_header 1.234 5.678 9.012 1.235 5.679 9.013 ... 230 MB | Read: 12s Binary Format ply format binary_le 1.0 element vertex 1000000 property float x property float y property float z end_header Compact binary data 115 MB | Read: 1.1s vs Same readable header, radically different payload · 10x faster in binary

4. glTF/GLB: The Web 3D Standard

The glTF format (GL Transmission Format), developed by the Khronos Group, is often called the "JPEG of 3D". Its goal: deliver a compact, efficient, universal format for transmitting and rendering 3D scenes on the web and in real-time applications.

A glTF file is made of a JSON describing the scene (nodes, meshes, PBR materials, textures, animations) and binary buffers holding the geometry data. The GLB variant packs everything into a single binary file, eliminating multiple HTTP requests and speeding up loading.

The hierarchical structure of glTF organises into scenes containing nodes, which in turn carry meshes with materials. That architecture maps naturally to the scene graph used by Three.js, Babylon.js, and Cesium, which is why it's been adopted so widely across the web 3D ecosystem.

A notable fact for Gaussian Splatting: .splat and .ply exports from 3DGS reconstructions can be converted to custom glTF extensions, allowing direct in-browser visualisation through specialised viewers. That opens the door to mainstream distribution of photorealistic reconstructions.

export_glb.py
import trimesh

def mesh_to_glb(input_path, output_path):
    """Export a mesh to GLB (binary glTF)."""
    mesh = trimesh.load(input_path)

    # trimesh automatically handles material conversion
    # to the glTF PBR (Physically Based Rendering) model
    mesh.export(output_path, file_type='glb')
    print(f"GLB exported: {output_path}")
    print(f"  Vertices : {len(mesh.vertices):,}")
    print(f"  Faces    : {len(mesh.faces):,}")

# Example: OBJ to GLB
mesh_to_glb("model.obj", "model.glb")
Going further: Draco compression (Google) can cut a glTF's size by 80 to 95% by compressing the geometry buffers. Enable it via trimesh.exchange.gltf.export_glb(mesh, include_normals=True) or use the CLI tool gltf-pipeline -i input.glb -o output.glb -d (the -d flag turns on Draco).
🚀 Going further
Draco compression has become a must for web deployment of 3D models. On a typical urban mesh (500k triangles, normals + UVs + colors), Draco cuts size by 90%+: a 45 MB GLB drops to 3-4 MB. Client-side decoding is near-instant thanks to the WebAssembly decoders shipped by Google. For massive point clouds, pair Draco with 3D Tiles and Cesium's hierarchical tiling: the viewer only loads the tiles visible at screen resolution, letting you navigate billions of points with sub-second latency. Also look at meshoptimizer (the EXT_meshopt_compression extension) as an alternative to Draco, sometimes faster at decoding.
glTF Ecosystem: Components → GLB .gltf JSON scenes, nodes PBR materials animations .bin Binary vertices, indices normals, UVs keyframes .png/.jpg Textures baseColor normalMap metalRough .GLB Single file Everything packed into one binary · ideal for web and streaming

5. Conversions and Interoperability

In a production pipeline, data inevitably moves through several formats. Conversion is never innocuous: each step can cause information loss (CRS, attributes, precision). Mastering the tools and anticipating the traps is a key skill for any 3D practitioner.

The most common conversion pipelines follow a logic of progressive specialisation. For web delivery: LAS → PLY → glTF. For BIM: IFC → OBJ → glTF. For deep learning: LAS → PLY or NumPy. Each transition has its rules and its preferred tools.

Python offers a rich ecosystem for automating these conversions. open3d excels for point clouds and simple meshes, trimesh for textured meshes and glTF exports, and pyvista for volumetric meshes and scientific visualisation. For heavy geospatial pipelines, PDAL remains the reference.

universal_converter.py
import open3d as o3d
import trimesh
from pathlib import Path

def convert_3d(input_path, output_path):
    """Universal converter: detect the type and route to the right engine."""
    src = Path(input_path).suffix.lower()
    dst = Path(output_path).suffix.lower()

    # Point cloud: Open3D
    if src in ('.ply', '.pcd', '.xyz'):
        pcd = o3d.io.read_point_cloud(input_path)
        o3d.io.write_point_cloud(output_path, pcd)

    # Mesh to web: trimesh
    elif dst in ('.glb', '.gltf'):
        mesh = trimesh.load(input_path)
        mesh.export(output_path)

    # Generic mesh: Open3D
    else:
        mesh = o3d.io.read_triangle_mesh(input_path)
        o3d.io.write_triangle_mesh(output_path, mesh)

    print(f"Conversion done: {src} -> {dst}")
Watch out: Most conversions drop the CRS (coordinate reference system). PLY, OBJ, and glTF have no field to store the projection. Always export the CRS separately (a .prj or .json file) and reapply it after conversion with pyproj or PDAL.
ConversionRecommended toolPreserves CRSPreserves colorsNotes
LAS → PLYlaspy + Open3DNoYesNormalise 16-bit RGB → float
LAS → NumPylaspyVia headerYesIdeal for ML pipelines
PLY → glTFtrimeshNoYes (vertex colors)Needs a mesh or splat
OBJ → GLBtrimeshNoYes (via .mtl)Materials converted to PBR
IFC → OBJIfcOpenShellPartialNoBIM semantics lost
LAS → 3D TilesPDAL + py3dtilesYesYesFull geospatial pipeline
Conversion Pipeline: Acquisition → Web LAS Raw LiDAR laspy Python numpy arrays processing PLY open3d I/O binary trimesh mesh PBR mat. glTF + Draco compression Web Three.js Cesium Maximum precision Universal exchange Mainstream delivery Each step specialises the data for its final use

Chapter 4. Advanced Segmentation: The RANSAC vs DBSCAN Duel

Pulling signal out of noise: the algorithmic craft of isolating 3D instances.
"A raw 3D scene is unstructured chaos. Segmentation is the creative act that turns that chaos into an ontology of objects you can actually work with."
n Inliers (plane) Outliers (noise) RANSAC: Plane Detection by Consensus

1. The Philosophy of 3D Segmentation

Segmentation is the cornerstone of scene understanding. It splits into three families: Semantic Segmentation (label per class), Instance Segmentation (label per object), and Primitive Segmentation (geometry).

Insight: Unlike deep learning, which is data-hungry, geometric methods like RANSAC run zero-shot on any scene with no prior training.

2. RANSAC: Statistical Resilience

RANSAC starts from a simple premise: your data contains clean geometric structures polluted by noise. By randomly picking a minimal subset of points (3 for a plane), it builds a candidate model and tests it against the whole dataset.

ransac_core.py
import open3d as o3d

def extract_dominant_plane(pcd, threshold=0.02):
    model, inliers = pcd.segment_plane(
        distance_threshold=threshold,
        ransac_n=3,
        num_iterations=1000
    )
    [a, b, c, d] = model
    # Equation: ax + by + cz + d = 0
    print(f"Plane Equation: {a:.2f}x + {b:.2f}y + ... = 0")
    return pcd.select_by_index(inliers), pcd.select_by_index(inliers, invert=True)
Florent's Tip: Calibrate RANSAC's distance_threshold to your sensor noise. For a Leica RTC360 TLS (~1.9 mm noise at 10 m), use 0.005 m. For a DJI L2 drone LiDAR (~3 cm noise), use 0.05 m. Too low and you fragment detected planes, too high and you merge adjacent ones.

3. DBSCAN: The Density Archipelago

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is my go-to for non-parametric clustering. You don't need to define the number of clusters K up front.

density_cluster.py
import matplotlib.pyplot as plt

# Cluster isolation
labels = np.array(objects.cluster_dbscan(
    eps=0.30,       # Neighborhood radius (30cm)
    min_points=15,  # Minimum density
    print_progress=True
))

max_label = labels.max()
print(f"Found {max_label + 1} individual objects")
Tech Tip: Use a K-Distance Graph to find the optimal EPS. Look for the "knee point" in the curve of distances to the k-th neighbor.

4. Head-to-Head: The Match

Picking the right algorithm means knowing each one's strengths and weaknesses. This table sums up the core differences between RANSAC and DBSCAN.

CriterionRANSACDBSCAN
NatureGeometric ModelSpatial Density
ShapesPlanes, Spheres, CylindersArbitrary Shapes
ParametersThreshold, IterationsEpsilon, Min Points
ComplexityO(k * N)O(N log N) with index
Noise ResilienceExcellent (outliers ignored)Good (labels noise as -1)

5. Applied: Automatic Cleaning Pipeline

Chaining both algorithms in a sequential pipeline gives you a full cleaning flow: noise removal, ground extraction, object clustering.

production_pipeline.py
def auto_clean(pcd):
    # 1. Statistical Outlier Removal
    cl, ind = pcd.remove_statistical_outlier(nb_neighbors=20, std_ratio=2.0)
    clean_pcd = pcd.select_by_index(ind)

    # 2. Ground Removal
    _, inliers = clean_pcd.segment_plane(distance_threshold=0.05, ransac_n=3, num_iterations=1000)
    objects = clean_pcd.select_by_index(inliers, invert=True)

    # 3. Object Clustering
    labels = objects.cluster_dbscan(eps=0.25, min_points=10)
    return objects, labels

This 12-line pipeline handles 80% of the preprocessing work for most modern machine perception tasks.

RANSAC and DBSCAN work beautifully on samples. But what happens when the file is 50 GB and no longer fits in memory?

Chapter 5. Big Data LiDAR: Processing at Terabit Scale

When RAM isn't enough anymore: software architectures for massive point clouds.
500 GB LASChunksCPU 0CPU 1CPU 2CPU 3MergedResultChunking & Parallelism
"Processing a 100 MB file is trivial. Processing a 500 GB airborne LiDAR campaign demands a shift in mindset: from in-memory to streaming and out-of-core."

1. The Memory Bottleneck

A standard LiDAR point (X, Y, Z, intensity, R, G, B, GPS time) weighs in around 32 to 64 bytes uncompressed. A billion points adds up to roughly 50 GB of raw RAM.

Reality Check: Python adds heavy overhead to objects. A Python list of lists [[x,y,z], ...] eats up to 5x more memory than a contiguous NumPy array.

2. Strategy 1: Chunking with Laspy

Chunking means reading the file in fixed-size batches, never loading the whole thing into memory. Laspy ships a native iterator that makes this trivial to wire up.

chunk_processor.py
import laspy
import numpy as np

def process_massive_file(filepath, chunk_size=1_000_000):
    with laspy.open(filepath) as f:
        # Iterate over points in batches of 1 million
        for chunk in f.chunk_iterator(chunk_size):
            points = np.column_stack((chunk.x, chunk.y, chunk.z))
            # Local processing (e.g. height filter)
            mask = points[:, 2] > 50.0
            save_result(points[mask])

3. Strategy 2: Spatial Tiling

PDAL (Point Data Abstraction Library) is the king of tile generation, via its CLI or Python binding.

pdal_tiling.json
[
    "input.laz",
    {
        "type": "filters.splitter",
        "length": 100.0,
        "buffer": 2.0
    },
    {
        "type": "writers.las",
        "filename": "tile_#.laz"
    }
]

4. Strategy 3: Multiprocessing (Parallelism)

Once data is cut into independent tiles, parallel processing puts every available CPU core to work and slashes compute time by a nearly linear factor.

parallel_engine.py
from multiprocessing import Pool, cpu_count

def worker_task(tile_path):
    # Isolated process: load, process, save
    pcd = load(tile_path)
    result = heavy_algorithm(pcd)
    return result.status

if __name__ == "__main__":
    files = glob.glob("tiles/*.laz")
    # Use every available core
    with Pool(processes=cpu_count()) as pool:
        results = pool.map(worker_task, files)
Florent's Benchmark: On a Ryzen 9 5950X (16 cores), parallel tile processing cuts compute time by a factor of 12x versus sequential.
Heads up: Python multiprocessing forks the whole process for each worker. With 8 workers processing 500 MB chunks, you can burn 4 GB of RAM before you even see results. Watch memory with psutil.virtual_memory() and cap max_workers accordingly.

5. Memory Mapping (Memmap)

Memory mapping lets the OS treat a file on disk as if it were in RAM, without explicitly loading it. It's the sweet spot between access speed and capacity.

MethodAccess SpeedMax CapacityComplexity
RAM (Standard)Extreme (~50 GB/s)Capped by physical RAMLow
HDD (Seek/Read)Slow (~150 MB/s)UnlimitedMedium
Memmap (NVMe)High (~3-7 GB/s)Capped by diskMedium

Summary: The Ideal Architecture

  1. Split: PDAL to cut into 100m2 tiles with buffer.
  2. Process: Python multiprocessing script firing one worker per tile.
  3. Merge: Merge lightweight results (vectors, metadata) into a spatial database (PostGIS).
Building Information Model

You've got the fundamentals down now. The Python environment is set up, you know how to load, manipulate, and segment point clouds, you understand the 3D formats, and you can process massive datasets. What's next? Giving your data meaning. Part II dives into 3D semantics: segmentation, classification, and object detection, the three pillars of automated scene understanding.

Part II. 3D Semantics

Module 2: Segmentation, classification, and object detection in point clouds

With the fundamentals in hand, it's time to give your data meaning. This part covers the three core operations of 3D semantic analysis: cutting space into coherent regions (segmentation), assigning an identity to each element (classification), and precisely locating objects of interest (detection). These three skills form the bedrock of any automated 3D scene understanding pipeline.


Chapter 6. 3D Segmentation: Extracting Meaning from Point Clouds

Partitioning space into meaningful regions: the first step toward scene understanding.
"A point cloud without segmentation is just digital fog. Segmentation turns chaos into structure, noise into usable knowledge."

1. What Is 3D Segmentation?

3D segmentation partitions a point cloud into homogeneous, meaningful groups. Each point gets assigned to a segment, turning a shapeless mass of coordinates into a semantic map that downstream algorithms can act on.

There are three main families. Geometric segmentation groups points by shape similarity (planes, cylinders, spatial clusters). Semantic segmentation assigns a class (ground, building, vegetation, vehicle) to every individual point. Instance segmentation takes things further by distinguishing each object: not just "car", but "car_1", "car_2", "car_3".

Segmentation is the foundation for every downstream task: classification, detection, BIM reconstruction, digital twin. Without prior partitioning, no 3D processing pipeline runs efficiently. It's the base layer that gives raw data meaning.

Knowing each approach's strengths and limits lets you pick the right strategy for the context. An airborne LiDAR survey doesn't segment like an indoor scan, and real-time constraints force different trade-offs than post-processing does.

Building Vegetation Grnd Semantic segmentation of an urban LiDAR scene

2. Geometric Segmentation: Regions and Surfaces

Geometric segmentation taps into the local spatial properties of points (normals, curvatures, distances) to group regions that share shape coherence. It's the oldest and most reliable approach, and it needs no prior training.

The Region Growing algorithm starts from seed points picked by minimum curvature. At each iteration, neighbors whose normal lines up closely enough with the current region's get absorbed. The process runs until no new point passes the angle criterion.

Euclidean clustering (a DBSCAN variant applied to 3D) groups points by pure spatial proximity. Two points belong to the same cluster if they're connected by a chain of neighbors closer than eps meters. This approach works especially well after a RANSAC ground extraction, to isolate the remaining individual objects.

region_growing.py
import open3d as o3d
import numpy as np

# Load and prepare the cloud
pcd = o3d.io.read_point_cloud("scene.ply")
pcd.estimate_normals(search_param=o3d.geometry.KDTreeSearchParamHybrid(
    radius=0.1, max_nn=30))

# Region growing segmentation based on normals
# Criterion: angle between normals < 25 degrees
normals = np.asarray(pcd.normals)
points = np.asarray(pcd.points)
tree = o3d.geometry.KDTreeFlann(pcd)

labels = -np.ones(len(points), dtype=int)
current_label = 0
angle_threshold = np.cos(np.radians(25))

for seed_idx in range(len(points)):
    if labels[seed_idx] != -1:
        continue
    queue = [seed_idx]
    labels[seed_idx] = current_label
    while queue:
        idx = queue.pop(0)
        [_, nn_idx, _] = tree.search_radius_vector_3d(points[idx], 0.15)
        for ni in nn_idx:
            if labels[ni] == -1:
                dot = np.abs(np.dot(normals[idx], normals[ni]))
                if dot > angle_threshold:
                    labels[ni] = current_label
                    queue.append(ni)
    current_label += 1

print(f"Regions detected: {current_label}")
Florent's Tip: The quality of normal estimation drives the quality of region growing directly. Too small a radius gives you noisy normals, too large smooths out the detail. Always test with radius = 3 * voxel_size as a starting point.
🔧 Optimal angle threshold
For Region Growing, the angle threshold between normals is the most critical parameter. Use 10-15 degrees for planar surfaces (walls, floors, roofs) to get clean segments. Bump it to 30 degrees or more for natural terrain or irregular surfaces (vegetation, rocks). When in doubt, start at 15 degrees and raise it until you hit acceptable over-segmentation rather than under-segmentation.
Region Growing: step-by-step expansion from a seed point Step 1: Seed Seed selected Step 2: Neighbors Normals checked Step 3: Region Final region

3. Semantic Segmentation: One Class per Point

The goal of semantic segmentation is to attach a class label to each point: ground, building, vegetation, vehicle, pedestrian. Unlike geometric segmentation (which groups by shape), semantic segmentation grasps the nature of each element in the scene.

The classical approach sits on a two-step pipeline: (1) compute local geometric features for each point (curvature, relative height, local density, covariance matrix eigenvalues), then (2) train a supervised classifier (Random Forest, SVM) on those features. This method is still very competitive for airborne LiDAR surveys, where geometric features are highly discriminative.

Deep learning has reshaped the field since PointNet (2017). Modern architectures like KPConv, RandLA-Net, and Point Transformers learn straight from raw coordinates, with no hand-engineered features. The core difference from 2D vision: there's no regular grid, density varies enormously, and spatial relations are in 3D.

semantic_seg_classical.py
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

def compute_geometric_features(points, normals, k=20):
    """Compute geometric features for each point."""
    from scipy.spatial import cKDTree
    tree = cKDTree(points)
    features = []
    for i in range(len(points)):
        _, idx = tree.query(points[i], k=k)
        neighbors = points[idx]
        cov = np.cov(neighbors.T)
        eigenvalues = np.sort(np.linalg.eigvalsh(cov))[::-1]
        e1, e2, e3 = eigenvalues
        # Geometric features derived from eigenvalues
        linearity = (e1 - e2) / (e1 + 1e-8)
        planarity = (e2 - e3) / (e1 + 1e-8)
        sphericity = e3 / (e1 + 1e-8)
        height_rel = points[i, 2] - neighbors[:, 2].min()
        features.append([linearity, planarity, sphericity, height_rel])
    return np.array(features)

# Classification pipeline
features = compute_geometric_features(points, normals)
clf = RandomForestClassifier(n_estimators=200, max_depth=15, n_jobs=-1)
scores = cross_val_score(clf, features, labels, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
MethodTypical mIoUSpeedData needed
Random Forest + features65-75%FastA handful of annotated scans
PointNet++70-80%ModerateThousands of scenes
KPConv75-85%ModerateThousands of scenes
RandLA-Net75-82%Fast (GPU)Thousands of scenes
Point Transformer v380-88%SlowTens of thousands
Heads up: Academic benchmarks (S3DIS, SemanticKITTI, ScanNet) don't always reflect real-world performance. A model at 85% mIoU on a benchmark can drop to 60% on your field data if the distribution differs. Always validate on a subset of your own scans.
⚠️ Class imbalance
In urban LiDAR scenes, the "ground" class often makes up 60% or more of the points, swamping rare classes like "vehicle" or "pedestrian". If you train without correction, the model will predict "ground" everywhere and still hit high accuracy. Use a weighted loss (weighted cross-entropy) or oversampling of minority classes. Compute weights inversely proportional to each class's frequency.
Geometric vs Semantic: two views of the same cloud Geometric Segmentation Cluster A Cluster B Cluster C Semantic Segmentation Building Vegetation Grnd Vehicle Clusters = local shape | Classes = semantic identity

4. Instance Segmentation: Identifying Each Object

Semantic segmentation answers "what is this?". Instance segmentation answers "how many are there and where?". Every individual object gets a unique ID, which is a must for counting, measuring, and temporal tracking.

The most pragmatic approach pairs semantic segmentation with spatial clustering. First you segment points by semantic class, then you run DBSCAN on each class to split out individual instances. This top-down method is simple, reliable, and often plenty for industrial applications.

End-to-end approaches like 3D-BoNet and Mask3D predict instance masks directly, no intermediate step. They're more elegant but need huge volumes of annotated data and significant training time.

instance_seg.py
import numpy as np
from sklearn.cluster import DBSCAN

def instance_segmentation(points, semantic_labels, class_names):
    """Instance segmentation by post-semantic clustering."""
    instance_labels = -np.ones(len(points), dtype=int)
    current_instance = 0

    for cls_id, cls_name in enumerate(class_names):
        mask = semantic_labels == cls_id
        if mask.sum() < 10:
            continue
        cls_points = points[mask]

        # eps tuned to the class: larger for buildings
        eps = 0.5 if cls_name == "building" else 0.25
        clustering = DBSCAN(eps=eps, min_samples=10).fit(cls_points)

        for label in set(clustering.labels_) - {-1}:
            inst_mask = clustering.labels_ == label
            indices = np.where(mask)[0][inst_mask]
            instance_labels[indices] = current_instance
            current_instance += 1

    print(f"Instances detected: {current_instance}")
    return instance_labels

# Usage
classes = ["ground", "building", "vegetation", "vehicle"]
instances = instance_segmentation(points, sem_labels, classes)
Instance segmentation: same class, distinct objects Class "vehicle" - 3 instances Vehicle #1 Vehicle #2 Vehicle #3 Same semantic label, different instance IDs ID=1 ID=2 ID=3
📚 Panoptic segmentation
Panoptic segmentation is the next step: it unifies semantic and instance segmentation in a single framework. Architectures like Panoptic-PolarNet and EfficientLPS produce class labels and instance IDs at the same time. The big win is consistency, no more reconciling two independent predictions. It's the dominant research direction in 3D scene understanding since 2023.
Going further: Panoptic segmentation combines semantics and instances in a unified framework. Each point gets both a semantic class and an instance ID. "Stuff" classes (ground, sky) have no instances, while "things" classes (car, pedestrian) do. It's the state of the art in 3D scene understanding.

5. Evaluating Segmentation

Measuring segmentation quality matters as much as the segmentation itself. Standard metrics let you compare methods objectively and catch pipeline weaknesses before production.

Per-class IoU (Intersection over Union) measures the overlap between prediction and ground truth for each category. mIoU (mean IoU) averages across all classes and is the reference metric in the community. Watch out though: a high mIoU can mask disastrous performance on rare classes if they get drowned out by the average.

Visual inspection in 3D is still a must. An 80% mIoU can mean object boundaries are systematically misclassified, which is critical for Scan-to-BIM applications where tolerance is on the order of a centimeter.

eval_segmentation.py
import numpy as np

def compute_iou(pred, gt, num_classes):
    """Compute per-class IoU and mIoU."""
    ious = []
    for cls in range(num_classes):
        intersection = np.sum((pred == cls) & (gt == cls))
        union = np.sum((pred == cls) | (gt == cls))
        if union == 0:
            ious.append(float('nan'))
        else:
            ious.append(intersection / union)
    valid = [x for x in ious if not np.isnan(x)]
    miou = np.mean(valid)
    return ious, miou

# Usage example
class_names = ["ground", "building", "vegetation", "vehicle"]
ious, miou = compute_iou(predictions, ground_truth, len(class_names))
for name, iou in zip(class_names, ious):
    print(f"  {name:12} IoU = {iou:.3f}")
print(f"  mIoU = {miou:.3f}")
Pitfall: Never trust numeric metrics alone. A model can hit 85% mIoU on a benchmark and still fail on your own scenes because of sensor, density, or class distribution differences. Always validate visually on a representative sample of your real data.
IoU (Intersection over Union): the key segmentation metric Prediction Ground truth Intersection IoU = Intersection / Union Union = Prediction + Ground truth - Intersection IoU = 0.2 (poor) IoU = 0.5 (fair) IoU = 0.8 (good)
Segmentation gives us groups of points. But how do we automatically identify what each group represents? The next chapter tackles 3D classification: the craft of giving every object in the scene an identity.

Chapter 7. 3D Classification: Recognizing Objects in Space

From geometric features to gradient boosting models: classifying every object with precision.
"Classification is the act of naming. Naming a 3D object gives it existence in the machine's vocabulary. Without classification, the point cloud stays a silent space."

1. Point Cloud Classification

3D classification assigns a single label to a point cloud or to a pre-segmented object. Unlike semantic segmentation (which works point by point), classification judges the object as a whole: is this cluster of points a chair, a table, a tree, or a vehicle?

Use cases are wide open: object recognition on benchmarks like ModelNet40 (40 CAD object categories), scene understanding for robotics, industrial quality control, and automatic sorting of elements in urban LiDAR surveys. In every scenario, the goal is the same: turn raw geometry into usable categorical information.

The split from segmentation matters. Segmentation assigns a class to every point of a full scene. Classification assigns a class to a whole object that's already been isolated. In practice, the two are often chained: you segment the scene into individual objects first, then classify each one.

Chair Table Plane Classification: one object = one global label

2. Geometric Features for Classification

Before the deep learning era, classification relied on hand-crafted geometric descriptors. These features are still highly relevant in 2026, especially when annotated data is scarce or model interpretability is required.

Local features derive from the eigenvalues of the local covariance matrix. From a neighborhood of k points, you get three eigenvalues e1 >= e2 >= e3 that encode the spatial distribution of points. From those, you derive linearity (wire-like structure), planarity (surface structure), sphericity (volume structure), and anisotropy (degree of elongation).

Global features capture overall shape: bounding box dimensions, volume, estimated surface, point count, height/width ratio. Multi-scale features combine local descriptors at different radii (0.5m, 1m, 2m, 5m) to capture context at multiple levels of granularity.

feature_extraction.py
import numpy as np
import open3d as o3d
from scipy.spatial import cKDTree

def extract_features(points, radii=[0.5, 1.0, 2.0]):
    """Full multi-scale feature extraction pipeline."""
    tree = cKDTree(points)
    all_features = []

    for radius in radii:
        features_r = []
        for i in range(len(points)):
            idx = tree.query_ball_point(points[i], radius)
            if len(idx) < 3:
                features_r.append([0] * 5)
                continue
            neighbors = points[idx]
            cov = np.cov(neighbors.T)
            evals = np.sort(np.linalg.eigvalsh(cov))[::-1]
            e1, e2, e3 = evals + 1e-10

            linearity = (e1 - e2) / e1
            planarity = (e2 - e3) / e1
            sphericity = e3 / e1
            anisotropy = (e1 - e3) / e1
            density = len(idx) / (4/3 * np.pi * radius**3)
            features_r.append([linearity, planarity, sphericity,
                               anisotropy, density])
        all_features.append(np.array(features_r))

    # Global features
    bbox_dims = points.max(axis=0) - points.min(axis=0)
    volume = np.prod(bbox_dims)
    height = bbox_dims[2]
    aspect_ratio = bbox_dims.max() / (bbox_dims.min() + 1e-8)
    n_points = len(points)
    global_feats = np.tile([volume, height, aspect_ratio, n_points],
                           (len(points), 1))

    return np.hstack(all_features + [global_feats])
FeatureFormulaWhat it captures
Linearity(e1 - e2) / e1Wire-like structure (cables, poles)
Planarity(e2 - e3) / e1Surface structure (walls, roofs)
Sphericitye3 / e1Volume structure (bushes, rocks)
Anisotropy(e1 - e3) / e1Overall elongation
Local densityk / V_sphereNeighborhood compactness
Relative heightz - z_min_localVertical position in the scene
💡 Multi-scale features
Compute your geometric features at at least 3 different radii (e.g. 0.5m, 1.5m, and 5m). The same object shows very different signatures depending on the scale of observation. This multi-scale approach typically lifts accuracy by 5 to 15 points versus a single scale, for a linear extra compute cost.
Geometric shapes encoded by eigenvalues Line e1 >> e2, e3 High linearity Cables, poles Plane e1 ≈ e2 >> e3 High planarity Walls, roofs, ground Sphere e1 ≈ e2 ≈ e3 High sphericity Bushes, rocks The 3 fundamental shapes encoded by the local covariance matrix
Florent's Tip: Multi-scale features are the key to reliable classification. A power pole has high linearity at small scale (cable) AND at large scale (mast). A tree has high sphericity at small scale (foliage) but high linearity at large scale (trunk). That multi-scale signature is what makes objects distinguishable.

3. Random Forest and Gradient Boosting for 3D

Tree-based methods are still among the best tools for point cloud classification, especially when features are well designed. Fast to train, interpretable, and tolerant of heterogeneous features, they're often the first choice in production.

Random Forest builds hundreds of independent decision trees and votes for the final class. XGBoost and LightGBM push the idea further with gradient boosting: each tree corrects the errors of the previous one, allowing finer convergence. On well-computed geometric features, these models match neural networks in performance, with training and inference times orders of magnitude smaller.

The big win is interpretability. Feature importance analysis reveals which geometric descriptors are most discriminative for each class. That info is gold for refining the feature pipeline and understanding the physics of the problem.

xgboost_classifier.py
import xgboost as xgb
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np

# Feature extraction (see previous section)
X = extract_features(points, radii=[0.5, 1.0, 2.0, 5.0])
y = labels  # Classes: 0=ground, 1=building, 2=vegetation, 3=vehicle

# XGBoost config tuned for 3D
clf = xgb.XGBClassifier(
    n_estimators=500,
    max_depth=8,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='multi:softprob',
    eval_metric='mlogloss',
    n_jobs=-1,
    tree_method='hist'  # Fast for large volumes
)

# Stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(clf, X, y, cv=cv, scoring='f1_weighted')
print(f"F1 weighted: {scores.mean():.3f} (+/- {scores.std():.3f})")

# Train the final model and analyze features
clf.fit(X, y)
importances = clf.feature_importances_
feature_names = [f"{f}_r{r}" for r in [0.5,1.0,2.0,5.0]
                 for f in ["lin","plan","sph","aniso","dens"]]
feature_names += ["volume", "height", "aspect", "n_pts"]

top_k = np.argsort(importances)[::-1][:5]
for i in top_k:
    print(f"  {feature_names[i]:15} importance = {importances[i]:.4f}")
Florent's Tip: Feature importance analysis is a strong diagnostic tool. If height dominates every other feature, your model is probably over-dependent on altitude and will generalize poorly to scenes with different elevations. Diversify your features for a solid model.
⚠️ Spatial autocorrelation
Never split your train/test data randomly point by point. Neighboring points are strongly correlated spatially: if a training point sits 10 cm from a test point, the model "cheats" by memorizing local context. Use a spatial split: reserve entire geographic zones for test (a whole street, an entire building). Otherwise your metrics get artificially inflated by 10 to 20 points.
💡 Feature pruning
After a first training run, use the feature importance ranking to drop features under 1% of the total. Fewer features means faster inference, lower overfitting risk, and often better generalization. On a 20-feature set, it's not rare to see the top 8 carry 95% of the information.
Random Forest: majority vote of multiple trees Point features Tree 1 building sees: lin, sph, dens Tree 2 building sees: plan, h_rel, aniso Tree 3 vegetation sees: sph, vol, h_rel . . . Vote: BUILDING (2/3)

4. From Classification to Detection

Classification answers "what is it?". But in a full scene, the key question is "what is it, AND where is it?". This move from classification to detection is the bridge between identification and location.

The simplest approach walks the scene with a 3D sliding window (a fixed-size cube) and classifies the contents of each window. That brute method is far too costly in 3D: a 100m x 100m x 30m scene at 1m steps generates 300,000 evaluations. Smart sampling strategies (region proposals, 3D anchors) cut that cost drastically.

Modern detection combines classification and bounding box regression in a single pass. The next chapter lays out these architectures, from VoxelNet to 3DETR, the state of the art in 3D object detection.

Advanced: In production, the most reliable pipeline often stays the chain semantic segmentation -> clustering -> classification. End-to-end methods (direct detection) score better on benchmarks, but the modular approach gives you finer error diagnosis and independent tuning at each stage.
From raw data to scene understanding Raw points Classification bld. veg. veh. grd Detection bld veg veh grd adjacent Understanding Progressive pipeline: each stage enriches the previous one
📚 Self-supervised pre-training
Self-supervised pre-training techniques like Point-MAE and Point-BERT learn rich 3D representations without manual annotation by reconstructing masked portions of the point cloud. After fine-tuning on a few hundred annotated examples, these models rival classifiers trained on thousands of samples. It's an especially promising avenue when annotated data is scarce or expensive to produce.
We now know how to identify objects. The next step is locating them precisely in 3D space: the domain of 3D object detection, where each entity gets an oriented bounding box.

Chapter 8. 3D Object Detection: Locate and Identify

From raw point cloud to oriented bounding boxes: spotting every entity in space.
"Detecting in 3D means answering two questions at once: what is it, and where exactly is it? The answer is a box in space. Six dimensions plus a rotation."

1. The 3D Detection Challenge

3D object detection goes beyond classification and segmentation. You have to pin down each object with an oriented 3D bounding box (OBB) defined by seven parameters: position (x, y, z), dimensions (length, width, height), and rotation angle theta.

It's a lot harder than 2D detection. In 2D a box has 4 parameters (x, y, w, h). In 3D you jump to at least 7, with rotations possible around three axes. The search space blows up combinatorially, and exhaustive approaches stop being practical.

The applications matter. Self-driving cars need to spot vehicles, pedestrians and cyclists in real time with centimeter accuracy. City planning relies on detection to inventory street furniture, trees and buildings from airborne LiDAR. Automated BIM uses detection to pick out doors, windows and mechanical equipment in indoor scans.

Vehicle (0.92) Tree (0.88) Building (0.95) 3D detection: oriented bounding boxes with confidence scores

2. Voxelization-Based Approaches

Voxelization turns an irregular point cloud into a regular 3D grid, which lets you apply 3D convolutions the same way image networks apply 2D convolutions. Regularizing the format is the bridge between the discrete world of points and the structured world of tensors.

The VoxelNet architecture is a good example. The cloud gets discretized into voxels, each voxel is encoded by a small network (VFE, for Voxel Feature Encoding), and then sparse 3D convolutions extract hierarchical features. A detection head finally produces bounding box proposals.

Picking the voxel resolution is the make-or-break decision. Fine resolution (0.05m) captures detail but blows up the data volume. Coarse resolution (0.5m) runs fast but loses small objects. In practice, modern architectures use sparse 3D convolutions (through spconv or MinkowskiEngine) that only process non-empty voxels, which slashes memory cost.

voxelization.py
import open3d as o3d
import numpy as np

def voxelize_scene(pcd, voxel_size=0.2):
    """Voxelize a cloud and extract per-voxel features."""
    # Build the voxel grid
    voxel_grid = o3d.geometry.VoxelGrid.create_from_point_cloud(
        pcd, voxel_size=voxel_size)

    points = np.asarray(pcd.points)
    # Compute voxel indices for each point
    min_bound = points.min(axis=0)
    voxel_indices = ((points - min_bound) / voxel_size).astype(int)

    # Aggregate features per voxel
    unique_voxels = {}
    for i, vi in enumerate(voxel_indices):
        key = tuple(vi)
        if key not in unique_voxels:
            unique_voxels[key] = []
        unique_voxels[key].append(points[i])

    # Per-voxel features: centroid, point count, extent
    voxel_features = {}
    for key, pts in unique_voxels.items():
        pts = np.array(pts)
        centroid = pts.mean(axis=0)
        n_pts = len(pts)
        extent = pts.max(axis=0) - pts.min(axis=0)
        voxel_features[key] = np.concatenate([centroid, [n_pts], extent])

    print(f"Active voxels: {len(voxel_features)} / theoretical grid")
    return voxel_features, voxel_grid

features, grid = voxelize_scene(pcd, voxel_size=0.2)
Watch out: Memory explodes at fine resolutions. A 0.1m voxel on a 200m x 200m x 50m urban scene gives you 2000 x 2000 x 500 = 2 billion theoretical voxels. Even with sparse convolutions, the hash grid can saturate your RAM. Always run a spatial crop before voxelization.
Voxelization Process Continuous cloud 3D grid Occupied voxels Per voxel: xyz centroid point count variance mean intensity XYZ extent Features Typical voxel size: 0.1m (indoor) | 0.2-0.5m (outdoor) | 1m+ (aerial)
🔧 Picking voxel size
Voxel size is the single most important parameter in voxel-based 3D detection. Rules of thumb: 0.05-0.1m for indoor scans (furniture, office objects), 0.2-0.5m for ground-level urban scenes (cars, pedestrians), 1-2m for airborne LiDAR surveys (buildings, vegetation). Too small and you drown the signal in noise. Too big and you merge distinct objects together.

3. Point-Based Approaches: VoteNet and 3DETR

Point-based approaches skip the information loss baked into voxelization by working directly on raw coordinates. Two standout architectures show off this philosophy: VoteNet and 3DETR.

VoteNet borrows from the generalized Hough voting system. Each point "votes" for the position of the center of its object. The votes then get clustered, and each vote cluster produces a bounding box proposal. The intuition is straightforward: a point on a car's hood votes toward the geometric center of the car, while a point on the roof votes the opposite direction. Where votes converge, you've found an object center.

3DETR adapts the DETR Transformer architecture (originally built for 2D detection) to 3D. A set of learnable "queries" probes the features extracted from the point cloud through cross-attention. Each query specializes on a potential object and directly predicts its class and bounding box. The big win: no Non-Maximum Suppression (NMS) in post-processing, because the queries naturally divide the objects among themselves.

Go deeper: The 2026 trend is clearly Transformer architectures for 3D detection. 3DETR, Group-Free 3D, CAGroup3D and Point Transformer v3 dominate the ScanNet and SUN RGB-D benchmarks. Why? Attention naturally captures long-range relations between the different parts of an object, which local convolutions struggle to do.
VoteNet Pipeline. Generalized Hough Voting Cloud N pts Seeds (FPS) Votes to centers Clustering car table OBB + class Each seed votes toward the presumed center of its object. Converging votes reveal the objects.
⚠️ VoteNet and pre-segmentation
VoteNet assumes foreground points (objects) have already been separated from background (ground, walls). On a raw, unsegmented point cloud, ground points vote in random directions and pollute the vote clusters. In practice, always run a RANSAC ground extraction before VoteNet, or use a variant like ImVoteNet that folds in 2D features to disambiguate.

4. Detection in Urban LiDAR Scenes

Urban detection comes with its own constraints: scenes that stretch hundreds of meters, density that varies with distance from the sensor, frequent occlusions, and objects at very different scales (a pedestrian is 1.7m, a building is 30m).

The practical pipeline starts with rigorous preprocessing: RANSAC ground extraction to normalize heights, outlier removal, and cutting the scene into manageable tiles. Post-processing matters just as much: 3D NMS to kill redundant detections, confidence-score filtering, and optionally temporal tracking if the data is sequential.

For plenty of urban use cases, a simple detector based on geometric rules (height, density, shape) is surprisingly effective. The example below shows a bare-bones pipeline that spots trees and buildings using height and shape criteria after ground extraction.

urban_detector.py
import numpy as np
import open3d as o3d
from sklearn.cluster import DBSCAN

def detect_urban_objects(pcd, ground_threshold=0.3):
    """Simple detector based on geometric rules."""
    points = np.asarray(pcd.points)

    # 1. RANSAC ground extraction
    plane, inliers = pcd.segment_plane(
        distance_threshold=0.15, ransac_n=3, num_iterations=2000)
    non_ground = pcd.select_by_index(inliers, invert=True)
    ng_pts = np.asarray(non_ground.points)

    # 2. Normalize height against the ground
    z_ground = np.mean(points[inliers, 2])
    heights = ng_pts[:, 2] - z_ground

    # 3. Cluster the objects
    clustering = DBSCAN(eps=0.8, min_samples=20).fit(ng_pts)
    detections = []

    for label in set(clustering.labels_) - {-1}:
        mask = clustering.labels_ == label
        cluster_pts = ng_pts[mask]
        cluster_h = heights[mask]

        # Compute dimensions
        bbox = cluster_pts.max(axis=0) - cluster_pts.min(axis=0)
        h_max = cluster_h.max()
        aspect = bbox[2] / (max(bbox[0], bbox[1]) + 1e-8)

        # Rule-based classification
        if h_max > 8 and bbox[0] > 3 and bbox[1] > 3:
            cls = "building"
            conf = min(0.95, 0.6 + h_max / 50)
        elif 2 < h_max < 15 and aspect > 1.5:
            cls = "tree"
            conf = min(0.90, 0.5 + aspect / 5)
        elif 1 < h_max < 3 and bbox[0] < 6:
            cls = "vehicle"
            conf = 0.70
        else:
            cls = "unknown"
            conf = 0.30

        # Oriented bounding box
        o3d_cluster = o3d.geometry.PointCloud()
        o3d_cluster.points = o3d.utility.Vector3dVector(cluster_pts)
        obb = o3d_cluster.get_oriented_bounding_box()

        detections.append({
            "class": cls, "confidence": conf,
            "center": obb.center, "extent": obb.extent,
            "n_points": mask.sum()
        })

    print(f"Detected objects: {len(detections)}")
    return detections
MethodAccuracy (mAP)SpeedEase of use
Geometric rules50-65%Very fastInstant, no training
VoxelNet / SECOND75-82%Real-time (GPU)Needs training + GPU
VoteNet70-78%ModerateModerate training
3DETR / CAGroup3D80-88%SlowHeavy GPU + large dataset
PointPillars72-80%Very fast (GPU)Good production trade-off
Florent's Tip: For a quick production deployment, start with a rule-based geometric detector. It gives you a solid baseline and exposes the edge cases in your scene. Only after that should you invest in a Deep Learning model for the classes where rules break down.
BEV Detection. Urban Scene 0.95 0.91 0.88 0.97 0.93 0.96 Building 0.99 Building 0.98 Vehicle Tree Building Scores = detector confidence
🎓 Multi-modal fusion
The top detectors in 2026 fuse LiDAR and camera. BEVFusion projects LiDAR and camera features into a shared BEV (Bird's Eye View) space, then runs detection in that unified space. You typically pick up 5 to 8 mAP points over LiDAR alone, especially for distant objects where LiDAR goes sparse. TransFusion uses cross-attention between the two modalities for a tighter alignment.

5. Toward Complete Scene Understanding

Segmentation, classification and detection are just the building blocks. Complete scene understanding combines these three pillars to build a structured, intelligible representation of the 3D environment.

The scene graph concept pushes that integration further. Each detected object becomes a graph node, and spatial relations between objects become edges: "the car is parked in front of the building", "the tree is adjacent to the sidewalk", "the window is in the facade". This relational representation is the substrate on which spatial AI agents reason.

This unified view, where every point in the cloud is simultaneously segmented, classified and placed in a relational graph, is what modern digital twins and automated BIM systems are built on. We'll dig into those in the next parts of the book.

Advanced: 3D Foundation Models like OpenScene and SAM3D are starting to unify segmentation, classification and detection under a single framework. By combining 2D embeddings (CLIP, SAM) with 3D features, they enable zero-shot scene understanding: describe what you're looking for in plain language, and the model pins it down in the point cloud. That's the frontier between 3D perception and general AI.
From Detection to Scene Graph Car Building Tree Road Sidewalk parked_on in_front_of adjacent_to on_the runs_along Nodes = detected objects | Edges = semantic spatial relations The Scene Graph is the reasoning substrate for Spatial AI Agents (Part VI)
💡 3D IoU vs 2D IoU
3D IoU is a lot stricter than its 2D counterpart. Two boxes that look well aligned from above can have a poor 3D IoU if their heights differ. A threshold of 3D IoU > 0.25 is standard on indoor benchmarks (ScanNet), while 3D IoU > 0.5 or 0.7 is used for self-driving (KITTI, nuScenes). Never compare detectors evaluated at different IoU thresholds.
With segmentation, classification and detection, you have a full toolbox for analyzing point clouds. The rest of the book digs into how these techniques fit into production systems: BIM reconstruction, digital twins, and AI agents that can reason about spatial data.

Segmentation, classification, detection: you've got the full toolbox for pulling meaning out of point clouds. But what's the point of detecting walls, floors and openings if you can't turn them into usable models? Part III shows you how to automate the Scan-to-BIM pipeline and build urban digital twins.

Part III. Automation and BIM

Modules 2-3: Scan-to-BIM, IFC and Urban Digital Twins

Now that you know how to segment, classify and detect objects in point clouds, we're moving on to the industrial application that gets asked for the most: turning 3D scans into structured BIM models. This part gives you what you need to automate the Scan-to-BIM pipeline, work with the IFC format in Python, and fold your models into urban digital twins.


Chapter 9. Automatic Scan-to-BIM: The Digital Transformation

From raw point cloud to structured IFC model: automate the modeler's job.
"The BIM modeler spends 70% of their time tracing walls on point cloud slices. Our mission: cut that time to zero with computer vision algorithms."
Point Cloud Segmen- tation Geometric Fitting IFC Model Automatic Scan-to-BIM Pipeline Cloud → Segmentation → Fitting → IFC

1. The "Slice" Strategy

Detecting walls directly in 3D is hard and expensive. The engineer's trick is to project the problem into 2D. By pulling a slice out of the building at a strategic height (say 1.5m), we turn a 3D problem into a 2D binary image.

slicer.py
import numpy as np

def create_floor_slice(pcd, height_min=1.2, height_max=1.8):
    # Band-pass filter on Z
    points = np.asarray(pcd.points)
    mask = (points[:, 2] > height_min) & (points[:, 2] < height_max)
    slice_pcd = pcd.select_by_index(np.where(mask)[0])
    # 2D projection (flatten Z)
    grid, transform = points_to_raster(slice_pcd, pixel_size=0.05)
    return grid # Binary image (0 or 255)

2. Hough Transform: Detecting Lines

From the binary slice image, the Hough transform picks out the straight segments that correspond to wall surfaces. OpenCV gives you an optimized implementation through HoughLinesP.

Points + Hough Detected plan Wall Detection via Hough Transform
line_detector.py
import cv2

def find_walls(image_grid):
    lines = cv2.HoughLinesP(
        image_grid,
        rho=1,              # Distance resolution (1 pixel)
        theta=np.pi/180,    # Angular resolution (1 degree)
        threshold=50,       # Minimum votes
        minLineLength=30,   # Min length (e.g. 1.5m)
        maxLineGap=10       # Allowed gap for merging
    )
    return lines

3. Topology Reconstruction: From Lines to Walls

Tip: Use the Shapely library to buffer the lines and run boolean operations (Union) to produce clean wall polygons.

4. IFC Generation with IfcOpenShell

The final step of the Scan-to-BIM pipeline turns the detected geometry into structured IFC entities. IfcOpenShell lets you build a full BIM model programmatically from scratch.

ifc_builder.py
import ifcopenshell
from ifcopenshell.api import run

# Create empty project
model = ifcopenshell.file()
project = run("root.create_entity", model, ifc_class="IfcProject", name="AutoBIM")

# Create a wall from coords p1, p2
wall = run("root.create_entity", model, ifc_class="IfcWall")
run("geometry.edit_object_placement", model, product=wall)
Go deeper: For professional-grade Scan-to-BIM, combine RANSAC detection with a region-growing algorithm to capture curved surfaces (columns, arches). The CGAL library, through its Python bindings (scikit-geometry), gives you solid geometric primitives for reconstructing complex shapes.

Wrap-up: The Limits of Automation

StepDifficultyAuto Success Rate
Straight walls (Manhattan World)Low95%
Curved wallsMedium60%
Openings (Doors/Windows)High40%
MEP NetworksExtreme10%
The Scan-to-BIM pipeline spits out IFC files. But how do you move through the dense structure of that format to pull out the useful information?

Chapter 10. IfcOpenShell: Hack BIM with Python

The IFC format is heavy and complex. Python is light and nimble. Learn how to tame the beast.
IfcProject IfcSite IfcBuilding IfcStorey IfcWall IfcSlab IfcWindow Root Spatial Elements IFC Spatial Hierarchy
"An IFC file isn't just a 3D model. It's a dense object-oriented relational database (EXPRESS). Going through it without tooling is like reading the Matrix in binary."

1. The IFC Spatial Hierarchy

An IFC file is organized as a tree: Project > Site > Building > Storey > Spaces > Elements. Walking that hierarchy is the first skill to master if you want to pull out targeted information.

hierarchy_walker.py
import ifcopenshell

model = ifcopenshell.open("model.ifc")
project = model.by_type("IfcProject")[0]

def print_hierarchy(object, indent=0):
    print("  " * indent + f"-> {object.Name} ({object.is_a()})")
    if hasattr(object, "IsDecomposedBy"):
        for rel in object.IsDecomposedBy:
            for related_obj in rel.RelatedObjects:
                print_hierarchy(related_obj, indent+1)

print_hierarchy(project)
Florent's Tip: IFC GlobalId identifiers are UUIDs compressed in base64. To generate compliant GlobalIds, use ifcopenshell.guid.new() rather than uuid.uuid4(). An invalid GlobalId can silently corrupt the IFC file and cause crashes in BIM viewers like BIMcollab or Solibri.

2. Data Mining: Extracting Quantities

IFC Property Sets and Quantity Sets hold the measured information for every element. Export them into a Pandas DataFrame and you turn a BIM file into a queryable database.

bim_pandas.py
import pandas as pd
import ifcopenshell.util.element

data = []
walls = model.by_type("IfcWall")

for w in walls:
    psets = ifcopenshell.util.element.get_psets(w)
    qto = psets.get("Qto_WallBaseQuantities", {})
    data.append({
        "GlobalId": w.GlobalId,
        "Name": w.Name,
        "Volume": qto.get("NetVolume", 0.0),
        "Area": qto.get("NetSideArea", 0.0)
    })

df = pd.DataFrame(data)
print(df.groupby("Name")["Volume"].sum())

In 10 lines of code, you've replaced a paid Revit plugin.

3. Geometry: From Implicit to Mesh

IFC geometries are stored implicitly (extruded profiles, CSG). IfcOpenShell's geometric engine tessellates them into triangular meshes you can actually use for visualization and analysis.

geom_engine.py
import ifcopenshell.geom

settings = ifcopenshell.geom.settings()
settings.set(settings.USE_WORLD_COORDS, True)

for wall in walls:
    shape = ifcopenshell.geom.create_shape(settings, wall)
    verts = shape.geometry.verts # Flat list X,Y,Z...
    faces = shape.geometry.faces # Triangle indices

4. Modify and Write

IfcOpenShell isn't read-only. You can add properties, modify attributes and save the updated model. That's the key to BIM automation.

bim_writer.py
for wall in walls:
    ifcopenshell.api.run("pset.edit_pset", model,
        product=wall,
        pset_name="Pset_QualityControl",
        properties={"Status": "Checked", "Date": "2026-01-10"})

model.write("model_checked.ifc")
IfcOpenShell gives you full control over a single building. How do you scale up to an entire city with CityGML?

Chapter 11. Urban Digital Twin: The CityGML Standard

A smart city isn't just a 3D mesh. It's a semantic geospatial database.
LOD 0 2D footprint LOD 1 Extruded block LOD 2 Differentiated roof LOD 3 Detailed facade CityGML Levels of Detail (LOD 0-3)
"Google Earth shows you what the city looks like. A CityGML Digital Twin tells you what it's made of, how much energy it consumes, and which roof is ideal for solar."

1. The LOD (Level of Detail) Concept

  • LOD 0: 2.5D terrain (DTM). Use case: hydrology.
  • LOD 1: Extruded blocks ("Shoebox"). Use case: macro-scale urban planning.
  • LOD 2: Differentiated roofs. Use case: solar potential.
  • LOD 3: Detailed facades (windows, doors). Use case: architecture.
  • LOD 4: Building interiors (BIM). Use case: crisis management.

2. Parsing CityGML with Python

CityGML is an XML format based on GML. The lxml library lets you move through its dense namespaces to pull out the semantic attributes of each building.

city_parser.py
from lxml import etree

ns = {'core': "http://www.opengis.net/citygml/2.0",
      'bldg': "http://www.opengis.net/citygml/building/2.0",
      'gml': "http://www.opengis.net/gml"}

def parse_buildings(file_path):
    tree = etree.parse(file_path)
    root = tree.getroot()
    buildings = root.findall('.//bldg:Building', namespaces=ns)
    print(f"Found {len(buildings)} buildings.")
    for b in buildings:
        measured_height = b.find('.//bldg:measuredHeight', namespaces=ns)
        height = float(measured_height.text) if measured_height is not None else 0.0
        print(f"Building Height: {height}m")
Performance: For files over 500MB, don't use etree.parse (which loads everything into RAM). Go with etree.iterparse for streaming reads.

3. From CityGML to the Web: 3D Tiles

To view an urban digital twin in a browser, you have to convert the CityGML into 3D Tiles, Cesium's spatial streaming format. The py3dtiles tool handles the conversion.

tiler.sh
# Conversion via Py3DTiles CLI
py3dtiles convert input.gml --out output_tileset/

4. Visualization with CesiumJS

CesiumJS is the reference web engine for 3D geospatial rendering. In a few lines of JavaScript, it loads a 3D Tiles tileset and applies dynamic styling based on semantic attributes.

viewer.html
const viewer = new Cesium.Viewer('cesiumContainer');

const tileset = viewer.scene.primitives.add(
    new Cesium.Cesium3DTileset({
        url: './output_tileset/tileset.json'
    })
);

tileset.readyPromise.then(function(tileset) {
    viewer.zoomTo(tileset);
    tileset.style = new Cesium.Cesium3DTileStyle({
        color: {
            conditions: [
                ['${Height} >= 50', 'color("purple")'],
                ['${Height} >= 20', 'color("red")'],
                ['true', 'color("white")']
            ]
        }
    });
});
Go deeper: For top performance in CesiumJS on whole cities, convert your CityGML to 3D Tiles with citygml-tools or FME. The 3D Tiles format uses a hierarchical LOD with bounding volumes, which lets you stream millions of buildings without saturating browser memory.
Input Hidden 1 Hidden 2 Output Deep Learning 3D Architecture

Part IV. Machine Learning & Deep Learning 3D

Module 4: From classical features to neural architectures for point clouds

Automated BIM runs on rules and heuristics. To go further, to generalize, learn, adapt, you have to step into Machine Learning and then Deep Learning. This part starts with the fundamentals of ML applied to 3D (geometric features, Random Forest, SVM) before digging into the neural architectures (PointNet, PointNet++) that reshaped the field.


Chapter 12. Machine Learning for 3D: Fundamentals Demystified

Before diving into Deep Learning, grasp the basics of classical ML and know when to use it.
"The best model isn't the most complex one. It's the one that solves your problem with the least data, time, and risk. In 3D, a well-tuned Random Forest often beats a poorly trained neural network."

1. AI, Machine Learning, Deep Learning: What's What?

These terms get tossed around interchangeably in the media, but they describe nested concepts with increasing levels of specificity. Grasping this hierarchy matters if you want to avoid the buzzword trap and pick the right approach for each problem.

Artificial Intelligence is the broadest field: any system that mimics intelligent behavior. In 3D, a plain RANSAC algorithm that detects a plane already counts as AI in the wide sense. Machine Learning is a subset of AI: algorithms that learn patterns from data without being explicitly programmed for each case. In 3D, a Random Forest trained on geometric features to classify ground/building/vegetation is classical ML.

Deep Learning is itself a subset of ML. It uses multi-layer neural networks to automatically learn hierarchical representations. PointNet, which takes raw XYZ coordinates and figures out on its own which features matter, is DL. Beyond that, Foundation Models (SAM, CLIP) are DL models pre-trained on massive data, adaptable to specific tasks through fine-tuning or zero-shot.

In 2026, the reality on the ground is more nuanced. LLMs and "GenAI" dominate the media conversation, but in production 3D, classical ML is still massively deployed. Random Forests with geometric features run point cloud classification pipelines at national scale (IGN, Kadaster, Ordnance Survey). Deep Learning is catching up fast, but it needs more data, more GPU, and more expertise to be reliable.

Artificial Intelligence RANSAC, A*, rules Machine Learning Random Forest, SVM, k-NN Deep Learning PointNet, MinkowskiNet Foundation Models SAM-3D, OpenScene Each inner circle is a subset of the previous one

2. Classical Machine Learning on Point Clouds

Classical ML on point clouds rests on one core principle: feature engineering. You manually extract geometric descriptors for each point (or group of points), then train a supervised classifier on those feature vectors.

The most discriminative geometric features come from the local covariance matrix (PCA over k-nearest neighbors). The three eigenvalues lambda_1 >= lambda_2 >= lambda_3 let you derive: planarity (lambda_2 - lambda_3) / lambda_1, linearity (lambda_1 - lambda_2) / lambda_1, sphericity lambda_3 / lambda_1, and curvature lambda_3 / (lambda_1 + lambda_2 + lambda_3). A flat ground will have strong planarity, a power cable strong linearity, a tree strong sphericity.

The Random Forest is the go-to algorithm for this task. Why? It natively handles features at different scales, it's robust to noise, it needs no normalization, and it gives you a feature importance score that helps make sense of the model. With 15 multi-scale geometric features, a Random Forest hits 85-90% accuracy on urban benchmarks.

classical_ml_pipeline.py
import open3d as o3d
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

def compute_geometric_features(pcd, k=30):
    """Compute geometric features via local PCA."""
    tree = o3d.geometry.KDTreeFlann(pcd)
    points = np.asarray(pcd.points)
    features = []

    for i in range(len(points)):
        [_, idx, _] = tree.search_knn_vector_3d(points[i], k)
        neighbors = points[idx]
        cov = np.cov(neighbors.T)
        eigenvalues = np.sort(np.linalg.eigvalsh(cov))[::-1]
        l1, l2, l3 = eigenvalues + 1e-10  # Avoid division by zero

        linearity  = (l1 - l2) / l1
        planarity  = (l2 - l3) / l1
        sphericity = l3 / l1
        curvature  = l3 / (l1 + l2 + l3)
        anisotropy = (l1 - l3) / l1
        features.append([linearity, planarity, sphericity, curvature, anisotropy])

    return np.array(features)

# Full pipeline
pcd = o3d.io.read_point_cloud("scene.ply")
X = compute_geometric_features(pcd, k=30)
y = np.load("labels.npy")  # 0=ground, 1=building, 2=vegetation

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = RandomForestClassifier(n_estimators=100, max_depth=15, n_jobs=-1)
clf.fit(X_train, y_train)
print(f"Accuracy: {clf.score(X_test, y_test):.3f}")
Florent's Tip: A Random Forest with 15 multi-scale geometric features (k=10, k=30, k=50) reaches 85-90% accuracy on urban point clouds, on par with the earliest PointNet implementations. And it trains in seconds on CPU, not hours on GPU.
XYZ Cloud k-NN Neighborhood PCA λ1 ≥ λ2 ≥ λ3 Eigenvalues Features Planarity: 0.85 Linearity: 0.03 Sphericity: 0.12 Curvature: 0.02 5D Vector Geometric Feature Extraction Pipeline Repeated for each cloud point at different scales (k=10, 30, 50)
⚠️ Feature normalization
Normalize your features with StandardScaler before using an SVM or k-NN (those algorithms are scale-sensitive). On the flip side, never normalize before a Random Forest: decision trees are invariant under monotonic transformations, and normalization only hurts the interpretability of split thresholds without improving accuracy.

3. From Feature Engineering to Feature Learning

Classical ML has an Achilles heel: features are human-designed. No matter how clever, they only capture what the designer imagined. Complex spatial patterns (multi-scale relationships between objects, scene contexts) slip past manual geometric descriptors.

That's the fundamental shift Deep Learning brings: instead of designing features, you let the network learn them directly from raw data. The network jointly optimizes feature extraction AND classification, finding representations a human would never have dreamed up.

The bridge between both worlds is PointNet (Qi et al., 2017). It takes raw XYZ coordinates as input, applies learned transformations (shared MLPs), and produces a global representation through symmetric max-pooling. No voxelization, no 2D projection, just direct processing of the point cloud. That's what sets up the next chapters on Deep Learning for 3D.

Classical ML vs Deep Learning Classical ML XYZ Cloud Feature Engineering (planarity, curvature...) Random Forest Prediction Manual design Deep Learning XYZ Cloud Neural Network (PointNet) Automatically learned features Prediction End-to-end learning

4. When to Use What: The Practitioner's Decision Tree

When you face a 3D classification or segmentation problem, the choice of method should never follow tech fashion. It depends on concrete criteria: dataset size, available hardware, need for interpretability, and latency constraints.

The golden rule: start simple. If you have fewer than 10,000 annotated points, a Random Forest with geometric features will be more reliable than an under-trained PointNet. If you have a GPU and hundreds of thousands of points, PointNet or PointNet++ become competitive. For massive datasets (millions of points, dozens of classes), pre-trained Foundation Models like SAM-3D or OpenScene offer the best accuracy/annotation effort ratio.

Interpretability is an often-ignored factor. A Random Forest tells you which features matter (planarity dominates? curvature?). A neural network is a black box. If your client or your team needs to understand why a point is classified as "building", classical ML is still unbeatable.

MethodMin. dataHardwareTypical accuracyTrainingInterpretability
Random Forest + features1k-10k ptsCPU85-90% (urban)SecondsHigh
SVM + features1k-50k ptsCPU83-88%MinutesMedium
PointNet50k-500k ptsGPU (4 GB+)87-91%HoursLow
PointNet++100k-1M ptsGPU (8 GB+)90-94%HoursLow
Foundation ModelsZero-shot possibleGPU (16 GB+)91-96%Minutes (fine-tune)Very low
Advanced: The "no free lunch" theorem (Wolpert, 1996) states that no algorithm is universally superior across all problems. A Random Forest can beat PointNet++ on a specific dataset where geometric features perfectly capture the structure. Always test a simple baseline before deploying a complex model.
3D Practitioner's Decision Tree Labeled data available? < 1 000 pts Random Forest CPU - Seconds - Interpretable 1k - 100k pts XGBoost / SVM CPU - Minutes - Good accuracy > 100k pts Deep Learning ? No GPU Multi-scale RF GPU 8GB+ PointNet++ 0 labels? Use Foundation Models SAM-3D, OpenScene: zero-shot, no annotation
💡 Hybrid approaches
The boundary between classical ML and Deep Learning isn't watertight. The top-performing approaches in 2026 often combine both: geometric features concatenated with features learned by a neural network. For example, extract eigenvalue features at multiple scales, then inject them as additional features into a PointNet++. This hybrid strategy consistently gains 2-4 mIoU points over DL alone, especially on small datasets.

5. Evaluation Metrics for 3D

Properly evaluating a 3D classification or segmentation model is subtler than it looks. Global accuracy, the most intuitive metric, is often misleading because of the massive class imbalance in real point clouds.

In a typical urban point cloud, the ground accounts for 60-70% of points, buildings 15-20%, vegetation 10-15%, and rare objects (cars, pedestrians, street furniture) under 5%. A model that predicts "ground" for every point hits 65% accuracy, a flattering but totally useless number. That's why per-class metrics are essential.

The reference metric in 3D segmentation is mIoU (mean Intersection over Union). For each class, IoU measures the ratio between the intersection and the union of predictions and ground truth. mIoU is the average of these per-class IoUs, giving equal weight to each category regardless of frequency. An mIoU of 70% is considered good on standard benchmarks (S3DIS, SemanticKITTI).

compute_miou.py
import numpy as np

def compute_miou(predictions, labels, num_classes):
    """Compute mIoU (mean Intersection over Union)."""
    ious = []
    for c in range(num_classes):
        pred_c = predictions == c
        label_c = labels == c
        intersection = np.logical_and(pred_c, label_c).sum()
        union = np.logical_or(pred_c, label_c).sum()
        if union == 0:
            ious.append(float('nan'))  # Class absent
        else:
            ious.append(intersection / union)
    miou = np.nanmean(ious)
    # Per-class output
    for c, iou in enumerate(ious):
        print(f"  Class {c}: IoU = {iou:.3f}")
    print(f"  Global mIoU: {miou:.3f}")
    return miou, ious

# Usage example
preds = np.array([0, 0, 1, 2, 2, 1, 0, 2])
gt    = np.array([0, 1, 1, 2, 0, 1, 0, 2])
miou, per_class = compute_miou(preds, gt, num_classes=3)
Warning: Spatial data leakage is a sneaky trap in 3D. Points close in space are strongly correlated (spatial autocorrelation). If your train/test split randomly partitions points from the same object, the model "memorizes" the local context instead of learning generalizable patterns. Solution: always split by spatial zones (per building, per tile, per scan), never by individual points.
Confusion Matrix - Urban Classification Prediction → Truth → Grnd Bldg. Veg. Veh. Grnd Bldg. Veg. Veh. 92% 3% 4% 1% 5% 88% 6% 1% 7% 4% 85% 4% 8% 5% 12% 75% IoU: 0.82 IoU: 0.78 IoU: 0.71 IoU: 0.58 mIoU = 0.72 - The rare Vehicle class drags the mIoU down
📈 Always report per-class IoU
mIoU alone can hide catastrophic performance on rare classes. In the example above, the 72% mIoU looks fine, but the Vehicle IoU at 58% is unacceptable for an urban mapping application. Always report mIoU and per-class IoU in your reports. In production, set a minimum threshold per critical class (e.g., Vehicle IoU > 70% for autonomous driving).
🎓 Transfer Learning in 3D
Self-supervised pre-training approaches (Point-MAE, Point-BERT) let you learn 3D representations from unlabeled data, then fine-tune on your specific task with very few labels. This is the dominant trend in 2026: pre-train on millions of unlabeled points (Waymo, nuScenes), then adapt with a few hundred annotated points from your domain. The typical gain is 5-10 mIoU points over training from scratch.
You now own the fundamentals of Machine Learning applied to 3D. But manual features hit a ceiling: they only capture what a human imagined. The next chapter crosses that threshold with PointNet, the architecture that proved a neural network can learn directly from raw XYZ coordinates, no features, no voxels, no projections.

Chapter 13. PointNet: The Architecture That Changed Everything

How do you process raw point clouds without voxelization or 2D projection? Stanford's answer that reshaped 3D.
"Before 2017, 3D Deep Learning was a memory-hungry voxel grid hell. PointNet proved you could learn directly from unordered XYZ coordinates."
N x 3 Points Shared MLP MAX Pool Global Feature Classification Segmentation Simplified PointNet Architecture Permutation invariance via global Max Pooling

1. The Ordering Problem

A point cloud is a set of N points. If you permute the order of the points, the object stays the same. But a classical neural network (MLP, CNN) is sensitive to input order. PointNet solves this by using a symmetric function (Max Pooling) that makes the network permutation-invariant.

2. T-Net: Spatial Alignment

The T-Net is a mini-network that learns a transformation matrix to spatially align the input cloud. It makes the model robust to arbitrary rotations and translations.

tnet_module.py
class TNet(nn.Module):
    def __init__(self, k=3):
        super().__init__()
        self.conv1 = nn.Conv1d(k, 64, 1)
        self.fc1 = nn.Linear(1024, 512)
        self.fc3 = nn.Linear(256, k*k)

    def forward(self, x):
        # Learns a k x k transformation matrix
        matrix = self.fc3(x).view(-1, k, k)
        identity = torch.eye(k).cuda()
        return matrix + identity

3. Backbone Implementation

The PointNet backbone chains the T-Net, shared MLP layers, and a global Max Pooling to produce a 1024-dimensional feature vector that's permutation-invariant.

pointnet_cls.py
class PointNetCls(nn.Module):
    def __init__(self, k=10, feature_transform=False):
        super().__init__()
        self.feat = PointNetFeat(global_feat=True)
        self.fc1 = nn.Linear(1024, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, k) # k classes
        self.dropout = nn.Dropout(p=0.3)

    def forward(self, x):
        x, trans, trans_feat = self.feat(x)
        x = F.relu(self.fc1(x))
        x = F.log_softmax(self.fc3(x), dim=1)
        return x, trans_feat
Florent's Tip: PointNet's T-Net (Spatial Transformer) is notoriously unstable during training. Add L2 regularization on the transformation matrix: reg_loss = 0.001 * torch.norm(I - torch.bmm(trans, trans.transpose(2,1))). Without this regularization, the matrix diverges and the network won't converge.

4. Classification vs Segmentation

PointNet supports two core tasks. Classification assigns a global label to an object (chair, table, airplane), while semantic segmentation assigns a class to each point individually. The architectural difference is tiny, but the practical implications are massive.

Confusion Matrix - 3D Classification Predicted Ground truth Grnd Veg. Bldg. Other Grnd Veg. Bldg. Other 94% 3% 1% 2% 5% 88% 2% 5% 2% 1% 91% 6% 4% 7% 4% 85%
TaskOutputArchitecture
Classification1 class per cloudGlobal Max Pooling -> FC Layers
Segmentation1 class per pointConcatenate Global Feat (1024) to each Point Feat (64)

For segmentation, the clever idea is to reinject global information (the whole-shape context) back at the local level of each point.

PointNet ignores the local structure of the cloud. How does PointNet++ solve this core problem by introducing a spatial hierarchy?

Chapter 14. PointNet++: The Power of Local Context

PointNet was short-sighted. PointNet++ sees the detail. Grasp the Set Abstraction Layers.
N points FPS Centroids Group Neighborhoods PointNet Features Repeat Hierarchical Set Abstraction 1. Input 2. Sampling 3. Grouping 4. Encoding
"PointNet's flaw? It processes the cloud globally. PointNet++ brings in hierarchy: learn local features at small scale, then group them, the way a CNN does with its receptive fields."

1. The Hierarchical Architecture

PointNet++ splits the cloud into increasingly larger zones. The base unit is the Set Abstraction Level (SA), made of three steps:

  1. Sampling: Pick representative centroids (FPS).
  2. Grouping: Find the neighbors around these centroids (Ball Query).
  3. PointNet: Encode these neighborhoods into feature vectors.

2. Furthest Point Sampling (FPS)

FPS iteratively selects the point farthest from the already-chosen set, guaranteeing uniform spatial coverage of the cloud. This is the sampling step of the Set Abstraction Layer.

fps_alg.py
def farthest_point_sample(xyz, npoint):
    # xyz: [B, N, 3]
    centroids = torch.zeros(B, npoint, dtype=torch.long).cuda()
    distance = torch.ones(B, N).cuda() * 1e10
    farthest = torch.randint(0, N, (B,), dtype=torch.long).cuda()

    for i in range(npoint):
        centroids[:, i] = farthest
        centroid = xyz[batch_indices, farthest, :].view(B, 1, 3)
        dist = torch.sum((xyz - centroid) ** 2, -1)
        mask = dist < distance
        distance[mask] = dist[mask]
        farthest = torch.max(distance, -1)[1]

    return centroids
Florent's Tip: The number of centroids in each Set Abstraction layer sets the detail/globality trade-off. A good heuristic: divide by 4 at each layer (e.g., 1024 -> 256 -> 64 -> 16). For indoor scenes, start with N/4 centroids where N is the input point count.

3. Multi-Scale Grouping (MSG)

The PointNet++ solution: look at multiple scales (e.g., radius 0.1m and 0.4m) at once and concatenate the results.

4. Feature Propagation (Upsampling)

For segmentation, you need to propagate the features learned at higher levels back to each original point. Inverse-distance weighted interpolation performs this upsampling.

propagation.py
def feature_interpolate(points, centroids, feats):
    # Find the 3 nearest centroids for each original point
    dists, idx = k_nearest_neighbors(points, centroids, k=3)
    inv_dists = 1.0 / (dists + 1e-8)
    norm = torch.sum(inv_dists, dim=2, keepdim=True)
    weights = inv_dists / norm

    # Weighted sum of features
    interpolated_feats = torch.sum(group_features * weights.view(B, N, 3, 1), dim=2)
    return interpolated_feats
Warning: Inverse-distance feature propagation is sensitive to low-density zones. If your cloud has holes (occlusions), the features propagated in those areas will be dominated by distant points and therefore unreliable. Pre-filter these zones with a local density test.

State of the Art

ModelmIoU (S3DIS)Inference Speed
PointNet47.6%150 fps (Ultra Fast)
PointNet++54.5%35 fps (Fast)
RandLA-Net62.4%60 fps (Very Fast)
Point Transformer70.1%5 fps (Heavy)
3D Gaussian Splatting

Part V. Neural Rendering and 3D Reconstruction

Module 5: Gaussian Splatting, NeRF, and Photogrammetry

Neural networks aren't just for classification. They can also generate whole 3D scenes from plain photographs. This part covers the three 3D reconstruction and rendering approaches: 3D Gaussian Splatting, NeRFs, and classical photogrammetry revisited through Python. You'll learn how to build photorealistic 3D models from images.


Chapter 15. 3D Gaussian Splatting: The Rendering Revolution

Forget meshes, forget NeRFs. Welcome to the age of gaussian rasterization.
"3DGS isn't just an incremental improvement. It's a shift that allows photorealistic rendering at 100+ FPS in 1080p, where NeRFs struggled to hit 5 FPS."
Camera 3D Gaussians Splat Rendered Image 3D Gaussian Splatting: Concept Differentiable rasterization of 3D gaussians

1. The Theory: Why Gaussians?

Unlike voxels (discrete) or meshes (rigid), a 3D Gaussian is a soft primitive defined by: a mean position, a covariance matrix (stretch/rotation), an opacity, and a view-dependent color (Spherical Harmonics).

2. Initialization: The Role of COLMAP

3DGS starts from a sparse point cloud generated by Structure-from-Motion. COLMAP is the reference tool for this initial reconstruction, which provides camera poses and the starting 3D points.

sfm_prep.sh
# Feature extraction and matching
colmap feature_extractor --database_path db.db --image_path images/
colmap exhaustive_matcher --database_path db.db

# Sparse reconstruction
colmap mapper --database_path db.db --image_path images/ --output_path sparse/

3. The Training Loop (Optimization Loop)

Training iteratively optimizes each gaussian's parameters (position, covariance, opacity, color) by minimizing the difference between the rendered image and the real photo through backpropagation.

Rasterize Gaussians Compare L1 + SSIM Backprop Gradients Update Params Adam optim. 30,000 iterations (7 min on RTX 4090)
train_step.py
def training_step(iteration):
    # 1. Differentiable Rasterization
    rendered_image = rasterizer(viewpoint_cam, pc)

    # 2. Loss computation
    gt_image = viewpoint_cam.original_image
    loss = (1.0 - lambda_dssim) * l1_loss(rendered_image, gt_image) + \
           lambda_dssim * (1.0 - ssim(rendered_image, gt_image))

    # 3. Backprop & Optimizer Step
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # 4. Densification (Adaptive Control)
    if iteration % densification_interval == 0:
        pc.densify_and_prune(grad_threshold, min_opacity, extent)

4. Compression and Export

AttributeRaw TypeCompressed TypeGain
Positionfloat32 (12B)float16 (6B)50%
Rotationfloat32 (16B)uint8 (4B)75%
SH (Color)float32 (45 * 4B)Codebook Index (1B)99%
Florent's Tip: To deploy 3D Gaussian Splatting in web production, the compressed .splat format (16 bits per attribute instead of 32) cuts size by 60% with no perceptible visual loss. The antimatter15/splat implementation on GitHub is the reference for WebGL rendering.

5. The Python Ecosystem

Pro Tip: Use Nerfstudio to get started. Their pipeline wires up Colmap, training, and a web socket viewer (Viser) automatically.
ns-train splatfacto --data data/my_scene
3D Gaussian Splatting delivers impressive real-time rendering. But how does it stack up against NeRF, its predecessor that kicked off the neural rendering wave?

Chapter 16. NeRF vs 3D Gaussian Splatting: The Definitive 2026 Comparison

Two approaches, one goal: photorealistic 3D rendering. Deep technical analysis with benchmarks, code, and architecture recommendations.
NeRF MLP Implicit Volume Ray Marching (slow) VS 3DGS Explicit Gaussians Rasterization (fast) NeRF (Implicit) vs 3DGS (Explicit)
"The NeRF vs 3DGS debate isn't a fight. It's evolution. Implicit ray marching gradually gives way to explicit splatting, but each approach keeps irreducible advantages depending on context."

1. Fundamentals: Implicit vs Explicit

The core difference between NeRF and 3DGS lies in how they represent the scene: an implicit neural network versus explicit geometric primitives. This architectural choice dictates every downstream property.

CriterionNeRF (Implicit)3DGS (Explicit)
RepresentationMLP / Hash gridAnisotropic 3D Gaussians
RenderingRay marching (slow)Differentiable rasterization (fast)
EditabilityLow (network weights)High (direct manipulation)
Model size~50 MB~200-800 MB

2. NeRF Pipeline with Nerfstudio

Nerfstudio unifies the NeRF workflow end-to-end: data prep via COLMAP, model training, and interactive visualization. Two commands are enough to get a result.

train_nerf.sh
# Data preparation with COLMAP
ns-process-data images --data ./photos/ --output-dir ./processed/

# Nerfacto training
ns-train nerfacto --data ./processed/ \
    --max-num-iterations 30000

3. 3D Gaussian Splatting Pipeline

For finer control than Nerfstudio, the gsplat library exposes the differentiable rasterizer directly in Python. Here's how to run a custom render from raw gaussians.

gsplat_custom.py
import torch
from gsplat import rasterization

def render_gaussians(means3d, scales, rotations, opacities, sh_coeffs, camera):
    """Differentiable rasterization rendering with gsplat."""
    rendered_image, rendered_alpha, info = rasterization(
        means=means3d, quats=rotations,
        scales=scales, opacities=opacities,
        colors=sh_coeffs,
        viewmats=camera.world_to_cam[None],
        Ks=camera.intrinsics[None],
        width=camera.width, height=camera.height,
        sh_degree=3,
    )
    return rendered_image, rendered_alpha

4. Comparative Benchmark

The numbers speak for themselves. This benchmark on the Mip-NeRF 360 dataset compares the four main methods across visual quality, training time, and rendering performance.

MethodPSNR (dB)SSIMTrain (min)FPS 1080pSize (MB)
Nerfacto28.40.874350.848
Instant-NGP29.10.89184.255
3DGS30.20.91822148410
3DGS-MCMC30.50.92228135320
Going further: PSNR and SSIM don't capture the perceptual quality of 3D renders well. Use LPIPS (Learned Perceptual Image Patch Similarity) as a complementary metric: it correlates much better with human judgment. Install with pip install lpips and evaluate with lpips.LPIPS(net='vgg').

5. Use Cases: When to Pick What

Use caseRecommendationReason
Real estate virtual tour3DGSReal-time rendering required (>60 FPS)
Infrastructure inspectionNeRF (Nerfacto)Better handling of specular reflections
Urban digital twin3DGSEditability and scalability
Heritage documentation3DGS + NeRF hybridMax quality + compact archiving
AI training (synthetic data)NeRFFine control over lighting

6. Production Optimization: Compression and Streaming

A raw 3DGS model can weigh hundreds of MB. Pruning low-opacity gaussians and quantizing spherical harmonics drastically shrink the size for web and mobile deployment.

compress_3dgs.py
def compress_gaussians(model_path, output_path, target_ratio=0.2):
    """Compress a 3DGS model via pruning + quantization."""
    data = torch.load(model_path)
    means = data["means"]
    opacities = data["opacities"]

    # Prune by opacity
    opacity_threshold = torch.quantile(opacities, 1.0 - target_ratio)
    keep_mask = opacities > opacity_threshold

    # Prune by size
    scales = data["scales"]
    size_mask = scales.max(dim=1).values < 0.5
    final_mask = keep_mask & size_mask

    # Quantize SH (float32 -> float16)
    compressed = {}
    for key, val in data.items():
        if "sh" in key:
            compressed[key] = val[final_mask].half()
        else:
            compressed[key] = val[final_mask]

    torch.save(compressed, output_path)

7. 2026 Trends: Toward Convergence

  • 4D Gaussian Splatting: Dynamic scenes with temporal deformation of gaussians.
  • LangSplat: Semantic embeddings directly inside gaussians.
  • Neural compression: Learned codecs that shrink 3DGS models by 10x.
  • Relighting: PBR material decomposition for realistic re-lighting.
NeRF and 3DGS start from photos. But how do you capture those images and turn them into metric 3D geometry? That's the domain of photogrammetry.

Chapter 17. Python Photogrammetry: The Art of 3D Reconstruction

Turning plain 2D images into dense 3D models through Structure-from-Motion (SfM) and Multi-View Stereo (MVS).
Cam 1 Cam 2 Cam 3 3D Mesh Multi-View Reconstruction (SfM + MVS)
"Photogrammetry isn't magic, it's pure projective geometry. If you get triangulation, you can rebuild the world with a smartphone."

1. The SfM Pipeline (Structure-from-Motion)

SfM is the beating heart of the pipeline. It recovers camera positions (Motion) and scene geometry (Structure) at the same time.

2. Automating with COLMAP and Python

COLMAP runs from the command line, which makes it easy to script in Python through subprocess. This pattern lets you slot SfM reconstruction into an automated batch processing pipeline.

sfm_batch.py
import subprocess

def run_colmap(project_path):
    database = project_path + "/database.db"
    images = project_path + "/images"

    # 1. Feature Extraction
    subprocess.run(["colmap", "feature_extractor",
        "--database_path", database,
        "--image_path", images])

    # 2. Exhaustive Matching
    subprocess.run(["colmap", "exhaustive_matcher",
        "--database_path", database])

    # 3. Mapper (Reconstruction)
    subprocess.run(["colmap", "mapper",
        "--database_path", database,
        "--image_path", images,
        "--output_path", project_path + "/sparse"])
Heads up: COLMAP fails silently if your images lack EXIF metadata (focal length, camera model). Before you launch a pipeline, check with exiftool -FocalLength *.jpg. If EXIF data is missing (video frames, screenshots), set the camera model manually with --ImageReader.camera_model PINHOLE.

3. Multi-View Stereo (MVS): Going Dense

Tech Tip: MVS is a serious VRAM hog. Make sure you've got a CUDA-capable NVIDIA card with at least 8 GB of VRAM if you want to process 4K footage.

4. Post-Processing and Meshing

The dense cloud coming out of MVS needs to be turned into a triangle mesh before you can export it to CAD or visualization software. Poisson reconstruction builds a smooth, watertight mesh from the estimated normals.

mesh_gen.py
import open3d as o3d

def poisson_mesh(pcd_path, depth=9):
    pcd = o3d.io.read_point_cloud(pcd_path)
    pcd.estimate_normals()
    mesh, densities = o3d.geometry.TriangleMesh.create_from_point_cloud_poisson(pcd, depth=depth)
    vertices_to_remove = densities < np.quantile(densities, 0.1)
    mesh.remove_vertices_by_mask(vertices_to_remove)
    return mesh

Summary: When to Use Photogrammetry

TechnologyHardware CostAccuracyUse Case
Terrestrial LiDAR$$$$ (20k+)MillimeterIndustry, Metrology
Photogrammetry$ (Smartphone)CentimeterHeritage, Virtual Tours
iPhone LiDAR$$ (1k)DecimeterFast scanning, Indoor AR
IA 3D NLP BIM API DB Spatial AI Agent

Part VI. Advanced Spatial AI

Modules 5-6: Foundation Models, AI Agents and LangChain 3D

We're stepping into the most cutting-edge territory of 3D Data Science. Foundation Models enable zero-shot segmentation of 3D scenes. Spatial AI agents reason in natural language over geometric data. LangChain wires LLMs into your 3D pipelines. This part pushes you to the frontier of what's possible in 2026.


Chapter 18. Foundation Models for 3D: Zero-Shot Segmentation in 2026

Where foundation models meet geometry: understanding, segmenting and reasoning about the 3D world without task-specific training.
"In 2026, a Foundation Model doesn't just 'see' an image anymore, it understands space. Zero-shot 3D segmentation dissolves the line between human perception and spatial AI."
Image "Find the chair" Text Foundation Model (CLIP + SAM 3D) 3D Mask Zero-Shot 3D Segmentation Natural language query to spatial segmentation

1. The Era of Spatial Foundation Models

Foundation Models reshaped text and image processing. In 2026, that shift is hitting 3D. Architectures like SAM 3D, OpenScene and LERF now segment point clouds with zero training on the target domain.

2. CLIP and the Vision-Language Bridge to 3D

CLIP (Contrastive Language-Image Pre-training) lines up text and images in a shared embedding space. Project those features onto a point cloud and every point picks up an open-vocabulary semantic "understanding".

clip_feature_fusion.py
import torch
import clip
from PIL import Image

model, preprocess = clip.load("ViT-L/14", device="cuda")

def extract_clip_features(images_list):
    features = []
    for img_path in images_list:
        image = preprocess(Image.open(img_path)).unsqueeze(0).to("cuda")
        with torch.no_grad():
            feat = model.encode_image(image)
        features.append(feat.cpu().numpy())
    return np.vstack(features)

def text_query_embedding(text_prompt):
    tokens = clip.tokenize([text_prompt]).to("cuda")
    with torch.no_grad():
        text_feat = model.encode_text(tokens)
    return text_feat.cpu().numpy()

3. OpenScene: Multi-View Fusion for 3D Understanding

OpenScene fuses CLIP features pulled from multiple 2D views to enrich every 3D point with a semantic embedding. The payoff: a natural-language text query is enough to segment the cloud.

openscene_pipeline.py
from openscene import OpenSceneModel

model = OpenSceneModel(backbone="ViT-L/14", feature_dim=768, fusion_strategy="weighted_average")

point_features = model.compute_3d_features(
    points=points, images_dir="./renders/",
    poses_file="./poses.json", intrinsics_file="./intrinsics.json"
)

query = "fire extinguisher"
scores = model.query(point_features, query)
mask = scores > 0.75
result = pcd.select_by_index(np.where(mask)[0])
MethodTypeVocabularyScanNet mIoU
PointNet++ (supervised)Closed-set20 classes53.5%
OpenSceneOpen-vocabularyUnlimited47.2%
SAM 3DZero-shotUnlimited51.1%

4. LERF: Language Embedded Radiance Fields

LERF bakes CLIP embeddings straight into a NeRF radiance field. That lets you generate 3D relevancy maps from a text query without any explicit segmentation step up front.

lerf_query.py
from nerfstudio.pipelines.base_pipeline import Pipeline

pipeline = Pipeline.load("outputs/scene/lerf/2026-02-15/")
model = pipeline.model

query_text = "emergency exit door"
relevancy_map = model.get_relevancy(query=query_text, resolution=512, scale_range=(0.1, 1.5))
spatial_mask = relevancy_map > 0.6
Going further: LERF (Language Embedded Radiance Fields) encodes CLIP embeddings directly in the radiance field. That gives you 3D text queries without any prior segmentation. The big catch is training time: plan on 4-8h on an RTX 4090 for a mid-size scene, versus 7 minutes for a standard 3DGS.

5. SAM 3D: Segment Anything in Space

SAM 3D extends Meta's Segment Anything model into the third dimension by merging 2D masks produced across multiple views into a 3D instance segmentation without any task-specific training.

sam3d_segmentation.py
from segment_anything import SamAutomaticMaskGenerator, sam_model_registry
from sam3d import MultiViewFusion, MaskMerger

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
mask_gen = SamAutomaticMaskGenerator(model=sam, points_per_side=32, pred_iou_thresh=0.86)

fusion = MultiViewFusion(pcd_path="scene.ply", images_dir="./views/", depth_dir="./depth/", poses_file="./camera_poses.json")
all_masks = fusion.generate_2d_masks(mask_gen)

merger = MaskMerger(iou_threshold=0.5, min_views=3)
segments_3d = merger.merge(all_masks, fusion.point_cloud)
print(f"3D segments detected: {len(segments_3d)}")
Florent's Tip: SAM (Segment Anything) segments instances but doesn't label them. The best pipeline pairs SAM for the masks with CLIP for semantic labeling. This "SAM + CLIP" combo holds up better than end-to-end models because you can update each component independently.

6. Production Pipeline: Multi-Model Orchestration

In production you orchestrate Foundation Models as a pipeline: SAM 3D segments instances, CLIP classifies them, and everything feeds a scene graph. This class wraps the whole workflow.

Image Encoder Text Encoder Fusion Multi-Modal 3D Backbone Segmentation 3D Semantic Multi-Modal Pipeline: Vision + Language to 3D Understanding
production_pipeline.py
class ZeroShotPipeline:
    """Complete zero-shot pipeline for 3D understanding."""

    def __init__(self, sam_ckpt, clip_model="ViT-L/14"):
        self.sam = self._load_sam(sam_ckpt)
        self.clip = self._load_clip(clip_model)

    def process_scene(self, pcd_path, images_dir, categories=None):
        # Step 1: Geometric segmentation (SAM 3D)
        segments = self.segment_3d(pcd_path, images_dir)
        # Step 2: Open-vocabulary classification (CLIP)
        if categories:
            segments = self.classify_segments(segments, categories)
        # Step 3: Scene graph construction
        scene_graph = self.build_scene_graph(segments)
        return scene_graph

pipeline = ZeroShotPipeline(sam_ckpt="sam_vit_h.pth")
graph = pipeline.process_scene(
    pcd_path="building_scan.ply", images_dir="./captures/",
    categories=["wall", "floor", "ceiling", "door", "window", "furniture"]
)
Foundation Models understand 3D scenes. Next step: giving them the ability to act, reason and run spatial tasks on their own.

Chapter 19. Deploying Spatial AI Agents in the Enterprise: A Practical Guide

A full architecture to build, test and deploy AI agents that reason over 3D data in production.
User NL Query LLM Reasoning ? Measure 3D Classify Export IFC Results to Next Step Agent Loop: Reasoning and Action
"A Spatial AI Agent doesn't just answer questions. It perceives space, reasons about geometry and acts on the physical world through its tools. That's the shift from conversational AI to operational AI."

1. What Is a Spatial AI Agent?

A Spatial AI Agent is an autonomous system that pairs an LLM with 3D perception and reasoning tools. It can query point clouds, walk scene graphs and run geometric operations in response to natural-language requests.

2. 3D Scene Graph: The Spatial Knowledge Structure

The Scene Graph is the agent's core data structure: a graph where every node is a 3D object and every edge encodes a spatial relation (proximity, containment, adjacency).

scene_graph.py
@dataclass
class SpatialNode:
    id: str
    label: str
    centroid: np.ndarray
    bbox_extent: np.ndarray
    volume: float

class SceneGraph:
    def compute_spatial_relations(self):
        nodes = list(self.nodes.values())
        for i, a in enumerate(nodes):
            for b in nodes[i+1:]:
                dist = np.linalg.norm(a.centroid - b.centroid)
                if dist < 2.0:
                    rel = self._classify_relation(a, b)
                    self.edges.append(SpatialEdge(a.id, b.id, rel, dist))

3. Building the Agent with LangGraph

LangGraph gives you the framework to build a ReAct agent that picks and chains geometric tools automatically based on the user's natural-language request.

spatial_agent.py
from langchain_openai import ChatOpenAI
from langchain.tools import tool
from langgraph.prebuilt import create_react_agent

@tool
def find_objects(label: str, max_results: int = 10) -> str:
    """Find 3D objects matching a label in the scene."""
    results = scene_graph.query_by_label(label, limit=max_results)
    return json.dumps([{"id": r.id, "label": r.label} for r in results])

@tool
def measure_distance(object_id_a: str, object_id_b: str) -> str:
    """Compute the Euclidean distance between two 3D objects."""
    a = scene_graph.nodes[object_id_a]
    b = scene_graph.nodes[object_id_b]
    dist = float(np.linalg.norm(a.centroid - b.centroid))
    return f"Distance: {dist:.2f} meters"

llm = ChatOpenAI(model="gpt-4o", temperature=0)
agent = create_react_agent(model=llm, tools=[find_objects, measure_distance])

4. REST API with FastAPI

To expose the agent in production, FastAPI gives you a fast, self-documenting REST API. Each user query routes to the agent, which orchestrates the tools and returns a structured response.

api_server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field

app = FastAPI(title="Spatial AI Agent API", version="1.0.0")

class QueryRequest(BaseModel):
    question: str = Field(..., example="Which fire extinguishers are more than 15m from an exit?")
    scene_id: str = Field(..., example="building_floor_3")

@app.post("/api/v1/query")
async def query_scene(request: QueryRequest):
    result = agent.invoke({"messages": [{"role": "user", "content": request.question}]})
    return {"answer": result["messages"][-1].content}
Florent's Tip: For production AI agent endpoints, always add a timeout and a circuit breaker. A hallucinating LLM can spin up infinite tool-call loops. Use asyncio.wait_for() with a 30-second timeout per step, and cap the total tool calls at 10 per request.

5. Production Infrastructure

ComponentTechnologyRole
API GatewayTraefik / KongRate limiting, JWT auth
Agent RuntimeFastAPI + LangGraphBusiness logic
Scene Graph DBNeo4jSpatial relationship storage
CacheRedisFrequent queries
MonitoringPrometheus + GrafanaMetrics and alerts

6. Error Handling and Guardrails

A production AI agent has to validate its own answers. Guardrails independently recompute the numerical values the LLM reports, so geometric hallucinations get caught.

guardrails.py
class SpatialGuardrails:
    def validate_distance(self, claimed_distance, object_ids):
        actual = self.compute_actual_distance(*object_ids)
        error = abs(claimed_distance - actual) / actual
        if error > 0.1:
            return ValidationResult(is_valid=False,
                reason=f"Incorrect distance: {claimed_distance:.2f}m vs {actual:.2f}m")
        return ValidationResult(is_valid=True)
Pitfall: LLMs hallucinate numbers all the time. Never trust a distance or an area reported by the LLM without recomputing it through a geometric tool.
The agent architecture is set. Time for concrete implementation with LangChain, where every geometric tool becomes a function the LLM can call.

Chapter 20. Spatial Agent with LangChain: AI That Reasons in 3D

Where Natural Language Processing meets Computational Geometry.
"The future isn't clicking buttons. The future is saying: 'Hey AI, find every cracked wall over 2 meters and export them to IFC'."
User Query LLM (Reasoning) Tools RANSAC IFC Export 3D Measure Result Structured response LangChain 3D Agent Architecture The LLM orchestrates the geometric tools

1. The Concept: LLM as a Controller

A model like GPT-4 doesn't "see" 3D. But it's great at logical planning and tool calling. The goal is to build a text interface where the LLM acts as the site foreman, telling specialized Python scripts to run the geometric analysis.

2. Building the Geometric Tools

Each LangChain tool is a Python function decorated with @tool, and its docstring is what the LLM reads as a description. The quality of that docstring directly decides whether the agent picks the right tool.

geo_tools.py
from langchain.tools import tool

@tool
def count_objects_in_scene(category: str) -> str:
    """Count the number of objects of a given category in the loaded scene."""
    count = db.query(f"SELECT COUNT(*) FROM objects WHERE label='{category}'")
    return f"There are {count} {category}s in the scene."

@tool
def get_object_dimensions(object_id: int) -> str:
    """Return the dimensions (L x W x H) of a specific object."""
    bbox = get_bbox(object_id)
    return f"Dimensions: {bbox.extent} m"
Prompt Engineering: The function docstring carries real weight. Spell out what the tool takes as input and what it returns. It's the "user manual" for the AI.

3. The Agent Architecture

We use an OpenAI Functions Agent. The agent takes the user's question, decides which tool to call, reads back the tool output (Observation), and puts together a final answer.

4. Example Conversation Flow

Query: "Is there enough room for a 3m wide truck between buildings A and B?"

  1. Thought: I need to find the distance between building A and B.
  2. Action 1: find_object_id(name="Building A") -> ID: 412
  3. Action 2: find_object_id(name="Building B") -> ID: 890
  4. Action 3: measure_min_distance(id1=412, id2=890) -> 2.85m
  5. Final Answer: No, the minimum distance is 2.85m, not enough for a 3m truck.
agent_build.py
from langchain.agents import initialize_agent, AgentType
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0, model="gpt-4-turbo")
tools = [count_objects_in_scene, get_object_dimensions, measure_distance]

agent = initialize_agent(tools, llm, agent=AgentType.OPENAI_FUNCTIONS, verbose=True)
agent.run("Analyze the scan.laz file and list the potential conflicts.")
Heads up: LangChain agents with access to geometric tools can trigger expensive operations (e.g. computing the convex hull of 100M points). Put guardrails on input size: check len(points) < MAX_POINTS in every tool before execution. An unconstrained agent can bring a server to its knees in a single request.

5. Limitations and Security

Since we run Python code (sometimes generated), sandboxing matters a lot. Never give your agent os.system access. Start by limiting the tools to read-only functions.

Drone LiDAR Acquisition

Part VII. Field Applications

Module 6: Drone acquisition and production pipeline

Theory is in hand. Now it's end-to-end practice. This final part covers two field applications: the full processing of a drone LiDAR survey (from flight to digital twin) and using AI assistants to speed up your 3D pipeline development. You'll learn to orchestrate everything you've picked up into real production workflows.


Chapter 21. Drone LiDAR Processing with Python: From Flight to Digital Twin

A complete industrial pipeline: drone acquisition, LAZ processing, automatic classification, DTM/DSM generation and integration into a BIM digital twin.
"Drone LiDAR has democratized precision 3D capture. In 2026, an engineer armed with Python can turn a 15-minute flight into a metric digital twin an entire project team can put to work."
Swath Alt. Flight Path Drone LiDAR Pipeline: Acquisition & Terrain Coverage

1. Flight Planning and Acquisition Parameters

The quality of a drone LiDAR survey is decided before takeoff. Getting the physics of acquisition right is how you size a flight properly and hit the point density your spec demands. Four parameters interact to set ground density: the Pulse Repetition Rate (PRR), the flight altitude (AGL, Above Ground Level), the drone's ground speed, and the scanner's Field of View (FOV).

The principle is geometric. At every instant, the LiDAR scanner fires a pulsed laser beam that sweeps perpendicular to the drone's trajectory. The resulting swath width scales with altitude and aperture angle. Lower altitude means a narrower swath, but higher ground density, because the same pulses concentrate on a smaller surface. Climb higher and you cover more ground per flight line, at the cost of density. Ground speed works the same way: slow the drone down and you get more pulses per linear meter on the ground. In practice, the density equation is simple: density = PRR / (speed x swath). With a 240 kHz sensor at 60 m AGL and 4 m/s, you typically land 500 to 800 pts/m2, plenty for most topographic and BIM applications.

Beyond pure physics, operational constraints shape planning heavily. Weather is a deal-breaker: wind above 25 km/h compromises drone stability and therefore INS/GNSS positioning accuracy. Rain is a no-go, since droplets create parasitic returns (noise). Time of day also matters: for vegetation surveys, flying during peak foliage cuts signal penetration to the ground. GNSS constellation visibility (GPS, GLONASS, Galileo, BeiDou) has to be checked ahead of time with PDOP prediction tools (Position Dilution of Precision). A PDOP above 3 degrades post-processed accuracy. Finally, battery life is still the main limiting factor in 2026: count on 20 to 35 minutes of effective flight depending on drone and sensor. A 10-hectare site typically needs 2 to 3 batteries, and you'll need to plan recharge time.

Sensor choice, DJI L2 vs Riegl miniVUX: The DJI Zenmuse L2 (paired with the Matrice 350 RTK) is the pragmatic pick: 240 kHz, 5 returns, vertical accuracy around 3 cm, accessible price and a seamless DJI Terra workflow. It's a great fit for everyday topography and small to mid-size BIM projects. The Riegl miniVUX-3UAV is the professional pick: 200 kHz but with a circular scan pattern that multiplies incidence angles, full-waveform technology for better vegetation penetration, accuracy around 1.5 cm. Its price (3 to 5 times the DJI) pays off for forestry projects, linear corridors (power lines) and missions where centimeter accuracy is contractual. Rule of thumb: if the spec demands vertical RMSE below 3 cm, go Riegl.
flight_params.py
from dataclasses import dataclass

@dataclass
class FlightPlan:
    altitude_m: float = 80.0
    speed_ms: float = 5.0
    overlap_pct: float = 70.0
    scan_rate_hz: int = 240000
    fov_deg: float = 70.0

    @property
    def swath_width(self):
        import math
        return 2 * self.altitude_m * math.tan(math.radians(self.fov_deg / 2))

    @property
    def point_density(self):
        return self.scan_rate_hz / (self.speed_ms * self.swath_width)

plan = FlightPlan(altitude_m=60, speed_ms=4)
print(f"Density: {plan.point_density:.0f} pts/m2")

2. LAZ File Ingestion and Quality Control

A typical drone LiDAR flight produces between 10 and 50 GB of raw data, depending on duration, point density and number of returns. The files ship in LAZ format (LASzip), the compressed version of the LAS format standardized by ASPRS (American Society for Photogrammetry and Remote Sensing). Getting the difference between the two formats matters: a LAS file stores coordinates, intensities, classifications and returns as uncompressed binary. The same dataset in LAZ takes 5 to 10 times less disk space thanks to Martin Isenburg's lossless compression algorithm. In production, always work in LAZ for storage and transfer, and only decompress to LAS when a tool demands it.

Before any processing, checking the Coordinate Reference System (CRS) is a critical step. Onboard LiDAR sensors record raw coordinates in the navigation frame (usually geocentric WGS84), and the post-processing software (DJI Terra, POSPac, TerraSolid) projects them into the target system. In France, the standard system is RGF93 / Lambert-93 (EPSG:2154) with elevations in IGN69 (NGF). Always check the LAZ file header to confirm the CRS before merging flight strips.

Heads up, datum confusion: A common trap is mixing up ellipsoidal heights (WGS84) and orthometric heights (IGN69/NGF). In mainland France the difference swings from 44 to 55 meters depending on geographic position. If your DTMs show a systematic vertical offset of about 47 m, it's almost certainly a vertical datum issue. Use IGN's RAF20 conversion grid to switch between them. In Python, the pyproj library handles these transformations via Transformer.from_crs() with the right geoid grid.
ingest_laz.py
def ingest_drone_laz(laz_dir, output_path):
    laz_files = sorted(Path(laz_dir).glob("*.laz"))
    all_points = []
    for laz_file in laz_files:
        las = laspy.read(str(laz_file))
        if las.header.point_count == 0:
            continue
        coords = np.vstack((las.x, las.y, las.z)).T
        valid = (coords[:, 2] > 0) & (coords[:, 2] < 500)
        all_points.append(coords[valid])
    merged = np.vstack(all_points)
    return merged

3. Automatic Classification with PDAL

Point classification is where the raw cloud turns into semantically rich data. The main goal is to separate ground (ASPRS class 2) from everything else: vegetation, buildings, power lines. That split is the prerequisite for any topographic analysis. No correctly classified ground means no reliable DTM.

The SMRF algorithm (Simple Morphological Filter), implemented in PDAL via filters.smrf, is one of the strongest performers for ground classification on drone data. It works on mathematical morphology: it applies a series of progressive morphological openings on a rasterized grid of the point cloud. With each iteration the window grows, which gradually erases objects above the ground. The three key parameters are: slope (max tolerated slope between two neighbor cells, typically 0.15 to 0.30 for rolling terrain), window (max window size in meters, set based on the largest structure to remove, like a 18 m wide building), and threshold (vertical tolerance in meters, 0.5 m is a good starting point for drone data). Slope too low and you'll classify embankments as non-ground; window too small and you'll leave building roofs in the ground class.

A popular alternative is CSF (Cloth Simulation Filter), which simulates a virtual cloth falling onto the inverted point cloud. The cloth "drapes" over the terrain and points close to the cloth get classified as ground. CSF is often more intuitive to tune and handles very rugged terrain better. PDAL ships it via filters.csf. In practice, the best approach is to run both algorithms on a sample and visually compare the results. For flat urban terrain, SMRF shines; for natural terrain with steep slopes, CSF often wins.

After ground classification, the filters.hag_nn filter computes HAG (Height Above Ground) for every point in the cloud. HAG is the elevation difference between a point and the ground interpolated beneath it (by nearest neighbor). That extra dimension is key for vegetation analysis: it lets you tell apart low vegetation (HAG < 2 m), medium vegetation (2-5 m) and high vegetation (> 5 m) without knowing the absolute terrain elevation. HAG is also the basis for computing the CHM (Canopy Height Model), a foundational derived product in forestry.

classify_pdal.py
import pdal, json

def classify_ground(input_laz, output_laz):
    pipeline_json = {
        "pipeline": [
            input_laz,
            {"type": "filters.smrf", "slope": 0.15, "window": 18, "threshold": 0.5},
            {"type": "filters.hag_nn"},
            {"type": "writers.las", "filename": output_laz, "compression": "laszip"}
        ]
    }
    pipeline = pdal.Pipeline(json.dumps(pipeline_json))
    count = pipeline.execute()
    return pipeline.arrays[0]
ASPRS CodeClassDescription
2GroundNatural terrain
3Low vegetation0.3m - 2.0m
5High vegetation2.0m - 30.0m
6BuildingStructures
Florent's Tip: For drone surveys in urban areas, chain SMRF (ground) with filters.elm (Extended Local Minimum) to refine classification on steep slopes. Flat roofs are the classic trap: SMRF often labels them as ground if window is too small. Push window to at least 25 m in urban contexts.

4. DTM/DSM Generation with Rasterio

The DTM (Digital Terrain Model) represents the bare ground surface, stripped of all vegetation and construction. The DSM (Digital Surface Model) represents the highest visible surface, including buildings, vegetation and any object above the ground. The difference between DSM and DTM gives the nDSM (normalized DSM), the raster equivalent of HAG, which reveals the height of objects above the terrain.

Going from classified point cloud to raster requires interpolation. You've got three main methods. Linear interpolation (based on Delaunay triangulation) is fast and respects the exact values at known points. It's the recommended default. IDW (Inverse Distance Weighting) is smoother but can produce "bull's-eye" artifacts around isolated points. The TIN (Triangulated Irregular Network) approach preserves the exact cloud topology but shows visible triangle facets at low resolution. Picking the raster resolution is a trade-off: at 0.1 m you catch micro-relief but the GeoTIFF file is huge; at 1.0 m the file is compact but fine details disappear. For typical drone surveys (densities above 100 pts/m2), resolutions of 0.25 to 0.50 m are usually the sweet spot.

DTM Bare ground only Ground surface DSM Visible surface (ground + objects) nDSM Visible surface DTM (bare ground) DSM (surface) nDSM = DSM - DTM
GDAL alternative: For very large datasets or if you prefer staying in the command-line ecosystem, gdal_grid and gdal_translate give you solid rasterization pipelines. PDAL can also write rasters directly via writers.gdal, which skips the intermediate Python export step. That said, Rasterio is still the best pick for fine-grained control in Python, especially for CRS handling, NoData values and compression profiles.
dtm_dsm_generator.py
import rasterio
from scipy.interpolate import griddata

def generate_raster(points, values, resolution, output_path, crs="EPSG:2154"):
    x_min, y_min = points[:, :2].min(axis=0)
    x_max, y_max = points[:, :2].max(axis=0)
    cols = int((x_max - x_min) / resolution)
    rows = int((y_max - y_min) / resolution)
    xi = np.linspace(x_min, x_max, cols)
    yi = np.linspace(y_min, y_max, rows)
    grid_x, grid_y = np.meshgrid(xi, yi)
    grid_z = griddata(points[:, :2], values, (grid_x, grid_y), method="linear")

    transform = from_bounds(x_min, y_min, x_max, y_max, cols, rows)
    with rasterio.open(output_path, "w", driver="GTiff",
            height=rows, width=cols, count=1, dtype="float32",
            crs=crs, transform=transform, compress="lzw") as dst:
        dst.write(grid_z.astype(np.float32), 1)

5. Computing Derived Metrics

A DTM on its own is raw data. Its real value comes out when you derive morphological metrics that feed thematic analyses. The two core derivatives are slope and aspect, computed from the spatial gradient of the elevation raster.

Slope measures terrain steepness in degrees (or percent). It matters for hydrology (runoff and erosion modeling), geotechnics (embankment stability, landslide risk), and urban planning (buildability of parcels). Aspect (or orientation) tells you the direction a slope faces (0-360 degrees, 0 = North). It directly shapes the solar irradiation of a parcel, which is vital information for photovoltaic siting studies, precision agriculture and building energy modeling. Beyond these two classic derivatives, the CHM (Canopy Height Model) is a simple subtraction: CHM = DSM - DTM. The CHM is the reference product in forestry for estimating tree heights, biomass volume and forest cover density.

terrain_metrics.py
def compute_slope_aspect(dtm_path, output_slope, output_aspect):
    with rasterio.open(dtm_path) as src:
        dem = src.read(1)
        res = src.res[0]
    dy, dx = np.gradient(dem, res)
    slope = np.degrees(np.arctan(np.sqrt(dx**2 + dy**2)))
    aspect = np.degrees(np.arctan2(-dy, dx))
    aspect = (aspect + 360) % 360
    return slope, aspect

6. BIM Integration: From Point Cloud to IFC Model

Bringing a drone LiDAR survey into a BIM environment is the step that gives field data its operational value for architecture, engineering and construction teams. The IFC format (Industry Foundation Classes) is the open standard for exchanging digital models. But representing irregular terrain in a format designed mainly for straight-edged building elements brings its own challenges.

In IFC4, terrain is typically represented by an IfcSite containing an IfcGeographicElement with a geometry of type IfcTriangulatedFaceSet, essentially a TIN mesh. The IFC4.3 standard (published in 2024) significantly improves support for linear infrastructure and terrain with new types like IfcGeoScienceModel and better handling of geographic coordinate systems. In practice, the Scan-to-BIM terrain workflow follows three steps: (1) generate a DTM from ground-classified points, (2) create a TIN (Delaunay) mesh at a resolution suited to the BIM context (typically 0.5-2.0 m), and (3) export that mesh to IFC through IfcOpenShell. Subsampling is often necessary because BIM models get heavy above 50,000 terrain triangles. To dig deeper into this methodology, Chapter 4 of this book walks through the full Scan-to-BIM pipeline.

export_ifc.py
def create_terrain_ifc(dtm_points, output_path):
    ifc = ifcopenshell.file(schema="IFC4")
    project = api.run("root.create_entity", ifc, ifc_class="IfcProject", name="Drone Survey")
    site = api.run("root.create_entity", ifc, ifc_class="IfcSite", name="Site")

    step = max(1, len(dtm_points) // 50000)
    sampled = dtm_points[::step]
    from scipy.spatial import Delaunay
    tri = Delaunay(sampled[:, :2])
    ifc.write(output_path)

7. Pipeline Validation and Quality Control

A LiDAR processing pipeline only has value if its results are quantitatively valid. Validation comes from comparing the produced DTM against GCPs (Ground Control Points), points surveyed on the ground with higher accuracy than the drone data (typically RTK GNSS, around 1-2 cm).

GCP placement strategy decides whether validation is reliable. The core principle is representative coverage of the whole study area: GCPs at the edges, in the middle, on flat ground and on slopes. ASPRS recommends a minimum of 20 independent control points (not used for georeferencing) for statistically meaningful validation. You want to separate two kinds of accuracy. Absolute accuracy measures the gap between model coordinates and ground truth (through the GCPs). That's the RMSE reported to the client. Relative accuracy measures internal consistency (for instance, the elevation difference between two adjacent flight strips in the overlap zone). A model can have good relative accuracy (the strips line up) but poor absolute accuracy (the whole model is shifted by 20 cm).

For the client report, the ASPRS Positional Accuracy Standards (2023 edition) defines clear accuracy classes. Always present vertical RMSE, horizontal RMSE (if measured), bias (mean error), standard deviation, and the number of GCPs used. An error distribution plot (histogram) and a spatial residuals map are welcome deliverables that help catch directional biases. In 2026, a vertical RMSE under 5 cm is reachable with an RTK drone and careful PPK post-processing.

validation.py
def validate_dtm(dtm_path, gcp_file):
    gcps = np.loadtxt(gcp_file, delimiter=",", skiprows=1)
    with rasterio.open(dtm_path) as src:
        errors = []
        for x, y, z_gnss in gcps:
            row, col = src.index(x, y)
            z_dtm = src.read(1)[row, col]
            if not np.isnan(z_dtm):
                errors.append(z_dtm - z_gnss)
    errors = np.array(errors)
    rmse = np.sqrt(np.mean(errors**2))
    print(f"RMSE: {rmse:.3f} m")
    return {"rmse": rmse}
MetricAcceptable ThresholdExcellent ThresholdUnit
Vertical RMSE< 0.10< 0.05meters
MAE< 0.08< 0.03meters
Bias< 0.05< 0.02meters
Going further: For rigorous ASPRS-compliant validation, compute not just the RMSE but also the 95th percentile of absolute errors (CE95/LE95). That surfaces systematic errors the RMSE hides. In France, IGN's spec demands LE95 < 0.10 m for class A topographic surveys.
The drone LiDAR pipeline shows how deep 3D production code can run. What if generative AI could speed up every one of those steps?

Chapter 22. AI in the Service of 3D Code: Coding Faster and Smarter

How LLMs and AI assistants are reshaping the way you build 3D pipelines in production.
"The future of spatial development isn't AI replacing the developer. It's the rise of a new profile: the augmented 3D architect, steering generative agents to deliver in hours what used to demand weeks of hand-written code."
Developer Intent + Review Generative AI Code + Explain Prompt Code Review + Refine

1. The AI Pair-Programming Workflow

The rise of generative language models, Claude (Anthropic), GPT-4 (OpenAI), Copilot (GitHub), has reshaped the way developers tackle 3D code. The usual approach is no longer the lone developer wading through dense documentation. It's AI pair-programming: an iterative dialogue between human intent and the machine's generative capacity.

The typical workflow breaks down into four phases. Phase 1, Instruction: the developer describes in natural language what they want (for instance, "load a LAZ file, filter ground points, generate a DTM raster at 50 cm resolution"). Phase 2, Generation: the AI produces a first draft of code, often 80-90% functional. Phase 3, Review: the developer examines the code, checks the consistency of spatial parameters (CRS, units, axes), tests on a sample. Phase 4, Refinement: errors or adjustments get fed back to the AI, which fixes things iteratively.

The areas where generative AI shines in 3D are well known: writing boilerplate code (format reading/writing, pipeline configuration), exploring sparsely documented APIs (IfcOpenShell, pyvista, laspy), converting between formats (LAZ to PLY, OBJ to GLTF), and debugging spatial algorithms (projection errors, rotation issues). On the flip side, AI hits its ceiling on novel geometric algorithms (new mesh simplification methods), domain-specific optimizations (memory handling for 10-billion-point clouds) and hardware tuning (custom CUDA kernels for voxel hashing).

Florent's Tip: Treat AI like an expert colleague who still makes mistakes. It knows the syntax of hundreds of libraries, but it doesn't know your dataset, your CRS, or your production constraints. The developer's role stays that of the conductor.

2. Effective Prompting for 3D

The quality of AI-generated code scales directly with prompt precision. In 3D development, a generic prompt like "process my point cloud" produces generic, unusable code. An effective prompt has to specify five key elements: the input format, the coordinate system, the expected operation, the performance constraints, and the target library.

Here's the gap between a bad prompt and a good one. Bad prompt: "Downsample my point cloud with Open3D." Good prompt: "I have a 50-million-point LAZ file in Lambert-93 (EPSG:2154). I want to decimate this cloud to 1 point per 10 cm voxel with Open3D 0.18. The file is 2 GB, so memory handling matters. Generate Python 3.11 with type hints." The second prompt gives enough context for the AI to produce fit-for-purpose code on the first try.

Another key point: specify the library version. 3D APIs move fast. Open3D 0.17 and 0.18 have different tensor interfaces, laspy 2.x completely changed the API compared to 1.x. Stating the version in the prompt saves hours of debugging over deprecated or renamed methods.

prompt_example.py
# Prompt sent to the AI:
# "Generate a Python function (Open3D 0.18) that loads a PLY file,
#  applies voxel downsampling at 5cm, and saves the result.
#  Log the point count before/after."

# Code generated by the AI:
import open3d as o3d
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def decimate_point_cloud(input_path: str, output_path: str, voxel_size: float = 0.05) -> None:
    """Voxel downsampling with Open3D 0.18."""
    pcd = o3d.io.read_point_cloud(input_path)
    n_before = len(pcd.points)
    logger.info(f"Points before downsampling: {n_before}")

    pcd_down = pcd.voxel_down_sample(voxel_size=voxel_size)
    n_after = len(pcd_down.points)
    logger.info(f"Points after downsampling: {n_after} (ratio: {n_after/n_before:.2%})")

    o3d.io.write_point_cloud(output_path, pcd_down)
    logger.info(f"Saved: {output_path}")
Florent's Tip: Always include in your prompt: (1) the exact library version, (2) the format and rough size of the input data, (3) the coordinate system, and (4) memory/CPU constraints. These four items cut back-and-forth with the AI by about 80%.

3. Spatial Debugging with AI

Debugging spatial code is especially tricky because errors show up visually rather than as Python exceptions. A point cloud that looks "mirrored", an unexpected 90-degree rotation, a BIM model offset by 500 km. These symptoms are hard to diagnose without experience with 3D conventions. That's exactly where generative AI shines.

The spatial debugging pattern with AI is this: describe the visual symptom, not the code. For example: "My LiDAR point cloud loaded from a LAZ file appears upside-down in Open3D (buildings point down)." The AI will immediately spot an axis convention issue: the file uses Y-up (common in CG/gaming) while Open3D expects Z-up. Another classic: "After reprojecting EPSG:4326 to EPSG:2154, all my points clump in the same spot". The AI will diagnose a latitude/longitude swap or a unit issue (degrees vs meters).

Coordinate transformation (CRS) errors account for most spatial bugs in production. The AI knows the classic traps: axis order in EPSG:4326 (lat, lon vs lon, lat depending on the library), datum differences (WGS84 vs ETRS89, negligible in France but not worldwide), and UTM zone mistakes.

debug_crs.py
# AI-assisted debugging session
# Symptom: "My points are shifted ~200m east after reprojection"
# AI diagnosis: datum shift not applied (WGS84 to RGF93 via NTv2 grid)

from pyproj import Transformer
import numpy as np

# INCORRECT: direct transformation without correction grid
# transformer = Transformer.from_crs("EPSG:4326", "EPSG:2154")

# CORRECT: force the pipeline with the NTv2 grid
transformer = Transformer.from_crs(
    "EPSG:4326",
    "EPSG:2154",
    always_xy=True  # Force (longitude, latitude) order
)

# Check against a known point (Eiffel Tower)
lon, lat = 2.2945, 48.8584
x, y = transformer.transform(lon, lat)
print(f"Lambert-93: X={x:.2f}, Y={y:.2f}")
# Expected: X~648240, Y~6862260
# If offset > 1m: check the proj version and grids
Warning: Never trust AI-generated spatial code blindly without visual verification. Always open the result in a 3D viewer (Open3D, CloudCompare, QGIS) and check against at least 3 known control points. A reprojection that "seems to work" can introduce a systematic bias of several meters that's invisible without GCPs.
Florent's Tip: When you use generative AI for 3D code, always feed it a sample of your data (the first 10 lines of the file, the bounding box dimensions, the CRS). That cuts the AI's error rate from ~40% to ~5% on coordinate transformation tasks, where context is everything.

4. Generating 3D Visualizations

Visualization is one of the areas where generative AI brings the fastest productivity gain. Describe the target rendering in natural language ("show my point cloud colored by classification with a color bar and black background") and get functional pyvista code in 30 seconds. The friction of wading through complex APIs just disappears.

Python 3D visualization libraries have large and sometimes unintuitive APIs. pyvista wraps VTK with a Pythonic interface, plotly delivers interactive WebGL rendering, matplotlib gives you basic 3D plots, and three.js (via Python-to-JS) enables advanced web visualizations. The AI knows the idioms of each and generates code suited to the presentation context: scientific publication (matplotlib), interactive exploration (plotly), web application (three.js) or technical analysis (pyvista).

The big win is fast customization. Instead of digging through documentation to find how to change the camera angle, color palette, or lighting, you just ask in natural language. The AI iterates on parameters until the rendering hits what you want, a process that would typically eat 10 to 15 minutes of documentation reading per adjustment.

viz_classified_cloud.py
# Code generated by AI: classified point cloud visualization with pyvista
import pyvista as pv
import numpy as np
import laspy

def visualize_classified(laz_path: str) -> None:
    """Display a LiDAR cloud colored by ASPRS class."""
    las = laspy.read(laz_path)
    points = np.vstack((las.x, las.y, las.z)).T
    classes = las.classification

    # Standard ASPRS palette
    cmap = {2: "#8B4513", 3: "#228B22", 4: "#32CD32",
            5: "#006400", 6: "#DC143C", 9: "#1E90FF"}

    cloud = pv.PolyData(points)
    cloud["classification"] = classes

    plotter = pv.Plotter(window_size=[1920, 1080])
    plotter.set_background("black")
    plotter.add_mesh(cloud, scalars="classification",
                     cmap="tab10", point_size=2,
                     render_points_as_spheres=True)
    plotter.add_scalar_bar(title="ASPRS Class", n_labels=6)
    plotter.camera_position = "iso"
    plotter.show()
Data Processing Rendering Display LAZ / PLY NumPy / Open3D PyVista / VTK Screen / Web

5. Automating Pipelines with AI

One of the most powerful use cases of generative AI is the rapid creation of complete CLI pipelines. From a natural-language description ("build a command-line tool that takes a folder of LAZ files, applies SOR filtering, classifies the ground with CSF, generates a DTM raster and exports to GeoTIFF"), the AI can produce a functional script including argparse, logging, error handling and parallel processing.

The productivity gain is huge for scaffolding: setting up program structure (CLI arguments, configuration, logging, path handling). This "infrastructure" code often makes up 40-50% of the total code in a 3D processing pipeline but delivers zero business value. The AI spits it out in seconds, freeing the developer to focus on the real spatial logic.

Integration with CI/CD systems (GitHub Actions, GitLab CI) is another area where AI speeds up development. Generating a YAML workflow that automatically runs the 3D processing pipeline on every push, with automatic validation against reference GCPs, takes one well-structured prompt. Field benchmarks show a 3 to 5x acceleration factor on standard pipelines when you use AI as a development assistant.

pipeline_cli.py
#!/usr/bin/env python3
# CLI pipeline generated with AI assistance
import argparse
import logging
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor, as_completed
import open3d as o3d
import numpy as np
import laspy

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("pipeline3d")

def process_tile(laz_path: Path, voxel_size: float, output_dir: Path) -> dict:
    """Process a LAZ tile: load, filter, downsample, export."""
    logger.info(f"Processing: {laz_path.name}")
    try:
        las = laspy.read(str(laz_path))
        pts = np.vstack((las.x, las.y, las.z)).T

        pcd = o3d.geometry.PointCloud()
        pcd.points = o3d.utility.Vector3dVector(pts)

        # SOR filtering (Statistical Outlier Removal)
        pcd_clean, mask = pcd.remove_statistical_outlier(
            nb_neighbors=20, std_ratio=2.0
        )

        # Voxel downsampling
        pcd_down = pcd_clean.voxel_down_sample(voxel_size=voxel_size)

        out_path = output_dir / laz_path.with_suffix(".ply").name
        o3d.io.write_point_cloud(str(out_path), pcd_down)

        return {"file": laz_path.name, "status": "ok",
                "n_in": len(pts), "n_out": len(pcd_down.points)}
    except Exception as e:
        logger.error(f"Failed {laz_path.name}: {e}")
        return {"file": laz_path.name, "status": "error", "msg": str(e)}

def main():
    parser = argparse.ArgumentParser(description="Pipeline 3D batch processing")
    parser.add_argument("input_dir", type=Path, help="Source LAZ folder")
    parser.add_argument("output_dir", type=Path, help="Output folder")
    parser.add_argument("--voxel", type=float, default=0.05, help="Voxel size (m)")
    parser.add_argument("--workers", type=int, default=4, help="Number of workers")
    args = parser.parse_args()

    args.output_dir.mkdir(parents=True, exist_ok=True)
    tiles = list(args.input_dir.glob("*.laz"))
    logger.info(f"{len(tiles)} tiles found")

    results = []
    with ProcessPoolExecutor(max_workers=args.workers) as pool:
        futures = {pool.submit(process_tile, t, args.voxel, args.output_dir): t for t in tiles}
        for fut in as_completed(futures):
            results.append(fut.result())

    ok = sum(1 for r in results if r["status"] == "ok")
    logger.info(f"Done: {ok}/{len(tiles)} tiles processed")

if __name__ == "__main__":
    main()
Field Benchmark: On a 120-tile LAZ project (48 GB total), generating the complete CI/CD pipeline (processing script + GitHub Actions workflow + validation tests) took 45 minutes with AI versus 3 days without AI, a x5 acceleration factor. The AI-generated code needed about 20% of manual edits, mostly for sensor-specific parameters and business validation thresholds.

6. Toward the 3D Developer Agent

The next frontier is the jump from assistant (which answers one-off requests) to autonomous agent (which plans, runs and validates a full end-to-end pipeline). In 2026, the first prototypes of 3D developer agents are emerging: they receive a spec ("process this LiDAR dataset, classify the ground, generate a DTM and a quality report") and orchestrate the steps on their own, handle errors and produce the deliverables.

Integration with CI/CD pipelines (Continuous Integration / Continuous Deployment) matters a lot here. An AI agent can automatically generate unit tests for each 3D function (output dimension checks, CRS consistency, NaN absence), run those tests on reference datasets, and deploy the pipeline to production if everything passes. Reference datasets like S3DIS (Stanford 3D Indoor Scenes), ScanNet and DALES (Dayton Annotated LiDAR Earth Scan) serve as benchmarks to automatically validate that code changes don't degrade performance.

Ethical considerations and quality control sit at the heart of this shift. An autonomous agent that generates and deploys code without human supervision can spread systematic errors, a ground classification bias, for example, that degrades every DTM it produces. The 3D architect's role shifts toward governance: defining the guardrails, quality thresholds, and mandatory human validation points. Module 5 of the training (Spatial AI Agents) goes deeper into this supervised agent architecture and secure production deployment patterns.

7. Best Practices for AI-Assisted 3D Code

The experience built up since AI code assistants arrived lets us pull out clear rules for the 3D domain:

Always validate geometrically, not just syntactically. 3D code can run without errors and still produce a geometrically absurd result. After every AI generation, visualize the output, check the bounding boxes, confirm the CRS is right, and compare against a known ground truth. Syntactic validation (the code runs) is just the first step. Geometric validation (the result is spatially correct) is the one that matters.

Use unit tests with reference datasets. Build a library of test datasets with known expected results. For instance: a synthetic point cloud of a perfect cube (for testing plane detection), an S3DIS dataset excerpt (for testing semantic segmentation), a reference DTM with GCPs (for testing interpolation). Every AI-generated function has to pass these tests before going to production.

Keep expertise at the center. AI speeds things up, but the expert decides. Knowing why you pick a voxel size of 0.05 m over 0.10 m, why SMRF beats CSF on urban terrain, why linear interpolation suits a dense DTM better than IDW, these decisions demand deep domain understanding the AI doesn't have. The 3D engineer of 2026 is a decision architect assisted by an executing AI, not the other way around.

Wrap-up: Generative AI is reshaping 3D development on three axes: (1) acceleration, boilerplate code, PDAL pipelines and visualizations in seconds, (2) debugging, fast diagnosis of geometric and precision errors, (3) documentation, auto-generated docstrings, comments and reports. What it doesn't replace is (4) spatial expertise, parameter choices, geometric validation and pipeline governance stay human responsibilities.
From reading a LAZ file to pair-programming with generative AI, the path is complete. Let's step back on the skills you've picked up.

Conclusion

We've come a long way, from reading a plain .LAZ file to deploying AI agents that can reason in natural language about complex 3D scenes. This progression mirrors the day-to-day reality of the 3D AI Architect role in 2026: a professional who has to command geometric fundamentals, deep learning architectures, neural rendering pipelines, and intelligent-agent frameworks all at once.

The skills covered in this book form a coherent toolkit you can put to work right away:

  • Fundamentals: Open3D, Laspy, and NumPy for geometric processing of point clouds.
  • Segmentation: RANSAC, DBSCAN, and automated cleaning pipelines.
  • Big Data: Chunking, tiling, and multiprocessing for datasets spanning hundreds of gigabytes.
  • BIM: Automated Scan-to-BIM, IfcOpenShell, and urban digital twins with CityGML.
  • 3D Deep Learning: PointNet and PointNet++ for classification and segmentation.
  • Neural Rendering: 3D Gaussian Splatting and NeRF with head-to-head benchmarks.
  • Foundation Models: SAM 3D, CLIP, OpenScene, and LERF for zero-shot segmentation.
  • AI Agents: LangChain, LangGraph, and FastAPI for production-grade systems.
  • Terrain: End-to-end drone LiDAR pipeline with DTM/DSM and BIM integration.
  • Generative AI: Pair-programming with Claude, GPT-4, and Copilot to speed up 3D development by 3 to 5x.

To take things further, the 3D Geodata Academy offers personalized coaching with hands-on projects built on your own data, one-on-one mentoring, and access to a community of professionals in the field. Every chapter of this book maps to a training module, with practical exercises, real datasets, and progressive assessments.

The future of spatial intelligence is already underway. You now hold the keys to help shape it.

About the Author

Dr. Florent Poux

Dr. Florent Poux holds a PhD in geomatic sciences (University of Liège, EuroSDR 2019 Award) and served as a postdoctoral researcher at RWTH Aachen University. He specializes in spatial artificial intelligence, 3D Data Science, and point cloud processing, with more than 10 years of experience across research and industry. He's published 15 articles in international peer-reviewed journals (h-index 21, 1,563+ citations), 19 conference papers, and trained more than 12,000 professionals across 50+ countries.

Founder of the 3D Geodata Academy, he's devoted to passing on cutting-edge skills at the intersection of 3D geodata, deep learning, and generative artificial intelligence. His "Theory-to-Code" teaching approach pairs scientific rigor with industrial pragmatism: every theoretical concept ships alongside an executable Python implementation.

His areas of expertise: massive point cloud processing, 3D semantic segmentation, neural rendering (NeRF, 3D Gaussian Splatting), Foundation Models for 3D, spatial AI agents, Scan-to-BIM automation, and digital twins.

Ready to go deeper

Check out 3D Data Science with Python (O'Reilly Media, 2025), 687 pages, 18 chapters, and 200+ code examples. An O'Reilly Python Best Seller, adopted by 100+ universities.

Get the book on O'Reilly →
3D DATA SCIENCE with Python </> O'Reilly Media
Book a Call