Doctoral Thesis Analysis
Geometric Deep Learning for Camera Pose Prediction, Registration, Depth Estimation, and 3D Reconstruction
Author: Xueyang Kang
University: The University of Melbourne & KU Leuven, Belgium
Award Date: September 2025
Executive Impact & Key Achievements
This dissertation pioneers geometric deep learning frameworks for advanced 3D vision, delivering robust, scalable, and high-quality solutions with significant real-world applications across various sectors.
This research bridges traditional geometric constraints with deep learning, leading to significant advancements in VR/AR, robotics, and cultural heritage digitization through more robust, accurate, and efficient 3D models.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Camera Pose Estimation Using Natural Geometry Cues and Manifold Constraint
Chapter 3 introduces a robust vision-based orientation tracking and fusion algorithm for camera pose estimation in challenging natural environments, specifically for UAVs. It leverages geometric primitives such as skylines and ground planes, combined with an adaptive particle filter, to ensure stable and accurate orientation. The system, implemented on embedded hardware, significantly mitigates motion blur and drift, outperforming IMU-based solutions.
Key Contributions:
- Developed a lightweight ResNet-18 backbone for real-time binary segmentation of ground and sky regions on embedded devices.
- Introduced a geometry-based framework utilizing skyline and ground cues for robust visual tracking in outdoor conditions.
- Proposed a nonlinear particle filter with adaptive sampling resolution on a multi-resolution manifold surface to fuse IMU and vision-based orientation estimates, ensuring robust tracking.
Challenges Addressed:
- Degradation of pose estimation accuracy due to abrupt tilting, minimal height change, or pure rotation and lack of overlap between frames.
- Motion blur introducing jitter and making feature extraction/correspondence matching challenging.
- Long-term drift and noise in conventional IMU-based camera pose tracking in mountainous regions with unpredictable rotations.
Potential Enterprise Impact:
- Enables stable and high-quality image capture for UAVs in dynamic, unstructured outdoor environments.
- Improves data quality, image quality, and pose precision for mapping and surveying applications.
- Increases operational efficiency by reducing post-processing needs for 3D applications like reconstruction.
- Provides a foundational ability for autonomous navigation and aerial mapping in natural environments.
Thesis Chapter Dependencies
Sequence | ORB3[25] | R-VIO[91] | DM-VIO[192] | Ours |
---|---|---|---|---|
Roll test (125s) test01 | 0.01011 | 0.02071 | 0.00942 | 0.00862 |
Roll test (125s) test02 | 0.00994 | 0.03904 | 0.00300 | 0.00824 |
Roll test (125s) test03 | 0.01309 | 0.06566 | 0.00324 | 0.00945 |
Pitch test (127s) test01 | 0.01207 | 0.02828 | 0.04666 | 0.01657 |
Pitch test (127s) test02 | 0.04254 | — | — | 0.03446 |
Pitch test (127s) test03 | 0.00914 | — | — | 0.00875 |
Mixed test (9200s) test01 | 0.01845 | 0.03975 | — | 0.01972 |
Mixed test (9200s) test02 | 0.01828 | 0.05635 | — | 0.01804 |
Mixed test (9200s) test03 | 0.01617 | 0.03982 | — | 0.01615 |
UAV-Based Camera Stabilization for Aerial Mapping
The developed camera pose estimation system is designed for UAVs, leveraging natural geometric cues like skylines and ground planes to stabilize camera orientation in dynamic outdoor environments.
This intricate process involves robust 3D alignment for large-scale reconstruction from LiDAR scans. The surfel-based SE(3)-equivariant features ensure reliable registration even with minimal overlaps and high outlier ratios. This technique aids archaeologists and historians in digitally recovering and interpreting cultural artifacts, making preservation more efficient and accurate. Advanced algorithms like PuzzleFusion++ are used for denoising, verifying, and incrementally refining fits, particularly for pieces with irregular shapes and minimal overlap.
Point Cloud Registration Using 2D Surfel-Based Equivariance Constraint
Chapter 4 introduces a novel surfel-based SE(3)-equivariant deep learning framework for point cloud registration. This method overcomes limitations of traditional feature-based approaches by leveraging 2D Gaussian surfel features, encoding point position, normal, and uncertainty. It ensures robust and generalizable alignment, particularly in noisy inputs, small overlaps, and high outlier ratios, demonstrating state-of-the-art accuracy on indoor and outdoor datasets.
Key Contributions:
- Developed a surfel-based initialization pipeline from RGB-D depth maps or LiDAR scans for pose regression.
- Introduced a SE(3)-equivariant deep learning model using surfel features for robust and efficient registration.
- Implemented a differentiable SE(3) Huber loss function for supervision using soft correspondences, enhancing robustness to outliers.
Challenges Addressed:
- Traditional methods struggling with sparse, noisy, or ambiguous keypoint features, leading to unreliable alignment.
- Existing deep learning models often neglecting point orientations and uncertainties, affecting generalization to noisy/unseen data.
- Inefficient training due to extensive data augmentation required by non-rotation-equivariant models.
Potential Enterprise Impact:
- Enhances 3D mapping for robotics, digital twins, and infrastructure inspection by improving alignment accuracy in real-world scans.
- Provides a robust foundation for consistent 3D global mapping in varied environments.
- Offers improved generalization and efficiency in point cloud registration for autonomous navigation and large-scale scene modeling.
Method | RE(◦) ↓ (3DMatch) | TE(cm) ↓ (3DMatch) | RR(%) ↑ (3DMatch) | F1(%) ↑ (3DMatch) | RE(◦) ↓ (KITTI) | TE(cm) ↓ (KITTI) | RR(%) ↑ (KITTI) | F1(%) ↑ (KITTI) |
---|---|---|---|---|---|---|---|---|
DGR [41] | 2.40 | 7.48 | 91.30 | 89.76 | 1.45 | 14.60 | 76.62 | 73.84 |
D3Feat [12] | 2.57 | 8.16 | 89.70 | 87.40 | 2.07 | 18.92 | 70.06 | 65.31 |
RoReg [243] | 1.84 | 6.28 | 93.70 | 91.60 | — | — | — | — |
SpinNet [8] | 1.93 | 6.24 | 93.74 | 92.07 | 1.08 | 10.75 | 82.83 | 80.91 |
PointDSC [11] | 2.06 | 6.55 | 93.28 | 89.35 | 1.63 | 12.31 | 74.41 | 70.08 |
MAC [282] | 1.89 | 6.03 | 93.72 | 91.46 | 1.42 | 8.46 | 91.37 | 89.25 |
MAC+GeoTF [282, 193] | 1.74 | 6.01 | 95.02 | 91.80 | 1.37 | 8.01 | 90.59 | 88.45 |
Ours | 1.34 | 5.72 | 95.08 | 93.32 | 1.57 | 6.09 | 92.05 | 90.61 |
3D Assembly of Cultural Heritage Fragments
Our point cloud registration method is applied in archaeology to reconstruct fragile artifacts from numerous fragments. By aligning scanned pieces, it creates a complete and accurate 3D model, preserving historical and cultural heritage.
This intricate process involves robust 3D alignment for large-scale reconstruction from LiDAR scans. The surfel-based SE(3)-equivariant features ensure reliable registration even with minimal overlaps and high outlier ratios. This technique aids archaeologists and historians in digitally recovering and interpreting cultural artifacts, making preservation more efficient and accurate. Advanced algorithms like PuzzleFusion++ are used for denoising, verifying, and incrementally refining fits, particularly for pieces with irregular shapes and minimal overlap.
Depth Prediction from Focal Stack Using Focal Geometry Constraint
Chapter 5 introduces FocDepthFormer, a novel Transformer-based network with an LSTM module for depth estimation from focal stack images. This model addresses the limitations of fixed input lengths and local receptive fields in conventional CNN-based methods. By integrating self-attention and recurrent modules, it efficiently processes focal stacks of arbitrary lengths, capturing global spatial relationships and fine-grained focus/defocus cues to achieve state-of-the-art depth prediction accuracy.
Key Contributions:
- Proposed a novel Transformer-based network, FocDepthFormer, for depth estimation from focal stack images, leveraging self-attention to capture non-local spatial visual features.
- Integrated an LSTM-based recurrent module to handle arbitrary numbers of input images in focal stacks, enhancing flexibility and generalization.
- Employed multi-scale convolutional kernels in an early-stage encoder to capture low-level focus/defocus cues at various scales.
Challenges Addressed:
- Limitations of CNN-based methods with fixed 2D/3D kernels, struggling to generalize across varying focal stack lengths.
- Ineffective capture of long-range dependencies due to local receptive fields in traditional CNN architectures.
- Reliance on large-scale, difficult-to-obtain focal stack training data.
Potential Enterprise Impact:
- Enables flexible and accurate depth prediction from focal stacks of arbitrary lengths, expanding applications in 3D photography and industrial quality control.
- Reduces reliance on expensive 3D sensor hardware and large-scale focal stack datasets through pre-training strategies.
- Contributes to advancing 3D reconstruction techniques and dense local viewpoint cloud generation, particularly for fragile objects where camera motion is prohibited.
Model | RMSE↓ | logRMSE↓ | absRel↓ | sqrRel↓ | Bump↓ | δ ↑ | δ² ↑ | δ³ ↑ |
---|---|---|---|---|---|---|---|---|
DDFFNet [79] | 2.91e-2 | 0.320 | 0.293 | 1.2e-2 | 0.59 | 61.95 | 85.14 | 92.98 |
DefocusNet [161] | 2.55e-2 | 0.230 | 0.180 | 6.0e-3 | 0.46 | 72.56 | 94.15 | 97.92 |
DFVNet [264] | 2.13e-2 | 0.210 | 0.171 | 6.2e-3 | 0.32 | 76.74 | 94.23 | 98.14 |
AiFNet [246] | 2.32e-2 | 0.290 | 0.251 | 8.3e-3 | 0.63 | 68.33 | 87.40 | 93.96 |
Ours (w/o Pre-training) | 2.01e-2 | 0.206 | 0.173 | 5.7e-3 | 0.26 | 78.01 | 95.04 | 98.32 |
Ours (w/ Pre-training) | 1.96e-2 | 0.197 | 0.161 | 5.4e-3 | 0.23 | 79.06 | 96.08 | 98.57 |
3D Modeling of Fragile Artworks and Sculptures
FocDepthFormer enables precise 3D modeling of cultural paintings and sculptures by analyzing focus differences across focal stack images, crucial for digitizing fragile artifacts without physical contact.
This technique extracts 3D information from 2D artworks with strong perspective principles, capturing fine details like painting strokes at millimeter-level precision. Unlike traditional methods, it doesn't require camera motion, making it ideal for preserving fragile heritage. The model's ability to handle arbitrary focal stack sizes offers flexibility for museums, cultural heritage organizations, and educational institutions to create high-accuracy digital replicas for exhibitions, in-depth analysis, and immersive virtual experiences.
3D Reconstruction Using Implicit SDF with Wavelet Feature-Based Prior
Chapter 6 proposes a wavelet-conditioned implicit SDF model for high-fidelity 3D reconstruction from multi-view images. It addresses the challenge of preserving fine-grained geometric details often lost in implicit representations by integrating pre-trained wavelet-transformed depth features and triplane latent space fusion. This approach significantly enhances surface accuracy and detail preservation, outperforming state-of-the-art explicit and implicit methods on various datasets.
Key Contributions:
- Introduced Wavelet-Transformed Depth Feature Conditioning, using a pre-trained multi-scale wavelet autoencoder to enhance geometric detail preservation in implicit SDF training.
- Developed Triplane-Aligned Wavelet Feature Projection to seamlessly integrate 2D wavelet features with 3D implicit representations.
- Implemented a Hybrid Feature Fusion for SDF Prediction, utilizing a UNet-based module to combine implicit 3D features with wavelet-transformed depth representations for accurate isosurface mesh extraction.
Challenges Addressed:
- Implicit SDF models struggling to capture fine-grained geometric details and high-frequency information due to limited learning capability of MLPs.
- Oversmoothed reconstructions and loss of sharp edges in existing implicit methods.
- Computational demands and generalization limitations of implicit models on diverse datasets.
Potential Enterprise Impact:
- Enables high-fidelity 3D reconstructions for digital cultural heritage preservation, virtual museums, and immersive VR/AR environments.
- Provides more complete and detailed 3D models of objects and architectural structures from limited image views.
- Advances the integration of geometric constraints in high-frequency bands with deep learning frameworks for robust 3D vision solutions.
3D Reconstruction Ablation Study
Method | 24 | 37 | 40 | 55 | 63 | 65 | 69 | 83 | 97 | 105 | 106 | 110 | 114 | 118 | 122 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VolSDF [270] | 1.14 | 1.26 | 0.81 | 0.49 | 1.25 | 0.70 | 0.72 | 1.29 | 1.18 | 0.70 | 0.66 | 1.08 | 0.42 | 0.61 | 0.55 |
NeuS [247] | 1.00 | 1.37 | 0.93 | 0.43 | 1.10 | 0.65 | 0.57 | 1.48 | 1.09 | 0.83 | 0.52 | 1.20 | 0.35 | 0.49 | 0.54 |
Neuralangelo [134] | 0.41 | 0.36 | 0.35 | 0.35 | 1.29 | 0.54 | 0.73 | 0.52 | 0.97 | 0.56 | 0.48 | 0.73 | 0.32 | 0.40 | 0.36 |
BakedSDF [271] | 0.63 | 0.58 | 0.40 | 0.52 | 1.37 | 0.63 | 0.81 | 0.56 | 1.02 | 0.81 | 0.50 | 0.82 | 0.39 | 0.45 | 0.38 |
SuGaR [73] | 1.47 | 1.33 | 1.13 | 0.61 | 2.25 | 1.71 | 1.15 | 1.63 | 1.62 | 1.07 | 0.79 | 2.45 | 0.98 | 0.88 | 0.79 |
GOF [94] | 0.50 | 0.37 | 0.38 | 0.74 | 1.18 | 0.76 | 0.90 | 0.47 | 1.29 | 0.68 | 0.77 | 0.90 | 0.42 | 0.41 | 0.42 |
2DGS [93] | 0.48 | 0.39 | 0.41 | 0.83 | 1.36 | 0.83 | 1.04 | 0.70 | 1.27 | 0.76 | 0.70 | 1.40 | 0.40 | 0.43 | 0.40 |
Ours | 0.45 | 0.34 | 0.32 | 0.34 | 0.97 | 0.52 | 0.54 | 0.50 | 0.82 | 0.53 | 0.45 | 0.68 | 0.30 | 0.36 | 0.34 |
Digital Preservation and Virtual Tourism for Cultural Heritage
Our 3D reconstruction model enables the creation of high-fidelity digital twins of historical sites and artifacts, supporting virtual museums, interactive learning, and 3D printing.
This technology is crucial for preserving cultural heritage, allowing virtual visitors to explore reconstructed ancient sites and artifacts with close-up detail and manipulation capabilities. It integrates wavelet-transformed depth features and triplane embeddings to achieve superior detail. Case studies include virtual tours of Chinese temples (Figure 8.7), game environments like 'Black-myth Wukong' (Figure 8.9), and 3D printed replicas of Dunhuang Mogao Grottoes (Figure 8.10), demonstrating wide-ranging applications in education, entertainment, and digital preservation.
Calculate Your Potential AI Impact
Estimate the operational efficiency gains and cost savings your enterprise could achieve by integrating advanced geometric deep learning solutions.
Your AI Implementation Roadmap
A phased approach to integrating geometric deep learning into your enterprise, ensuring robust and scalable deployment.
Phase 1: Discovery & Strategy Alignment (1-2 Weeks)
Initial consultations to understand your specific 3D vision challenges and business objectives. We'll identify optimal geometric deep learning solutions and define clear KPIs for success.
Phase 2: Data Preparation & Model Customization (4-8 Weeks)
Collection and annotation of enterprise-specific 3D data. Customization of chosen geometric deep learning models (e.g., surfel-based registration, FocDepthFormer) to your unique environments and requirements.
Phase 3: Integration & Pilot Deployment (6-10 Weeks)
Seamless integration of the AI models into your existing infrastructure. Conduct pilot programs to validate performance in real-world scenarios, gather feedback, and fine-tune for optimal results.
Phase 4: Scaling & Continuous Optimization (Ongoing)
Full-scale deployment across your operations. Establish continuous monitoring, performance tuning, and regular updates to ensure long-term robustness, efficiency, and adaptability to evolving needs.
Ready to Transform Your 3D Vision Capabilities?
Leverage cutting-edge geometric deep learning to enhance camera pose estimation, point cloud registration, depth estimation, and 3D reconstruction in your enterprise.