Doctoral Thesis Analysis

Geometric Deep Learning for Camera Pose Prediction, Registration, Depth Estimation, and 3D Reconstruction

Author: Xueyang Kang

University: The University of Melbourne & KU Leuven, Belgium

Award Date: September 2025

Executive Impact & Key Achievements

This dissertation pioneers geometric deep learning frameworks for advanced 3D vision, delivering robust, scalable, and high-quality solutions with significant real-world applications across various sectors.

0 Avg. Reduction in Rotation Error

0 Point Cloud Registration Recall

0 Depth Prediction Accuracy (DDFF)

0 Best Chamfer Distance (DTU)

Transforming 3D Vision

This research bridges traditional geometric constraints with deep learning, leading to significant advancements in VR/AR, robotics, and cultural heritage digitization through more robust, accurate, and efficient 3D models.

Schedule a Strategic Consultation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Camera Pose Estimation Using Natural Geometry Cues and Manifold Constraint

Chapter 3 introduces a robust vision-based orientation tracking and fusion algorithm for camera pose estimation in challenging natural environments, specifically for UAVs. It leverages geometric primitives such as skylines and ground planes, combined with an adaptive particle filter, to ensure stable and accurate orientation. The system, implemented on embedded hardware, significantly mitigates motion blur and drift, outperforming IMU-based solutions.

Key Contributions:

Developed a lightweight ResNet-18 backbone for real-time binary segmentation of ground and sky regions on embedded devices.
Introduced a geometry-based framework utilizing skyline and ground cues for robust visual tracking in outdoor conditions.
Proposed a nonlinear particle filter with adaptive sampling resolution on a multi-resolution manifold surface to fuse IMU and vision-based orientation estimates, ensuring robust tracking.

Challenges Addressed:

Degradation of pose estimation accuracy due to abrupt tilting, minimal height change, or pure rotation and lack of overlap between frames.
Motion blur introducing jitter and making feature extraction/correspondence matching challenging.
Long-term drift and noise in conventional IMU-based camera pose tracking in mountainous regions with unpredictable rotations.

Potential Enterprise Impact:

Enables stable and high-quality image capture for UAVs in dynamic, unstructured outdoor environments.
Improves data quality, image quality, and pose precision for mapping and surveying applications.
Increases operational efficiency by reducing post-processing needs for 3D applications like reconstruction.
Provides a foundational ability for autonomous navigation and aerial mapping in natural environments.

Thesis Chapter Dependencies

Camera Pose Estimation (Chapter 3)

→

Point Cloud Registration (Chapter 4)

→

Depth Estimation (Chapter 5)

→

3D Reconstruction (Chapter 6)

0.00862 Lowest Average RMSE (radians) in Roll Test

Sequence	ORB3[25]	R-VIO[91]	DM-VIO[192]	Ours
Roll test (125s) test01	0.01011	0.02071	0.00942	0.00862
Roll test (125s) test02	0.00994	0.03904	0.00300	0.00824
Roll test (125s) test03	0.01309	0.06566	0.00324	0.00945
Pitch test (127s) test01	0.01207	0.02828	0.04666	0.01657
Pitch test (127s) test02	0.04254	—	—	0.03446
Pitch test (127s) test03	0.00914	—	—	0.00875
Mixed test (9200s) test01	0.01845	0.03975	—	0.01972
Mixed test (9200s) test02	0.01828	0.05635	—	0.01804
Mixed test (9200s) test03	0.01617	0.03982	—	0.01615

UAV-Based Camera Stabilization for Aerial Mapping

The developed camera pose estimation system is designed for UAVs, leveraging natural geometric cues like skylines and ground planes to stabilize camera orientation in dynamic outdoor environments.

This intricate process involves robust 3D alignment for large-scale reconstruction from LiDAR scans. The surfel-based SE(3)-equivariant features ensure reliable registration even with minimal overlaps and high outlier ratios. This technique aids archaeologists and historians in digitally recovering and interpreting cultural artifacts, making preservation more efficient and accurate. Advanced algorithms like PuzzleFusion++ are used for denoising, verifying, and incrementally refining fits, particularly for pieces with irregular shapes and minimal overlap.

Learn More About UAV AI

Point Cloud Registration Using 2D Surfel-Based Equivariance Constraint

Chapter 4 introduces a novel surfel-based SE(3)-equivariant deep learning framework for point cloud registration. This method overcomes limitations of traditional feature-based approaches by leveraging 2D Gaussian surfel features, encoding point position, normal, and uncertainty. It ensures robust and generalizable alignment, particularly in noisy inputs, small overlaps, and high outlier ratios, demonstrating state-of-the-art accuracy on indoor and outdoor datasets.

Key Contributions:

Developed a surfel-based initialization pipeline from RGB-D depth maps or LiDAR scans for pose regression.
Introduced a SE(3)-equivariant deep learning model using surfel features for robust and efficient registration.
Implemented a differentiable SE(3) Huber loss function for supervision using soft correspondences, enhancing robustness to outliers.

Challenges Addressed:

Traditional methods struggling with sparse, noisy, or ambiguous keypoint features, leading to unreliable alignment.
Existing deep learning models often neglecting point orientations and uncertainties, affecting generalization to noisy/unseen data.
Inefficient training due to extensive data augmentation required by non-rotation-equivariant models.

Potential Enterprise Impact:

Enhances 3D mapping for robotics, digital twins, and infrastructure inspection by improving alignment accuracy in real-world scans.
Provides a robust foundation for consistent 3D global mapping in varied environments.
Offers improved generalization and efficiency in point cloud registration for autonomous navigation and large-scale scene modeling.

11% Reduction in Rotation Error (3DMatch)

Method	RE(◦) ↓ (3DMatch)	TE(cm) ↓ (3DMatch)	RR(%) ↑ (3DMatch)	F1(%) ↑ (3DMatch)	RE(◦) ↓ (KITTI)	TE(cm) ↓ (KITTI)	RR(%) ↑ (KITTI)	F1(%) ↑ (KITTI)
DGR [41]	2.40	7.48	91.30	89.76	1.45	14.60	76.62	73.84
D3Feat [12]	2.57	8.16	89.70	87.40	2.07	18.92	70.06	65.31
RoReg [243]	1.84	6.28	93.70	91.60	—	—	—	—
SpinNet [8]	1.93	6.24	93.74	92.07	1.08	10.75	82.83	80.91
PointDSC [11]	2.06	6.55	93.28	89.35	1.63	12.31	74.41	70.08
MAC [282]	1.89	6.03	93.72	91.46	1.42	8.46	91.37	89.25
MAC+GeoTF [282, 193]	1.74	6.01	95.02	91.80	1.37	8.01	90.59	88.45
Ours	1.34	5.72	95.08	93.32	1.57	6.09	92.05	90.61

3D Assembly of Cultural Heritage Fragments

Our point cloud registration method is applied in archaeology to reconstruct fragile artifacts from numerous fragments. By aligning scanned pieces, it creates a complete and accurate 3D model, preserving historical and cultural heritage.

This intricate process involves robust 3D alignment for large-scale reconstruction from LiDAR scans. The surfel-based SE(3)-equivariant features ensure reliable registration even with minimal overlaps and high outlier ratios. This technique aids archaeologists and historians in digitally recovering and interpreting cultural artifacts, making preservation more efficient and accurate. Advanced algorithms like PuzzleFusion++ are used for denoising, verifying, and incrementally refining fits, particularly for pieces with irregular shapes and minimal overlap.

Explore Point Cloud Solutions

Depth Prediction from Focal Stack Using Focal Geometry Constraint

Chapter 5 introduces FocDepthFormer, a novel Transformer-based network with an LSTM module for depth estimation from focal stack images. This model addresses the limitations of fixed input lengths and local receptive fields in conventional CNN-based methods. By integrating self-attention and recurrent modules, it efficiently processes focal stacks of arbitrary lengths, capturing global spatial relationships and fine-grained focus/defocus cues to achieve state-of-the-art depth prediction accuracy.

Key Contributions:

Proposed a novel Transformer-based network, FocDepthFormer, for depth estimation from focal stack images, leveraging self-attention to capture non-local spatial visual features.
Integrated an LSTM-based recurrent module to handle arbitrary numbers of input images in focal stacks, enhancing flexibility and generalization.
Employed multi-scale convolutional kernels in an early-stage encoder to capture low-level focus/defocus cues at various scales.

Challenges Addressed:

Limitations of CNN-based methods with fixed 2D/3D kernels, struggling to generalize across varying focal stack lengths.
Ineffective capture of long-range dependencies due to local receptive fields in traditional CNN architectures.
Reliance on large-scale, difficult-to-obtain focal stack training data.

Potential Enterprise Impact:

Enables flexible and accurate depth prediction from focal stacks of arbitrary lengths, expanding applications in 3D photography and industrial quality control.
Reduces reliance on expensive 3D sensor hardware and large-scale focal stack datasets through pre-training strategies.
Contributes to advancing 3D reconstruction techniques and dense local viewpoint cloud generation, particularly for fragile objects where camera motion is prohibited.

1.27% Accuracy Improvement on DDFF 12-Scene

Model	RMSE↓	logRMSE↓	absRel↓	sqrRel↓	Bump↓	δ ↑	δ² ↑	δ³ ↑
DDFFNet [79]	2.91e-2	0.320	0.293	1.2e-2	0.59	61.95	85.14	92.98
DefocusNet [161]	2.55e-2	0.230	0.180	6.0e-3	0.46	72.56	94.15	97.92
DFVNet [264]	2.13e-2	0.210	0.171	6.2e-3	0.32	76.74	94.23	98.14
AiFNet [246]	2.32e-2	0.290	0.251	8.3e-3	0.63	68.33	87.40	93.96
Ours (w/o Pre-training)	2.01e-2	0.206	0.173	5.7e-3	0.26	78.01	95.04	98.32
Ours (w/ Pre-training)	1.96e-2	0.197	0.161	5.4e-3	0.23	79.06	96.08	98.57

3D Modeling of Fragile Artworks and Sculptures

FocDepthFormer enables precise 3D modeling of cultural paintings and sculptures by analyzing focus differences across focal stack images, crucial for digitizing fragile artifacts without physical contact.

This technique extracts 3D information from 2D artworks with strong perspective principles, capturing fine details like painting strokes at millimeter-level precision. Unlike traditional methods, it doesn't require camera motion, making it ideal for preserving fragile heritage. The model's ability to handle arbitrary focal stack sizes offers flexibility for museums, cultural heritage organizations, and educational institutions to create high-accuracy digital replicas for exhibitions, in-depth analysis, and immersive virtual experiences.

Discuss Depth Estimation

3D Reconstruction Using Implicit SDF with Wavelet Feature-Based Prior

Chapter 6 proposes a wavelet-conditioned implicit SDF model for high-fidelity 3D reconstruction from multi-view images. It addresses the challenge of preserving fine-grained geometric details often lost in implicit representations by integrating pre-trained wavelet-transformed depth features and triplane latent space fusion. This approach significantly enhances surface accuracy and detail preservation, outperforming state-of-the-art explicit and implicit methods on various datasets.

Key Contributions:

Introduced Wavelet-Transformed Depth Feature Conditioning, using a pre-trained multi-scale wavelet autoencoder to enhance geometric detail preservation in implicit SDF training.
Developed Triplane-Aligned Wavelet Feature Projection to seamlessly integrate 2D wavelet features with 3D implicit representations.
Implemented a Hybrid Feature Fusion for SDF Prediction, utilizing a UNet-based module to combine implicit 3D features with wavelet-transformed depth representations for accurate isosurface mesh extraction.

Challenges Addressed:

Implicit SDF models struggling to capture fine-grained geometric details and high-frequency information due to limited learning capability of MLPs.
Oversmoothed reconstructions and loss of sharp edges in existing implicit methods.
Computational demands and generalization limitations of implicit models on diverse datasets.

Potential Enterprise Impact:

Enables high-fidelity 3D reconstructions for digital cultural heritage preservation, virtual museums, and immersive VR/AR environments.
Provides more complete and detailed 3D models of objects and architectural structures from limited image views.
Advances the integration of geometric constraints in high-frequency bands with deep learning frameworks for robust 3D vision solutions.

3D Reconstruction Ablation Study

w/o Triplane

→

w/o Wavelet

→

w/o 2D Unet Fusion

→

Full Model

0.51 Best Chamfer Distance (CD) on DTU Dataset

Method	24	37	40	55	63	65	69	83	97	105	106	110	114	118	122
VolSDF [270]	1.14	1.26	0.81	0.49	1.25	0.70	0.72	1.29	1.18	0.70	0.66	1.08	0.42	0.61	0.55
NeuS [247]	1.00	1.37	0.93	0.43	1.10	0.65	0.57	1.48	1.09	0.83	0.52	1.20	0.35	0.49	0.54
Neuralangelo [134]	0.41	0.36	0.35	0.35	1.29	0.54	0.73	0.52	0.97	0.56	0.48	0.73	0.32	0.40	0.36
BakedSDF [271]	0.63	0.58	0.40	0.52	1.37	0.63	0.81	0.56	1.02	0.81	0.50	0.82	0.39	0.45	0.38
SuGaR [73]	1.47	1.33	1.13	0.61	2.25	1.71	1.15	1.63	1.62	1.07	0.79	2.45	0.98	0.88	0.79
GOF [94]	0.50	0.37	0.38	0.74	1.18	0.76	0.90	0.47	1.29	0.68	0.77	0.90	0.42	0.41	0.42
2DGS [93]	0.48	0.39	0.41	0.83	1.36	0.83	1.04	0.70	1.27	0.76	0.70	1.40	0.40	0.43	0.40
Ours	0.45	0.34	0.32	0.34	0.97	0.52	0.54	0.50	0.82	0.53	0.45	0.68	0.30	0.36	0.34

Digital Preservation and Virtual Tourism for Cultural Heritage

Our 3D reconstruction model enables the creation of high-fidelity digital twins of historical sites and artifacts, supporting virtual museums, interactive learning, and 3D printing.

This technology is crucial for preserving cultural heritage, allowing virtual visitors to explore reconstructed ancient sites and artifacts with close-up detail and manipulation capabilities. It integrates wavelet-transformed depth features and triplane embeddings to achieve superior detail. Case studies include virtual tours of Chinese temples (Figure 8.7), game environments like 'Black-myth Wukong' (Figure 8.9), and 3D printed replicas of Dunhuang Mogao Grottoes (Figure 8.10), demonstrating wide-ranging applications in education, entertainment, and digital preservation.

Request a 3D Reconstruction Demo

Calculate Your Potential AI Impact

Estimate the operational efficiency gains and cost savings your enterprise could achieve by integrating advanced geometric deep learning solutions.

Your Industry

Number of Employees (Impacted by 3D Vision Tasks)

Average Hours Spent Weekly on Relevant Tasks per Employee

Average Hourly Cost per Employee ($)

Estimated Annual Savings

Annual Hours Reclaimed

Book a Personalized ROI Analysis

Your AI Implementation Roadmap

A phased approach to integrating geometric deep learning into your enterprise, ensuring robust and scalable deployment.

Phase 1: Discovery & Strategy Alignment (1-2 Weeks)

Initial consultations to understand your specific 3D vision challenges and business objectives. We'll identify optimal geometric deep learning solutions and define clear KPIs for success.

Phase 2: Data Preparation & Model Customization (4-8 Weeks)

Collection and annotation of enterprise-specific 3D data. Customization of chosen geometric deep learning models (e.g., surfel-based registration, FocDepthFormer) to your unique environments and requirements.

Phase 3: Integration & Pilot Deployment (6-10 Weeks)

Seamless integration of the AI models into your existing infrastructure. Conduct pilot programs to validate performance in real-world scenarios, gather feedback, and fine-tune for optimal results.

Phase 4: Scaling & Continuous Optimization (Ongoing)

Full-scale deployment across your operations. Establish continuous monitoring, performance tuning, and regular updates to ensure long-term robustness, efficiency, and adaptability to evolving needs.

Start Your AI Journey

Ready to Transform Your 3D Vision Capabilities?

Leverage cutting-edge geometric deep learning to enhance camera pose estimation, point cloud registration, depth estimation, and 3D reconstruction in your enterprise.

Schedule Your Strategy Session

Doctoral Thesis Analysis

Geometric Deep Learning for Camera Pose Prediction, Registration, Depth Estimation, and 3D Reconstruction

Executive Impact & Key Achievements

Deep Analysis & Enterprise Applications

Camera Pose Estimation Using Natural Geometry Cues and Manifold Constraint

Key Contributions:

Challenges Addressed:

Potential Enterprise Impact:

Thesis Chapter Dependencies

UAV-Based Camera Stabilization for Aerial Mapping

Point Cloud Registration Using 2D Surfel-Based Equivariance Constraint

Key Contributions:

Challenges Addressed:

Potential Enterprise Impact:

3D Assembly of Cultural Heritage Fragments

Depth Prediction from Focal Stack Using Focal Geometry Constraint

Key Contributions:

Challenges Addressed:

Potential Enterprise Impact:

3D Modeling of Fragile Artworks and Sculptures

3D Reconstruction Using Implicit SDF with Wavelet Feature-Based Prior

Key Contributions:

Challenges Addressed:

Potential Enterprise Impact:

3D Reconstruction Ablation Study

Digital Preservation and Virtual Tourism for Cultural Heritage

Calculate Your Potential AI Impact

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy Alignment (1-2 Weeks)

Phase 2: Data Preparation & Model Customization (4-8 Weeks)

Phase 3: Integration & Pilot Deployment (6-10 Weeks)

Phase 4: Scaling & Continuous Optimization (Ongoing)

Ready to Transform Your 3D Vision Capabilities?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai