Fix VS2022 build (OpenMP optional) and debugger working directory for examples/tests (#119 )

* Pass compile on Windows * Set debugger working diretoriy as project root for examples and tests on Windows
fix typo (#115 )
2025-06-10 12:48:47 +09:00 · 2025-05-07 14:51:07 +09:00 · 2025-05-07 11:35:58 +09:00 · 2025-05-07 11:25:04 +09:00 · 2025-05-07 10:02:51 +09:00 · 2025-05-07 09:31:31 +09:00
83 changed files with 5481 additions and 3850 deletions
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@ -0,0 +1,30 @@
+---
+name: Bug report
+about: Create a report to help us improve
+title: ''
+labels: bug
+assignees: ''
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior:
+1. Build the package
+2. Run '...'
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Screenshots**
+If applicable, add screenshots to help explain your problem.
+
+**Environment (please complete the following information):**
+ - OS: [e.g. Ubuntu]
+ - Version [e.g. 22.04]
+ - Language [C++ / Python]
+
+**Additional context**
+Add any other context about the problem here.
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@ -0,0 +1,17 @@
+---
+name: Feature request
+about: Suggest an idea for this project
+title: ''
+labels: enhancement
+assignees: ''
+
+---
+
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is.
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Additional context**
+Add any other context or screenshots about the feature request here.
--- a/.github/workflows/coverage.yml
+++ b/.github/workflows/coverage.yml
@ -13,7 +13,7 @@ on:

 jobs:
  coverage:
-    runs-on: ubuntu-22.04
+    runs-on: ubuntu-24.04

    steps:
      - uses: actions/checkout@v2
@ -23,7 +23,7 @@ jobs:
      - name: Install Dependencies
        run: |
          sudo apt-get -y update
-          sudo apt-get install --no-install-recommends -y build-essential cmake python3-pip pybind11-dev libeigen3-dev libfmt-dev libtbb-dev libomp-dev libpcl-dev libgtest-dev lcov
+          sudo apt-get install --no-install-recommends -y build-essential cmake python3-pip pybind11-dev libeigen3-dev libfmt-dev libtbb-dev libomp-dev libpcl-dev libgtest-dev lcov ninja-build
          pip install -U setuptools pytest pytest-cov numpy scipy

      - name: Build (C++)
--- a/.github/workflows/gendoc.yml
+++ b/.github/workflows/gendoc.yml
@ -0,0 +1,38 @@
+name: gendoc
+
+on:
+  push:
+    tags:
+      - '*'
+  # Allows you to run this workflow manually from the Actions tab
+  workflow_dispatch:
+
+jobs:
+  gendoc:
+    name: Generate documentation
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Configure Git Credentials
+        run: |
+          git config user.name github-actions[bot]
+          git config user.email 41898282+github-actions[bot]@users.noreply.github.com
+
+      - name : Install dependencies
+        run: sudo apt-get install -y git cmake build-essential doxygen graphviz python3-dev python3-pip pybind11-dev libeigen3-dev libomp-dev python3-numpy python3-sphinx python3-sphinx-rtd-theme
+
+      - name: Install python dependencies
+        run: python -m pip install mkdocs mkdocs-material
+      
+      - name: Generate documentation
+        run: cd docs && make all
+      
+      - uses: actions/upload-artifact@v4
+        with:
+          name: site
+          path: ./site/*
+
+      - name: Deploy documentation
+        run: cd docs && make deploy
--- a/.github/workflows/license.yml
+++ b/.github/workflows/license.yml
@ -0,0 +1,21 @@
+name: license
+
+on:
+  push:
+    branches: [ master ]
+    paths-ignore: '**.md'
+  pull_request:
+    branches: [ master ]
+    paths-ignore: '**.md'
+
+jobs:
+  license_check:
+    name: License check
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Check license
+        run: |
+          find include/small_gicp src/small_gicp -type f | xargs -I filename bash -c 'if ! grep -q SPDX-License-Identifier filename; then echo filename : lisence not found; exit 1; fi'
--- a/.github/workflows/paper.yml
+++ b/.github/workflows/paper.yml
@ -0,0 +1,29 @@
+name: Draft PDF
+on:
+  push:
+    branches: [ paper ]
+
+  # Allows you to run this workflow manually from the Actions tab
+  workflow_dispatch:
+
+jobs:
+  paper:
+    runs-on: ubuntu-latest
+    name: Paper Draft
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Build draft PDF
+        uses: openjournals/openjournals-draft-action@master
+        with:
+          journal: joss
+          # This should be the path to the paper within your repo.
+          paper-path: docs/paper.md
+      - name: Upload
+        uses: actions/upload-artifact@v3
+        with:
+          name: paper
+          # This is the output path where Pandoc will write the compiled
+          # PDF. Note, this should be the same directory as the input
+          # paper.md
+          path: docs/paper.pdf
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@ -13,7 +13,7 @@ on:

 jobs:
  test:
-    runs-on: ubuntu-22.04
+    runs-on: ubuntu-24.04

    steps:
      - uses: actions/checkout@v2
@ -23,7 +23,7 @@ jobs:
      - name: Install Dependencies
        run: |
          sudo apt-get -y update
-          sudo apt-get install --no-install-recommends -y build-essential cmake python3-pip pybind11-dev libeigen3-dev libfmt-dev libtbb-dev libomp-dev libpcl-dev libgtest-dev
+          sudo apt-get install --no-install-recommends -y build-essential cmake python3-pip pybind11-dev libeigen3-dev libfmt-dev libtbb-dev libomp-dev libpcl-dev libgtest-dev ninja-build
          pip install -U setuptools pytest numpy scipy

      - name: Build
--- a/.github/workflows/wheels.yml
+++ b/.github/workflows/wheels.yml
@ -1,4 +1,4 @@
-name: Build
+name: wheels

 on:
  workflow_dispatch:
@ -74,5 +74,3 @@ jobs:
          merge-multiple: true

      - uses: pypa/gh-action-pypi-publish@release/v1
-        with:
-          repository-url: https://pypi.org/legacy/
--- a/BENCHMARK.md
+++ b/BENCHMARK.md
@ -1,5 +1,11 @@
 # Benchmark

+## Dataset
+
+The following benchmark requires the KITTI odometry evaluation dataset. You can download the full dataset (80GB) from [the official dataset page](https://www.cvlibs.net/datasets/kitti/eval_odometry.php) or a part of the dataset (500 frames in 00 sequence, 622MB) from [google drive (KITTI00.tar.gz)](https://drive.google.com/file/d/1h9tARKvX6BwLfc_vfMfdxuP3bSbmgjmd/view?usp=sharing).
+
+Note that because the original KITTI dataset is distributed under the CC BY-NC-SA 3.0 license, the derived dataset (KITTI00.tar.gz) must not be used for commercial purposes.
+
 ## Build

 ```bash
@ -117,4 +123,11 @@ small_gicp           : APE=6.096 +- 3.056  RPE(100)=1.211 +- 0.717  RPE(400)=6.0
 small_gicp (tbb)     : APE=6.096 +- 3.056  RPE(100)=1.211 +- 0.717  RPE(400)=6.057 +- 3.123  RPE(800)=10.364 +- 6.336
 small_gicp (omp)     : APE=6.096 +- 3.056  RPE(100)=1.211 +- 0.717  RPE(400)=6.057 +- 3.123  RPE(800)=10.364 +- 6.336
 small_vgicp          : APE=5.956 +- 2.725  RPE(100)=1.315 +- 0.762  RPE(400)=6.849 +- 3.401  RPE(800)=10.396 +- 6.972
-```
+```
+
+### Comparison with Open3D
+
+[Code](https://github.com/koide3/small_gicp/blob/pybench/src/benchmark/odometry_benchmark.py)
+
+Processing speed comparison between small_gicp and Open3D ([youtube](https://youtu.be/LNESzGXPr4c?feature=shared)).
+[![small_comp](https://github.com/koide3/small_gicp/assets/31344317/7959edd6-f0e4-4318-b4c1-a3f8755c407f)](https://youtu.be/LNESzGXPr4c?feature=shared)
--- a/CITATION.cff
+++ b/CITATION.cff
@ -0,0 +1,32 @@
+cff-version: "1.2.0"
+authors:
+- family-names: Koide
+  given-names: Kenji
+  orcid: "https://orcid.org/0000-0001-5361-1428"
+contact:
+- family-names: Koide
+  given-names: Kenji
+  orcid: "https://orcid.org/0000-0001-5361-1428"
+doi: 10.5281/zenodo.13283012
+message: If you use this software, please cite our article in the
+  Journal of Open Source Software.
+preferred-citation:
+  authors:
+  - family-names: Koide
+    given-names: Kenji
+    orcid: "https://orcid.org/0000-0001-5361-1428"
+  date-published: 2024-08-10
+  doi: 10.21105/joss.06948
+  issn: 2475-9066
+  issue: 100
+  journal: Journal of Open Source Software
+  publisher:
+    name: Open Journals
+  start: 6948
+  title: "small_gicp: Efficient and parallel algorithms for point cloud
+    registration"
+  type: article
+  url: "https://joss.theoj.org/papers/10.21105/joss.06948"
+  volume: 9
+title: "small_gicp: Efficient and parallel algorithms for point cloud
+  registration"
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -1,5 +1,5 @@
 cmake_minimum_required(VERSION 3.16)
-project(small_gicp VERSION 0.0.6 LANGUAGES C CXX)
+project(small_gicp VERSION 1.0.0 LANGUAGES C CXX)

 set(CMAKE_CXX_STANDARD 17)
 list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_LIST_DIR}/cmake")
@ -22,7 +22,7 @@ option(BUILD_PYTHON_BINDINGS "Build python bindings" OFF)
 option(ENABLE_COVERAGE "Enable coverage" OFF)

 # Dependency options
-set(BUILD_WITH_OPENMP CACHE STRING "Build with OpenMP" "auto")
+set(BUILD_WITH_OPENMP "auto" CACHE STRING "Build with OpenMP")
 option(BUILD_WITH_TBB "Build with TBB" OFF)
 option(BUILD_WITH_PCL "Build with PCL (required for benchmark and test only)" OFF)
 option(BUILD_WITH_FAST_GICP "Build with fast_gicp (required for benchmark and test only)" OFF)
@ -97,7 +97,10 @@ endif()

 if(MSVC)
  add_compile_definitions(_USE_MATH_DEFINES)
-  # add_compile_options(/openmp:llvm)
+  add_compile_options(/bigobj)
+  if(BUILD_WITH_OPENMP)
+    add_compile_options(/openmp:llvm)
+  endif()
 endif()

 ##############
@ -111,8 +114,8 @@ if(ENABLE_COVERAGE)
  find_program(GENHTML genhtml REQUIRED)

  add_custom_target(coverage
-    COMMAND ${LCOV} --directory . --capture --output-file coverage.info
-    COMMAND ${LCOV} --remove coverage.info -o coverage.info '/usr/*'
+    COMMAND ${LCOV} --directory . --capture --output-file coverage.info --ignore-errors mismatch
+    COMMAND ${LCOV} --remove coverage.info -o coverage.info '/usr/*' --ignore-errors mismatch
    COMMAND ${GENHTML} --demangle-cpp -o coverage coverage.info
    WORKING_DIRECTORY ${CMAKE_BINARY_DIR})
 endif()
@ -187,7 +190,9 @@ if(BUILD_BENCHMARKS)
    src/benchmark/odometry_benchmark_small_gicp_tbb.cpp
    src/benchmark/odometry_benchmark_small_vgicp_tbb.cpp
    src/benchmark/odometry_benchmark_small_gicp_model_tbb.cpp
+    src/benchmark/odometry_benchmark_small_gicp_model_omp.cpp
    src/benchmark/odometry_benchmark_small_vgicp_model_tbb.cpp
+    src/benchmark/odometry_benchmark_small_vgicp_model_omp.cpp
    src/benchmark/odometry_benchmark_small_gicp_tbb_flow.cpp
    src/benchmark/odometry_benchmark.cpp
  )
@ -245,6 +250,12 @@ if(BUILD_EXAMPLES)
      TBB::tbbmalloc
      PCL::PCL
    )
+   if(MSVC)
+     set_target_properties(${EXAMPLE_NAME}
+                           PROPERTIES VS_DEBUGGER_WORKING_DIRECTORY
+                           "${CMAKE_SOURCE_DIR}"
+     )
+   endif()
  endforeach()
 endif()

@ -271,6 +282,13 @@ if(BUILD_TESTS)
    )

    gtest_discover_tests(${TEST_NAME} WORKING_DIRECTORY ${CMAKE_SOURCE_DIR})
+
+   if(MSVC)
+     set_target_properties(${TEST_NAME}
+                           PROPERTIES VS_DEBUGGER_WORKING_DIRECTORY
+                           "${CMAKE_SOURCE_DIR}"
+     )
+   endif()
  endforeach()
 endif()

--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@ -0,0 +1,128 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, religion, or sexual identity
+and orientation.
+
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+
+## Our Standards
+
+Examples of behavior that contributes to a positive environment for our
+community include:
+
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the
+  overall community
+
+Examples of unacceptable behavior include:
+
+* The use of sexualized language or imagery, and sexual attention or
+  advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email
+  address, without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Enforcement Responsibilities
+
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+
+## Scope
+
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+k.koide@aist.go.jp.
+All complaints will be reviewed and investigated promptly and fairly.
+
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+
+## Enforcement Guidelines
+
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+
+### 1. Correction
+
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+
+### 2. Warning
+
+**Community Impact**: A violation through a single incident or series
+of actions.
+
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or
+permanent ban.
+
+### 3. Temporary Ban
+
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+
+### 4. Permanent Ban
+
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior,  harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+
+**Consequence**: A permanent ban from any sort of public interaction within
+the community.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.0, available at
+https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
+
+Community Impact Guidelines were inspired by [Mozilla's code of conduct
+enforcement ladder](https://github.com/mozilla/diversity).
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see the FAQ at
+https://www.contributor-covenant.org/faq. Translations are available at
+https://www.contributor-covenant.org/translations.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,12 @@
+## Contributing
+
+Thank you for your support of this package. Your contribution for improving the package is welcome as long as you follow the [Code of Conduct](CODE_OF_CONDUCT.md).
+
+## Issues
+
+Please open issues to report bugs and propose feature requests. When creating a new issue, please use a issue template and fill each item to better tell your situation to the maintainers.
+
+## Pull Requests
+
+Please take a look at the Github official document [GitHub](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) for general recommendations to create a PR.
+In addition, please let your PR focus on a single topic (improvement or bugfix) and keep it as minimal as possible to reduce the review effort.
--- a/README.md
+++ b/README.md
@ -1,23 +1,23 @@
-# small_gicp (fast_gicp2)
+# small_gicp

-**small_gicp** is a header-only C++ library that offers efficient and parallelized algorithms for fine point cloud registration (ICP, Point-to-Plane ICP, GICP, VGICP, etc.). It is a refined and optimized version of its predecessor, [fast_gicp](https://github.com/SMRT-AIST/fast_gicp), re-written from scratch with the following features.
+**small_gicp** is a header-only C++ library providing efficient and parallelized algorithms for fine point cloud registration (ICP, Point-to-Plane ICP, GICP, VGICP, etc.). It is a refined and optimized version of its predecessor, [fast_gicp](https://github.com/SMRT-AIST/fast_gicp), re-written from scratch with the following features.

- **Highly Optimized** : The implementation of the core registration algorithm is further optimized from that in fast_gicp. It enables up to **2x speed gain**.
- **All parallerized** : small_gicp offers parallel implementations of several preprocessing algorithms to make the entire registration process parallelized (Downsampling, KdTree construction, Normal/covariance estimation). As a parallelism backend, either (or both) [OpenMP](https://www.openmp.org/) and [Intel TBB](https://github.com/oneapi-src/oneTBB) can be used. 
- **Minimum dependency** : Only [Eigen](https://eigen.tuxfamily.org/) (and bundled [nanoflann](https://github.com/jlblancoc/nanoflann) and [Sophus](https://github.com/strasdat/Sophus)) are required at a minimum. Optionally, it provides the [PCL](https://pointclouds.org/) registration interface so that it can be used as a drop-in replacement.
- **Customizable** : small_gicp allows feeding any custom point cloud class to the registration algorithm via traits. Furthermore, the template-based implementation enables customizing the registration process with your original correspondence estimator and registration factors.
- **Python bindings** : The isolation from PCL makes small_gicp's python bindings more portable and usable with other libraries (e.g., Open3D) without problems. 
+- **Highly Optimized** : The core registration algorithm implementation has been further optimized from fast_gicp, achieving up to **2x speed gain**.
+- **Fully parallerized** : small_gicp offers parallel implementations of several preprocessing algorithms, making the entire registration process parallelized (e.g., Downsampling, KdTree construction, Normal/Covariance estimation). It supports [OpenMP](https://www.openmp.org/) and [Intel TBB](https://github.com/oneapi-src/oneTBB) as parallelism backends.
+- **Minimum dependencies** : The library requires only [Eigen](https://eigen.tuxfamily.org/) along with the bundled [nanoflann](https://github.com/jlblancoc/nanoflann) and [Sophus](https://github.com/strasdat/Sophus). Optionally, it supports a [PCL](https://pointclouds.org/) registration interface for use as a drop-in replacement
+- **Customizable** : small_gicp allows the integration of any custom point cloud class into the registration algorithm via traits. Its template-based implementation enables customization of the registration process with original correspondence estimators and registration factors.
+- **Python bindings** : By being isolated from PCL, small_gicp's Python bindings are more portable and can be used seamlessly with other libraries such as Open3D.

 Note that GPU-based implementations are NOT included in this package.

-If you find this package useful for your project, please consider leaving a comment [here](https://github.com/koide3/small_gicp/issues/3). It would help the author receive recognition in his organization and keep working on this project.
+If you find this package useful for your project, please consider leaving a comment [here](https://github.com/koide3/small_gicp/issues/3). It would help the author receive recognition in his organization and keep working on this project. Please also cite [the corresponding software paper](https://joss.theoj.org/papers/10.21105/joss.06948) if you use this software in an academic work.


-[![Build(Linux)](https://github.com/koide3/small_gicp/actions/workflows/build-linux.yml/badge.svg)](https://github.com/koide3/small_gicp/actions/workflows/build-linux.yml) [![macos](https://github.com/koide3/small_gicp/actions/workflows/build-macos.yml/badge.svg)](https://github.com/koide3/small_gicp/actions/workflows/build-macos.yml) [![Build(Windows)](https://github.com/koide3/small_gicp/actions/workflows/build-windows.yml/badge.svg)](https://github.com/koide3/small_gicp/actions/workflows/build-windows.yml) [![Test](https://github.com/koide3/small_gicp/actions/workflows/test.yml/badge.svg)](https://github.com/koide3/small_gicp/actions/workflows/test.yml) [![codecov](https://codecov.io/gh/koide3/small_gicp/graph/badge.svg?token=PCVIUP2Z33)](https://codecov.io/gh/koide3/small_gicp)
+[![status](https://joss.theoj.org/papers/059b017c823ed9fd211fc91373cdc2cc/status.svg)](https://joss.theoj.org/papers/059b017c823ed9fd211fc91373cdc2cc) [![Build(Linux)](https://github.com/koide3/small_gicp/actions/workflows/build-linux.yml/badge.svg)](https://github.com/koide3/small_gicp/actions/workflows/build-linux.yml) [![macos](https://github.com/koide3/small_gicp/actions/workflows/build-macos.yml/badge.svg)](https://github.com/koide3/small_gicp/actions/workflows/build-macos.yml) [![Build(Windows)](https://github.com/koide3/small_gicp/actions/workflows/build-windows.yml/badge.svg)](https://github.com/koide3/small_gicp/actions/workflows/build-windows.yml) [![Test](https://github.com/koide3/small_gicp/actions/workflows/test.yml/badge.svg)](https://github.com/koide3/small_gicp/actions/workflows/test.yml) [![codecov](https://codecov.io/gh/koide3/small_gicp/graph/badge.svg?token=PCVIUP2Z33)](https://codecov.io/gh/koide3/small_gicp)

 ## Requirements

-This library uses some C++17 features. The PCL interface is not compatible with PCL older than 1.11 that uses `boost::shared_ptr`.
+This library uses C++17 features. The PCL interface is not compatible with PCL older than 1.11 that uses `boost::shared_ptr`.

 ## Dependencies

@ -41,11 +41,19 @@ cmake .. -DCMAKE_BUILD_TYPE=Release && make -j
 sudo make install
 ```

-### Python (Linux / Windows)
+### Python (Linux / Windows / MacOS)
+
+#### Install from [PyPI](https://pypi.org/project/small-gicp/)
+
+```bash
+pip install small_gicp
+```
+
+#### Install from source

 ```bash
 cd small_gicp
-pip install . --user
+pip install .

 # [Optional (linux)] Install stubs for autocomplete (If you know a better way, let me know...)
 pip install pybind11-stubgen
@ -53,6 +61,12 @@ cd ~/.local/lib/python3.10/site-packages
 pybind11-stubgen -o . --ignore-invalid=all small_gicp
 ```

+## Documentation
+
+- Auto-generated docs: [C++](https://koide3.github.io/small_gicp/doc_cpp/index.html) [Python](https://koide3.github.io/small_gicp/doc_py/index.html)
+
+
+
 ## Usage (C++)

 The following examples assume `using namespace small_gicp` is placed somewhere.
@ -266,7 +280,7 @@ Example A : Perform registration with numpy arrays
 # registration_type : str = 'GICP'
 #     Type of registration algorithm to use ('ICP', 'PLANE_ICP', 'GICP', 'VGICP').
 # voxel_resolution : float = 1.0
-#     Resolution of voxels used for downsampling.
+#     Resolution of voxels used for correspondence search (used only in VGICP).
 # downsampling_resolution : float = 0.25
 #     Resolution for downsampling the point clouds.
 # max_correspondence_distance : float = 1.0
@ -390,34 +404,58 @@ open3d.visualization.draw_geometries([target_o3d, source_o3d])

 - [Scan-to-scan and scan-to-model GICP matching odometry on KITTI](src/example/kitti_odometry.py)

+## Running examples
+
+### C++
+
+```bash
+cd small_gicp
+mkdir build && cd build
+cmake .. -DBUILD_EXAMPLES=ON && make -j
+
+cd ..
+./build/01_basic_registration
+./build/02_basic_registration_pcl
+./build/03_registration_template
+```
+
+
+### Python
+
+```bash
+cd small_gicp
+pip install .
+
+python3 src/example/basic_registration.py
+```
+
 ## [Benchmark](BENCHMARK.md)

-Processing speed comparison between small_gicp and Open3D ([youtube]((https://youtu.be/LNESzGXPr4c?feature=shared))).
+Processing speed comparison between small_gicp and Open3D ([youtube](https://youtu.be/LNESzGXPr4c?feature=shared)).
 [![small_comp](https://github.com/koide3/small_gicp/assets/31344317/7959edd6-f0e4-4318-b4c1-a3f8755c407f)](https://youtu.be/LNESzGXPr4c?feature=shared)

 ### Downsampling

 - Single-threaded `small_gicp::voxelgrid_sampling` is about **1.3x faster** than `pcl::VoxelGrid`.
 - Multi-threaded `small_gicp::voxelgrid_sampling_tbb` (6 threads) is about **3.2x faster** than `pcl::VoxelGrid`.
- `small_gicp::voxelgrid_sampling` gives accurate downsampling results (almost identical to those of `pcl::VoxelGrid`) while `pcl::ApproximateVoxelGrid` yields spurious points (up to 2x points).
- `small_gicp::voxelgrid_sampling` can process a larger point cloud with a fine voxel resolution compared to `pcl::VoxelGrid` (for a point cloud of 1000m width, the minimum voxel resolution can be **0.5 mm**).
+- `small_gicp::voxelgrid_sampling` provides accurate downsampling results that are nearly identical to those of `pcl::VoxelGrid`, while `pcl::ApproximateVoxelGrid` can produce spurious points (up to 2x more points).
+- `small_gicp::voxelgrid_sampling` can handle larger point clouds with finer voxel resolutions compared to `pcl::VoxelGrid`. For a point cloud with a width of 1000m, the minimum voxel resolution can be **0.5 mm**.

 ![downsampling_comp](docs/assets/downsampling_comp.png)

 ### KdTree construction

- Multi-threaded implementation (TBB and OMP) can be up to **6x faster** than the single-threaded one. The single-thread version shows almost equivalent performance with nanoflann.
- ~~The processing speed gets faster as the number of threads increases, but the speed gain is not monotonic sometimes (because of the scheduling algorithm or some CPU(AMD 5995WX)-specific issues?)~~.
- The new KdTree implementation shows a good scalability thanks to its well balanced task assignment.
- This benchmark only compares the construction time (query time is not included). 
+- Multi-threaded implementation (TBB and OMP) can be up to **6x faster** than the single-threaded version. The single-thread version performs almost equivalently to nanoflann.
+- The new KdTree implementation demonstrates good scalability due to its well-balanced task assignment.
+- This benchmark compares only the construction time (query time is not included). Nearest neighbor queries are included and evaluated in the following odometry estimation evaluation.

 ![kdtree_time](docs/assets/kdtree_time.png)

 ### Odometry estimation

 - Single-thread `small_gicp::GICP` is about **2.4x and 1.9x faster** than `pcl::GICP` and `fast_gicp::GICP`, respectively.
- `small_gicp::(GICP|VGICP)` shows a better multi-thread scalability compared to `fast_gicp::(GICP|VGICP)`.
- `small_gicp::GICP` parallelized with [TBB flow graph](src/odometry_benchmark_small_gicp_tbb_flow.cpp) shows an excellent scalability to many-threads situations (**~128 threads**) but with latency degradation.
+- `small_gicp::(GICP|VGICP)` demonstrates better multi-threaded scalability compared to `fast_gicp::(GICP|VGICP)`.
+- `small_gicp::GICP` parallelized with [TBB flow graph](src/benchmark/odometry_benchmark_small_gicp_tbb_flow.cpp) shows excellent scalability in many-threads scenarios (**~128 threads**), though with some latency degradation.
 - Outputs of `small_gicp::GICP` are almost identical to those of `fast_gicp::GICP`.

 ![odometry_time](docs/assets/odometry_time.png)
@ -427,6 +465,22 @@ This package is released under the MIT license.

 If you find this package useful for your project, please consider leaving a comment [here](https://github.com/koide3/small_gicp/issues/3). It would help the author receive recognition in his organization and keep working on this project.

+Please also cite [the following article](https://joss.theoj.org/papers/10.21105/joss.06948) if you use this software in an academic work.
+
+```
+@article{small_gicp,
+author = {Kenji Koide},
+title = {{small\_gicp: Efficient and parallel algorithms for point cloud registration}},
+journal = {Journal of Open Source Software},
+month = aug,
+number = {100},
+pages = {6948},
+volume = {9},
+year = {2024},
+doi = {10.21105/joss.06948}
+}
+```
+
 ## Contact

 [Kenji Koide](https://staff.aist.go.jp/k.koide/), National Institute of Advanced Industrial Science and Technology (AIST)
--- a/cmake/FindIridescence.cmake
+++ b/cmake/FindIridescence.cmake
@ -1,21 +0,0 @@
-find_path(Iridescence_INCLUDE_DIRS glk/drawable.hpp
-  HINTS /usr/local/include/iridescence /usr/include/iridescence
-  DOC "Iridescence include directories")
-
-find_library(Iridescence_LIBRARY NAMES iridescence
-  HINTS /usr/local/lib /usr/lib
-  DOC "Iridescence libraries")
-
-  find_library(gl_imgui_LIBRARY NAMES gl_imgui
-  HINTS /usr/local/lib /usr/lib
-  DOC "Iridescence libraries")
-
-set(Iridescence_LIBRARIES ${Iridescence_LIBRARY} ${gl_imgui_LIBRARY})
-
-add_library(Iridescence::Iridescence INTERFACE IMPORTED GLOBAL)
-set_target_properties(Iridescence::Iridescence PROPERTIES
-  INTERFACE_INCLUDE_DIRECTORIES "${Iridescence_INCLUDE_DIRS}"
-  INTERFACE_LINK_LIBRARIES "${Iridescence_LIBRARIES}")
-
-include(FindPackageHandleStandardArgs)
-find_package_handle_standard_args(Iridescence DEFAULT_MSG Iridescence_INCLUDE_DIRS Iridescence_LIBRARIES)
--- a/docker/Dockerfile.pytest.gcc
+++ b/docker/Dockerfile.pytest.gcc
@ -7,7 +7,7 @@ ENV DEBIAN_FRONTEND=noninteractive
 RUN apt-get update && apt-get install --no-install-recommends -y \
  && apt-get install --no-install-recommends -y \
  wget nano build-essential git cmake python3-dev python3-pip pybind11-dev \
-  libeigen3-dev libomp-dev
+  libeigen3-dev libomp-dev ninja-build

 RUN mkdir -p ~/.config/pip
 RUN echo "[global]\nbreak-system-packages=true" > ~/.config/pip/pip.conf
--- a/docker/Dockerfile.pytest.llvm
+++ b/docker/Dockerfile.pytest.llvm
@ -7,7 +7,7 @@ ENV DEBIAN_FRONTEND=noninteractive
 RUN apt-get update && apt-get install --no-install-recommends -y \
  && apt-get install --no-install-recommends -y \
  wget nano build-essential git cmake python3-dev python3-pip pybind11-dev \
-  libeigen3-dev libomp-dev
+  libeigen3-dev libomp-dev ninja-build

 RUN apt-get update && apt-get install --no-install-recommends -y \
  && apt-get install --no-install-recommends -y \
--- a/docs/Doxyfile
+++ b/docs/Doxyfile
--- a/docs/Makefile
+++ b/docs/Makefile
@ -0,0 +1,28 @@
+.PHONY: help
+help:
+	@echo "make cpp|py|all"
+
+.PHONY: cpp
+cpp:
+	@echo "Building C++ documentation..."
+	doxygen Doxyfile doc_cpp
+
+.PHONY: py
+py:
+	@echo "Building Python documentation..."
+	mkdir -p build && cd build && cmake ../../ -DBUILD_PYTHON_BINDINGS=ON && make -j
+	sphinx-build -b singlehtml . ./doc_py/
+
+.PHONY: mkdocs
+mkdocs:
+	@echo "Building MkDocs documentation..."
+	cd .. && mkdocs build
+
+.PHONY: all
+all: cpp py mkdocs
+	@echo "All documentation built."
+
+.PHONY: deploy
+deploy:
+	@echo "Deploying documentation..."
+	cd .. && mkdocs gh-deploy --force
--- a/docs/conf.py
+++ b/docs/conf.py
@ -0,0 +1,32 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+
+project = 'small_gicp'
+copyright = '2024, k.koide'
+author = 'k.koide'
+version = '1.0.0'
+
+import os
+import sys
+sys.path.insert(0, os.path.abspath('./build/'))
+
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+
+extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon']
+
+templates_path = ['_templates']
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+autoclass_content = 'both'
+
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+
+html_theme = 'sphinx_rtd_theme'
+html_static_path = ['_static']
--- a/docs/index.md
+++ b/docs/index.md
@ -0,0 +1,6 @@
+# small_gicp
+
+## Documentation
+
+- [C++](doc_cpp/index.html)
+- [Python](doc_py/index.html)
--- a/docs/index.rst
+++ b/docs/index.rst
@ -0,0 +1,22 @@
+.. small_gicp documentation master file, created by
+   sphinx-quickstart on Wed Jun  5 11:07:43 2024.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to small_gicp's documentation!
+======================================
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+   small_gicp
+
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
--- a/docs/paper.bib
+++ b/docs/paper.bib
@ -0,0 +1,120 @@
+@InProceedings{Rusu,
+	author    = {Radu Bogdan Rusu and Steve Cousins},
+	title     = {{3D is here: Point Cloud Library (PCL)}},
+	booktitle = {{IEEE International Conference on Robotics and Automation (ICRA2011)}},
+	month     = {May},
+	year      = {2011},
+	address   = {Shanghai, China},
+  doi       = {10.1109/ICRA.2011.5980567}
+}
+
+@article{Zhou,
+    author    = {Qian-Yi Zhou and Jaesik Park and Vladlen Koltun},
+    title     = {{Open3D}: {A} Modern Library for {3D} Data Processing},
+    journal   = {arXiv:1801.09847},
+    year      = {2018},
+    doi       = {10.48550/arXiv.1801.09847}
+}
+
+@ARTICLE{Bai,
+  author={Bai, Chunge and Xiao, Tao and Chen, Yajie and Wang, Haoqian and Zhang, Fang and Gao, Xiang},
+  journal={IEEE Robotics and Automation Letters}, 
+  title={Faster-LIO: Lightweight Tightly Coupled Lidar-Inertial Odometry Using Parallel Sparse Incremental Voxels}, 
+  year={2022},
+  volume={7},
+  number={2},
+  pages={4861-4868},
+  doi={10.1109/LRA.2022.3152830}
+}
+
+@inproceedings{Koide,
+  title      = {Voxelized GICP for Fast and Accurate 3D Point Cloud Registration},
+  author     = {Kenji Koide, Masashi Yokozuka, Shuji Oishi, and Atsuhiko Banno},
+  booktitle  = {IEEE International Conference on Robotics and Automation (ICRA2021)},
+  pages      = {11054--11059},
+  year       = {2021},
+  month      = {May},
+  doi        = {10.1109/ICRA48506.2021.9560835}
+}
+
+@article{Zhang,
+  title={Iterative point matching for registration of free-form curves and surfaces},
+  author={Zhang, Zhengyou},
+  journal={International journal of computer vision},
+  volume={13},
+  number={2},
+  pages={119--152},
+  year={1994},
+  publisher={Springer},
+  doi={10.1007/BF01427149}
+}
+
+@inproceedings{Segal,
+  title={Generalized-{ICP}},
+  author={Segal, Aleksandr and Haehnel, Dirk and Thrun, Sebastian},
+  booktitle={Robotics: science and systems},
+  volume={2},
+  number={4},
+  pages={435},
+  year={2009},
+  doi={10.15607/rss.2009.v.021}
+}
+
+@INPROCEEDINGS{Wang,
+  author={Wang, Han and Wang, Chen and Xie, Lihua},
+  booktitle={IEEE International Conference on Robotics and Automation (ICRA2020)}, 
+  title={Intensity Scan Context: Coding Intensity and Geometry Relations for Loop Closure Detection}, 
+  year={2020},
+  pages={2095-2101},
+  doi={10.1109/ICRA40945.2020.9196764}
+}
+
+@inproceedings{Izadinia,
+  title={Scene recomposition by learning-based icp},
+  author={Izadinia, Hamid and Seitz, Steven M},
+  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2020)},
+  pages={930--939},
+  year={2020},
+  doi={10.1109/cvpr42600.2020.00101}
+}
+
+@ARTICLE{Kim,
+  author={Kim, Kyuwon and Im, Junhyuck and Jee, Gyuin},
+  journal={IEEE Transactions on Intelligent Transportation Systems}, 
+  title={Tunnel Facility Based Vehicle Localization in Highway Tunnel Using 3D LIDAR}, 
+  year={2022},
+  volume={23},
+  number={10},
+  pages={17575-17583},
+  doi={10.1109/TITS.2022.3160235}
+}
+
+@article{Pomerleau,
+	author = {Pomerleau, Fran{\c c}ois and Colas, Francis and Siegwart, Roland and Magnenat, St{\'e}phane},
+	title = {{Comparing ICP Variants on Real-World Data Sets}},
+	journal = {Autonomous Robots},
+	year = {2013},
+	volume = {34},
+	number = {3},
+	pages = {133--148},
+	month = feb,
+  doi={10.1007/s10514-013-9327-2}
+}
+
+@INPROCEEDINGS{Serafin,
+  author={Serafin, Jacopo and Grisetti, Giorgio},
+  booktitle={IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS2015)}, 
+  title={NICP: Dense normal based point cloud registration}, 
+  year={2015},
+  pages={742-749},
+  doi={10.1109/IROS.2015.7353455}}
+
+
+@inproceedings{Datar,
+author = {Datar, Mayur and Immorlica, Nicole and Indyk, Piotr and Mirrokni, Vahab S.},
+title = {Locality-sensitive hashing scheme based on p-stable distributions},
+year = {2004},
+doi = {10.1145/997817.997857},
+booktitle = {Twentieth Annual Symposium on Computational Geometry},
+pages = {253–262},
+}
--- a/docs/paper.md
+++ b/docs/paper.md
@ -0,0 +1,127 @@
+---
+title: 'small_gicp: Efficient and parallel algorithms for point cloud registration'
+tags:
+  - C++
+  - Python
+  - Point cloud registration
+authors:
+  - name: Kenji Koide
+    orcid: 0000-0001-5361-1428
+    corresponding: true
+    affiliation: "1"
+affiliations:
+ - name: National Institute of Advanced Industrial Science and Technology (AIST), Japan
+   index: 1
+date: 22 June 2024
+bibliography: paper.bib
+---
+
+\def\CC{{C\nolinebreak[4]\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}
+
+# Summary
+
+Point cloud registration is a task of aligning two point clouds measured by 3D ranging
+sensors, for example, LiDARs and range cameras. Iterative point cloud registration,
+also known as fine registration or local registration, iteratively refines the transformation
+between point clouds starting from an initial guess.
+Each iteration involves a proximity-based point correspondence search and the minimization
+of the distance between corresponding points, continuing until convergence.
+Iterative closest point (ICP) and its variants,
+such as Generalized ICP, are representative iterative point cloud registration algorithms.
+They are widely used in applications like autonomous vehicle localization [@Kim], place recognition [@Wang],
+and object classification [@Izadinia]. Since these applications often require real-time or near-real-time
+processing, speed is a critical factor in point cloud registration routines.
+
+**small_gicp** provides efficient and parallel algorithms to create an extremely
+fast point cloud registration pipeline. It offers parallel implementations of 
+downsampling, nearest neighbor search, local feature extraction, and registration
+to accelerate the entire process.
+small_gicp is implemented as a header-only \CC library with minimal dependencies
+to offer efficiency, portability, and customizability.
+
+# Statement of need
+
+There are several point cloud processing libraries, and PCL [@Rusu], Open3D
+[@Zhou], libpointmatcher [@Pomerleau] are commonly used in real-time applications
+owing to their performant implementations.
+Although they offer numerous functionalities, including those required for point cloud
+registration, they present several challenges for practical applications and scientific
+research.
+
+**Processing speed:**
+A typical point cloud registration pipeline includes processes such as downsampling,
+nearest neighbor search (e.g., KdTree construction), local feature estimation, and
+registration error minimization.
+PCL and Open3D support multi-threading only for parts of these processes (feature
+estimation and registration error minimization), with the remaining single-threaded
+parts often limiting the overall processing speed.
+Additionally, the multi-thread implementations in these libraries can have significant
+overheads, reducing scalability to many-core CPUs.
+These issues make it difficult to meet real-time processing requirements, especially on
+low-specification CPUs. It is also difficult to fully utilize the computational power 
+of modern high-end CPUs.
+
+**Customizability:**
+Customizing the internal workings (e.g., replacing the registration cost function or
+changing the correspondence search method) of existing implementations is challenging
+due to hard-coded processes. This poses a significant hurdle for research and development,
+where testing new cost functions and search algorithms is essential.
+
+**small_gicp:**
+To address these issues and accelerate the development of point cloud registration-related systems,
+we designed small_gicp with the following features:
+
+- Fully parallelized point cloud preprocessing and registration algorithms with minimal overhead,
+  offering up to 2x speed gain in single-threaded scenarios and better scalability in multi-core
+  environments.
+
+- A modular and customizable framework using \CC templates, allowing easy customization of the
+  algorithm's internal workings while maintaining efficiency.
+
+- A header-only \CC library implementation for easy integration into user projects, with Python bindings
+  provided for collaborative use with other libraries (e.g., Open3D).
+
+# Functionalities
+
+**small_gicp** implements several preprocessing algorithms related to point cloud registration, and
+ICP variant algorithms (point-to-point ICP, point-to-plane ICP, and Generalized ICP based on
+distribution-to-distribution correspondence).
+
+- Downsampling
+    - Voxelgrid sampling
+    - Random sampling
+- Nearest neighbor search and point accumulation structures
+    - KdTree
+    - Linear iVox (supports incremental points insertion and LRU-cache-based voxel deletion) [@Bai]
+    - Gaussian voxelmap (supports incremental points insertion and LRU-cache-based voxel deletion) [@Koide]
+- Registration error functions
+    - Point-to-point ICP error [@Zhang]
+    - Point-to-plane ICP error
+    - Generalized ICP error [@Segal]
+    - Robust kernels
+- Least squares optimizers
+    - GaussNewton optimizer
+    - LevenbergMarquardt optimizer
+
+# Benchmark results
+
+- Single-threaded and multi-threaded (6 threads) `small_gicp::voxelgrid_sampling` are approximately 1.3x and 3.2x faster than `pcl::VoxelGrid`, respectively.
+- Multi-threaded construction of `small_gicp::KdTree` can be up to 6x faster than that of `nanoflann`.
+- Single-threaded `small_gicp::GICP` is about 2.4x faster than `pcl::GICP`, with the multi-threaded version showing better scalability.
+
+More details can be found at \url{https://github.com/koide3/small_gicp/blob/master/BENCHMARK.md}.
+
+# Future work
+
+The efficiency of nearest neighbor search significantly impacts the overall performance of point cloud registration.
+While small_gicp currently offers efficient and parallel implementations of KdTree and voxelmap, which are general and useful in many situations,
+there are other nearest neighbor search methods that can be more efficient under mild assumptions about the point cloud measurement model
+(e.g., projective search [@Serafin]).
+We plan to implement these alternative neighbor search algorithms to further enhance the speed of the point cloud registration process.
+The design of small_gicp, where nearest neighbor search and pose optimization are decoupled, facilitates the easy integration of these new search algorithms.
+
+# Acknowledgements
+
+This work was supported in part by JSPS KAKENHI Grant Number 23K16979.
+
+# References
--- a/docs/small_gicp.rst
+++ b/docs/small_gicp.rst
@ -0,0 +1,7 @@
+small\_gicp module
+==================
+
+.. automodule:: small_gicp
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/include/small_gicp/ann/deprecated/kdtree.hpp
+++ b/include/small_gicp/ann/deprecated/kdtree.hpp
@ -1,105 +0,0 @@
-// SPDX-FileCopyrightText: Copyright 2024 Kenji Koide
-// SPDX-License-Identifier: MIT
-#pragma once
-
-#include <memory>
-#include <Eigen/Core>
-#include <small_gicp/points/traits.hpp>
-#include <small_gicp/ann/traits.hpp>
-#include <small_gicp/ann/nanoflann.hpp>
-
-namespace small_gicp {
-
-/// @brief Unsafe KdTree with arbitrary nanoflann Adaptor.
-/// @note  This class does not hold the ownership of the input points.
-///        You must keep the input points along with this class.
-template <class PointCloud, template <typename, typename, int, typename> class Adaptor>
-class UnsafeKdTreeGeneric {
-public:
-  using Ptr = std::shared_ptr<UnsafeKdTreeGeneric>;
-  using ConstPtr = std::shared_ptr<const UnsafeKdTreeGeneric>;
-  using ThisClass = UnsafeKdTreeGeneric<PointCloud, Adaptor>;
-  using Index = Adaptor<nanoflann::L2_Simple_Adaptor<double, ThisClass, double>, ThisClass, 3, size_t>;
-
-  /// @brief Constructor
-  /// @param points  Input points
-  explicit UnsafeKdTreeGeneric(const PointCloud& points) : points(points), index(3, *this, nanoflann::KDTreeSingleIndexAdaptorParams(10)) { index.buildIndex(); }
-
-  /// @brief Constructor
-  /// @param points  Input points
-  /// @params num_threads  Number of threads used for building the index (This argument is only valid for OMP implementation)
-  explicit UnsafeKdTreeGeneric(const PointCloud& points, int num_threads) : points(points), index(3, *this, nanoflann::KDTreeSingleIndexAdaptorParams(10)) {
-    index.buildIndex(num_threads);
-  }
-
-  ~UnsafeKdTreeGeneric() {}
-
-  // Interfaces for nanoflann
-  size_t kdtree_get_point_count() const { return traits::size(points); }
-  double kdtree_get_pt(const size_t idx, const size_t dim) const { return traits::point(points, idx)[dim]; }
-
-  template <class BBox>
-  bool kdtree_get_bbox(BBox&) const {
-    return false;
-  }
-
-  /// @brief Find k-nearest neighbors
-  size_t knn_search(const Eigen::Vector4d& pt, size_t k, size_t* k_indices, double* k_sq_dists) const { return index.knnSearch(pt.data(), k, k_indices, k_sq_dists); }
-
-private:
-  const PointCloud& points;  ///< Input points
-  Index index;               ///< KdTree index
-};
-
-/// @brief KdTree  with arbitrary nanoflann Adaptor
-template <class PointCloud, template <typename, typename, int, typename> class Adaptor>
-class KdTreeGeneric {
-public:
-  using Ptr = std::shared_ptr<KdTreeGeneric>;
-  using ConstPtr = std::shared_ptr<const KdTreeGeneric>;
-
-  /// @brief Constructor
-  /// @param points  Input points
-  explicit KdTreeGeneric(const std::shared_ptr<const PointCloud>& points) : points(points), tree(*points) {}
-
-  /// @brief Constructor
-  /// @param points  Input points
-  explicit KdTreeGeneric(const std::shared_ptr<const PointCloud>& points, int num_threads) : points(points), tree(*points, num_threads) {}
-
-  ~KdTreeGeneric() {}
-
-  /// @brief Find k-nearest neighbors
-  size_t knn_search(const Eigen::Vector4d& pt, size_t k, size_t* k_indices, double* k_sq_dists) const { return tree.knn_search(pt, k, k_indices, k_sq_dists); }
-
-private:
-  const std::shared_ptr<const PointCloud> points;       ///< Input points
-  const UnsafeKdTreeGeneric<PointCloud, Adaptor> tree;  ///< KdTree
-};
-
-/// @brief Standard KdTree (unsafe)
-template <class PointCloud>
-using UnsafeKdTree = UnsafeKdTreeGeneric<PointCloud, nanoflann::KDTreeSingleIndexAdaptor>;
-
-/// @brief Standard KdTree
-template <class PointCloud>
-using KdTree = KdTreeGeneric<PointCloud, nanoflann::KDTreeSingleIndexAdaptor>;
-
-namespace traits {
-
-template <class PointCloud, template <typename, typename, int, typename> class Adaptor>
-struct Traits<UnsafeKdTreeGeneric<PointCloud, Adaptor>> {
-  static size_t knn_search(const UnsafeKdTreeGeneric<PointCloud, Adaptor>& tree, const Eigen::Vector4d& point, size_t k, size_t* k_indices, double* k_sq_dists) {
-    return tree.knn_search(point, k, k_indices, k_sq_dists);
-  }
-};
-
-template <class PointCloud, template <typename, typename, int, typename> class Adaptor>
-struct Traits<KdTreeGeneric<PointCloud, Adaptor>> {
-  static size_t knn_search(const KdTreeGeneric<PointCloud, Adaptor>& tree, const Eigen::Vector4d& point, size_t k, size_t* k_indices, double* k_sq_dists) {
-    return tree.knn_search(point, k, k_indices, k_sq_dists);
-  }
-};
-
-}  // namespace traits
-
-}  // namespace small_gicp
--- a/include/small_gicp/ann/deprecated/kdtree_omp.hpp
+++ b/include/small_gicp/ann/deprecated/kdtree_omp.hpp
@ -1,22 +0,0 @@
-// SPDX-FileCopyrightText: Copyright 2024 Kenji Koide
-// SPDX-License-Identifier: MIT
-#pragma once
-
-#include <Eigen/Core>
-#include <small_gicp/points/traits.hpp>
-#include <small_gicp/ann/kdtree.hpp>
-#include <small_gicp/ann/nanoflann_omp.hpp>
-
-namespace small_gicp {
-
-/// @brief Unsafe KdTree with multi-thread tree construction with OpenMP backend.
-/// @note  This class only parallelizes the tree construction. The search is still single-threaded as in the normal KdTree.
-template <class PointCloud>
-using UnsafeKdTreeOMP = UnsafeKdTreeGeneric<PointCloud, nanoflann::KDTreeSingleIndexAdaptorOMP>;
-
-/// @brief KdTree with multi-thread tree construction with OpenMP backend.
-/// @note  This class only parallelizes the tree construction. The search is still single-threaded as in the normal KdTree.
-template <class PointCloud>
-using KdTreeOMP = KdTreeGeneric<PointCloud, nanoflann::KDTreeSingleIndexAdaptorOMP>;
-
-}  // namespace small_gicp
--- a/include/small_gicp/ann/deprecated/kdtree_tbb.hpp
+++ b/include/small_gicp/ann/deprecated/kdtree_tbb.hpp
@ -1,22 +0,0 @@
-// SPDX-FileCopyrightText: Copyright 2024 Kenji Koide
-// SPDX-License-Identifier: MIT
-#pragma once
-
-#include <Eigen/Core>
-#include <small_gicp/points/traits.hpp>
-#include <small_gicp/ann/kdtree.hpp>
-#include <small_gicp/ann/nanoflann_tbb.hpp>
-
-namespace small_gicp {
-
-/// @brief Unsafe KdTree with multi-thread tree construction with TBB backend.
-/// @note  This class only parallelizes the tree construction. The search is still single-threaded as in the normal KdTree.
-template <class PointCloud>
-using UnsafeKdTreeTBB = UnsafeKdTreeGeneric<PointCloud, nanoflann::KDTreeSingleIndexAdaptorTBB>;
-
-/// @brief KdTree with multi-thread tree construction with TBB backend.
-/// @note  This class only parallelizes the tree construction. The search is still single-threaded as in the normal KdTree.
-template <class PointCloud>
-using KdTreeTBB = KdTreeGeneric<PointCloud, nanoflann::KDTreeSingleIndexAdaptorTBB>;
-
-}  // namespace small_gicp
--- a/include/small_gicp/ann/deprecated/nanoflann.hpp
+++ b/include/small_gicp/ann/deprecated/nanoflann.hpp
--- a/include/small_gicp/ann/deprecated/nanoflann_omp.hpp
+++ b/include/small_gicp/ann/deprecated/nanoflann_omp.hpp
@ -1,657 +0,0 @@
-/***********************************************************************
- * Software License Agreement (BSD License)
- *
- * Copyright 2008-2009  Marius Muja (mariusm@cs.ubc.ca). All rights reserved.
- * Copyright 2008-2009  David G. Lowe (lowe@cs.ubc.ca). All rights reserved.
- * Copyright 2011-2021  Jose Luis Blanco (joseluisblancoc@gmail.com).
- *   All rights reserved.
- *
- * THE BSD LICENSE
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
- * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
- * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
- * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
- * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
- * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *************************************************************************/
-
-/**
- * This nanoflann_mt.hpp is derived from nanoflann.hpp to parallelize the tree construction with OpenMP.
- */
-
-/** \mainpage nanoflann C++ API documentation
- *  nanoflann is a C++ header-only library for building KD-Trees, mostly
- *  optimized for 2D or 3D point clouds.
- *
- *  nanoflann does not require compiling or installing, just an
- *  #include <nanoflann.hpp> in your code.
- *
- *  See:
- *   - <a href="modules.html" >C++ API organized by modules</a>
- *   - <a href="https://github.com/jlblancoc/nanoflann" >Online README</a>
- *   - <a href="http://jlblancoc.github.io/nanoflann/" >Doxygen
- * documentation</a>
- */
-
-#ifndef NANOFLANN_OMP_HPP_
-#define NANOFLANN_OMP_HPP_
-
-#include <atomic>
-#include <iostream>
-#include <small_gicp/ann/nanoflann.hpp>
-
-namespace nanoflann {
-
-/** kd-tree base-class
- *
- * Contains the member functions common to the classes KDTreeSingleIndexAdaptor
- * and KDTreeSingleIndexDynamicAdaptor_.
- *
- * \tparam Derived The name of the class which inherits this class.
- * \tparam DatasetAdaptor The user-provided adaptor (see comments above).
- * \tparam Distance The distance metric to use, these are all classes derived
- * from nanoflann::Metric \tparam DIM Dimensionality of data points (e.g. 3 for
- * 3D points) \tparam IndexType Will be typically size_t or int
- */
-
-template <class Derived, typename Distance, class DatasetAdaptor, int DIM = -1, typename IndexType = size_t>
-class KDTreeBaseClassOMP {
-public:
-  /** Frees the previously-built index. Automatically called within
-   * buildIndex(). */
-  void freeIndex(Derived& obj) {
-    obj.root_node = NULL;
-    obj.m_size_at_index_build = 0;
-  }
-
-  typedef typename Distance::ElementType ElementType;
-  typedef typename Distance::DistanceType DistanceType;
-
-  /*--------------------- Internal Data Structures --------------------------*/
-  struct Node {
-    /** Union used because a node can be either a LEAF node or a non-leaf node,
-     * so both data fields are never used simultaneously */
-    union {
-      struct leaf {
-        IndexType left, right;  //!< Indices of points in leaf node
-      } lr;
-      struct nonleaf {
-        int divfeat;                   //!< Dimension used for subdivision.
-        DistanceType divlow, divhigh;  //!< The values used for subdivision.
-      } sub;
-    } node_type;
-    Node *child1, *child2;  //!< Child nodes (both=NULL mean its a leaf node)
-  };
-
-  typedef Node* NodePtr;
-
-  struct Interval {
-    ElementType low, high;
-  };
-
-  /**
-   *  Array of indices to vectors in the dataset.
-   */
-  std::vector<IndexType> vind;
-
-  NodePtr root_node;
-
-  size_t m_leaf_max_size;
-
-  size_t m_size;                 //!< Number of current points in the dataset
-  size_t m_size_at_index_build;  //!< Number of points in the dataset when the
-                                 //!< index was built
-  int dim;                       //!< Dimensionality of each data point
-
-  /** Define "BoundingBox" as a fixed-size or variable-size container depending
-   * on "DIM" */
-  typedef typename array_or_vector_selector<DIM, Interval>::container_t BoundingBox;
-
-  /** Define "distance_vector_t" as a fixed-size or variable-size container
-   * depending on "DIM" */
-  typedef typename array_or_vector_selector<DIM, DistanceType>::container_t distance_vector_t;
-
-  /** The KD-tree used to find neighbours */
-
-  BoundingBox root_bbox;
-
-  /**
-   * Pooled memory allocator.
-   *
-   * Using a pooled memory allocator is more efficient
-   * than allocating memory directly when there is a large
-   * number small of memory allocations.
-   */
-  std::atomic_uint64_t pool_count;
-  std::vector<Node> pool;
-
-  /** Returns number of points in dataset  */
-  size_t size(const Derived& obj) const { return obj.m_size; }
-
-  /** Returns the length of each point in the dataset */
-  size_t veclen(const Derived& obj) { return static_cast<size_t>(DIM > 0 ? DIM : obj.dim); }
-
-  /// Helper accessor to the dataset points:
-  inline ElementType dataset_get(const Derived& obj, size_t idx, int component) const { return obj.dataset.kdtree_get_pt(idx, component); }
-
-  void computeMinMax(const Derived& obj, IndexType* ind, IndexType count, int element, ElementType& min_elem, ElementType& max_elem) {
-    min_elem = dataset_get(obj, ind[0], element);
-    max_elem = dataset_get(obj, ind[0], element);
-    for (IndexType i = 1; i < count; ++i) {
-      ElementType val = dataset_get(obj, ind[i], element);
-      if (val < min_elem) min_elem = val;
-      if (val > max_elem) max_elem = val;
-    }
-  }
-
-  /**
-   * Create a tree node that subdivides the list of vecs from vind[first]
-   * to vind[last].  The routine is called recursively on each sublist.
-   *
-   * @param left index of the first vector
-   * @param right index of the last vector
-   */
-  NodePtr divideTree(Derived& obj, const IndexType left, const IndexType right, BoundingBox& bbox) {
-    const size_t pool_loc = pool_count++;
-    if (pool_loc >= pool.size()) {
-      std::cerr << "error: pool_loc=" << pool_loc << " >= pool.size()=" << pool.size() << std::endl;
-      abort();
-    }
-    NodePtr node = pool.data() + pool_loc;
-
-    /* If too few exemplars remain, then make this a leaf node. */
-    if ((right - left) <= static_cast<IndexType>(obj.m_leaf_max_size)) {
-      node->child1 = node->child2 = NULL; /* Mark as leaf node. */
-      node->node_type.lr.left = left;
-      node->node_type.lr.right = right;
-
-      // compute bounding-box of leaf points
-      for (int i = 0; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-        bbox[i].low = dataset_get(obj, obj.vind[left], i);
-        bbox[i].high = dataset_get(obj, obj.vind[left], i);
-      }
-      for (IndexType k = left + 1; k < right; ++k) {
-        for (int i = 0; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-          if (bbox[i].low > dataset_get(obj, obj.vind[k], i)) bbox[i].low = dataset_get(obj, obj.vind[k], i);
-          if (bbox[i].high < dataset_get(obj, obj.vind[k], i)) bbox[i].high = dataset_get(obj, obj.vind[k], i);
-        }
-      }
-    } else {
-      IndexType idx;
-      int cutfeat;
-      DistanceType cutval;
-      middleSplit_(obj, &obj.vind[0] + left, right - left, idx, cutfeat, cutval, bbox);
-
-      node->node_type.sub.divfeat = cutfeat;
-
-      BoundingBox left_bbox(bbox);
-      left_bbox[cutfeat].high = cutval;
-
-      BoundingBox right_bbox(bbox);
-      right_bbox[cutfeat].low = cutval;
-
-      if ((right - left) <= 512) {
-        // Do not parallelize if there are less than 512 points
-        node->child1 = divideTree(obj, left, left + idx, left_bbox);
-        node->child2 = divideTree(obj, left + idx, right, right_bbox);
-      } else {
-        // I did my best to check that the following parallelization does not cause race conditions.
-        // But, still not 100% sure if it is correct.
-
-#pragma omp task shared(obj, left_bbox)
-        node->child1 = divideTree(obj, left, left + idx, left_bbox);
-#pragma omp task shared(obj, right_bbox)
-        node->child2 = divideTree(obj, left + idx, right, right_bbox);
-#pragma omp taskwait
-      }
-
-      node->node_type.sub.divlow = left_bbox[cutfeat].high;
-      node->node_type.sub.divhigh = right_bbox[cutfeat].low;
-
-      for (int i = 0; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-        bbox[i].low = std::min(left_bbox[i].low, right_bbox[i].low);
-        bbox[i].high = std::max(left_bbox[i].high, right_bbox[i].high);
-      }
-    }
-
-    return node;
-  }
-
-  void middleSplit_(Derived& obj, IndexType* ind, IndexType count, IndexType& index, int& cutfeat, DistanceType& cutval, const BoundingBox& bbox) {
-    const DistanceType EPS = static_cast<DistanceType>(0.00001);
-    ElementType max_span = bbox[0].high - bbox[0].low;
-    for (int i = 1; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-      ElementType span = bbox[i].high - bbox[i].low;
-      if (span > max_span) {
-        max_span = span;
-      }
-    }
-    ElementType max_spread = -1;
-    cutfeat = 0;
-    for (int i = 0; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-      ElementType span = bbox[i].high - bbox[i].low;
-      if (span > (1 - EPS) * max_span) {
-        ElementType min_elem, max_elem;
-        computeMinMax(obj, ind, count, i, min_elem, max_elem);
-        ElementType spread = max_elem - min_elem;
-        if (spread > max_spread) {
-          cutfeat = i;
-          max_spread = spread;
-        }
-      }
-    }
-    // split in the middle
-    DistanceType split_val = (bbox[cutfeat].low + bbox[cutfeat].high) / 2;
-    ElementType min_elem, max_elem;
-    computeMinMax(obj, ind, count, cutfeat, min_elem, max_elem);
-
-    if (split_val < min_elem)
-      cutval = min_elem;
-    else if (split_val > max_elem)
-      cutval = max_elem;
-    else
-      cutval = split_val;
-
-    IndexType lim1, lim2;
-    planeSplit(obj, ind, count, cutfeat, cutval, lim1, lim2);
-
-    if (lim1 > count / 2)
-      index = lim1;
-    else if (lim2 < count / 2)
-      index = lim2;
-    else
-      index = count / 2;
-  }
-
-  /**
-   *  Subdivide the list of points by a plane perpendicular on axe corresponding
-   *  to the 'cutfeat' dimension at 'cutval' position.
-   *
-   *  On return:
-   *  dataset[ind[0..lim1-1]][cutfeat]<cutval
-   *  dataset[ind[lim1..lim2-1]][cutfeat]==cutval
-   *  dataset[ind[lim2..count]][cutfeat]>cutval
-   */
-  void planeSplit(Derived& obj, IndexType* ind, const IndexType count, int cutfeat, DistanceType& cutval, IndexType& lim1, IndexType& lim2) {
-    /* Move vector indices for left subtree to front of list. */
-    IndexType left = 0;
-    IndexType right = count - 1;
-    for (;;) {
-      while (left <= right && dataset_get(obj, ind[left], cutfeat) < cutval) ++left;
-      while (right && left <= right && dataset_get(obj, ind[right], cutfeat) >= cutval) --right;
-      if (left > right || !right) break;  // "!right" was added to support unsigned Index types
-      std::swap(ind[left], ind[right]);
-      ++left;
-      --right;
-    }
-    /* If either list is empty, it means that all remaining features
-     * are identical. Split in the middle to maintain a balanced tree.
-     */
-    lim1 = left;
-    right = count - 1;
-    for (;;) {
-      while (left <= right && dataset_get(obj, ind[left], cutfeat) <= cutval) ++left;
-      while (right && left <= right && dataset_get(obj, ind[right], cutfeat) > cutval) --right;
-      if (left > right || !right) break;  // "!right" was added to support unsigned Index types
-      std::swap(ind[left], ind[right]);
-      ++left;
-      --right;
-    }
-    lim2 = left;
-  }
-
-  DistanceType computeInitialDistances(const Derived& obj, const ElementType* vec, distance_vector_t& dists) const {
-    assert(vec);
-    DistanceType distsq = DistanceType();
-
-    for (int i = 0; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-      if (vec[i] < obj.root_bbox[i].low) {
-        dists[i] = obj.distance.accum_dist(vec[i], obj.root_bbox[i].low, i);
-        distsq += dists[i];
-      }
-      if (vec[i] > obj.root_bbox[i].high) {
-        dists[i] = obj.distance.accum_dist(vec[i], obj.root_bbox[i].high, i);
-        distsq += dists[i];
-      }
-    }
-    return distsq;
-  }
-};
-
-/** @addtogroup kdtrees_grp KD-tree classes and adaptors
- * @{ */
-
-/** kd-tree static index
- *
- * Contains the k-d trees and other information for indexing a set of points
- * for nearest-neighbor matching.
- *
- *  The class "DatasetAdaptor" must provide the following interface (can be
- * non-virtual, inlined methods):
- *
- *  \code
- *   // Must return the number of data poins
- *   inline size_t kdtree_get_point_count() const { ... }
- *
- *
- *   // Must return the dim'th component of the idx'th point in the class:
- *   inline T kdtree_get_pt(const size_t idx, const size_t dim) const { ... }
- *
- *   // Optional bounding-box computation: return false to default to a standard
- * bbox computation loop.
- *   //   Return true if the BBOX was already computed by the class and returned
- * in "bb" so it can be avoided to redo it again.
- *   //   Look at bb.size() to find out the expected dimensionality (e.g. 2 or 3
- * for point clouds) template <class BBOX> bool kdtree_get_bbox(BBOX &bb) const
- *   {
- *      bb[0].low = ...; bb[0].high = ...;  // 0th dimension limits
- *      bb[1].low = ...; bb[1].high = ...;  // 1st dimension limits
- *      ...
- *      return true;
- *   }
- *
- *  \endcode
- *
- * \tparam DatasetAdaptor The user-provided adaptor (see comments above).
- * \tparam Distance The distance metric to use: nanoflann::metric_L1,
- * nanoflann::metric_L2, nanoflann::metric_L2_Simple, etc. \tparam DIM
- * Dimensionality of data points (e.g. 3 for 3D points) \tparam IndexType Will
- * be typically size_t or int
- */
-template <typename Distance, class DatasetAdaptor, int DIM = -1, typename IndexType = size_t>
-class KDTreeSingleIndexAdaptorOMP : public KDTreeBaseClassOMP<KDTreeSingleIndexAdaptorOMP<Distance, DatasetAdaptor, DIM, IndexType>, Distance, DatasetAdaptor, DIM, IndexType> {
-public:
-  /** Deleted copy constructor*/
-  KDTreeSingleIndexAdaptorOMP(const KDTreeSingleIndexAdaptorOMP<Distance, DatasetAdaptor, DIM, IndexType>&) = delete;
-
-  /**
-   * The dataset used by this index
-   */
-  const DatasetAdaptor& dataset;  //!< The source of our data
-
-  const KDTreeSingleIndexAdaptorParams index_params;
-
-  Distance distance;
-
-  typedef typename nanoflann::KDTreeBaseClassOMP<nanoflann::KDTreeSingleIndexAdaptorOMP<Distance, DatasetAdaptor, DIM, IndexType>, Distance, DatasetAdaptor, DIM, IndexType>
-    BaseClassRef;
-
-  typedef typename BaseClassRef::ElementType ElementType;
-  typedef typename BaseClassRef::DistanceType DistanceType;
-
-  typedef typename BaseClassRef::Node Node;
-  typedef Node* NodePtr;
-
-  typedef typename BaseClassRef::Interval Interval;
-  /** Define "BoundingBox" as a fixed-size or variable-size container depending
-   * on "DIM" */
-  typedef typename BaseClassRef::BoundingBox BoundingBox;
-
-  /** Define "distance_vector_t" as a fixed-size or variable-size container
-   * depending on "DIM" */
-  typedef typename BaseClassRef::distance_vector_t distance_vector_t;
-
-  /**
-   * KDTree constructor
-   *
-   * Refer to docs in README.md or online in
-   * https://github.com/jlblancoc/nanoflann
-   *
-   * The KD-Tree point dimension (the length of each point in the datase, e.g. 3
-   * for 3D points) is determined by means of:
-   *  - The \a DIM template parameter if >0 (highest priority)
-   *  - Otherwise, the \a dimensionality parameter of this constructor.
-   *
-   * @param inputData Dataset with the input features
-   * @param params Basically, the maximum leaf node size
-   */
-  KDTreeSingleIndexAdaptorOMP(const int dimensionality, const DatasetAdaptor& inputData, const KDTreeSingleIndexAdaptorParams& params = KDTreeSingleIndexAdaptorParams())
-  : dataset(inputData),
-    index_params(params),
-    distance(inputData) {
-    BaseClassRef::root_node = NULL;
-    BaseClassRef::m_size = dataset.kdtree_get_point_count();
-    BaseClassRef::m_size_at_index_build = BaseClassRef::m_size;
-    BaseClassRef::dim = dimensionality;
-    if (DIM > 0) BaseClassRef::dim = DIM;
-    BaseClassRef::m_leaf_max_size = params.leaf_max_size;
-
-    // Create a permutable array of indices to the input vectors.
-    init_vind();
-  }
-
-  /**
-   * Builds the index
-   */
-  void buildIndex(int num_threads) {
-    BaseClassRef::m_size = dataset.kdtree_get_point_count();
-    BaseClassRef::m_size_at_index_build = BaseClassRef::m_size;
-    init_vind();
-    this->freeIndex(*this);
-    BaseClassRef::m_size_at_index_build = BaseClassRef::m_size;
-    if (BaseClassRef::m_size == 0) return;
-    computeBoundingBox(BaseClassRef::root_bbox);
-
-    BaseClassRef::pool_count = 0;
-    BaseClassRef::pool.resize(BaseClassRef::m_size);
-
-#pragma omp parallel num_threads(num_threads)
-    {
-#pragma omp single nowait
-      {
-        BaseClassRef::root_node = this->divideTree(*this, 0, BaseClassRef::m_size,
-                                                   BaseClassRef::root_bbox);  // construct the tree
-      }
-    }
-  }
-
-  /** \name Query methods
-   * @{ */
-
-  /**
-   * Find set of nearest neighbors to vec[0:dim-1]. Their indices are stored
-   * inside the result object.
-   *
-   * Params:
-   *     result = the result object in which the indices of the
-   * nearest-neighbors are stored vec = the vector for which to search the
-   * nearest neighbors
-   *
-   * \tparam RESULTSET Should be any ResultSet<DistanceType>
-   * \return  True if the requested neighbors could be found.
-   * \sa knnSearch, radiusSearch
-   */
-  template <typename RESULTSET>
-  bool findNeighbors(RESULTSET& result, const ElementType* vec, const SearchParams& searchParams) const {
-    assert(vec);
-    if (this->size(*this) == 0) return false;
-    if (!BaseClassRef::root_node) throw std::runtime_error("[nanoflann] findNeighbors() called before building the index.");
-    float epsError = 1 + searchParams.eps;
-
-    distance_vector_t dists;  // fixed or variable-sized container (depending on DIM)
-    auto zero = static_cast<decltype(result.worstDist())>(0);
-    assign(dists, (DIM > 0 ? DIM : BaseClassRef::dim),
-           zero);  // Fill it with zeros.
-    DistanceType distsq = this->computeInitialDistances(*this, vec, dists);
-    searchLevel(
-      result,
-      vec,
-      BaseClassRef::root_node,
-      distsq,
-      dists,
-      epsError);  // "count_leaf" parameter removed since was neither
-                  // used nor returned to the user.
-    return result.full();
-  }
-
-  /**
-   * Find the "num_closest" nearest neighbors to the \a query_point[0:dim-1].
-   * Their indices are stored inside the result object. \sa radiusSearch,
-   * findNeighbors \note nChecks_IGNORED is ignored but kept for compatibility
-   * with the original FLANN interface. \return Number `N` of valid points in
-   * the result set. Only the first `N` entries in `out_indices` and
-   * `out_distances_sq` will be valid. Return may be less than `num_closest`
-   * only if the number of elements in the tree is less than `num_closest`.
-   */
-  size_t knnSearch(const ElementType* query_point, const size_t num_closest, IndexType* out_indices, DistanceType* out_distances_sq, const int /* nChecks_IGNORED */ = 10) const {
-    nanoflann::KNNResultSet<DistanceType, IndexType> resultSet(num_closest);
-    resultSet.init(out_indices, out_distances_sq);
-    this->findNeighbors(resultSet, query_point, nanoflann::SearchParams());
-    return resultSet.size();
-  }
-
-  /**
-   * Find all the neighbors to \a query_point[0:dim-1] within a maximum radius.
-   *  The output is given as a vector of pairs, of which the first element is a
-   * point index and the second the corresponding distance. Previous contents of
-   * \a IndicesDists are cleared.
-   *
-   *  If searchParams.sorted==true, the output list is sorted by ascending
-   * distances.
-   *
-   *  For a better performance, it is advisable to do a .reserve() on the vector
-   * if you have any wild guess about the number of expected matches.
-   *
-   *  \sa knnSearch, findNeighbors, radiusSearchCustomCallback
-   * \return The number of points within the given radius (i.e. indices.size()
-   * or dists.size() )
-   */
-  size_t radiusSearch(const ElementType* query_point, const DistanceType& radius, std::vector<std::pair<IndexType, DistanceType>>& IndicesDists, const SearchParams& searchParams)
-    const {
-    RadiusResultSet<DistanceType, IndexType> resultSet(radius, IndicesDists);
-    const size_t nFound = radiusSearchCustomCallback(query_point, resultSet, searchParams);
-    if (searchParams.sorted) std::sort(IndicesDists.begin(), IndicesDists.end(), IndexDist_Sorter());
-    return nFound;
-  }
-
-  /**
-   * Just like radiusSearch() but with a custom callback class for each point
-   * found in the radius of the query. See the source of RadiusResultSet<> as a
-   * start point for your own classes. \sa radiusSearch
-   */
-  template <class SEARCH_CALLBACK>
-  size_t radiusSearchCustomCallback(const ElementType* query_point, SEARCH_CALLBACK& resultSet, const SearchParams& searchParams = SearchParams()) const {
-    this->findNeighbors(resultSet, query_point, searchParams);
-    return resultSet.size();
-  }
-
-  /** @} */
-
-public:
-  /** Make sure the auxiliary list \a vind has the same size than the current
-   * dataset, and re-generate if size has changed. */
-  void init_vind() {
-    // Create a permutable array of indices to the input vectors.
-    BaseClassRef::m_size = dataset.kdtree_get_point_count();
-    if (BaseClassRef::vind.size() != BaseClassRef::m_size) BaseClassRef::vind.resize(BaseClassRef::m_size);
-    for (size_t i = 0; i < BaseClassRef::m_size; i++) BaseClassRef::vind[i] = i;
-  }
-
-  void computeBoundingBox(BoundingBox& bbox) {
-    resize(bbox, (DIM > 0 ? DIM : BaseClassRef::dim));
-    if (dataset.kdtree_get_bbox(bbox)) {
-      // Done! It was implemented in derived class
-    } else {
-      const size_t N = dataset.kdtree_get_point_count();
-      if (!N)
-        throw std::runtime_error(
-          "[nanoflann] computeBoundingBox() called but "
-          "no data points found.");
-      for (int i = 0; i < (DIM > 0 ? DIM : BaseClassRef::dim); ++i) {
-        bbox[i].low = bbox[i].high = this->dataset_get(*this, 0, i);
-      }
-      for (size_t k = 1; k < N; ++k) {
-        for (int i = 0; i < (DIM > 0 ? DIM : BaseClassRef::dim); ++i) {
-          if (this->dataset_get(*this, k, i) < bbox[i].low) bbox[i].low = this->dataset_get(*this, k, i);
-          if (this->dataset_get(*this, k, i) > bbox[i].high) bbox[i].high = this->dataset_get(*this, k, i);
-        }
-      }
-    }
-  }
-
-  /**
-   * Performs an exact search in the tree starting from a node.
-   * \tparam RESULTSET Should be any ResultSet<DistanceType>
-   * \return true if the search should be continued, false if the results are
-   * sufficient
-   */
-  template <class RESULTSET>
-  bool searchLevel(RESULTSET& result_set, const ElementType* vec, const NodePtr node, DistanceType mindistsq, distance_vector_t& dists, const float epsError) const {
-    /* If this is a leaf node, then do check and return. */
-    if ((node->child1 == NULL) && (node->child2 == NULL)) {
-      // count_leaf += (node->lr.right-node->lr.left);  // Removed since was
-      // neither used nor returned to the user.
-      DistanceType worst_dist = result_set.worstDist();
-      for (IndexType i = node->node_type.lr.left; i < node->node_type.lr.right; ++i) {
-        const IndexType index = BaseClassRef::vind[i];  // reorder... : i;
-        DistanceType dist = distance.evalMetric(vec, index, (DIM > 0 ? DIM : BaseClassRef::dim));
-        if (dist < worst_dist) {
-          if (!result_set.addPoint(dist, BaseClassRef::vind[i])) {
-            // the resultset doesn't want to receive any more points, we're done
-            // searching!
-            return false;
-          }
-        }
-      }
-      return true;
-    }
-
-    /* Which child branch should be taken first? */
-    int idx = node->node_type.sub.divfeat;
-    ElementType val = vec[idx];
-    DistanceType diff1 = val - node->node_type.sub.divlow;
-    DistanceType diff2 = val - node->node_type.sub.divhigh;
-
-    NodePtr bestChild;
-    NodePtr otherChild;
-    DistanceType cut_dist;
-    if ((diff1 + diff2) < 0) {
-      bestChild = node->child1;
-      otherChild = node->child2;
-      cut_dist = distance.accum_dist(val, node->node_type.sub.divhigh, idx);
-    } else {
-      bestChild = node->child2;
-      otherChild = node->child1;
-      cut_dist = distance.accum_dist(val, node->node_type.sub.divlow, idx);
-    }
-
-    /* Call recursively to search next level down. */
-    if (!searchLevel(result_set, vec, bestChild, mindistsq, dists, epsError)) {
-      // the resultset doesn't want to receive any more points, we're done
-      // searching!
-      return false;
-    }
-
-    DistanceType dst = dists[idx];
-    mindistsq = mindistsq + cut_dist - dst;
-    dists[idx] = cut_dist;
-    if (mindistsq * epsError <= result_set.worstDist()) {
-      if (!searchLevel(result_set, vec, otherChild, mindistsq, dists, epsError)) {
-        // the resultset doesn't want to receive any more points, we're done
-        // searching!
-        return false;
-      }
-    }
-    dists[idx] = dst;
-    return true;
-  }
-};
-
-}  // namespace nanoflann
-
-#endif /* NANOFLANN_HPP_ */
--- a/include/small_gicp/ann/deprecated/nanoflann_tbb.hpp
+++ b/include/small_gicp/ann/deprecated/nanoflann_tbb.hpp
@ -1,639 +0,0 @@
-/***********************************************************************
- * Software License Agreement (BSD License)
- *
- * Copyright 2008-2009  Marius Muja (mariusm@cs.ubc.ca). All rights reserved.
- * Copyright 2008-2009  David G. Lowe (lowe@cs.ubc.ca). All rights reserved.
- * Copyright 2011-2021  Jose Luis Blanco (joseluisblancoc@gmail.com).
- *   All rights reserved.
- *
- * THE BSD LICENSE
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
- * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
- * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
- * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
- * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
- * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *************************************************************************/
-
-/**
- * This nanoflann_mt.hpp is derived from nanoflann.hpp to parallelize the tree construction with TBB
- */
-
-/** \mainpage nanoflann C++ API documentation
- *  nanoflann is a C++ header-only library for building KD-Trees, mostly
- *  optimized for 2D or 3D point clouds.
- *
- *  nanoflann does not require compiling or installing, just an
- *  #include <nanoflann.hpp> in your code.
- *
- *  See:
- *   - <a href="modules.html" >C++ API organized by modules</a>
- *   - <a href="https://github.com/jlblancoc/nanoflann" >Online README</a>
- *   - <a href="http://jlblancoc.github.io/nanoflann/" >Doxygen
- * documentation</a>
- */
-
-#ifndef NANOFLANN_TBB_HPP_
-#define NANOFLANN_TBB_HPP_
-
-#include <tbb/tbb.h>
-#include <small_gicp/ann/nanoflann.hpp>
-
-namespace nanoflann {
-
-/** kd-tree base-class
- *
- * Contains the member functions common to the classes KDTreeSingleIndexAdaptor
- * and KDTreeSingleIndexDynamicAdaptor_.
- *
- * \tparam Derived The name of the class which inherits this class.
- * \tparam DatasetAdaptor The user-provided adaptor (see comments above).
- * \tparam Distance The distance metric to use, these are all classes derived
- * from nanoflann::Metric \tparam DIM Dimensionality of data points (e.g. 3 for
- * 3D points) \tparam IndexType Will be typically size_t or int
- */
-
-template <class Derived, typename Distance, class DatasetAdaptor, int DIM = -1, typename IndexType = size_t>
-class KDTreeBaseClassTBB {
-public:
-  /** Frees the previously-built index. Automatically called within
-   * buildIndex(). */
-  void freeIndex(Derived& obj) {
-    obj.root_node = NULL;
-    obj.m_size_at_index_build = 0;
-  }
-
-  typedef typename Distance::ElementType ElementType;
-  typedef typename Distance::DistanceType DistanceType;
-
-  /*--------------------- Internal Data Structures --------------------------*/
-  struct Node {
-    /** Union used because a node can be either a LEAF node or a non-leaf node,
-     * so both data fields are never used simultaneously */
-    union {
-      struct leaf {
-        IndexType left, right;  //!< Indices of points in leaf node
-      } lr;
-      struct nonleaf {
-        int divfeat;                   //!< Dimension used for subdivision.
-        DistanceType divlow, divhigh;  //!< The values used for subdivision.
-      } sub;
-    } node_type;
-    Node *child1, *child2;  //!< Child nodes (both=NULL mean its a leaf node)
-  };
-
-  typedef Node* NodePtr;
-
-  struct Interval {
-    ElementType low, high;
-  };
-
-  /**
-   *  Array of indices to vectors in the dataset.
-   */
-  std::vector<IndexType> vind;
-
-  NodePtr root_node;
-
-  size_t m_leaf_max_size;
-
-  size_t m_size;                 //!< Number of current points in the dataset
-  size_t m_size_at_index_build;  //!< Number of points in the dataset when the
-                                 //!< index was built
-  int dim;                       //!< Dimensionality of each data point
-
-  /** Define "BoundingBox" as a fixed-size or variable-size container depending
-   * on "DIM" */
-  typedef typename array_or_vector_selector<DIM, Interval>::container_t BoundingBox;
-
-  /** Define "distance_vector_t" as a fixed-size or variable-size container
-   * depending on "DIM" */
-  typedef typename array_or_vector_selector<DIM, DistanceType>::container_t distance_vector_t;
-
-  /** The KD-tree used to find neighbours */
-
-  BoundingBox root_bbox;
-
-  /**
-   * Pooled memory allocator.
-   *
-   * Using a pooled memory allocator is more efficient
-   * than allocating memory directly when there is a large
-   * number small of memory allocations.
-   */
-  tbb::concurrent_vector<Node> pool;
-
-  /** Returns number of points in dataset  */
-  size_t size(const Derived& obj) const { return obj.m_size; }
-
-  /** Returns the length of each point in the dataset */
-  size_t veclen(const Derived& obj) { return static_cast<size_t>(DIM > 0 ? DIM : obj.dim); }
-
-  /// Helper accessor to the dataset points:
-  inline ElementType dataset_get(const Derived& obj, size_t idx, int component) const { return obj.dataset.kdtree_get_pt(idx, component); }
-
-  void computeMinMax(const Derived& obj, IndexType* ind, IndexType count, int element, ElementType& min_elem, ElementType& max_elem) {
-    min_elem = dataset_get(obj, ind[0], element);
-    max_elem = dataset_get(obj, ind[0], element);
-    for (IndexType i = 1; i < count; ++i) {
-      ElementType val = dataset_get(obj, ind[i], element);
-      if (val < min_elem) min_elem = val;
-      if (val > max_elem) max_elem = val;
-    }
-  }
-
-  /**
-   * Create a tree node that subdivides the list of vecs from vind[first]
-   * to vind[last].  The routine is called recursively on each sublist.
-   *
-   * @param left index of the first vector
-   * @param right index of the last vector
-   */
-  NodePtr divideTree(Derived& obj, const IndexType left, const IndexType right, BoundingBox& bbox) {
-    NodePtr node = &(*pool.emplace_back());
-
-    /* If too few exemplars remain, then make this a leaf node. */
-    if ((right - left) <= static_cast<IndexType>(obj.m_leaf_max_size)) {
-      node->child1 = node->child2 = NULL; /* Mark as leaf node. */
-      node->node_type.lr.left = left;
-      node->node_type.lr.right = right;
-
-      // compute bounding-box of leaf points
-      for (int i = 0; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-        bbox[i].low = dataset_get(obj, obj.vind[left], i);
-        bbox[i].high = dataset_get(obj, obj.vind[left], i);
-      }
-      for (IndexType k = left + 1; k < right; ++k) {
-        for (int i = 0; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-          if (bbox[i].low > dataset_get(obj, obj.vind[k], i)) bbox[i].low = dataset_get(obj, obj.vind[k], i);
-          if (bbox[i].high < dataset_get(obj, obj.vind[k], i)) bbox[i].high = dataset_get(obj, obj.vind[k], i);
-        }
-      }
-    } else {
-      IndexType idx;
-      int cutfeat;
-      DistanceType cutval;
-      middleSplit_(obj, &obj.vind[0] + left, right - left, idx, cutfeat, cutval, bbox);
-
-      node->node_type.sub.divfeat = cutfeat;
-
-      BoundingBox left_bbox(bbox);
-      left_bbox[cutfeat].high = cutval;
-
-      BoundingBox right_bbox(bbox);
-      right_bbox[cutfeat].low = cutval;
-
-      if ((right - left) <= 512) {
-        // Do not parallelize if there are less than 512 points
-        node->child1 = divideTree(obj, left, left + idx, left_bbox);
-        node->child2 = divideTree(obj, left + idx, right, right_bbox);
-      } else {
-        // I did my best to check that the following parallelization does not cause race conditions.
-        // But, still not 100% sure if it is correct.
-
-        tbb::parallel_invoke([&] { node->child1 = divideTree(obj, left, left + idx, left_bbox); }, [&] { node->child2 = divideTree(obj, left + idx, right, right_bbox); });
-      }
-
-      node->node_type.sub.divlow = left_bbox[cutfeat].high;
-      node->node_type.sub.divhigh = right_bbox[cutfeat].low;
-
-      for (int i = 0; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-        bbox[i].low = std::min(left_bbox[i].low, right_bbox[i].low);
-        bbox[i].high = std::max(left_bbox[i].high, right_bbox[i].high);
-      }
-    }
-
-    return node;
-  }
-
-  void middleSplit_(Derived& obj, IndexType* ind, IndexType count, IndexType& index, int& cutfeat, DistanceType& cutval, const BoundingBox& bbox) {
-    const DistanceType EPS = static_cast<DistanceType>(0.00001);
-    ElementType max_span = bbox[0].high - bbox[0].low;
-    for (int i = 1; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-      ElementType span = bbox[i].high - bbox[i].low;
-      if (span > max_span) {
-        max_span = span;
-      }
-    }
-    ElementType max_spread = -1;
-    cutfeat = 0;
-    for (int i = 0; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-      ElementType span = bbox[i].high - bbox[i].low;
-      if (span > (1 - EPS) * max_span) {
-        ElementType min_elem, max_elem;
-        computeMinMax(obj, ind, count, i, min_elem, max_elem);
-        ElementType spread = max_elem - min_elem;
-        if (spread > max_spread) {
-          cutfeat = i;
-          max_spread = spread;
-        }
-      }
-    }
-    // split in the middle
-    DistanceType split_val = (bbox[cutfeat].low + bbox[cutfeat].high) / 2;
-    ElementType min_elem, max_elem;
-    computeMinMax(obj, ind, count, cutfeat, min_elem, max_elem);
-
-    if (split_val < min_elem)
-      cutval = min_elem;
-    else if (split_val > max_elem)
-      cutval = max_elem;
-    else
-      cutval = split_val;
-
-    IndexType lim1, lim2;
-    planeSplit(obj, ind, count, cutfeat, cutval, lim1, lim2);
-
-    if (lim1 > count / 2)
-      index = lim1;
-    else if (lim2 < count / 2)
-      index = lim2;
-    else
-      index = count / 2;
-  }
-
-  /**
-   *  Subdivide the list of points by a plane perpendicular on axe corresponding
-   *  to the 'cutfeat' dimension at 'cutval' position.
-   *
-   *  On return:
-   *  dataset[ind[0..lim1-1]][cutfeat]<cutval
-   *  dataset[ind[lim1..lim2-1]][cutfeat]==cutval
-   *  dataset[ind[lim2..count]][cutfeat]>cutval
-   */
-  void planeSplit(Derived& obj, IndexType* ind, const IndexType count, int cutfeat, DistanceType& cutval, IndexType& lim1, IndexType& lim2) {
-    /* Move vector indices for left subtree to front of list. */
-    IndexType left = 0;
-    IndexType right = count - 1;
-    for (;;) {
-      while (left <= right && dataset_get(obj, ind[left], cutfeat) < cutval) ++left;
-      while (right && left <= right && dataset_get(obj, ind[right], cutfeat) >= cutval) --right;
-      if (left > right || !right) break;  // "!right" was added to support unsigned Index types
-      std::swap(ind[left], ind[right]);
-      ++left;
-      --right;
-    }
-    /* If either list is empty, it means that all remaining features
-     * are identical. Split in the middle to maintain a balanced tree.
-     */
-    lim1 = left;
-    right = count - 1;
-    for (;;) {
-      while (left <= right && dataset_get(obj, ind[left], cutfeat) <= cutval) ++left;
-      while (right && left <= right && dataset_get(obj, ind[right], cutfeat) > cutval) --right;
-      if (left > right || !right) break;  // "!right" was added to support unsigned Index types
-      std::swap(ind[left], ind[right]);
-      ++left;
-      --right;
-    }
-    lim2 = left;
-  }
-
-  DistanceType computeInitialDistances(const Derived& obj, const ElementType* vec, distance_vector_t& dists) const {
-    assert(vec);
-    DistanceType distsq = DistanceType();
-
-    for (int i = 0; i < (DIM > 0 ? DIM : obj.dim); ++i) {
-      if (vec[i] < obj.root_bbox[i].low) {
-        dists[i] = obj.distance.accum_dist(vec[i], obj.root_bbox[i].low, i);
-        distsq += dists[i];
-      }
-      if (vec[i] > obj.root_bbox[i].high) {
-        dists[i] = obj.distance.accum_dist(vec[i], obj.root_bbox[i].high, i);
-        distsq += dists[i];
-      }
-    }
-    return distsq;
-  }
-};
-
-/** @addtogroup kdtrees_grp KD-tree classes and adaptors
- * @{ */
-
-/** kd-tree static index
- *
- * Contains the k-d trees and other information for indexing a set of points
- * for nearest-neighbor matching.
- *
- *  The class "DatasetAdaptor" must provide the following interface (can be
- * non-virtual, inlined methods):
- *
- *  \code
- *   // Must return the number of data poins
- *   inline size_t kdtree_get_point_count() const { ... }
- *
- *
- *   // Must return the dim'th component of the idx'th point in the class:
- *   inline T kdtree_get_pt(const size_t idx, const size_t dim) const { ... }
- *
- *   // Optional bounding-box computation: return false to default to a standard
- * bbox computation loop.
- *   //   Return true if the BBOX was already computed by the class and returned
- * in "bb" so it can be avoided to redo it again.
- *   //   Look at bb.size() to find out the expected dimensionality (e.g. 2 or 3
- * for point clouds) template <class BBOX> bool kdtree_get_bbox(BBOX &bb) const
- *   {
- *      bb[0].low = ...; bb[0].high = ...;  // 0th dimension limits
- *      bb[1].low = ...; bb[1].high = ...;  // 1st dimension limits
- *      ...
- *      return true;
- *   }
- *
- *  \endcode
- *
- * \tparam DatasetAdaptor The user-provided adaptor (see comments above).
- * \tparam Distance The distance metric to use: nanoflann::metric_L1,
- * nanoflann::metric_L2, nanoflann::metric_L2_Simple, etc. \tparam DIM
- * Dimensionality of data points (e.g. 3 for 3D points) \tparam IndexType Will
- * be typically size_t or int
- */
-template <typename Distance, class DatasetAdaptor, int DIM = -1, typename IndexType = size_t>
-class KDTreeSingleIndexAdaptorTBB : public KDTreeBaseClassTBB<KDTreeSingleIndexAdaptorTBB<Distance, DatasetAdaptor, DIM, IndexType>, Distance, DatasetAdaptor, DIM, IndexType> {
-public:
-  /** Deleted copy constructor*/
-  KDTreeSingleIndexAdaptorTBB(const KDTreeSingleIndexAdaptorTBB<Distance, DatasetAdaptor, DIM, IndexType>&) = delete;
-
-  /**
-   * The dataset used by this index
-   */
-  const DatasetAdaptor& dataset;  //!< The source of our data
-
-  const KDTreeSingleIndexAdaptorParams index_params;
-
-  Distance distance;
-
-  typedef typename nanoflann::KDTreeBaseClassTBB<nanoflann::KDTreeSingleIndexAdaptorTBB<Distance, DatasetAdaptor, DIM, IndexType>, Distance, DatasetAdaptor, DIM, IndexType>
-    BaseClassRef;
-
-  typedef typename BaseClassRef::ElementType ElementType;
-  typedef typename BaseClassRef::DistanceType DistanceType;
-
-  typedef typename BaseClassRef::Node Node;
-  typedef Node* NodePtr;
-
-  typedef typename BaseClassRef::Interval Interval;
-  /** Define "BoundingBox" as a fixed-size or variable-size container depending
-   * on "DIM" */
-  typedef typename BaseClassRef::BoundingBox BoundingBox;
-
-  /** Define "distance_vector_t" as a fixed-size or variable-size container
-   * depending on "DIM" */
-  typedef typename BaseClassRef::distance_vector_t distance_vector_t;
-
-  /**
-   * KDTree constructor
-   *
-   * Refer to docs in README.md or online in
-   * https://github.com/jlblancoc/nanoflann
-   *
-   * The KD-Tree point dimension (the length of each point in the datase, e.g. 3
-   * for 3D points) is determined by means of:
-   *  - The \a DIM template parameter if >0 (highest priority)
-   *  - Otherwise, the \a dimensionality parameter of this constructor.
-   *
-   * @param inputData Dataset with the input features
-   * @param params Basically, the maximum leaf node size
-   */
-  KDTreeSingleIndexAdaptorTBB(const int dimensionality, const DatasetAdaptor& inputData, const KDTreeSingleIndexAdaptorParams& params = KDTreeSingleIndexAdaptorParams())
-  : dataset(inputData),
-    index_params(params),
-    distance(inputData) {
-    BaseClassRef::root_node = NULL;
-    BaseClassRef::m_size = dataset.kdtree_get_point_count();
-    BaseClassRef::m_size_at_index_build = BaseClassRef::m_size;
-    BaseClassRef::dim = dimensionality;
-    if (DIM > 0) BaseClassRef::dim = DIM;
-    BaseClassRef::m_leaf_max_size = params.leaf_max_size;
-
-    // Create a permutable array of indices to the input vectors.
-    init_vind();
-  }
-
-  /**
-   * Builds the index
-   */
-  void buildIndex() {
-    BaseClassRef::m_size = dataset.kdtree_get_point_count();
-    BaseClassRef::m_size_at_index_build = BaseClassRef::m_size;
-    init_vind();
-    this->freeIndex(*this);
-    BaseClassRef::m_size_at_index_build = BaseClassRef::m_size;
-    if (BaseClassRef::m_size == 0) return;
-    computeBoundingBox(BaseClassRef::root_bbox);
-
-    BaseClassRef::pool.reserve(BaseClassRef::m_size);
-
-    BaseClassRef::root_node = this->divideTree(*this, 0, BaseClassRef::m_size,
-                                               BaseClassRef::root_bbox);  // construct the tree
-  }
-
-  /** \name Query methods
-   * @{ */
-
-  /**
-   * Find set of nearest neighbors to vec[0:dim-1]. Their indices are stored
-   * inside the result object.
-   *
-   * Params:
-   *     result = the result object in which the indices of the
-   * nearest-neighbors are stored vec = the vector for which to search the
-   * nearest neighbors
-   *
-   * \tparam RESULTSET Should be any ResultSet<DistanceType>
-   * \return  True if the requested neighbors could be found.
-   * \sa knnSearch, radiusSearch
-   */
-  template <typename RESULTSET>
-  bool findNeighbors(RESULTSET& result, const ElementType* vec, const SearchParams& searchParams) const {
-    assert(vec);
-    if (this->size(*this) == 0) return false;
-    if (!BaseClassRef::root_node) throw std::runtime_error("[nanoflann] findNeighbors() called before building the index.");
-    float epsError = 1 + searchParams.eps;
-
-    distance_vector_t dists;  // fixed or variable-sized container (depending on DIM)
-    auto zero = static_cast<decltype(result.worstDist())>(0);
-    assign(dists, (DIM > 0 ? DIM : BaseClassRef::dim),
-           zero);  // Fill it with zeros.
-    DistanceType distsq = this->computeInitialDistances(*this, vec, dists);
-    searchLevel(
-      result,
-      vec,
-      BaseClassRef::root_node,
-      distsq,
-      dists,
-      epsError);  // "count_leaf" parameter removed since was neither
-                  // used nor returned to the user.
-    return result.full();
-  }
-
-  /**
-   * Find the "num_closest" nearest neighbors to the \a query_point[0:dim-1].
-   * Their indices are stored inside the result object. \sa radiusSearch,
-   * findNeighbors \note nChecks_IGNORED is ignored but kept for compatibility
-   * with the original FLANN interface. \return Number `N` of valid points in
-   * the result set. Only the first `N` entries in `out_indices` and
-   * `out_distances_sq` will be valid. Return may be less than `num_closest`
-   * only if the number of elements in the tree is less than `num_closest`.
-   */
-  size_t knnSearch(const ElementType* query_point, const size_t num_closest, IndexType* out_indices, DistanceType* out_distances_sq, const int /* nChecks_IGNORED */ = 10) const {
-    nanoflann::KNNResultSet<DistanceType, IndexType> resultSet(num_closest);
-    resultSet.init(out_indices, out_distances_sq);
-    this->findNeighbors(resultSet, query_point, nanoflann::SearchParams());
-    return resultSet.size();
-  }
-
-  /**
-   * Find all the neighbors to \a query_point[0:dim-1] within a maximum radius.
-   *  The output is given as a vector of pairs, of which the first element is a
-   * point index and the second the corresponding distance. Previous contents of
-   * \a IndicesDists are cleared.
-   *
-   *  If searchParams.sorted==true, the output list is sorted by ascending
-   * distances.
-   *
-   *  For a better performance, it is advisable to do a .reserve() on the vector
-   * if you have any wild guess about the number of expected matches.
-   *
-   *  \sa knnSearch, findNeighbors, radiusSearchCustomCallback
-   * \return The number of points within the given radius (i.e. indices.size()
-   * or dists.size() )
-   */
-  size_t radiusSearch(const ElementType* query_point, const DistanceType& radius, std::vector<std::pair<IndexType, DistanceType>>& IndicesDists, const SearchParams& searchParams)
-    const {
-    RadiusResultSet<DistanceType, IndexType> resultSet(radius, IndicesDists);
-    const size_t nFound = radiusSearchCustomCallback(query_point, resultSet, searchParams);
-    if (searchParams.sorted) std::sort(IndicesDists.begin(), IndicesDists.end(), IndexDist_Sorter());
-    return nFound;
-  }
-
-  /**
-   * Just like radiusSearch() but with a custom callback class for each point
-   * found in the radius of the query. See the source of RadiusResultSet<> as a
-   * start point for your own classes. \sa radiusSearch
-   */
-  template <class SEARCH_CALLBACK>
-  size_t radiusSearchCustomCallback(const ElementType* query_point, SEARCH_CALLBACK& resultSet, const SearchParams& searchParams = SearchParams()) const {
-    this->findNeighbors(resultSet, query_point, searchParams);
-    return resultSet.size();
-  }
-
-  /** @} */
-
-public:
-  /** Make sure the auxiliary list \a vind has the same size than the current
-   * dataset, and re-generate if size has changed. */
-  void init_vind() {
-    // Create a permutable array of indices to the input vectors.
-    BaseClassRef::m_size = dataset.kdtree_get_point_count();
-    if (BaseClassRef::vind.size() != BaseClassRef::m_size) BaseClassRef::vind.resize(BaseClassRef::m_size);
-    for (size_t i = 0; i < BaseClassRef::m_size; i++) BaseClassRef::vind[i] = i;
-  }
-
-  void computeBoundingBox(BoundingBox& bbox) {
-    resize(bbox, (DIM > 0 ? DIM : BaseClassRef::dim));
-    if (dataset.kdtree_get_bbox(bbox)) {
-      // Done! It was implemented in derived class
-    } else {
-      const size_t N = dataset.kdtree_get_point_count();
-      if (!N)
-        throw std::runtime_error(
-          "[nanoflann] computeBoundingBox() called but "
-          "no data points found.");
-      for (int i = 0; i < (DIM > 0 ? DIM : BaseClassRef::dim); ++i) {
-        bbox[i].low = bbox[i].high = this->dataset_get(*this, 0, i);
-      }
-      for (size_t k = 1; k < N; ++k) {
-        for (int i = 0; i < (DIM > 0 ? DIM : BaseClassRef::dim); ++i) {
-          if (this->dataset_get(*this, k, i) < bbox[i].low) bbox[i].low = this->dataset_get(*this, k, i);
-          if (this->dataset_get(*this, k, i) > bbox[i].high) bbox[i].high = this->dataset_get(*this, k, i);
-        }
-      }
-    }
-  }
-
-  /**
-   * Performs an exact search in the tree starting from a node.
-   * \tparam RESULTSET Should be any ResultSet<DistanceType>
-   * \return true if the search should be continued, false if the results are
-   * sufficient
-   */
-  template <class RESULTSET>
-  bool searchLevel(RESULTSET& result_set, const ElementType* vec, const NodePtr node, DistanceType mindistsq, distance_vector_t& dists, const float epsError) const {
-    /* If this is a leaf node, then do check and return. */
-    if ((node->child1 == NULL) && (node->child2 == NULL)) {
-      // count_leaf += (node->lr.right-node->lr.left);  // Removed since was
-      // neither used nor returned to the user.
-      DistanceType worst_dist = result_set.worstDist();
-      for (IndexType i = node->node_type.lr.left; i < node->node_type.lr.right; ++i) {
-        const IndexType index = BaseClassRef::vind[i];  // reorder... : i;
-        DistanceType dist = distance.evalMetric(vec, index, (DIM > 0 ? DIM : BaseClassRef::dim));
-        if (dist < worst_dist) {
-          if (!result_set.addPoint(dist, BaseClassRef::vind[i])) {
-            // the resultset doesn't want to receive any more points, we're done
-            // searching!
-            return false;
-          }
-        }
-      }
-      return true;
-    }
-
-    /* Which child branch should be taken first? */
-    int idx = node->node_type.sub.divfeat;
-    ElementType val = vec[idx];
-    DistanceType diff1 = val - node->node_type.sub.divlow;
-    DistanceType diff2 = val - node->node_type.sub.divhigh;
-
-    NodePtr bestChild;
-    NodePtr otherChild;
-    DistanceType cut_dist;
-    if ((diff1 + diff2) < 0) {
-      bestChild = node->child1;
-      otherChild = node->child2;
-      cut_dist = distance.accum_dist(val, node->node_type.sub.divhigh, idx);
-    } else {
-      bestChild = node->child2;
-      otherChild = node->child1;
-      cut_dist = distance.accum_dist(val, node->node_type.sub.divlow, idx);
-    }
-
-    /* Call recursively to search next level down. */
-    if (!searchLevel(result_set, vec, bestChild, mindistsq, dists, epsError)) {
-      // the resultset doesn't want to receive any more points, we're done
-      // searching!
-      return false;
-    }
-
-    DistanceType dst = dists[idx];
-    mindistsq = mindistsq + cut_dist - dst;
-    dists[idx] = cut_dist;
-    if (mindistsq * epsError <= result_set.worstDist()) {
-      if (!searchLevel(result_set, vec, otherChild, mindistsq, dists, epsError)) {
-        // the resultset doesn't want to receive any more points, we're done
-        // searching!
-        return false;
-      }
-    }
-    dists[idx] = dst;
-    return true;
-  }
-};
-
-}  // namespace nanoflann
-
-#endif /* NANOFLANN_HPP_ */
--- a/include/small_gicp/ann/flat_container.hpp
+++ b/include/small_gicp/ann/flat_container.hpp
@ -1,3 +1,5 @@
+// SPDX-FileCopyrightText: Copyright 2024 Kenji Koide
+// SPDX-License-Identifier: MIT
 #pragma once

 #include <queue>
@ -5,29 +7,38 @@
 #include <Eigen/Geometry>
 #include <small_gicp/ann/traits.hpp>
 #include <small_gicp/points/traits.hpp>
+#include <small_gicp/ann/knn_result.hpp>

 namespace small_gicp {

-/// @brief Flat point container
-/// @note  IncrementalVoxelMap combined with FlastContainer is mostly the same as iVox.
+/// @brief Point container with a flat vector.
+/// @note  IncrementalVoxelMap combined with FlastContainer is mostly the same as linear iVox.
 ///        Bai et al., "Faster-LIO: Lightweight Tightly Coupled Lidar-Inertial Odometry Using Parallel Sparse Incremental Voxels", IEEE RA-L, 2022
-/// @tparam HasNormals  If true, normals are stored
-/// @tparam HasCovs     If true, covariances are stored
+/// @note  This container stores only up to max_num_points_in_cell points and avoids insertings points that are too close to existing points (min_sq_dist_in_cell).
+/// @tparam HasNormals  If true, store normals.
+/// @tparam HasCovs     If true, store covariances.
 template <bool HasNormals = false, bool HasCovs = false>
 struct FlatContainer {
 public:
+  /// @brief FlatContainer setting.
  struct Setting {
-    double min_sq_dist_in_cell = 0.1 * 0.1;  ///< Minimum squared distance between points in a cell
-    size_t max_num_points_in_cell = 10;      ///< Maximum number of points in a cell
+    double min_sq_dist_in_cell = 0.1 * 0.1;  ///< Minimum squared distance between points in a cell.
+    size_t max_num_points_in_cell = 10;      ///< Maximum number of points in a cell.
  };

-  /// @brief Constructor
+  /// @brief Constructor.
  FlatContainer() { points.reserve(5); }

-  /// @brief Number of points
+  /// @brief Number of points.
  size_t size() const { return points.size(); }

-  /// @brief Add a point to the container
+  /// @brief Add a point to the container.
+  ///        If there is a point that is too close to the input point, or there are too many points in the cell, the input point will not be ignored.
+  /// @param setting         Setting
+  /// @param transformed_pt  Transformed point (== T * points[i])
+  /// @param points          Point cloud
+  /// @param i               Index of the point
+  /// @param T               Transformation matrix
  template <typename PointCloud>
  void add(const Setting& setting, const Eigen::Vector4d& transformed_pt, const PointCloud& points, size_t i, const Eigen::Isometry3d& T) {
    if (
@ -46,10 +57,10 @@ public:
    }
  }

-  /// @brief Finalize the container (Nothing to do for FlatContainer)
+  /// @brief Finalize the container (Nothing to do for FlatContainer).
  void finalize() {}

-  /// @brief Find the nearest neighbor
+  /// @brief Find the nearest neighbor.
  /// @param pt           Query point
  /// @param k_index      Index of the nearest neighbor
  /// @param k_sq_dist    Squared distance to the nearest neighbor
@ -59,64 +70,60 @@ public:
      return 0;
    }

-    size_t min_index = -1;
-    double min_sq_dist = std::numeric_limits<double>::max();
-
-    for (size_t i = 0; i < points.size(); i++) {
-      const double sq_dist = (points[i] - pt).squaredNorm();
-      if (sq_dist < min_sq_dist) {
-        min_index = i;
-        min_sq_dist = sq_dist;
-      }
-    }
-
-    *k_index = min_index;
-    *k_sq_dist = min_sq_dist;
-
-    return 1;
+    KnnResult<1> result(k_index, k_sq_dist);
+    knn_search(pt, result);
+    return result.num_found();
  }

-  /// @brief Find k nearest neighbors
+  /// @brief Find k nearest neighbors.
  /// @param pt           Query point
  /// @param k            Number of neighbors
  /// @param k_index      Indices of nearest neighbors
-  /// @param k_sq_dist    Squared distances to nearest neighbors
+  /// @param k_sq_dist    Squared distances to nearest neighbors (sorted in ascending order)
  /// @return             Number of found points
-  size_t knn_search(const Eigen::Vector4d& pt, int k, size_t* k_index, double* k_sq_dist) const {
+  size_t knn_search(const Eigen::Vector4d& pt, int k, size_t* k_indices, double* k_sq_dists) const {
    if (points.empty()) {
      return 0;
    }

-    std::priority_queue<std::pair<size_t, double>> queue;
+    KnnResult<-1> result(k_indices, k_sq_dists, k);
+    knn_search(pt, result);
+    return result.num_found();
+  }
+
+  /// @brief Find k nearest neighbors.
+  /// @param pt           Query point
+  /// @param k            Number of neighbors
+  /// @param k_index      Indices of nearest neighbors
+  /// @param k_sq_dist    Squared distances to nearest neighbors (sorted in ascending order)
+  /// @return             Number of found points
+  template <typename Result>
+  void knn_search(const Eigen::Vector4d& pt, Result& result) const {
+    if (points.empty()) {
+      return;
+    }
+
    for (size_t i = 0; i < points.size(); i++) {
      const double sq_dist = (points[i] - pt).squaredNorm();
-      queue.push({i, sq_dist});
-      if (queue.size() > k) {
-        queue.pop();
-      }
+      result.push(i, sq_dist);
    }
-
-    const size_t n = queue.size();
-    while (!queue.empty()) {
-      k_index[queue.size() - 1] = queue.top().first;
-      k_sq_dist[queue.size() - 1] = queue.top().second;
-      queue.pop();
-    }
-
-    return n;
  }

 public:
  struct Empty {};

-  std::vector<Eigen::Vector4d> points;
-  std::conditional_t<HasNormals, std::vector<Eigen::Vector4d>, Empty> normals;
-  std::conditional_t<HasCovs, std::vector<Eigen::Matrix4d>, Empty> covs;
+  std::vector<Eigen::Vector4d> points;                                          ///< Points
+  std::conditional_t<HasNormals, std::vector<Eigen::Vector4d>, Empty> normals;  ///< Normals (Empty if HasNormals is false)
+  std::conditional_t<HasCovs, std::vector<Eigen::Matrix4d>, Empty> covs;        ///< Covariances (Empty if HasCovs is false)
 };

+/// @brief FlatContainer that stores only points.
 using FlatContainerPoints = FlatContainer<false, false>;
+/// @brief FlatContainer with normals.
 using FlatContainerNormal = FlatContainer<true, false>;
+/// @brief FlatContainer with covariances.
 using FlatContainerCov = FlatContainer<false, true>;
+/// @brief FlatContainer with normals and covariances.
 using FlatContainerNormalCov = FlatContainer<true, true>;

 namespace traits {
@ -139,6 +146,11 @@ struct Traits<FlatContainer<HasNormals, HasCovs>> {
  static size_t knn_search(const FlatContainer<HasNormals, HasCovs>& container, const Eigen::Vector4d& pt, size_t k, size_t* k_index, double* k_sq_dist) {
    return container.knn_search(pt, k, k_index, k_sq_dist);
  }
+
+  template <typename Result>
+  static void knn_search(const FlatContainer<HasNormals, HasCovs>& container, const Eigen::Vector4d& pt, Result& result) {
+    container.knn_search(pt, result);
+  }
 };

 }  // namespace traits
--- a/include/small_gicp/ann/gaussian_voxelmap.hpp
+++ b/include/small_gicp/ann/gaussian_voxelmap.hpp
@ -11,18 +11,18 @@

 namespace small_gicp {

-/// @brief Gaussian voxel.
+/// @brief Gaussian voxel that computes and stores voxel mean and covariance.
 struct GaussianVoxel {
 public:
  struct Setting {};

-  /// @brief Constructor
+  /// @brief Constructor.
  GaussianVoxel() : finalized(false), num_points(0), mean(Eigen::Vector4d::Zero()), cov(Eigen::Matrix4d::Zero()) {}

-  /// @brief  Number of points in the voxel (Always 1 for GaussianVoxel)
+  /// @brief  Number of points in the voxel (Always 1 for GaussianVoxel).
  size_t size() const { return 1; }

-  /// @brief  Add a point to the voxel
+  /// @brief  Add a point to the voxel.
  /// @param  setting        Setting
  /// @param  transformed_pt Transformed point mean
  /// @param  points         Point cloud
@ -41,7 +41,7 @@ public:
    this->cov += T.matrix() * traits::cov(points, i) * T.matrix().transpose();
  }

-  /// @brief Finalize the voxel mean and covariance
+  /// @brief Finalize the voxel mean and covariance.
  void finalize() {
    if (finalized) {
      return;
@ -79,6 +79,11 @@ struct Traits<GaussianVoxel> {
  static size_t knn_search(const GaussianVoxel& voxel, const Eigen::Vector4d& pt, size_t k, size_t* k_index, double* k_sq_dist) {
    return nearest_neighbor_search(voxel, pt, k_index, k_sq_dist);
  }
+
+  template <typename Result>
+  static void knn_search(const GaussianVoxel& voxel, const Eigen::Vector4d& pt, Result& result) {
+    result.push(0, (voxel.mean - pt).squaredNorm());
+  }
 };

 }  // namespace traits
--- a/include/small_gicp/ann/incremental_voxelmap.hpp
+++ b/include/small_gicp/ann/incremental_voxelmap.hpp
@ -9,6 +9,7 @@
 #include <Eigen/Geometry>

 #include <small_gicp/ann/traits.hpp>
+#include <small_gicp/ann/knn_result.hpp>
 #include <small_gicp/ann/flat_container.hpp>
 #include <small_gicp/points/traits.hpp>
 #include <small_gicp/util/fast_floor.hpp>
@ -16,11 +17,11 @@

 namespace small_gicp {

-/// @brief Voxel meta information
+/// @brief Voxel meta information.
 struct VoxelInfo {
 public:
-  /// @brief Constructor
-  /// @param coord Voxel coordinate
+  /// @brief Constructor.
+  /// @param coord Integer voxel coordinates
  /// @param lru   LRU counter for caching
  VoxelInfo(const Eigen::Vector3i& coord, size_t lru) : lru(lru), coord(coord) {}

@ -30,22 +31,24 @@ public:
 };

 /// @brief Incremental voxelmap.
-///        This class supports incremental point cloud insertion and LRU-based voxel deletion.
-/// @note  This class can be used as a point cloud as well as a neighbor search.
+///        This class supports incremental point cloud insertion and LRU-based voxel deletion that removes voxels that are not recently referenced.
+/// @note  This class can be used as a point cloud as well as a neighbor search structure.
+/// @note  This class can handle arbitrary number of voxels and arbitrary range of voxel coordinates (in 32-bit int range).
+/// @note  To use this as a source point cloud for registration, use `SequentialVoxelMapAccessor`.
 template <typename VoxelContents>
 struct IncrementalVoxelMap {
 public:
  using Ptr = std::shared_ptr<IncrementalVoxelMap>;
  using ConstPtr = std::shared_ptr<const IncrementalVoxelMap>;

-  /// @brief Constructor
+  /// @brief Constructor.
  /// @param leaf_size  Voxel size
-  explicit IncrementalVoxelMap(double leaf_size) : inv_leaf_size(1.0 / leaf_size), lru_horizon(100), lru_clear_cycle(10), lru_counter(0) {}
+  explicit IncrementalVoxelMap(double leaf_size) : inv_leaf_size(1.0 / leaf_size), lru_horizon(100), lru_clear_cycle(10), lru_counter(0) { set_search_offsets(1); }

-  /// @brief Number of points in the voxelmap
+  /// @brief Number of points in the voxelmap.
  size_t size() const { return flat_voxels.size(); }

-  /// @brief Insert points to the voxelmap
+  /// @brief Insert points to the voxelmap.
  /// @param points Point cloud
  /// @param T      Transformation matrix
  template <typename PointCloud>
@ -88,60 +91,99 @@ public:
    }
  }

-  /// @brief Find the nearest neighbor
+  /// @brief Find the nearest neighbor.
  /// @param pt       Query point
  /// @param index    Index of the nearest neighbor
  /// @param sq_dist  Squared distance to the nearest neighbor
  /// @return         Number of found points (0 or 1)
  size_t nearest_neighbor_search(const Eigen::Vector4d& pt, size_t* index, double* sq_dist) const {
-    const Eigen::Vector3i coord = fast_floor(pt * inv_leaf_size).template head<3>();
-    const auto found = voxels.find(coord);
-    if (found == voxels.end()) {
-      return 0;
+    const Eigen::Vector3i center = fast_floor(pt * inv_leaf_size).template head<3>();
+    size_t voxel_index = 0;
+    const auto index_transform = [&](size_t i) { return calc_index(voxel_index, i); };
+    KnnResult<1, decltype(index_transform)> result(index, sq_dist, -1, index_transform);
+
+    for (const auto& offset : search_offsets) {
+      const Eigen::Vector3i coord = center + offset;
+      const auto found = voxels.find(coord);
+      if (found == voxels.end()) {
+        continue;
+      }
+
+      voxel_index = found->second;
+      const auto& voxel = flat_voxels[voxel_index]->second;
+
+      traits::Traits<VoxelContents>::knn_search(voxel, pt, result);
    }

-    const size_t voxel_index = found->second;
-    const auto& voxel = flat_voxels[voxel_index]->second;
-
-    size_t point_index;
-    if (traits::nearest_neighbor_search(voxel, pt, &point_index, sq_dist) == 0) {
-      return 0;
-    }
-
-    *index = calc_index(voxel_index, point_index);
-    return 1;
+    return result.num_found();
  }

  /// @brief Find k nearest neighbors
  /// @param pt          Query point
  /// @param k           Number of neighbors
  /// @param k_indices   Indices of nearest neighbors
-  /// @param k_sq_dists  Squared distances to nearest neighbors
+  /// @param k_sq_dists  Squared distances to nearest neighbors (sorted in ascending order)
  /// @return            Number of found points
  size_t knn_search(const Eigen::Vector4d& pt, size_t k, size_t* k_indices, double* k_sq_dists) const {
-    const Eigen::Vector3i coord = fast_floor(pt * inv_leaf_size).template head<3>();
-    const auto found = voxels.find(coord);
-    if (found == voxels.end()) {
-      return 0;
+    const Eigen::Vector3i center = fast_floor(pt * inv_leaf_size).template head<3>();
+
+    size_t voxel_index = 0;
+    const auto index_transform = [&](size_t i) { return calc_index(voxel_index, i); };
+    KnnResult<-1, decltype(index_transform)> result(k_indices, k_sq_dists, k, index_transform);
+
+    for (const auto& offset : search_offsets) {
+      const Eigen::Vector3i coord = center + offset;
+      const auto found = voxels.find(coord);
+      if (found == voxels.end()) {
+        continue;
+      }
+
+      voxel_index = found->second;
+      const auto& voxel = flat_voxels[voxel_index]->second;
+
+      traits::Traits<VoxelContents>::knn_search(voxel, pt, result);
    }

-    const size_t voxel_index = found->second;
-    const auto& voxel = flat_voxels[voxel_index]->second;
-
-    std::vector<size_t> point_indices(k);
-    std::vector<double> sq_dists(k);
-    const size_t num_found = traits::knn_search(voxel, pt, k, point_indices.data(), sq_dists.data());
-
-    for (size_t i = 0; i < num_found; i++) {
-      k_indices[i] = calc_index(voxel_index, point_indices[i]);
-      k_sq_dists[i] = sq_dists[i];
-    }
-    return num_found;
+    return result.num_found();
  }

+  /// @brief Calculate the global point index from the voxel index and the point index.
  inline size_t calc_index(const size_t voxel_id, const size_t point_id) const { return (voxel_id << point_id_bits) | point_id; }
-  inline size_t voxel_id(const size_t i) const { return i >> point_id_bits; }                ///< Extract the point ID from an index
-  inline size_t point_id(const size_t i) const { return i & ((1ull << point_id_bits) - 1); }  ///< Extract the voxel ID from an index
+  inline size_t voxel_id(const size_t i) const { return i >> point_id_bits; }                 ///< Extract the point ID from a global index.
+  inline size_t point_id(const size_t i) const { return i & ((1ull << point_id_bits) - 1); }  ///< Extract the voxel ID from a global index.
+
+  /// @brief Set the pattern of the search offsets. (Must be 1, 7, or 27)
+  /// @note  1: center only, 7: center + 6 adjacent neighbors (+- 1X/1Y/1Z), 27: center + 26 neighbors (3 x 3 x 3 cube)
+  void set_search_offsets(int num_offsets) {
+    switch (num_offsets) {
+      default:
+        std::cerr << "warning: unsupported search_offsets=" << num_offsets << " (supported values: 1, 7, 27)" << std::endl;
+        std::cerr << "       : using default search_offsets=1" << std::endl;
+        [[fallthrough]];
+      case 1:
+        search_offsets = {Eigen::Vector3i(0, 0, 0)};
+        break;
+      case 7:
+        search_offsets = {
+          Eigen::Vector3i(0, 0, 0),
+          Eigen::Vector3i(1, 0, 0),
+          Eigen::Vector3i(0, 1, 0),
+          Eigen::Vector3i(0, 0, 1),
+          Eigen::Vector3i(-1, 0, 0),
+          Eigen::Vector3i(0, -1, 0),
+          Eigen::Vector3i(0, 0, -1)};
+        break;
+      case 27:
+        for (int i = -1; i <= 1; i++) {
+          for (int j = -1; j <= 1; j++) {
+            for (int k = -1; k <= 1; k++) {
+              search_offsets.emplace_back(i, j, k);
+            }
+          }
+        }
+        break;
+    }
+  }

 public:
  static_assert(sizeof(size_t) == 8, "size_t must be 64-bit");
@ -149,13 +191,15 @@ public:
  static constexpr int voxel_id_bits = 64 - point_id_bits;  ///< Use the remaining bits for voxel id
  const double inv_leaf_size;                               ///< Inverse of the voxel size

-  size_t lru_horizon;      ///< LRU horizon size
-  size_t lru_clear_cycle;  ///< LRU clear cycle
-  size_t lru_counter;      ///< LRU counter
+  size_t lru_horizon;      ///< LRU horizon size. Voxels that have not been accessed for lru_horizon steps are deleted.
+  size_t lru_clear_cycle;  ///< LRU clear cycle. Voxel deletion is performed every lru_clear_cycle steps.
+  size_t lru_counter;      ///< LRU counter. Incremented every step.

-  typename VoxelContents::Setting voxel_setting;                                  ///< Voxel setting
-  std::vector<std::shared_ptr<std::pair<VoxelInfo, VoxelContents>>> flat_voxels;  ///< Voxel contents
-  std::unordered_map<Eigen::Vector3i, size_t, XORVector3iHash> voxels;            ///< Voxel index map
+  std::vector<Eigen::Vector3i> search_offsets;  ///< Voxel search offsets.
+
+  typename VoxelContents::Setting voxel_setting;                                  ///< Voxel setting.
+  std::vector<std::shared_ptr<std::pair<VoxelInfo, VoxelContents>>> flat_voxels;  ///< Voxel contents.
+  std::unordered_map<Eigen::Vector3i, size_t, XORVector3iHash> voxels;            ///< Voxel index map.
 };

 namespace traits {
--- a/include/small_gicp/ann/kdtree.hpp
+++ b/include/small_gicp/ann/kdtree.hpp
@ -154,8 +154,8 @@ public:

  /// @brief Find the nearest neighbor.
  /// @param query        Query point
-  /// @param k_indices    Index of the nearest neighbor
-  /// @param k_sq_dists   Squared distance to the nearest neighbor
+  /// @param k_indices    Index of the nearest neighbor (uninitialized if not found)
+  /// @param k_sq_dists   Squared distance to the nearest neighbor (uninitialized if not found)
  /// @param setting      KNN search setting
  /// @return             Number of found neighbors (0 or 1)
  size_t nearest_neighbor_search(const Eigen::Vector4d& query, size_t* k_indices, double* k_sq_dists, const KnnSetting& setting = KnnSetting()) const {
@ -166,7 +166,7 @@ public:
  /// @param  query       Query point
  /// @param  k           Number of neighbors
  /// @param  k_indices   Indices of neighbors
-  /// @param  k_sq_dists  Squared distances to neighbors
+  /// @param  k_sq_dists  Squared distances to neighbors (sorted in ascending order)
  /// @param  setting     KNN search setting
  /// @return             Number of found neighbors
  size_t knn_search(const Eigen::Vector4d& query, int k, size_t* k_indices, double* k_sq_dists, const KnnSetting& setting = KnnSetting()) const {
@ -178,7 +178,7 @@ public:
  /// @brief Find k-nearest neighbors. This method uses fixed and static memory allocation. Might be faster for small k.
  /// @param query       Query point
  /// @param k_indices   Indices of neighbors
-  /// @param k_sq_dists  Squared distances to neighbors
+  /// @param k_sq_dists  Squared distances to neighbors (sorted in ascending order)
  /// @param setting     KNN search setting
  /// @return            Number of found neighbors
  template <int N>
@ -255,7 +255,7 @@ public:
  /// @param  query       Query point
  /// @param  k           Number of neighbors
  /// @param  k_indices   Indices of neighbors
-  /// @param  k_sq_dists  Squared distances to neighbors
+  /// @param  k_sq_dists  Squared distances to neighbors (sorted in ascending order)
  /// @param  setting     KNN search setting
  /// @return             Number of found neighbors
  size_t nearest_neighbor_search(const Eigen::Vector4d& query, size_t* k_indices, double* k_sq_dists, const KnnSetting& setting = KnnSetting()) const {
@ -266,7 +266,7 @@ public:
  /// @param  query       Query point
  /// @param  k           Number of neighbors
  /// @param  k_indices   Indices of neighbors
-  /// @param  k_sq_dists  Squared distances to neighbors
+  /// @param  k_sq_dists  Squared distances to neighbors (sorted in ascending order)
  /// @param  setting     KNN search setting
  /// @return             Number of found neighbors
  size_t knn_search(const Eigen::Vector4d& query, size_t k, size_t* k_indices, double* k_sq_dists, const KnnSetting& setting = KnnSetting()) const {
--- a/include/small_gicp/ann/knn_result.hpp
+++ b/include/small_gicp/ann/knn_result.hpp
@ -1,3 +1,5 @@
+// SPDX-FileCopyrightText: Copyright 2024 Kenji Koide
+// SPDX-License-Identifier: MIT
 #pragma once

 #include <array>
@ -20,9 +22,14 @@ public:
  double epsilon = 0.0;  ///< Early termination threshold
 };

+/// @brief Identity transform (alternative to std::identity in C++20).
+struct identity_transform {
+  size_t operator()(size_t i) const { return i; }
+};
+
 /// @brief K-nearest neighbor search result container.
 /// @tparam N   Number of neighbors to search. If N == -1, the number of neighbors is dynamicaly determined.
-template <int N>
+template <int N, typename IndexTransform = identity_transform>
 struct KnnResult {
 public:
  static constexpr size_t INVALID = std::numeric_limits<size_t>::max();
@ -31,7 +38,12 @@ public:
  /// @param indices        Buffer to store indices (must be larger than k=max(N, num_neighbors))
  /// @param distances      Buffer to store distances (must be larger than k=max(N, num_neighbors))
  /// @param num_neighbors  Number of neighbors to search (must be -1 for static case N > 0)
-  explicit KnnResult(size_t* indices, double* distances, int num_neighbors = -1) : num_neighbors(num_neighbors), indices(indices), distances(distances) {
+  explicit KnnResult(size_t* indices, double* distances, int num_neighbors = -1, const IndexTransform& index_transform = identity_transform())
+  : index_transform(index_transform),
+    capacity(num_neighbors),
+    num_found_neighbors(0),
+    indices(indices),
+    distances(distances) {
    if constexpr (N > 0) {
      if (num_neighbors >= 0) {
        std::cerr << "warning: Specifying dynamic num_neighbors=" << num_neighbors << " for a static KNN result container (N=" << N << ")" << std::endl;
@ -53,43 +65,46 @@ public:
    if constexpr (N > 0) {
      return N;
    } else {
-      return num_neighbors;
+      return capacity;
    }
  }

  /// @brief Number of found neighbors.
-  size_t num_found() const { return std::distance(indices, std::find(indices, indices + buffer_size(), INVALID)); }
+  size_t num_found() const { return num_found_neighbors; }

  /// @brief Worst distance in the result.
  double worst_distance() const { return distances[buffer_size() - 1]; }

  /// @brief  Push a pair of point index and distance to the result.
+  /// @note   The result is sorted by distance in ascending order.
  void push(size_t index, double distance) {
    if (distance >= worst_distance()) {
      return;
    }

    if constexpr (N == 1) {
-      indices[0] = index;
+      indices[0] = index_transform(index);
      distances[0] = distance;
    } else {
-      for (int i = buffer_size() - 1; i >= 0; i--) {
-        if (i == 0 || distance >= distances[i - 1]) {
-          indices[i] = index;
-          distances[i] = distance;
-          break;
-        }
-
-        indices[i] = indices[i - 1];
-        distances[i] = distances[i - 1];
+      int insert_loc = std::min<int>(num_found_neighbors, buffer_size() - 1);
+      for (; insert_loc > 0 && distance < distances[insert_loc - 1]; insert_loc--) {
+        indices[insert_loc] = indices[insert_loc - 1];
+        distances[insert_loc] = distances[insert_loc - 1];
      }
+
+      indices[insert_loc] = index_transform(index);
+      distances[insert_loc] = distance;
    }
+
+    num_found_neighbors = std::min<int>(num_found_neighbors + 1, buffer_size());
  }

 public:
-  const int num_neighbors;  ///< Maximum number of neighbors to search
-  size_t* indices;          ///< Indices of neighbors
-  double* distances;        ///< Distances to neighbors
+  const IndexTransform index_transform;  ///< Point index transformation (e.g., local point index to global point/voxel index)
+  const int capacity;                    ///< Maximum number of neighbors to search
+  int num_found_neighbors;               ///< Number of found neighbors
+  size_t* indices;                       ///< Indices of neighbors
+  double* distances;                     ///< Distances to neighbors
 };

 }  // namespace small_gicp
--- a/include/small_gicp/ann/projection.hpp
+++ b/include/small_gicp/ann/projection.hpp
@ -14,6 +14,7 @@ struct ProjectionSetting {
 };

 /// @brief Conventional axis-aligned projection (i.e., selecting any of XYZ axes with the largest variance).
+/// @note  Up to max_scan_count samples are used to estimate the variance.
 struct AxisAlignedProjection {
 public:
  /// @brief Project the point to the selected axis.
@ -53,6 +54,7 @@ public:
 };

 /// @brief Normal projection (i.e., selecting the 3D direction with the largest variance of the points).
+/// @note  Up to max_scan_count samples are used to estimate the variance along the axis.
 struct NormalProjection {
 public:
  /// @brief Project the point to the normal direction.
--- a/include/small_gicp/ann/projective_search.hpp
+++ b/include/small_gicp/ann/projective_search.hpp
@ -0,0 +1,219 @@
+// SPDX-FileCopyrightText: Copyright 2024 Kenji Koide
+// SPDX-License-Identifier: MIT
+#pragma once
+
+#include <cmath>
+#include <Eigen/Core>
+#include <small_gicp/points/traits.hpp>
+#include <small_gicp/ann/knn_result.hpp>
+
+namespace small_gicp {
+
+/// @brief Equirectangular projection.
+struct EquirectangularProjection {
+public:
+  /// @brief Project the point into the normalized image coordinates. (u, v) in ([0, 1], [0, 1])
+  Eigen::Vector2d operator()(const Eigen::Vector3d& pt_3d) const {
+    if (pt_3d.squaredNorm() < 1e-3) {
+      return Eigen::Vector2d(0.5, 0.5);
+    }
+
+    const Eigen::Vector3d bearing = pt_3d.normalized();
+    const double lat = -std::asin(bearing[1]);
+    const double lon = std::atan2(bearing[0], bearing[2]);
+
+    return Eigen::Vector2d(lon / (2.0 * M_PI) + 0.5, lat / M_PI + 0.5);
+  };
+};
+
+/// @brief Border clamp mode. Points out of the border are discarded.
+struct BorderClamp {
+public:
+  int operator()(int x, int width) const { return x; }
+};
+
+/// @brief Border repeat mode. Points out of the border are wrapped around.
+struct BorderRepeat {
+public:
+  int operator()(int x, int width) const { return x < 0 ? x + width : (x >= width ? x - width : x); }
+};
+
+/// @brief "Unsafe" projective search. This class does not hold the ownership of the target point cloud.
+template <typename PointCloud, typename Projection = EquirectangularProjection, typename BorderModeH = BorderRepeat, typename BorderModeV = BorderClamp>
+struct UnsafeProjectiveSearch {
+public:
+  /// @brief Constructor.
+  /// @param width    Index map width
+  /// @param height   Index map height
+  /// @param points   Target point cloud
+  UnsafeProjectiveSearch(int width, int height, const PointCloud& points) : points(points), index_map(height, width), search_window_h(10), search_window_v(5) {
+    index_map.setConstant(invalid_index);
+
+    Projection project;
+    for (size_t i = 0; i < traits::size(points); ++i) {
+      const Eigen::Vector4d pt = traits::point(points, i);
+      const Eigen::Vector2d uv = project(pt.head<3>());
+      const int u = uv[0] * index_map.cols();
+      const int v = uv[1] * index_map.rows();
+
+      if (u < 0 || u >= index_map.cols() || v < 0 || v >= index_map.rows()) {
+        continue;
+      }
+      index_map(v, u) = i;
+    }
+  }
+
+  /// @brief Find the nearest neighbor.
+  /// @param query        Query point
+  /// @param k_indices    Index of the nearest neighbor (uninitialized if not found)
+  /// @param k_sq_dists   Squared distance to the nearest neighbor (uninitialized if not found)
+  /// @param setting      KNN search setting
+  /// @return             Number of found neighbors (0 or 1)
+  size_t nearest_neighbor_search(const Eigen::Vector4d& query, size_t* k_indices, double* k_sq_dists, const KnnSetting& setting = KnnSetting()) const {
+    return knn_search<1>(query, k_indices, k_sq_dists, setting);
+  }
+
+  /// @brief  Find k-nearest neighbors. This method uses dynamic memory allocation.
+  /// @param  query       Query point
+  /// @param  k           Number of neighbors
+  /// @param  k_indices   Indices of neighbors
+  /// @param  k_sq_dists  Squared distances to neighbors (sorted in ascending order)
+  /// @param  setting     KNN search setting
+  /// @return             Number of found neighbors
+  size_t knn_search(const Eigen::Vector4d& query, int k, size_t* k_indices, double* k_sq_dists, const KnnSetting& setting = KnnSetting()) const {
+    KnnResult<-1> result(k_indices, k_sq_dists, k);
+    knn_search(query, result, setting);
+    return result.num_found();
+  }
+
+  /// @brief Find k-nearest neighbors. This method uses fixed and static memory allocation. Might be faster for small k.
+  /// @param query       Query point
+  /// @param k_indices   Indices of neighbors
+  /// @param k_sq_dists  Squared distances to neighbors (sorted in ascending order)
+  /// @param setting     KNN search setting
+  /// @return            Number of found neighbors
+  template <int N>
+  size_t knn_search(const Eigen::Vector4d& query, size_t* k_indices, double* k_sq_dists, const KnnSetting& setting = KnnSetting()) const {
+    KnnResult<N> result(k_indices, k_sq_dists);
+    knn_search(query, result, setting);
+    return result.num_found();
+  }
+
+private:
+  template <typename Result>
+  void knn_search(const Eigen::Vector4d& query, Result& result, const KnnSetting& setting) const {
+    BorderModeH border_h;
+    BorderModeV border_v;
+
+    Projection project;
+    const Eigen::Vector2d uv = project(query.head<3>());
+    const int u = uv[0] * index_map.cols();
+    const int v = uv[1] * index_map.rows();
+
+    for (int du = -search_window_h; du <= search_window_h; du++) {
+      const int u_clamped = border_h(u + du, index_map.cols());
+      if (u_clamped < 0 || u_clamped >= index_map.cols()) {
+        continue;
+      }
+
+      for (int dv = -search_window_v; dv <= search_window_v; dv++) {
+        const int v_clamped = border_v(v + dv, index_map.rows());
+        if (v_clamped < 0 || v_clamped >= index_map.rows()) {
+          continue;
+        }
+
+        const auto index = index_map(v_clamped, u_clamped);
+        if (index == invalid_index) {
+          continue;
+        }
+
+        const double sq_dist = (traits::point(points, index) - query).squaredNorm();
+        result.push(index, sq_dist);
+
+        if (setting.fulfilled(result)) {
+          return;
+        }
+      }
+    }
+  }
+
+public:
+  static constexpr std::uint32_t invalid_index = std::numeric_limits<std::uint32_t>::max();
+
+  const PointCloud& points;
+  Eigen::Matrix<std::uint32_t, -1, -1> index_map;
+
+  int search_window_h;
+  int search_window_v;
+};
+
+/// @brief "Safe" projective search. This class holds the ownership of the target point cloud.
+template <typename PointCloud, typename Projection = EquirectangularProjection, typename BorderModeH = BorderRepeat, typename BorderModeV = BorderClamp>
+struct ProjectiveSearch {
+public:
+  using Ptr = std::shared_ptr<ProjectiveSearch<PointCloud, Projection>>;
+  using ConstPtr = std::shared_ptr<const ProjectiveSearch<PointCloud, Projection>>;
+
+  explicit ProjectiveSearch(int width, int height, std::shared_ptr<const PointCloud> points) : points(points), search(width, height, *points) {}
+
+  /// @brief  Find k-nearest neighbors. This method uses dynamic memory allocation.
+  /// @param  query       Query point
+  /// @param  k           Number of neighbors
+  /// @param  k_indices   Indices of neighbors
+  /// @param  k_sq_dists  Squared distances to neighbors (sorted in ascending order)
+  /// @param  setting     KNN search setting
+  /// @return             Number of found neighbors
+  size_t nearest_neighbor_search(const Eigen::Vector4d& query, size_t* k_indices, double* k_sq_dists, const KnnSetting& setting = KnnSetting()) const {
+    return search.nearest_neighbor_search(query, k_indices, k_sq_dists, setting);
+  }
+
+  /// @brief  Find k-nearest neighbors. This method uses dynamic memory allocation.
+  /// @param  query       Query point
+  /// @param  k           Number of neighbors
+  /// @param  k_indices   Indices of neighbors
+  /// @param  k_sq_dists  Squared distances to neighbors (sorted in ascending order)
+  /// @param  setting     KNN search setting
+  /// @return             Number of found neighbors
+  size_t knn_search(const Eigen::Vector4d& query, size_t k, size_t* k_indices, double* k_sq_dists, const KnnSetting& setting = KnnSetting()) const {
+    return search.knn_search(query, k, k_indices, k_sq_dists, setting);
+  }
+
+public:
+  const std::shared_ptr<const PointCloud> points;                                         ///< Points
+  const UnsafeProjectiveSearch<PointCloud, Projection, BorderModeH, BorderModeV> search;  ///< Search
+};
+
+namespace traits {
+
+template <typename PointCloud, typename Projection, typename BorderModeH, typename BorderModeV>
+struct Traits<UnsafeProjectiveSearch<PointCloud, Projection, BorderModeH, BorderModeV>> {
+  static size_t nearest_neighbor_search(
+    const UnsafeProjectiveSearch<PointCloud, Projection, BorderModeH, BorderModeV>& tree,
+    const Eigen::Vector4d& point,
+    size_t* k_indices,
+    double* k_sq_dists) {
+    return tree.nearest_neighbor_search(point, k_indices, k_sq_dists);
+  }
+
+  static size_t
+  knn_search(const UnsafeProjectiveSearch<PointCloud, Projection, BorderModeH, BorderModeV>& tree, const Eigen::Vector4d& point, size_t k, size_t* k_indices, double* k_sq_dists) {
+    return tree.knn_search(point, k, k_indices, k_sq_dists);
+  }
+};
+
+template <typename PointCloud, typename Projection, typename BorderModeH, typename BorderModeV>
+struct Traits<ProjectiveSearch<PointCloud, Projection, BorderModeH, BorderModeV>> {
+  static size_t
+  nearest_neighbor_search(const ProjectiveSearch<PointCloud, Projection, BorderModeH, BorderModeV>& tree, const Eigen::Vector4d& point, size_t* k_indices, double* k_sq_dists) {
+    return tree.nearest_neighbor_search(point, k_indices, k_sq_dists);
+  }
+
+  static size_t
+  knn_search(const ProjectiveSearch<PointCloud, Projection, BorderModeH, BorderModeV>& tree, const Eigen::Vector4d& point, size_t k, size_t* k_indices, double* k_sq_dists) {
+    return tree.knn_search(point, k, k_indices, k_sq_dists);
+  }
+};
+
+}  // namespace traits
+
+}  // namespace small_gicp
--- a/include/small_gicp/ann/sequential_voxelmap_accessor.hpp
+++ b/include/small_gicp/ann/sequential_voxelmap_accessor.hpp
@ -0,0 +1,58 @@
+// SPDX-FileCopyrightText: Copyright 2024 Kenji Koide
+// SPDX-License-Identifier: MIT
+#pragma once
+
+#include <small_gicp/points/traits.hpp>
+#include <small_gicp/ann/incremental_voxelmap.hpp>
+
+namespace small_gicp {
+
+/**
+ * @brief A wrapper class to sequentially access points in a voxelmap (e.g., using points in a voxelmap as source point cloud for registration).
+ * @note  If the contents of the voxelmap are changed, the accessor must be recreated.
+ * @example
+    small_gicp::IncrementalVoxelMap<small_gicp::FlatContainerCov>::Ptr last_voxelmap = ...;
+    small_gicp::IncrementalVoxelMap<small_gicp::FlatContainerCov>::Ptr voxelmap = ...;
+
+    // Create a sequential accessor
+    const auto accessor = small_gicp::create_sequential_accessor(*voxelmap);
+
+    // Use the voxelmap as a source point cloud
+    small_gicp::Registration<small_gicp::GICPFactor, small_gicp::ParallelReductionOMP> registration;
+    auto result = registration.align(*last_voxelmap, accessor, *last_voxelmap, Eigen::Isometry3d::Identity());
+ */
+template <typename VoxelMap>
+class SequentialVoxelMapAccessor {
+public:
+  /// @brief Constructor.
+  /// @param voxelmap Voxelmap
+  SequentialVoxelMapAccessor(const VoxelMap& voxelmap) : voxelmap(voxelmap), indices(traits::point_indices(voxelmap)) {}
+
+  /// @brief Number of points in the voxelmap.
+  size_t size() const { return indices.size(); }
+
+public:
+  const VoxelMap& voxelmap;           ///< Voxelmap
+  const std::vector<size_t> indices;  ///< Indices of points in the voxelmap
+};
+
+/// @brief Create a sequential voxelmap accessor.
+template <typename VoxelMap>
+SequentialVoxelMapAccessor<VoxelMap> create_sequential_accessor(const VoxelMap& voxelmap) {
+  return SequentialVoxelMapAccessor<VoxelMap>(voxelmap);
+}
+
+template <typename VoxelMap>
+struct traits::Traits<SequentialVoxelMapAccessor<VoxelMap>> {
+  static size_t size(const SequentialVoxelMapAccessor<VoxelMap>& accessor) { return accessor.size(); }
+
+  static bool has_points(const SequentialVoxelMapAccessor<VoxelMap>& accessor) { return traits::has_points(accessor.voxelmap); }
+  static bool has_normals(const SequentialVoxelMapAccessor<VoxelMap>& accessor) { return traits::has_normals(accessor.voxelmap); }
+  static bool has_covs(const SequentialVoxelMapAccessor<VoxelMap>& accessor) { return traits::has_covs(accessor.voxelmap); }
+
+  static Eigen::Vector4d point(const SequentialVoxelMapAccessor<VoxelMap>& accessor, size_t i) { return traits::point(accessor.voxelmap, accessor.indices[i]); }
+  static Eigen::Vector4d normal(const SequentialVoxelMapAccessor<VoxelMap>& accessor, size_t i) { return traits::normal(accessor.voxelmap, accessor.indices[i]); }
+  static Eigen::Matrix4d cov(const SequentialVoxelMapAccessor<VoxelMap>& accessor, size_t i) { return traits::cov(accessor.voxelmap, accessor.indices[i]); }
+};
+
+}  // namespace small_gicp
--- a/include/small_gicp/ann/traits.hpp
+++ b/include/small_gicp/ann/traits.hpp
@ -16,14 +16,15 @@ struct Traits;
 /// @param tree       Nearest neighbor search (e.g., KdTree)
 /// @param point      Query point
 /// @param k          Number of neighbors
-/// @param k_index    [out] Index of the nearest neighbor
-/// @param k_sq_dist  [out] Squared distance to the nearest neighbor
+/// @param k_indices  [out] Indices of k-nearest neighbors
+/// @param k_sq_dists [out] Squared distances to k-nearest neighbors
 /// @return Number of found neighbors
 template <typename T>
 size_t knn_search(const T& tree, const Eigen::Vector4d& point, size_t k, size_t* k_indices, double* k_sq_dists) {
  return Traits<T>::knn_search(tree, point, k, k_indices, k_sq_dists);
 }

+/// @brief Check if T has nearest_neighbor_search method.
 template <typename T>
 struct has_nearest_neighbor_search {
  template <typename U, int = (&Traits<U>::nearest_neighbor_search, 0)>
@ -33,7 +34,7 @@ struct has_nearest_neighbor_search {
  static constexpr bool value = decltype(test((T*)nullptr))::value;
 };

-/// @brief Find the nearest neighbor.
+/// @brief Find the nearest neighbor. If Traits<T>::nearest_neighbor_search is not defined, fallback to knn_search with k=1.
 /// @param tree       Nearest neighbor search (e.g., KdTree)
 /// @param point      Query point
 /// @param k_index    [out] Index of the nearest neighbor
--- a/include/small_gicp/factors/gicp_factor.hpp
+++ b/include/small_gicp/factors/gicp_factor.hpp
@ -12,8 +12,13 @@ namespace small_gicp {

 /// @brief GICP (distribution-to-distribution) per-point error factor.
 struct GICPFactor {
+  struct Setting {};
+
  /// @brief Constructor
-  GICPFactor() : target_index(std::numeric_limits<size_t>::max()), source_index(std::numeric_limits<size_t>::max()), mahalanobis(Eigen::Matrix4d::Zero()) {}
+  GICPFactor(const Setting& setting = Setting())
+  : target_index(std::numeric_limits<size_t>::max()),
+    source_index(std::numeric_limits<size_t>::max()),
+    mahalanobis(Eigen::Matrix4d::Zero()) {}

  /// @brief Linearize the factor
  /// @param target       Target point cloud
@ -25,7 +30,7 @@ struct GICPFactor {
  /// @param H            Linearized information matrix
  /// @param b            Linearized information vector
  /// @param e            Error at the linearization point
-  /// @return
+  /// @return             True if the point is inlier
  template <typename TargetPointCloud, typename SourcePointCloud, typename TargetTree, typename CorrespondenceRejector>
  bool linearize(
    const TargetPointCloud& target,
--- a/include/small_gicp/factors/icp_factor.hpp
+++ b/include/small_gicp/factors/icp_factor.hpp
@ -12,7 +12,9 @@ namespace small_gicp {

 /// @brief Point-to-point per-point error factor.
 struct ICPFactor {
-  ICPFactor() : target_index(std::numeric_limits<size_t>::max()), source_index(std::numeric_limits<size_t>::max()) {}
+  struct Setting {};
+
+  ICPFactor(const Setting& setting = Setting()) : target_index(std::numeric_limits<size_t>::max()), source_index(std::numeric_limits<size_t>::max()) {}

  template <typename TargetPointCloud, typename SourcePointCloud, typename TargetTree, typename CorrespondenceRejector>
  bool linearize(
--- a/include/small_gicp/factors/plane_icp_factor.hpp
+++ b/include/small_gicp/factors/plane_icp_factor.hpp
@ -12,7 +12,9 @@ namespace small_gicp {

 /// @brief Point-to-plane per-point error factor.
 struct PointToPlaneICPFactor {
-  PointToPlaneICPFactor() : target_index(std::numeric_limits<size_t>::max()), source_index(std::numeric_limits<size_t>::max()) {}
+  struct Setting {};
+
+  PointToPlaneICPFactor(const Setting& setting = Setting()) : target_index(std::numeric_limits<size_t>::max()), source_index(std::numeric_limits<size_t>::max()) {}

  template <typename TargetPointCloud, typename SourcePointCloud, typename TargetTree, typename CorrespondenceRejector>
  bool linearize(
--- a/include/small_gicp/factors/robust_kernel.hpp
+++ b/include/small_gicp/factors/robust_kernel.hpp
@ -0,0 +1,108 @@
+// SPDX-FileCopyrightText: Copyright 2024 Kenji Koide
+// SPDX-License-Identifier: MIT
+#pragma once
+
+#include <Eigen/Core>
+#include <Eigen/Geometry>
+
+namespace small_gicp {
+
+/// @brief Huber robust kernel
+struct Huber {
+public:
+  /// @brief Huber robust kernel setting
+  struct Setting {
+    double c = 1.0;  ///< Kernel width
+  };
+
+  /// @brief Constructor
+  Huber(const Setting& setting) : c(setting.c) {}
+
+  /// @brief Compute the weight for an error
+  /// @param e Error
+  /// @return  Weight
+  double weight(double e) const {
+    const double e_abs = std::abs(e);
+    return e_abs < c ? 1.0 : c / e_abs;
+  }
+
+public:
+  const double c;  ///< Kernel width
+};
+
+/// @brief Cauchy robust kernel
+struct Cauchy {
+public:
+  /// @brief Huber robust kernel setting
+  struct Setting {
+    double c = 1.0;  ///< Kernel width
+  };
+
+  /// @brief Constructor
+  Cauchy(const Setting& setting) : c(setting.c) {}
+
+  /// @brief Compute the weight for an error
+  /// @param e Error
+  /// @return  Weight
+  double weight(double e) const { return c / (c + e * e); }
+
+public:
+  const double c;  ///< Kernel width
+};
+
+/// @brief Robustify a factor with a robust kernel
+/// @tparam Kernel  Robust kernel
+/// @tparam Factor  Factor
+template <typename Kernel, typename Factor>
+struct RobustFactor {
+public:
+  /// @brief Robust factor setting
+  struct Setting {
+    typename Kernel::Setting robust_kernel;  ///< Robust kernel setting
+    typename Factor::Setting factor;         ///< Factor setting
+  };
+
+  /// @brief Constructor
+  RobustFactor(const Setting& setting = Setting()) : robust_kernel(setting.robust_kernel), factor(setting.factor) {}
+
+  /// @brief Linearize the factor
+  template <typename TargetPointCloud, typename SourcePointCloud, typename TargetTree, typename CorrespondenceRejector>
+  bool linearize(
+    const TargetPointCloud& target,
+    const SourcePointCloud& source,
+    const TargetTree& target_tree,
+    const Eigen::Isometry3d& T,
+    size_t source_index,
+    const CorrespondenceRejector& rejector,
+    Eigen::Matrix<double, 6, 6>* H,
+    Eigen::Matrix<double, 6, 1>* b,
+    double* e) {
+    if (!factor.linearize(target, source, target_tree, T, source_index, rejector, H, b, e)) {
+      return false;
+    }
+
+    // Robustify the linearized factor
+    const double w = robust_kernel.weight(std::sqrt(*e));
+    *H *= w;
+    *b *= w;
+    *e *= w;
+
+    return true;
+  }
+
+  /// @brief Evaluate error
+  template <typename TargetPointCloud, typename SourcePointCloud>
+  double error(const TargetPointCloud& target, const SourcePointCloud& source, const Eigen::Isometry3d& T) const {
+    const double e = factor.error(target, source, T);
+    return robust_kernel.weight(std::sqrt(e)) * e;
+  }
+
+  /// @brief  Check if the factor is inlier
+  bool inlier() const { return factor.inlier(); }
+
+public:
+  Kernel robust_kernel;  ///< Robust kernel
+  Factor factor;         ///< Factor
+};
+
+}  // namespace small_gicp
--- a/include/small_gicp/pcl/pcl_point.hpp
+++ b/include/small_gicp/pcl/pcl_point.hpp
@ -9,24 +9,30 @@ namespace pcl {
 using Matrix4fMap = Eigen::Map<Eigen::Matrix4f, Eigen::Aligned>;
 using Matrix4fMapConst = const Eigen::Map<const Eigen::Matrix4f, Eigen::Aligned>;

-/// @brief  Point with covariance
+/// @brief  Point with covariance for PCL
 struct PointCovariance {
-  PCL_ADD_POINT4D;
-  Eigen::Matrix4f cov;
+  PCL_ADD_POINT4D;        ///< Point coordinates
+  Eigen::Matrix4f cov;    ///< 4x4 covariance matrix

+  /// @brief Get covariance matrix as Matrix4fMap
  Matrix4fMap getCovariance4fMap() { return Matrix4fMap(cov.data()); }
+
+  /// @brief Get covariance matrix as Matrix4fMapConst
  Matrix4fMapConst getCovariance4fMap() const { return Matrix4fMapConst(cov.data()); }

  PCL_MAKE_ALIGNED_OPERATOR_NEW
 };

-/// @brief  Point with normal and covariance
+/// @brief  Point with normal and covariance for PCL
 struct PointNormalCovariance {
-  PCL_ADD_POINT4D;
-  PCL_ADD_NORMAL4D
-  Eigen::Matrix4f cov;
+  PCL_ADD_POINT4D;      ///< Point coordinates
+  PCL_ADD_NORMAL4D      ///< Point normal
+  Eigen::Matrix4f cov;  ///< 4x4 covariance matrix

+  /// @brief Get covariance matrix as Matrix4fMap
  Matrix4fMap getCovariance4fMap() { return Matrix4fMap(cov.data()); }
+
+  /// @brief Get covariance matrix as Matrix4fMapConst
  Matrix4fMapConst getCovariance4fMap() const { return Matrix4fMapConst(cov.data()); }

  PCL_MAKE_ALIGNED_OPERATOR_NEW
--- a/include/small_gicp/pcl/pcl_proxy.hpp
+++ b/include/small_gicp/pcl/pcl_proxy.hpp
@ -7,12 +7,13 @@

 namespace small_gicp {

+/// @brief Proxy class to access PCL point cloud with external covariance matrices.
 template <typename PointT>
 struct PointCloudProxy {
  PointCloudProxy(const pcl::PointCloud<PointT>& cloud, std::vector<Eigen::Matrix4d>& covs) : cloud(cloud), covs(covs) {}

-  const pcl::PointCloud<PointT>& cloud;
-  std::vector<Eigen::Matrix4d>& covs;
+  const pcl::PointCloud<PointT>& cloud; ///< Point cloud
+  std::vector<Eigen::Matrix4d>& covs;   ///< Covariance matrices
 };

 namespace traits {
--- a/include/small_gicp/pcl/pcl_registration.hpp
+++ b/include/small_gicp/pcl/pcl_registration.hpp
@ -99,8 +99,8 @@ protected:
  std::string registration_type_;  ///< Registration type ("GICP" or "VGICP").
  bool verbose_;                   ///< Verbosity flag.

-  std::shared_ptr<KdTree<pcl::PointCloud<PointSource>>> target_tree_;  ///< KdTree for target point cloud.
-  std::shared_ptr<KdTree<pcl::PointCloud<PointSource>>> source_tree_;  ///< KdTree for source point cloud.
+  std::shared_ptr<small_gicp::KdTree<pcl::PointCloud<PointSource>>> target_tree_;  ///< KdTree for target point cloud.
+  std::shared_ptr<small_gicp::KdTree<pcl::PointCloud<PointSource>>> source_tree_;  ///< KdTree for source point cloud.

  std::shared_ptr<GaussianVoxelMap> target_voxelmap_;  ///< VoxelMap for target point cloud.
  std::shared_ptr<GaussianVoxelMap> source_voxelmap_;  ///< VoxelMap for source point cloud.
--- a/include/small_gicp/pcl/pcl_registration_impl.hpp
+++ b/include/small_gicp/pcl/pcl_registration_impl.hpp
@ -44,7 +44,7 @@ void RegistrationPCL<PointSource, PointTarget>::setInputSource(const PointCloudS
  }

  pcl::Registration<PointSource, PointTarget, Scalar>::setInputSource(cloud);
-  source_tree_ = std::make_shared<KdTree<pcl::PointCloud<PointSource>>>(input_, KdTreeBuilderOMP(num_threads_));
+  source_tree_ = std::make_shared<small_gicp::KdTree<pcl::PointCloud<PointSource>>>(input_, KdTreeBuilderOMP(num_threads_));
  source_covs_.clear();
  source_voxelmap_.reset();
 }
@ -56,7 +56,7 @@ void RegistrationPCL<PointSource, PointTarget>::setInputTarget(const PointCloudT
  }

  pcl::Registration<PointSource, PointTarget, Scalar>::setInputTarget(cloud);
-  target_tree_ = std::make_shared<KdTree<pcl::PointCloud<PointTarget>>>(target_, KdTreeBuilderOMP(num_threads_));
+  target_tree_ = std::make_shared<small_gicp::KdTree<pcl::PointCloud<PointTarget>>>(target_, KdTreeBuilderOMP(num_threads_));
  target_covs_.clear();
  target_voxelmap_.reset();
 }
@ -214,7 +214,7 @@ void RegistrationPCL<PointSource, PointTarget>::computeTransformation(PointCloud
    estimate_covariances_omp(target_proxy, *target_tree_, k_correspondences_, num_threads_);
  }

-  Registration<GICPFactor, ParallelReductionOMP> registration;
+  small_gicp::Registration<GICPFactor, ParallelReductionOMP> registration;
  registration.criteria.rotation_eps = rotation_epsilon_;
  registration.criteria.translation_eps = transformation_epsilon_;
  registration.reduction.num_threads = num_threads_;
--- a/include/small_gicp/points/eigen.hpp
+++ b/include/small_gicp/points/eigen.hpp
@ -2,6 +2,7 @@
 // SPDX-License-Identifier: MIT
 #pragma once

+#include <vector>
 #include <Eigen/Core>
 #include <small_gicp/points/traits.hpp>

@ -15,5 +16,30 @@ struct Traits<Eigen::MatrixXd> {
  static Eigen::Vector4d point(const Eigen::MatrixXd& points, size_t i) { return Eigen::Vector4d(points(i, 0), points(i, 1), points(i, 2), 1.0); }
 };

+template <typename Scalar, int D>
+struct Traits<std::vector<Eigen::Matrix<Scalar, D, 1>>> {
+  static_assert(D == 3 || D == 4, "Only 3D and 4D (homogeneous) points are supported");
+  static_assert(std::is_floating_point<Scalar>::value, "Only floating point types are supported");
+
+  using Point = Eigen::Matrix<Scalar, D, 1>;
+  static size_t size(const std::vector<Point>& points) { return points.size(); }
+  static bool has_points(const std::vector<Point>& points) { return points.size(); }
+  static Eigen::Vector4d point(const std::vector<Point>& points, size_t i) {
+    if constexpr (std::is_same_v<Scalar, double>) {
+      if constexpr (D == 3) {
+        return points[i].homogeneous();
+      } else {
+        return points[i];
+      }
+    } else {
+      if constexpr (D == 3) {
+        return points[i].homogeneous().template cast<double>();
+      } else {
+        return points[i].template cast<double>();
+      }
+    }
+  }
+};
+
 }  // namespace traits
 }  // namespace small_gicp
--- a/include/small_gicp/registration/optimizer.hpp
+++ b/include/small_gicp/registration/optimizer.hpp
@ -106,9 +106,9 @@ struct LevenbergMarquardtOptimizer {
      bool success = false;
      for (int j = 0; j < max_inner_iterations; j++) {
        // Solve with damping
-        const Eigen::Matrix<double, 6, 1> delta = (H + lambda * Eigen ::Matrix<double, 6, 6>::Identity()).ldlt().solve(-b);
+        const Eigen::Matrix<double, 6, 1> delta = (H + lambda * Eigen::Matrix<double, 6, 6>::Identity()).ldlt().solve(-b);

-        // Validte new solution
+        // Validate new solution
        const Eigen::Isometry3d new_T = result.T_target_source * se3_exp(delta);
        double new_e = reduction.error(target, source, new_T, factors);
        general_factor.update_error(target, source, new_T, &e);
@ -118,12 +118,13 @@ struct LevenbergMarquardtOptimizer {
                    << " dr=" << delta.head<3>().norm() << std::endl;
        }

-        if (new_e < e) {
+        if (new_e <= e) {
          // Error decreased, decrease lambda
          result.converged = criteria.converged(delta);
          result.T_target_source = new_T;
          lambda /= lambda_factor;
          success = true;
+          e = new_e;

          break;
        } else {
@ -149,7 +150,7 @@ struct LevenbergMarquardtOptimizer {

  bool verbose;              ///< If true, print debug messages
  int max_iterations;        ///< Max number of optimization iterations
-  int max_inner_iterations;  ///< Max  number of inner iterations (lambda-trial)
+  int max_inner_iterations;  ///< Max number of inner iterations (lambda-trial)
  double init_lambda;        ///< Initial lambda (damping factor)
  double lambda_factor;      ///< Lambda increase factor
 };
--- a/include/small_gicp/registration/reduction_omp.hpp
+++ b/include/small_gicp/registration/reduction_omp.hpp
@ -7,7 +7,11 @@
 namespace small_gicp {

 #ifndef _OPENMP
+#ifdef _WIN32
+#pragma message(__FILE__, " OpenMP is not available.Parallel reduction will be disabled.")
+#else
 #warning "OpenMP is not available. Parallel reduction will be disabled."
+#endif
 inline int omp_get_thread_num() {
  return 0;
 }
--- a/include/small_gicp/registration/registration.hpp
+++ b/include/small_gicp/registration/registration.hpp
@ -15,7 +15,7 @@ namespace small_gicp {

 /// @brief Point cloud registration.
 template <
-  typename Factor,
+  typename PointFactor,
  typename Reduction,
  typename GeneralFactor = NullFactor,
  typename CorrespondenceRejector = DistanceRejector,
@ -38,13 +38,16 @@ public:
      std::cerr << "warning: source point cloud is too small. |source|=" << traits::size(source) << std::endl;
    }

-    std::vector<Factor> factors(traits::size(source));
+    std::vector<PointFactor> factors(traits::size(source), PointFactor(point_factor));
    return optimizer.optimize(target, source, target_tree, rejector, criteria, reduction, init_T, factors, general_factor);
  }

 public:
+  using PointFactorSetting = typename PointFactor::Setting;
+
  TerminationCriteria criteria;     ///< Termination criteria
  CorrespondenceRejector rejector;  ///< Correspondence rejector
+  PointFactorSetting point_factor;  ///< Factor setting
  GeneralFactor general_factor;     ///< General factor
  Reduction reduction;              ///< Reduction
  Optimizer optimizer;              ///< Optimizer
--- a/include/small_gicp/registration/registration_helper.hpp
+++ b/include/small_gicp/registration/registration_helper.hpp
@ -9,7 +9,9 @@

 namespace small_gicp {

-/// @brief Preprocess point cloud (downsampling, kdtree creation, and normal and covariance estimation)
+/// @brief Preprocess point cloud (downsampling, kdtree creation, and normal and covariance estimation).
+/// @note  When num_threads >= 2, this function has minor run-by-run non-determinism due to the parallel downsampling.
+/// @see   small_gicp::voxelgrid_sampling_omp, small_gicp::estimate_normals_covariances_omp
 /// @param points                Input points
 /// @param downsampling_resolution Downsample resolution
 /// @param num_neighbors         Number of neighbors for normal/covariance estimation
@ -19,11 +21,14 @@ preprocess_points(const PointCloud& points, double downsampling_resolution, int

 /// @brief Preprocess point cloud (downsampling, kdtree creation, and normal and covariance estimation)
 /// @note  This function only accepts Eigen::Vector(3|4)(f|d) as input
+/// @note  When num_threads >= 2, this function has minor run-by-run non-determinism due to the parallel downsampling.
+/// @see   small_gicp::voxelgrid_sampling_omp, small_gicp::estimate_normals_covariances_omp
 template <typename T, int D>
 std::pair<PointCloud::Ptr, std::shared_ptr<KdTree<PointCloud>>>
 preprocess_points(const std::vector<Eigen::Matrix<T, D, 1>>& points, double downsampling_resolution, int num_neighbors = 10, int num_threads = 4);

-/// @brief Create Gaussian voxel map
+/// @brief Create an incremental Gaussian voxel map.
+/// @see   small_gicp::IncrementalVoxelMap
 /// @param points            Input points
 /// @param voxel_resolution  Voxel resolution
 GaussianVoxelMap::Ptr create_gaussian_voxelmap(const PointCloud& points, double voxel_resolution);
@ -39,10 +44,13 @@ struct RegistrationSetting {
  double rotation_eps = 0.1 * M_PI / 180.0;  ///< Rotation tolerance for convergence check [rad]
  double translation_eps = 1e-3;             ///< Translation tolerance for convergence check
  int num_threads = 4;                       ///< Number of threads
+  int max_iterations = 20;                   ///< Maximum number of iterations
+  bool verbose = false;                      ///< Verbose mode
 };

 /// @brief Align point clouds
 /// @note This function only accepts Eigen::Vector(3|4)(f|d) as input
+/// @see  small_gicp::voxelgrid_sampling_omp, small_gicp::estimate_normals_covariances_omp
 /// @param target     Target points
 /// @param source     Source points
 /// @param init_T     Initial guess of T_target_source
@ -70,9 +78,8 @@ RegistrationResult align(
  const RegistrationSetting& setting = RegistrationSetting());

 /// @brief Align preprocessed point clouds with VGICP
-/// @param target       Target point cloud
+/// @param target       Target Gaussian voxelmap
 /// @param source       Source point cloud
-/// @param target_tree  Nearest neighbor search for the target point cloud
 /// @param init_T       Initial guess of T_target_source
 /// @param setting      Registration setting
 /// @return             Registration result
--- a/include/small_gicp/registration/rejector.hpp
+++ b/include/small_gicp/registration/rejector.hpp
@ -24,7 +24,7 @@ struct DistanceRejector {
    return sq_dist > max_dist_sq;
  }

-  double max_dist_sq;
+  double max_dist_sq;  ///< Maximum squared distance between corresponding points
 };

 }  // namespace small_gicp
--- a/include/small_gicp/registration/termination_criteria.hpp
+++ b/include/small_gicp/registration/termination_criteria.hpp
@ -14,7 +14,7 @@ struct TerminationCriteria {
  /// @brief Check the convergence
  /// @param delta  Transformation update vector
  /// @return       True if converged
-  bool converged(const Eigen::Matrix<double, 6, 1>& delta) const { return delta.template head<3>().norm() < rotation_eps && delta.template tail<3>().norm() < translation_eps; }
+  bool converged(const Eigen::Matrix<double, 6, 1>& delta) const { return delta.template head<3>().norm() <= rotation_eps && delta.template tail<3>().norm() <= translation_eps; }

  double translation_eps;  ///< Rotation tolerance [rad]
  double rotation_eps;     ///< Translation tolerance
--- a/include/small_gicp/util/downsampling.hpp
+++ b/include/small_gicp/util/downsampling.hpp
@ -12,8 +12,10 @@

 namespace small_gicp {

-/// @brief Voxelgrid downsampling.
+/// @brief Voxelgrid downsampling. This function computes exact average of points in each voxel, and each voxel can contain arbitrary number of points.
 /// @note  Discretized voxel coords must be in 21bit range [-1048576, 1048575].
+///        For example, if the downsampling resolution is 0.01 m, point coordinates must be in [-10485.76, 10485.75] m.
+///        Points outside the valid range will be ignored.
 /// @param points     Input points
 /// @param leaf_size  Downsampling resolution
 /// @return           Downsampled points
@ -25,21 +27,22 @@ std::shared_ptr<OutputPointCloud> voxelgrid_sampling(const InputPointCloud& poin

  const double inv_leaf_size = 1.0 / leaf_size;

-  const int coord_bit_size = 21;                       // Bits to represent each voxel coordinate (pack 21x3=63bits in 64bit int)
-  const size_t coord_bit_mask = (1 << 21) - 1;         // Bit mask
-  const int coord_offset = 1 << (coord_bit_size - 1);  // Coordinate offset to make values positive
+  constexpr std::uint64_t invalid_coord = std::numeric_limits<std::uint64_t>::max();
+  constexpr int coord_bit_size = 21;                       // Bits to represent each voxel coordinate (pack 21x3=63bits in 64bit int)
+  constexpr size_t coord_bit_mask = (1 << 21) - 1;         // Bit mask
+  constexpr int coord_offset = 1 << (coord_bit_size - 1);  // Coordinate offset to make values positive

  std::vector<std::pair<std::uint64_t, size_t>> coord_pt(traits::size(points));
  for (size_t i = 0; i < traits::size(points); i++) {
    const Eigen::Array4i coord = fast_floor(traits::point(points, i) * inv_leaf_size) + coord_offset;
    if ((coord < 0).any() || (coord > coord_bit_mask).any()) {
      std::cerr << "warning: voxel coord is out of range!!" << std::endl;
-      coord_pt[i] = {0, i};
+      coord_pt[i] = {invalid_coord, i};
      continue;
    }

    // Compute voxel coord bits (0|1bit, z|21bit, y|21bit, x|21bit)
-    const std::uint64_t bits =                                 //
+    const std::uint64_t bits =                                                           //
      (static_cast<std::uint64_t>(coord[0] & coord_bit_mask) << (coord_bit_size * 0)) |  //
      (static_cast<std::uint64_t>(coord[1] & coord_bit_mask) << (coord_bit_size * 1)) |  //
      (static_cast<std::uint64_t>(coord[2] & coord_bit_mask) << (coord_bit_size * 2));
@ -56,6 +59,10 @@ std::shared_ptr<OutputPointCloud> voxelgrid_sampling(const InputPointCloud& poin
  size_t num_points = 0;
  Eigen::Vector4d sum_pt = traits::point(points, coord_pt.front().second);
  for (size_t i = 1; i < traits::size(points); i++) {
+    if (coord_pt[i].first == invalid_coord) {
+      continue;
+    }
+
    if (coord_pt[i - 1].first != coord_pt[i].first) {
      traits::set_point(*downsampled, num_points++, sum_pt / sum_pt.w());
      sum_pt.setZero();
--- a/include/small_gicp/util/downsampling_omp.hpp
+++ b/include/small_gicp/util/downsampling_omp.hpp
@ -14,7 +14,11 @@
 namespace small_gicp {

 /// @brief Voxel grid downsampling with OpenMP backend.
+/// @note  This function has minor run-by-run non-deterministic behavior due to parallel data collection that results
+///        in a deviation of the number of points in the downsampling results (up to 10% increase from the single-thread version).
 /// @note  Discretized voxel coords must be in 21bit range [-1048576, 1048575].
+///        For example, if the downsampling resolution is 0.01 m, point coordinates must be in [-10485.76, 10485.75] m.
+///        Points outside the valid range will be ignored.
 /// @param points     Input points
 /// @param leaf_size  Downsampling resolution
 /// @return           Downsampled points
@ -25,9 +29,11 @@ std::shared_ptr<OutputPointCloud> voxelgrid_sampling_omp(const InputPointCloud&
  }

  const double inv_leaf_size = 1.0 / leaf_size;
-  const int coord_bit_size = 21;                       // Bits to represent each voxel coordinate (pack 21x3 = 63bits in 64bit int)
-  const size_t coord_bit_mask = (1 << 21) - 1;         // Bit mask
-  const int coord_offset = 1 << (coord_bit_size - 1);  // Coordinate offset to make values positive
+
+  constexpr std::uint64_t invalid_coord = std::numeric_limits<std::uint64_t>::max();
+  constexpr int coord_bit_size = 21;                       // Bits to represent each voxel coordinate (pack 21x3 = 63bits in 64bit int)
+  constexpr size_t coord_bit_mask = (1 << 21) - 1;         // Bit mask
+  constexpr int coord_offset = 1 << (coord_bit_size - 1);  // Coordinate offset to make values positive

  std::vector<std::pair<std::uint64_t, size_t>> coord_pt(traits::size(points));
 #pragma omp parallel for num_threads(num_threads) schedule(guided, 32)
@ -35,11 +41,11 @@ std::shared_ptr<OutputPointCloud> voxelgrid_sampling_omp(const InputPointCloud&
    const Eigen::Array4i coord = fast_floor(traits::point(points, i) * inv_leaf_size) + coord_offset;
    if ((coord < 0).any() || (coord > coord_bit_mask).any()) {
      std::cerr << "warning: voxel coord is out of range!!" << std::endl;
-      coord_pt[i] = {0, i};
+      coord_pt[i] = {invalid_coord, i};
      continue;
    }
    // Compute voxel coord bits (0|1bit, z|21bit, y|21bit, x|21bit)
-    const std::uint64_t bits =                                 //
+    const std::uint64_t bits =                                                           //
      (static_cast<std::uint64_t>(coord[0] & coord_bit_mask) << (coord_bit_size * 0)) |  //
      (static_cast<std::uint64_t>(coord[1] & coord_bit_mask) << (coord_bit_size * 1)) |  //
      (static_cast<std::uint64_t>(coord[2] & coord_bit_mask) << (coord_bit_size * 2));
@ -47,11 +53,7 @@ std::shared_ptr<OutputPointCloud> voxelgrid_sampling_omp(const InputPointCloud&
  }

  // Sort by voxel coords
-  quick_sort_omp(
-    coord_pt.begin(),
-    coord_pt.end(),
-    [](const auto& lhs, const auto& rhs) { return lhs.first < rhs.first; },
-    num_threads);
+  quick_sort_omp(coord_pt.begin(), coord_pt.end(), [](const auto& lhs, const auto& rhs) { return lhs.first < rhs.first; }, num_threads);

  auto downsampled = std::make_shared<OutputPointCloud>();
  traits::resize(*downsampled, traits::size(points));
@ -69,6 +71,10 @@ std::shared_ptr<OutputPointCloud> voxelgrid_sampling_omp(const InputPointCloud&

    Eigen::Vector4d sum_pt = traits::point(points, coord_pt[block_begin].second);
    for (size_t i = block_begin + 1; i != block_end; i++) {
+      if (coord_pt[i].first == invalid_coord) {
+        continue;
+      }
+
      if (coord_pt[i - 1].first != coord_pt[i].first) {
        sub_points.emplace_back(sum_pt / sum_pt.w());
        sum_pt.setZero();
--- a/include/small_gicp/util/downsampling_tbb.hpp
+++ b/include/small_gicp/util/downsampling_tbb.hpp
@ -14,7 +14,11 @@
 namespace small_gicp {

 /// @brief Voxel grid downsampling with TBB backend.
+/// @note  This function has minor run-by-run non-deterministic behavior due to parallel data collection that results
+///        in a deviation of the number of points in the downsampling results (up to 10% increase from the single-thread version).
 /// @note  Discretized voxel coords must be in 21bit range [-1048576, 1048575].
+///        For example, if the downsampling resolution is 0.01 m, point coordinates must be in [-10485.76, 10485.75] m.
+///        Points outside the valid range will be ignored.
 /// @param points     Input points
 /// @param leaf_size  Downsampling resolution
 /// @return           Downsampled points
@ -25,9 +29,11 @@ std::shared_ptr<OutputPointCloud> voxelgrid_sampling_tbb(const InputPointCloud&
  }

  const double inv_leaf_size = 1.0 / leaf_size;
-  const int coord_bit_size = 21;                       // Bits to represent each voxel coordinate (pack 21x3 = 63bits in 64bit int)
-  const size_t coord_bit_mask = (1 << 21) - 1;         // Bit mask
-  const int coord_offset = 1 << (coord_bit_size - 1);  // Coordinate offset to make values positive
+
+  constexpr std::uint64_t invalid_coord = std::numeric_limits<std::uint64_t>::max();
+  constexpr int coord_bit_size = 21;                       // Bits to represent each voxel coordinate (pack 21x3 = 63bits in 64bit int)
+  constexpr size_t coord_bit_mask = (1 << 21) - 1;         // Bit mask
+  constexpr int coord_offset = 1 << (coord_bit_size - 1);  // Coordinate offset to make values positive

  std::vector<std::pair<std::uint64_t, size_t>> coord_pt(traits::size(points));
  tbb::parallel_for(tbb::blocked_range<size_t>(0, traits::size(points), 64), [&](const tbb::blocked_range<size_t>& range) {
@ -35,12 +41,12 @@ std::shared_ptr<OutputPointCloud> voxelgrid_sampling_tbb(const InputPointCloud&
      const Eigen::Array4i coord = fast_floor(traits::point(points, i) * inv_leaf_size) + coord_offset;
      if ((coord < 0).any() || (coord > coord_bit_mask).any()) {
        std::cerr << "warning: voxel coord is out of range!!" << std::endl;
-        coord_pt[i] = {0, i};
+        coord_pt[i] = {invalid_coord, i};
        continue;
      }

      // Compute voxel coord bits (0|1bit, z|21bit, y|21bit, x|21bit)
-      const std::uint64_t bits =                                 //
+      const std::uint64_t bits =                                                           //
        (static_cast<std::uint64_t>(coord[0] & coord_bit_mask) << (coord_bit_size * 0)) |  //
        (static_cast<std::uint64_t>(coord[1] & coord_bit_mask) << (coord_bit_size * 1)) |  //
        (static_cast<std::uint64_t>(coord[2] & coord_bit_mask) << (coord_bit_size * 2));
@ -63,6 +69,10 @@ std::shared_ptr<OutputPointCloud> voxelgrid_sampling_tbb(const InputPointCloud&

    Eigen::Vector4d sum_pt = traits::point(points, coord_pt[range.begin()].second);
    for (size_t i = range.begin() + 1; i != range.end(); i++) {
+      if (coord_pt[i].first == invalid_coord) {
+        continue;
+      }
+
      if (coord_pt[i - 1].first != coord_pt[i].first) {
        sub_points.emplace_back(sum_pt / sum_pt.w());
        sum_pt.setZero();
--- a/include/small_gicp/util/lie.hpp
+++ b/include/small_gicp/util/lie.hpp
@ -47,6 +47,10 @@ inline Eigen::Matrix3d skew(const Eigen::Vector3d& x) {
 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
 * IN THE SOFTWARE.
 */
+
+/// @brief SO3 expmap.
+/// @param omega  [rx, ry, rz]
+/// @return       Quaternion
 inline Eigen::Quaterniond so3_exp(const Eigen::Vector3d& omega) {
  double theta_sq = omega.dot(omega);

@ -67,28 +71,26 @@ inline Eigen::Quaterniond so3_exp(const Eigen::Vector3d& omega) {
 }

 // Rotation-first
+/// @brief SE3 expmap (Rotation-first).
+/// @param a  Twist vector [rx, ry, rz, tx, ty, tz]
+/// @return   SE3 matrix
 inline Eigen::Isometry3d se3_exp(const Eigen::Matrix<double, 6, 1>& a) {
-  using std::cos;
-  using std::sin;
  const Eigen::Vector3d omega = a.head<3>();

-  double theta = std::sqrt(omega.dot(omega));
-  const Eigen::Quaterniond so3 = so3_exp(omega);
-  const Eigen::Matrix3d Omega = skew(omega);
-  const Eigen::Matrix3d Omega_sq = Omega * Omega;
-  Eigen::Matrix3d V;
-
-  if (theta < 1e-10) {
-    V = so3.matrix();
-    /// Note: That is an accurate expansion!
-  } else {
-    const double theta_sq = theta * theta;
-    V = (Eigen::Matrix3d::Identity() + (1.0 - cos(theta)) / (theta_sq)*Omega + (theta - sin(theta)) / (theta_sq * theta) * Omega_sq);
-  }
+  const double theta_sq = omega.dot(omega);
+  const double theta = std::sqrt(theta_sq);

  Eigen::Isometry3d se3 = Eigen::Isometry3d::Identity();
-  se3.linear() = so3.toRotationMatrix();
-  se3.translation() = V * a.tail<3>();
+  se3.linear() = so3_exp(omega).toRotationMatrix();
+
+  if (theta < 1e-10) {
+    se3.translation() = se3.linear() * a.tail<3>();
+    /// Note: That is an accurate expansion!
+  } else {
+    const Eigen::Matrix3d Omega = skew(omega);
+    const Eigen::Matrix3d V = (Eigen::Matrix3d::Identity() + (1.0 - std::cos(theta)) / theta_sq * Omega + (theta - std::sin(theta)) / (theta_sq * theta) * Omega * Omega);
+    se3.translation() = V * a.tail<3>();
+  }

  return se3;
 }
--- a/include/small_gicp/util/normal_estimation.hpp
+++ b/include/small_gicp/util/normal_estimation.hpp
@ -7,10 +7,14 @@

 namespace small_gicp {

+/// @brief Computes point normals from eigenvectors and sets them to the point cloud.
+/// @note  If a sufficient number of neighbor points is not found, a zero vector is set to the point.
 template <typename PointCloud>
 struct NormalSetter {
+  /// @brief Handle invalid case (too few points).
  static void set_invalid(PointCloud& cloud, size_t i) { traits::set_normal(cloud, i, Eigen::Vector4d::Zero()); }

+  /// @brief Compute and set the normal to the point cloud.
  static void set(PointCloud& cloud, size_t i, const Eigen::Matrix3d& eigenvectors) {
    const Eigen::Vector4d normal = (Eigen::Vector4d() << eigenvectors.col(0).normalized(), 0.0).finished();
    if (traits::point(cloud, i).dot(normal) > 0) {
@ -21,14 +25,18 @@ struct NormalSetter {
  }
 };

+/// @brief Computes point covariances from eigenvectors and sets them to the point cloud.
+/// @note  If a sufficient number of neighbor points is not found, an identity matrix is set to the point.
 template <typename PointCloud>
 struct CovarianceSetter {
+  /// @brief Handle invalid case (too few points).
  static void set_invalid(PointCloud& cloud, size_t i) {
    Eigen::Matrix4d cov = Eigen::Matrix4d::Identity();
    cov(3, 3) = 0.0;
    traits::set_cov(cloud, i, cov);
  }

+  /// @brief Compute and set the covariance to the point cloud.
  static void set(PointCloud& cloud, size_t i, const Eigen::Matrix3d& eigenvectors) {
    const Eigen::Vector3d values(1e-3, 1.0, 1.0);
    Eigen::Matrix4d cov = Eigen::Matrix4d::Zero();
@ -37,13 +45,17 @@ struct CovarianceSetter {
  }
 };

+/// @brief Computes point normals and covariances from eigenvectors and sets them to the point cloud.
+/// @note  If a sufficient number of neighbor points is not found, a zero vector and an identity matrix are set to the point.
 template <typename PointCloud>
 struct NormalCovarianceSetter {
+  /// @brief Handle invalid case (too few points).
  static void set_invalid(PointCloud& cloud, size_t i) {
    NormalSetter<PointCloud>::set_invalid(cloud, i);
    CovarianceSetter<PointCloud>::set_invalid(cloud, i);
  }

+  /// @brief Compute and set the normal and covariance to the point cloud.
  static void set(PointCloud& cloud, size_t i, const Eigen::Matrix3d& eigenvectors) {
    NormalSetter<PointCloud>::set(cloud, i, eigenvectors);
    CovarianceSetter<PointCloud>::set(cloud, i, eigenvectors);
@ -98,6 +110,8 @@ void estimate_local_features(PointCloud& cloud, int num_neighbors) {
 }

 /// @brief Estimate point normals
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param num_neighbors  Number of neighbors used for attribute estimation
 template <typename PointCloud>
@ -106,6 +120,8 @@ void estimate_normals(PointCloud& cloud, int num_neighbors = 20) {
 }

 /// @brief Estimate point normals
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param kdtree         Nearest neighbor search
 /// @param num_neighbors  Number of neighbors used for attribute estimation
@ -115,6 +131,8 @@ void estimate_normals(PointCloud& cloud, KdTree& kdtree, int num_neighbors = 20)
 }

 /// @brief Estimate point covariances
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param num_neighbors  Number of neighbors used for attribute estimation
 template <typename PointCloud>
@ -123,6 +141,8 @@ void estimate_covariances(PointCloud& cloud, int num_neighbors = 20) {
 }

 /// @brief Estimate point covariances
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param kdtree         Nearest neighbor search
 /// @param num_neighbors  Number of neighbors used for attribute estimation
@ -132,6 +152,8 @@ void estimate_covariances(PointCloud& cloud, KdTree& kdtree, int num_neighbors =
 }

 /// @brief Estimate point normals and covariances
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param num_neighbors  Number of neighbors used for attribute estimation
 template <typename PointCloud>
@ -140,6 +162,8 @@ void estimate_normals_covariances(PointCloud& cloud, int num_neighbors = 20) {
 }

 /// @brief Estimate point normals and covariances
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param kdtree         Nearest neighbor search
 /// @param num_neighbors  Number of neighbors used for attribute estimation
--- a/include/small_gicp/util/normal_estimation_omp.hpp
+++ b/include/small_gicp/util/normal_estimation_omp.hpp
@ -26,6 +26,8 @@ void estimate_local_features_omp(PointCloud& cloud, KdTree& kdtree, int num_neig
 }

 /// @brief Estimate point normals with OpenMP
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param num_neighbors  Number of neighbors used for attribute estimation
 /// @param num_threads    Number of threads
@ -35,6 +37,8 @@ void estimate_normals_omp(PointCloud& cloud, int num_neighbors = 20, int num_thr
 }

 /// @brief Estimate point normals with OpenMP
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param kdtree         Nearest neighbor search
 /// @param num_neighbors  Number of neighbors used for attribute estimation
@ -45,6 +49,8 @@ void estimate_normals_omp(PointCloud& cloud, KdTree& kdtree, int num_neighbors =
 }

 /// @brief Estimate point covariances with OpenMP
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param num_neighbors  Number of neighbors used for attribute estimation
 /// @param num_threads    Number of threads
@ -54,6 +60,8 @@ void estimate_covariances_omp(PointCloud& cloud, int num_neighbors = 20, int num
 }

 /// @brief Estimate point covariances with OpenMP
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param kdtree         Nearest neighbor search
 /// @param num_neighbors  Number of neighbors used for attribute estimation
@ -64,6 +72,8 @@ void estimate_covariances_omp(PointCloud& cloud, KdTree& kdtree, int num_neighbo
 }

 /// @brief Estimate point normals and covariances with OpenMP
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param num_neighbors  Number of neighbors used for attribute estimation
 /// @param num_threads    Number of threads
@ -73,6 +83,8 @@ void estimate_normals_covariances_omp(PointCloud& cloud, int num_neighbors = 20,
 }

 /// @brief Estimate point normals and covariances with OpenMP
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param kdtree         Nearest neighbor search
 /// @param num_neighbors  Number of neighbors used for attribute estimation
--- a/include/small_gicp/util/normal_estimation_tbb.hpp
+++ b/include/small_gicp/util/normal_estimation_tbb.hpp
@ -30,6 +30,8 @@ void estimate_local_features_tbb(PointCloud& cloud, int num_neighbors) {
 }

 /// @brief Estimate point normals with TBB
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param num_neighbors  Number of neighbors used for attribute estimation
 template <typename PointCloud>
@ -38,6 +40,8 @@ void estimate_normals_tbb(PointCloud& cloud, int num_neighbors = 20) {
 }

 /// @brief Estimate point normals with TBB
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param kdtree         Nearest neighbor search
 /// @param num_neighbors  Number of neighbors used for attribute estimation
@ -47,6 +51,8 @@ void estimate_normals_tbb(PointCloud& cloud, KdTree& kdtree, int num_neighbors =
 }

 /// @brief Estimate point covariances with TBB
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param num_neighbors  Number of neighbors used for attribute estimation
 template <typename PointCloud>
@ -55,6 +61,8 @@ void estimate_covariances_tbb(PointCloud& cloud, int num_neighbors = 20) {
 }

 /// @brief Estimate point covariances with TBB
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param kdtree         Nearest neighbor search
 /// @param num_neighbors  Number of neighbors used for attribute estimation
@ -64,6 +72,8 @@ void estimate_covariances_tbb(PointCloud& cloud, KdTree& kdtree, int num_neighbo
 }

 /// @brief Estimate point normals and covariances with TBB
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param num_neighbors  Number of neighbors used for attribute estimation
 template <typename PointCloud>
@ -72,6 +82,8 @@ void estimate_normals_covariances_tbb(PointCloud& cloud, int num_neighbors = 20)
 }

 /// @brief Estimate point normals and covariances with TBB
+/// @note  If a sufficient number of neighbor points for normal/covariance estimation (5 points) is not found,
+///        an invalid normal/covariance is set to the point (normal=zero vector, covariance=identity matrix).
 /// @param cloud          [in/out] Point cloud
 /// @param kdtree         Nearest neighbor search
 /// @param num_neighbors  Number of neighbors used for attribute estimation
--- a/include/small_gicp/util/sort_omp.hpp
+++ b/include/small_gicp/util/sort_omp.hpp
@ -11,6 +11,10 @@

 namespace small_gicp {

+/// @brief Implementation of merge sort with OpenMP parallelism. Do not call this directly. Use merge_sort_omp instead.
+/// @param first  First iterator
+/// @param last   Last iterator
+/// @param comp   Comparison function
 template <typename RandomAccessIterator, typename Compare>
 void merge_sort_omp_impl(RandomAccessIterator first, RandomAccessIterator last, const Compare& comp) {
  const size_t n = std::distance(first, last);
@ -31,6 +35,12 @@ void merge_sort_omp_impl(RandomAccessIterator first, RandomAccessIterator last,
  std::inplace_merge(first, center, last, comp);
 }

+/// @brief Merge sort with OpenMP parallelism.
+/// @note  This tends to be slower than quick_sort_omp.
+/// @param first        First iterator
+/// @param last         Last iterator
+/// @param comp         Comparison function
+/// @param num_threads  Number of threads
 template <typename RandomAccessIterator, typename Compare>
 void merge_sort_omp(RandomAccessIterator first, RandomAccessIterator last, const Compare& comp, int num_threads) {
 #ifndef _MSC_VER
@ -44,6 +54,10 @@ void merge_sort_omp(RandomAccessIterator first, RandomAccessIterator last, const
 #endif
 }

+/// @brief Implementation of quick sort with OpenMP parallelism. Do not call this directly. Use quick_sort_omp instead.
+/// @param first  First iterator
+/// @param last   Last iterator
+/// @param comp   Comparison function
 template <typename RandomAccessIterator, typename Compare>
 void quick_sort_omp_impl(RandomAccessIterator first, RandomAccessIterator last, const Compare& comp) {
  const std::ptrdiff_t n = std::distance(first, last);
@ -72,6 +86,11 @@ void quick_sort_omp_impl(RandomAccessIterator first, RandomAccessIterator last,
  quick_sort_omp_impl(middle2, last, comp);
 }

+/// @brief Quick sort with OpenMP parallelism.
+/// @param first        First iterator
+/// @param last         Last iterator
+/// @param comp         Comparison function
+/// @param num_threads  Number of threads
 template <typename RandomAccessIterator, typename Compare>
 void quick_sort_omp(RandomAccessIterator first, RandomAccessIterator last, const Compare& comp, int num_threads) {
 #ifndef _MSC_VER
--- a/include/small_gicp/util/sort_tbb.hpp
+++ b/include/small_gicp/util/sort_tbb.hpp
@ -0,0 +1,143 @@
+// SPDX-FileCopyrightText: Copyright 2024 Kenji Koide
+// SPDX-License-Identifier: MIT
+#pragma once
+
+#include <vector>
+#include <algorithm>
+#include <functional>
+#include <tbb/tbb.h>
+
+namespace small_gicp {
+
+/// @brief Temporal buffers for radix sort.
+template <typename T>
+struct RadixSortBuffers {
+  std::vector<std::uint64_t> tile_buckets;    //< Tiled buckets
+  std::vector<std::uint64_t> global_offsets;  //< Global offsets
+  std::vector<T> sorted_buffer;               //< Sorted objects
+};
+
+/// @brief Radix sort with TBB parallelization.
+/// @note  This function outperforms tbb::parallel_sort only in case with many elements and threads.
+///        For usual data size and number of threads, use tbb::parallel_sort.
+/// @tparam T          Data type (must be unsigned integral type)
+/// @tparam KeyFunc    Key function
+/// @tparam bits       Number of bits per step
+/// @tparam tile_size  Tile size
+/// @param first_      [in/out] First iterator
+/// @param last_       [in/out] Last iterator
+/// @param key_        Key function (T => uint)
+/// @param buffers     Temporal buffers
+template <typename T, typename KeyFunc, int bits = 8, int tile_size = 256>
+void radix_sort_tbb(T* first_, T* last_, const KeyFunc& key_, RadixSortBuffers<T>& buffers) {
+  if (first_ == last_) {
+    return;
+  }
+
+  auto first = first_;
+  auto last = last_;
+
+  using Key = decltype(key_(*first));
+  static_assert(std::is_unsigned_v<Key>, "Key must be unsigned integral type");
+
+  // Number of total radix sort steps.
+  constexpr int num_steps = (sizeof(Key) * 8 + bits - 1) / bits;
+
+  constexpr int num_bins = 1 << bits;
+  const std::uint64_t N = std::distance(first, last);
+  const std::uint64_t num_tiles = (N + tile_size - 1) / tile_size;
+
+  // Allocate buffers.
+  auto& tile_buckets = buffers.tile_buckets;
+  auto& global_offsets = buffers.global_offsets;
+  auto& sorted_buffer = buffers.sorted_buffer;
+  tile_buckets.resize(num_bins * num_tiles);
+  global_offsets.resize(num_bins);
+  sorted_buffer.resize(N);
+
+  auto sorted = sorted_buffer.data();
+
+  // Radix sort.
+  for (int step = 0; step < num_steps; step++) {
+    const auto key = [&](const auto& x) { return ((key_(x) >> (step * bits))) & ((1 << bits) - 1); };
+
+    // Create per-tile histograms.
+    std::fill(tile_buckets.begin(), tile_buckets.end(), 0);
+    tbb::parallel_for(tbb::blocked_range<std::uint64_t>(0, num_tiles, 4), [&](const tbb::blocked_range<std::uint64_t>& r) {
+      for (std::uint64_t tile = r.begin(); tile < r.end(); tile++) {
+        std::uint64_t data_begin = tile * tile_size;
+        std::uint64_t data_end = std::min<std::uint64_t>((tile + 1) * tile_size, N);
+
+        for (int i = data_begin; i < data_end; ++i) {
+          auto buckets = tile_buckets.data() + key(*(first + i)) * num_tiles;
+          ++buckets[tile];
+        }
+      }
+    });
+
+    // Store the number of elements of the last tile, which will be overwritten by the next step, in global_offsets.
+    std::fill(global_offsets.begin(), global_offsets.end(), 0);
+    for (int i = 1; i < num_bins; i++) {
+      global_offsets[i] = tile_buckets[i * num_tiles - 1];
+    }
+
+    // Calculate per-tile offsets.
+    tbb::parallel_for(tbb::blocked_range<std::uint64_t>(0, num_bins, 1), [&](const tbb::blocked_range<std::uint64_t>& r) {
+      for (std::uint64_t bin = r.begin(); bin < r.end(); bin++) {
+        auto buckets = tile_buckets.data() + bin * num_tiles;
+        std::uint64_t last = buckets[0];
+        buckets[0] = 0;
+
+        for (std::uint64_t tile = 1; tile < num_tiles; tile++) {
+          std::uint64_t tmp = buckets[tile];
+          buckets[tile] = buckets[tile - 1] + last;
+          last = tmp;
+        }
+      }
+    });
+
+    // Calculate global offsets for each sorting bin.
+    for (int i = 1; i < num_bins; i++) {
+      global_offsets[i] += global_offsets[i - 1] + tile_buckets[i * num_tiles - 1];
+    }
+
+    // Sort elements.
+    tbb::parallel_for(tbb::blocked_range<std::uint64_t>(0, num_tiles, 8), [&](const tbb::blocked_range<std::uint64_t>& r) {
+      for (std::uint64_t tile = r.begin(); tile < r.end(); ++tile) {
+        std::uint64_t data_begin = tile * tile_size;
+        std::uint64_t data_end = std::min((tile + 1) * tile_size, static_cast<std::uint64_t>(N));
+
+        for (std::uint64_t i = data_begin; i < data_end; ++i) {
+          const T x = *(first + i);
+          const int bin = key(x);
+          auto offset = tile_buckets.data() + bin * num_tiles + tile;
+          sorted[global_offsets[bin] + ((*offset)++)] = x;
+        }
+      }
+    });
+
+    // Swap input and output buffers.
+    std::swap(first, sorted);
+  }
+
+  // Copy the result to the original buffer.
+  if (num_steps % 2 == 1) {
+    std::copy(sorted_buffer.begin(), sorted_buffer.end(), first_);
+  }
+}
+
+/// @brief Radix sort with TBB parallelization.
+/// @tparam T           Data type (must be unsigned integral type)
+/// @tparam KeyFunc     Key function
+/// @tparam bits        Number of bits per step
+/// @tparam tile_size   Tile size
+/// @param first_       [in/out] First iterator
+/// @param last_        [in/out] Last iterator
+/// @param key_         Key function (T => uint)
+template <typename T, typename KeyFunc, int bits = 4, int tile_size = 256>
+void radix_sort_tbb(T* first_, T* last_, const KeyFunc& key_) {
+  RadixSortBuffers<T> buffers;
+  radix_sort_tbb(first_, last_, key_, buffers);
+}
+
+}  // namespace small_gicp
--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -0,0 +1,41 @@
+site_name: "small_gicp"
+site_author: k.koide
+repo_url: https://github.com/koide3/small_gicp
+site_url: https://github.com/koide3/small_gicp
+site_description: "small_gicp: Efficient and parallel algorithms for point cloud registration"
+
+theme: 
+    name: material
+    palette:
+        primary: indigo
+    front:
+        text: Roboto
+
+use_directory_urls: false
+
+markdown_extensions:
+    - meta
+    - attr_list
+    - admonition
+    - pymdownx.highlight:
+        anchor_linenums: true
+    - pymdownx.inlinehilite
+    - pymdownx.snippets
+    - pymdownx.superfences
+
+copyright: Copyright &copy; 2024 Kenji Koide
+extra:
+    social:
+        - icon: material/home
+          link: https://staff.aist.go.jp/k.koide/
+        - icon: fontawesome/brands/github
+          link: https://github.com/koide3/small_gicp
+        - icon: fontawesome/brands/twitter
+          link: https://twitter.com/k_koide3
+
+extra_css:
+    - "css/custom.css"
+    - "https://maxcdn.bootstrapcdn.com/font-awesome/4.6.1/css/font-awesome.min.css"
+
+nav:
+    - 'index.md'
--- a/package.xml
+++ b/package.xml
@ -2,7 +2,7 @@
 <?xml-model href="http://download.ros.org/schema/package_format3.xsd" schematypens="http://www.w3.org/2001/XMLSchema"?>
 <package format="3">  
  <name>small_gicp</name>
-  <version>0.1.0</version>
+  <version>1.0.0</version>
  <description>Efficient and parallelized algorithms for point cloud registration</description>

  <maintainer email="k.koide@aist.go.jp">Kenji Koide</maintainer>
--- a/pyproject.toml
+++ b/pyproject.toml
@ -4,7 +4,7 @@ build-backend = "scikit_build_core.build"

 [project]
 name = "small_gicp"
-version = "0.0.6"
+version = "1.0.0"
 authors = [{name = "Kenji Koide", email = "k.koide@aist.go.jp"}]
 description = "Efficient and parallelized algorithms for fine point cloud registration"
 readme = "README.md"
--- a/src/benchmark/odometry_benchmark_small_gicp_model_omp.cpp
+++ b/src/benchmark/odometry_benchmark_small_gicp_model_omp.cpp
@ -0,0 +1,66 @@
+#include <small_gicp/benchmark/benchmark_odom.hpp>
+
+#include <small_gicp/factors/gicp_factor.hpp>
+#include <small_gicp/points/point_cloud.hpp>
+#include <small_gicp/ann/gaussian_voxelmap.hpp>
+#include <small_gicp/util/normal_estimation_omp.hpp>
+#include <small_gicp/registration/reduction_omp.hpp>
+#include <small_gicp/registration/registration.hpp>
+
+namespace small_gicp {
+
+class SmallGICPModelOnlineOdometryEstimationOMP : public OnlineOdometryEstimation {
+public:
+  explicit SmallGICPModelOnlineOdometryEstimationOMP(const OdometryEstimationParams& params) : OnlineOdometryEstimation(params), T_world_lidar(Eigen::Isometry3d::Identity()) {}
+
+  Eigen::Isometry3d estimate(const PointCloud::Ptr& points) override {
+    Stopwatch sw;
+    sw.start();
+
+    // Note that input points here is already downsampled
+    // Estimate per-point covariances
+    estimate_covariances_omp(*points, params.num_neighbors, params.num_threads);
+
+    if (voxelmap == nullptr) {
+      // This is the very first frame
+      voxelmap = std::make_shared<IncrementalVoxelMap<FlatContainerCov>>(params.voxel_resolution);
+      voxelmap->insert(*points);
+      return T_world_lidar;
+    }
+
+    // Registration using GICP + OMP-based parallel reduction
+    Registration<GICPFactor, ParallelReductionOMP> registration;
+    registration.reduction.num_threads = params.num_threads;
+    auto result = registration.align(*voxelmap, *points, *voxelmap, T_world_lidar);
+
+    // Update T_world_lidar with the estimated transformation
+    T_world_lidar = result.T_target_source;
+
+    // Insert points to the target voxel map
+    voxelmap->insert(*points, T_world_lidar);
+
+    sw.stop();
+    reg_times.push(sw.msec());
+
+    if (params.visualize) {
+      update_model_points(T_world_lidar, traits::voxel_points(*voxelmap));
+    }
+
+    return T_world_lidar;
+  }
+
+  void report() override {  //
+    std::cout << "registration_time_stats=" << reg_times.str() << " [msec/scan]  total_throughput=" << total_times.str() << " [msec/scan]" << std::endl;
+  }
+
+private:
+  Summarizer reg_times;
+
+  IncrementalVoxelMap<FlatContainerCov>::Ptr voxelmap;  // Target voxel map that is an accumulation of past point clouds
+  Eigen::Isometry3d T_world_lidar;                      // Current world-to-lidar transformation
+};
+
+static auto small_gicp_model_omp_registry =
+  register_odometry("small_gicp_model_omp", [](const OdometryEstimationParams& params) { return std::make_shared<SmallGICPModelOnlineOdometryEstimationOMP>(params); });
+
+}  // namespace small_gicp
--- a/src/benchmark/odometry_benchmark_small_vgicp_model_omp.cpp
+++ b/src/benchmark/odometry_benchmark_small_vgicp_model_omp.cpp
@ -0,0 +1,66 @@
+#include <small_gicp/benchmark/benchmark_odom.hpp>
+
+#include <small_gicp/factors/gicp_factor.hpp>
+#include <small_gicp/points/point_cloud.hpp>
+#include <small_gicp/ann/gaussian_voxelmap.hpp>
+#include <small_gicp/util/normal_estimation_omp.hpp>
+#include <small_gicp/registration/reduction_omp.hpp>
+#include <small_gicp/registration/registration.hpp>
+
+namespace small_gicp {
+
+class SmallVGICPModelOnlineOdometryEstimationOMP : public OnlineOdometryEstimation {
+public:
+  explicit SmallVGICPModelOnlineOdometryEstimationOMP(const OdometryEstimationParams& params) : OnlineOdometryEstimation(params), T_world_lidar(Eigen::Isometry3d::Identity()) {}
+
+  Eigen::Isometry3d estimate(const PointCloud::Ptr& points) override {
+    Stopwatch sw;
+    sw.start();
+
+    // Note that input points here is already downsampled
+    // Estimate per-point covariances
+    estimate_covariances_omp(*points, params.num_neighbors, params.num_threads);
+
+    if (voxelmap == nullptr) {
+      // This is the very first frame
+      voxelmap = std::make_shared<GaussianVoxelMap>(params.voxel_resolution);
+      voxelmap->insert(*points);
+      return T_world_lidar;
+    }
+
+    // Registration using GICP + OMP-based parallel reduction
+    Registration<GICPFactor, ParallelReductionOMP> registration;
+    registration.reduction.num_threads = params.num_threads;
+    auto result = registration.align(*voxelmap, *points, *voxelmap, T_world_lidar);
+
+    // Update T_world_lidar with the estimated transformation
+    T_world_lidar = result.T_target_source;
+
+    // Insert points to the target voxel map
+    voxelmap->insert(*points, T_world_lidar);
+
+    sw.stop();
+    reg_times.push(sw.msec());
+
+    if (params.visualize) {
+      update_model_points(T_world_lidar, traits::voxel_points(*voxelmap));
+    }
+
+    return T_world_lidar;
+  }
+
+  void report() override {  //
+    std::cout << "registration_time_stats=" << reg_times.str() << " [msec/scan]  total_throughput=" << total_times.str() << " [msec/scan]" << std::endl;
+  }
+
+private:
+  Summarizer reg_times;
+
+  GaussianVoxelMap::Ptr voxelmap;   // Target voxel map that is an accumulation of past point clouds
+  Eigen::Isometry3d T_world_lidar;  // Current world-to-lidar transformation
+};
+
+static auto small_gicp_model_omp_registry =
+  register_odometry("small_vgicp_model_omp", [](const OdometryEstimationParams& params) { return std::make_shared<SmallVGICPModelOnlineOdometryEstimationOMP>(params); });
+
+}  // namespace small_gicp
--- a/src/example/basic_registration.py
+++ b/src/example/basic_registration.py
@ -9,6 +9,8 @@ import small_gicp

 # Basic registation example with numpy arrays
 def example_numpy1(target_raw_numpy : numpy.ndarray, source_raw_numpy : numpy.ndarray):
+  print('*** example_numpy1 ***')
+
  # Example A : Perform registration with numpy arrays
  # Arguments
  # - target_points               : Nx4 or Nx3 numpy array of the target point cloud
@ -22,10 +24,15 @@ def example_numpy1(target_raw_numpy : numpy.ndarray, source_raw_numpy : numpy.nd
  # - num_threads                 : Number of threads
  result = small_gicp.align(target_raw_numpy, source_raw_numpy, downsampling_resolution=0.25)

+  print('--- registration result ---')
+  print(result)
+
  return result.T_target_source

 # Example to perform preprocessing and registration separately
 def example_numpy2(target_raw_numpy : numpy.ndarray, source_raw_numpy : numpy.ndarray):
+  print('*** example_numpy2 ***')
+
  # Example B : Perform preprocessing and registration separately

  # Preprocess point clouds
@ -38,6 +45,9 @@ def example_numpy2(target_raw_numpy : numpy.ndarray, source_raw_numpy : numpy.nd
  target, target_tree = small_gicp.preprocess_points(target_raw_numpy, downsampling_resolution=0.25)
  source, source_tree = small_gicp.preprocess_points(source_raw_numpy, downsampling_resolution=0.25)

+  print('preprocessed target=', target)
+  print('preprocessed source=', source)
+
  # Align point clouds
  # Arguments
  # - target                      : Target point cloud (small_gicp.PointCloud)
@ -48,12 +58,17 @@ def example_numpy2(target_raw_numpy : numpy.ndarray, source_raw_numpy : numpy.nd
  # - max_correspondence_distance : Maximum correspondence distance
  # - num_threads                 : Number of threads
  result = small_gicp.align(target, source, target_tree)
-  
+
+  print('--- registration result ---')
+  print(result)
+
  return result.T_target_source


 # Basic registation example with small_gicp.PointCloud
 def example_small1(target_raw_numpy : numpy.ndarray, source_raw_numpy : numpy.ndarray):
+  print('*** example_small1 ***')
+
  # Convert numpy arrays (Nx3 or Nx4) to small_gicp.PointCloud
  target_raw = small_gicp.PointCloud(target_raw_numpy)
  source_raw = small_gicp.PointCloud(source_raw_numpy)
@ -61,13 +76,21 @@ def example_small1(target_raw_numpy : numpy.ndarray, source_raw_numpy : numpy.nd
  # Preprocess point clouds
  target, target_tree = small_gicp.preprocess_points(target_raw, downsampling_resolution=0.25)
  source, source_tree = small_gicp.preprocess_points(source_raw, downsampling_resolution=0.25)
-  
+
+  print('preprocessed target=', target)
+  print('preprocessed source=', source)
+
  result = small_gicp.align(target, source, target_tree)
+
+  print('--- registration result ---')
+  print(result)
  
  return result.T_target_source
  
 # Example to perform each preprocessing and registration separately
 def example_small2(target_raw_numpy : numpy.ndarray, source_raw_numpy : numpy.ndarray):
+  print('*** example_small2 ***')
+
  # Convert numpy arrays (Nx3 or Nx4) to small_gicp.PointCloud
  target_raw = small_gicp.PointCloud(target_raw_numpy)
  source_raw = small_gicp.PointCloud(source_raw_numpy)
@ -79,13 +102,19 @@ def example_small2(target_raw_numpy : numpy.ndarray, source_raw_numpy : numpy.nd
  # KdTree construction
  target_tree = small_gicp.KdTree(target)
  source_tree = small_gicp.KdTree(source)
-  
+
  # Estimate covariances
  small_gicp.estimate_covariances(target, target_tree)
  small_gicp.estimate_covariances(source, source_tree)

+  print('preprocessed target=', target)
+  print('preprocessed source=', source)
+
  # Align point clouds  
  result = small_gicp.align(target, source, target_tree)
+
+  print('--- registration result ---')
+  print(result)
  
  return result.T_target_source

--- a/src/python/align.cpp
+++ b/src/python/align.cpp
@ -31,7 +31,11 @@ void define_align(py::module& m) {
      double voxel_resolution,
      double downsampling_resolution,
      double max_correspondence_distance,
-      int num_threads) {
+      int num_threads,
+      int max_iterations,
+      double rotation_epsilon,
+      double translation_epsilon,
+      bool verbose) {
      if (target_points.cols() != 3 && target_points.cols() != 4) {
        std::cerr << "target_points must be Nx3 or Nx4" << std::endl;
        return RegistrationResult(Eigen::Isometry3d::Identity());
@ -59,6 +63,10 @@ void define_align(py::module& m) {
      setting.downsampling_resolution = downsampling_resolution;
      setting.max_correspondence_distance = max_correspondence_distance;
      setting.num_threads = num_threads;
+      setting.max_iterations = max_iterations;
+      setting.rotation_eps = rotation_epsilon;
+      setting.translation_eps = translation_epsilon;
+      setting.verbose = verbose;

      std::vector<Eigen::Vector4d> target(target_points.rows());
      if (target_points.cols() == 3) {
@ -92,31 +100,47 @@ void define_align(py::module& m) {
    py::arg("downsampling_resolution") = 0.25,
    py::arg("max_correspondence_distance") = 1.0,
    py::arg("num_threads") = 1,
+    py::arg("max_iterations") = 20,
+    py::arg("rotation_epsilon") = 0.1 * M_PI / 180.0,
+    py::arg("translation_epsilon") = 1e-3,
+    py::arg("verbose") = false,
    R"pbdoc(
        Align two point clouds using various ICP-like algorithms.
+        This function first performs preprocessing (downsampling, normal estimation, KdTree construction) and then estimates the transformation.
+
+        See also: :class:`voxelgrid_sampling` :class:`estimate_normals` :class:`preprocess_points`

        Parameters
        ----------
-        target_points : NDArray[np.float64]
+        target_points : :class:`numpy.ndarray[np.float64]`
            Nx3 or Nx4 matrix representing the target point cloud.
-        source_points : NDArray[np.float64]
+        source_points : numpy.ndarray[np.float64]
            Nx3 or Nx4 matrix representing the source point cloud.
-        init_T_target_source : np.ndarray[np.float64]
+        init_T_target_source : numpy.ndarray[np.float64]
            4x4 matrix representing the initial transformation from target to source.
        registration_type : str = 'GICP'
            Type of registration algorithm to use ('ICP', 'PLANE_ICP', 'GICP', 'VGICP').
        voxel_resolution : float = 1.0
-            Resolution of voxels used for downsampling.
+            Resolution of voxels used for correspondence search (used only in VGICP).
        downsampling_resolution : float = 0.25
            Resolution for downsampling the point clouds.
+            Input points out of the 21bit range after discretization will be ignored (See also: :class:`voxelgrid_sampling`).
        max_correspondence_distance : float = 1.0
            Maximum distance for matching points between point clouds.
        num_threads : int = 1
            Number of threads to use for parallel processing.
+        max_iterations : int = 20
+            Maximum number of iterations for the optimization algorithm.
+        rotation_epsilon: double = 0.1 * M_PI / 180.0
+            Convergence criteria for rotation change
+        translation_epsilon: double = 1e-3
+            Convergence criteria for transformation change
+        verbose : bool = False
+            If True, print debug information during the optimization process.

        Returns
        -------
-        RegistrationResult
+        result : :class:`RegistrationResult`
            Object containing the final transformation matrix and convergence status.
        )pbdoc");

@ -130,7 +154,11 @@ void define_align(py::module& m) {
      const Eigen::Matrix4d& init_T_target_source,
      const std::string& registration_type,
      double max_correspondence_distance,
-      int num_threads) {
+      int num_threads,
+      int max_iterations,
+      double rotation_epsilon,
+      double translation_epsilon,
+      bool verbose) {
      RegistrationSetting setting;
      if (registration_type == "ICP") {
        setting.type = RegistrationSetting::ICP;
@ -144,6 +172,10 @@ void define_align(py::module& m) {
      }
      setting.max_correspondence_distance = max_correspondence_distance;
      setting.num_threads = num_threads;
+      setting.max_iterations = max_iterations;
+      setting.rotation_eps = rotation_epsilon;
+      setting.translation_eps = translation_epsilon;
+      setting.verbose = verbose;

      if (target_tree == nullptr) {
        target_tree = std::make_shared<KdTree<PointCloud>>(target, KdTreeBuilderOMP(num_threads));
@ -157,18 +189,24 @@ void define_align(py::module& m) {
    py::arg("registration_type") = "GICP",
    py::arg("max_correspondence_distance") = 1.0,
    py::arg("num_threads") = 1,
+    py::arg("max_iterations") = 20,
+    py::arg("rotation_epsilon") = 0.1 * M_PI / 180.0,
+    py::arg("translation_epsilon") = 1e-3,
+    py::arg("verbose") = false,
    R"pbdoc(
    Align two point clouds using specified ICP-like algorithms, utilizing point cloud and KD-tree inputs.
+    Input point clouds are assumed to be preprocessed (downsampled, normals estimated, KD-tree built).
+    See also: :class:`voxelgrid_sampling` :class:`estimate_normals` :class:`preprocess_points`

    Parameters
    ----------
-    target : PointCloud::ConstPtr
-        Pointer to the target point cloud.
-    source : PointCloud::ConstPtr
-        Pointer to the source point cloud.
-    target_tree : KdTree<PointCloud>::ConstPtr, optional
-        Pointer to the KD-tree of the target for nearest neighbor search. If nullptr, a new tree is built.
-    init_T_target_source : NDArray[np.float64]
+    target : :class:`PointCloud`
+        Target point cloud.
+    source : :class:`PointCloud`
+        Source point cloud.
+    target_tree : :class:`KdTree`, optional
+        KdTree for the target point cloud. If not given, a new KdTree is built.
+    init_T_target_source : numpy.ndarray[np.float64]
        4x4 matrix representing the initial transformation from target to source.
    registration_type : str = 'GICP'
        Type of registration algorithm to use ('ICP', 'PLANE_ICP', 'GICP').
@ -176,10 +214,18 @@ void define_align(py::module& m) {
        Maximum distance for corresponding point pairs.
    num_threads : int = 1
        Number of threads to use for computation.
+    max_iterations : int = 20
+        Maximum number of iterations for the optimization algorithm.
+    rotation_epsilon: double = 0.1 * M_PI / 180.0
+        Convergence criteria for rotation change
+    translation_epsilon: double = 1e-3
+        Convergence criteria for transformation change
+    verbose : bool = False
+        If True, print debug information during the optimization process.

    Returns
    -------
-    RegistrationResult
+    result : :class:`RegistrationResult`
        Object containing the final transformation matrix and convergence status.
    )pbdoc");

@ -192,11 +238,17 @@ void define_align(py::module& m) {
      const Eigen::Matrix4d& init_T_target_source,
      double max_correspondence_distance,
      int num_threads,
-      int max_iterations) {
+      int max_iterations,
+      double rotation_epsilon,
+      double translation_epsilon,
+      bool verbose) {
      Registration<GICPFactor, ParallelReductionOMP> registration;
      registration.rejector.max_dist_sq = max_correspondence_distance * max_correspondence_distance;
      registration.reduction.num_threads = num_threads;
      registration.optimizer.max_iterations = max_iterations;
+      registration.criteria.rotation_eps = rotation_epsilon;
+      registration.criteria.translation_eps = translation_epsilon;
+      registration.optimizer.verbose = verbose;

      return registration.align(target_voxelmap, source, target_voxelmap, Eigen::Isometry3d(init_T_target_source));
    },
@ -206,16 +258,21 @@ void define_align(py::module& m) {
    py::arg("max_correspondence_distance") = 1.0,
    py::arg("num_threads") = 1,
    py::arg("max_iterations") = 20,
+    py::arg("rotation_epsilon") = 0.1 * M_PI / 180.0,
+    py::arg("translation_epsilon") = 1e-3,
+    py::arg("verbose") = false,
    R"pbdoc(
    Align two point clouds using voxel-based GICP algorithm, utilizing a Gaussian Voxel Map.
+    Input source point cloud is assumed to be preprocessed (downsampled, normals estimated, KD-tree built).
+    See also: :class:`voxelgrid_sampling` :class:`estimate_normals` :class:`preprocess_points`

    Parameters
    ----------
-    target_voxelmap : GaussianVoxelMap
+    target_voxelmap : :class:`GaussianVoxelMap`
        Voxel map constructed from the target point cloud.
-    source : PointCloud
+    source : :class:`PointCloud`
        Source point cloud to align to the target.
-    init_T_target_source : NDArray[np.float64]
+    init_T_target_source : numpy.ndarray[np.float64]
        4x4 matrix representing the initial transformation from target to source.
    max_correspondence_distance : float = 1.0
        Maximum distance for corresponding point pairs.
@ -223,10 +280,16 @@ void define_align(py::module& m) {
        Number of threads to use for computation.
    max_iterations : int = 20
        Maximum number of iterations for the optimization algorithm.
+    rotation_epsilon: double = 0.1 * M_PI / 180.0
+        Convergence criteria for rotation change
+    translation_epsilon: double = 1e-3
+        Convergence criteria for transformation change
+    verbose : bool = False
+        If True, print debug information during the optimization process.

    Returns
    -------
-    RegistrationResult
+    result : :class:`RegistrationResult`
        Object containing the final transformation matrix and convergence status.
    )pbdoc");
 }
--- a/src/python/factors.cpp
+++ b/src/python/factors.cpp
@ -20,12 +20,34 @@ using namespace small_gicp;

 void define_factors(py::module& m) {
  // DistanceRejector
-  py::class_<DistanceRejector>(m, "DistanceRejector")  //
+  py::class_<DistanceRejector>(
+    m,
+    "DistanceRejector",
+    R"pbdoc(
+    Correspondence rejection based on the distance between points.
+    )pbdoc")
    .def(py::init<>())
-    .def("set_max_distance", [](DistanceRejector& rejector, double dist) { rejector.max_dist_sq = dist * dist; });
+    .def(
+      "set_max_distance",
+      [](DistanceRejector& rejector, double dist) { rejector.max_dist_sq = dist * dist; },
+      py::arg("dist"),
+      R"pbdoc(
+        Set maximum correspondence distance.
+
+        Parameters
+        ----------
+        dist : float
+            Maximum correspondence distance.
+        )pbdoc");

  // ICPFactor
-  py::class_<ICPFactor>(m, "ICPFactor")  //
+  py::class_<ICPFactor>(m, "ICPFactor", R"pbdoc(
+    ICP per-point factor based on the conventional point-to-point distance.
+
+    References
+    ----------
+    Zhang, "Iterative Point Matching for Registration of Free-Form Curve", IJCV1994
+    )pbdoc")
    .def(py::init<>())
    .def(
      "linearize",
@ -42,10 +64,51 @@ void define_factors(py::module& m) {
        double e = 0.0;
        bool succ = factor.linearize(target, source, kdtree, Eigen::Isometry3d(T), source_index, rejector, &H, &b, &e);
        return std::make_tuple(succ, H, b, e);
-      });
+      },
+      py::arg("target"),
+      py::arg("source"),
+      py::arg("kdtree"),
+      py::arg("T"),
+      py::arg("source_index"),
+      py::arg("rejector"),
+      R"pbdoc(
+        Linearize the factor for the i-th source point.
+
+        Parameters
+        ----------
+        target : :class:`PointCloud`
+            Target point cloud.
+        source : PointCloud
+            Source point cloud.
+        kdtree : KdTree
+            KdTree for the target point cloud.
+        T : numpy.ndarray
+            Transformation matrix. (4x4)
+        source_index : int
+            Index of the source point.
+        rejector : DistanceRejector
+            Correspondence rejector.
+
+        Returns
+        -------
+        success: bool
+            Success flag. If false, the correspondence is rejected.
+        H : numpy.ndarray
+            Hessian matrix (6x6).
+        b : numpy.ndarray
+            Gradient vector (6,).
+        e : float
+            Error.
+        )pbdoc");

  // PointToPlaneICPFactor
-  py::class_<PointToPlaneICPFactor>(m, "PointToPlaneICPFactor")  //
+  py::class_<PointToPlaneICPFactor>(m, "PointToPlaneICPFactor", R"pbdoc(
+    Point-to-plane ICP per-point factor based on the point-to-plane distance.
+
+    References
+    ----------
+    Zhang, "Iterative Point Matching for Registration of Free-Form Curve", IJCV1994
+    )pbdoc")
    .def(py::init<>())
    .def(
      "linearize",
@ -62,10 +125,51 @@ void define_factors(py::module& m) {
        double e = 0.0;
        bool succ = factor.linearize(target, source, kdtree, Eigen::Isometry3d(T), source_index, rejector, &H, &b, &e);
        return std::make_tuple(succ, H, b, e);
-      });
+      },
+      py::arg("target"),
+      py::arg("source"),
+      py::arg("kdtree"),
+      py::arg("T"),
+      py::arg("source_index"),
+      py::arg("rejector"),
+      R"pbdoc(
+        Linearize the factor for the i-th source point.
+
+        Parameters
+        ----------
+        target : PointCloud
+            Target point cloud. Must have normals.
+        source : PointCloud
+            Source point cloud.
+        kdtree : KdTree
+            KdTree for the target point cloud.
+        T : numpy.ndarray
+            Transformation matrix. (4x4)
+        source_index : int
+            Index of the source point.
+        rejector : DistanceRejector
+            Correspondence rejector.
+
+        Returns
+        -------
+        success: bool
+            Success flag. If false, the correspondence is rejected.
+        H : numpy.ndarray
+            Hessian matrix (6x6).
+        b : numpy.ndarray
+            Gradient vector (6,).
+        e : float
+            Error.
+        )pbdoc");

  // GICPFactor
-  py::class_<GICPFactor>(m, "GICPFactor")  //
+  py::class_<GICPFactor>(m, "GICPFactor", R"pbdoc(
+    Generalized ICP per-point factor based on distribution-to-distribution distance.
+    
+    References
+    ----------
+    Segal et al., "Generalized-ICP", RSS2005
+    )pbdoc")  //
    .def(py::init<>())
    .def(
      "linearize",
@ -82,5 +186,40 @@ void define_factors(py::module& m) {
        double e = 0.0;
        bool succ = factor.linearize(target, source, kdtree, Eigen::Isometry3d(T), source_index, rejector, &H, &b, &e);
        return std::make_tuple(succ, H, b, e);
-      });
+      },
+      py::arg("target"),
+      py::arg("source"),
+      py::arg("kdtree"),
+      py::arg("T"),
+      py::arg("source_index"),
+      py::arg("rejector"),
+      R"pbdoc(
+        Linearize the factor for the i-th source point.
+
+        Parameters
+        ----------
+        target : PointCloud
+            Target point cloud. Must have covariances.
+        source : PointCloud
+            Source point cloud. Must have covariances.
+        kdtree : KdTree
+            KdTree for the target point cloud.
+        T : numpy.ndarray
+            Transformation matrix. (4x4)
+        source_index : int
+            Index of the source point.
+        rejector : DistanceRejector
+            Correspondence rejector.
+
+        Returns
+        -------
+        success: bool
+            Success flag. If false, the correspondence is rejected.
+        H : numpy.ndarray
+            Hessian matrix (6x6).
+        b : numpy.ndarray
+            Gradient vector (6,).
+        e : float
+            Error.
+        )pbdoc");
 }
--- a/src/python/kdtree.cpp
+++ b/src/python/kdtree.cpp
@ -17,10 +17,53 @@ using namespace small_gicp;
 void define_kdtree(py::module& m) {
  // KdTree
  py::class_<KdTree<PointCloud>, std::shared_ptr<KdTree<PointCloud>>>(m, "KdTree")  //
+    .def(
+      py::init([](const Eigen::MatrixXd& points_numpy, int num_threads) {
+        if (points_numpy.cols() != 3 && points_numpy.cols() != 4) {
+          throw std::invalid_argument("points must have shape (n, 3) or (n, 4)");
+        }
+
+        auto points = std::make_shared<PointCloud>();
+        points->resize(points_numpy.rows());
+
+        for (int i = 0; i < points_numpy.rows(); i++) {
+          points->point(i) << points_numpy(i, 0), points_numpy(i, 1), points_numpy(i, 2), 1.0;
+        }
+
+        return std::make_shared<KdTree<PointCloud>>(points, KdTreeBuilderOMP(num_threads));
+      }),
+      py::arg("points"),
+      py::arg("num_threads") = 1,
+      R"""(
+        KdTree(points: numpy.ndarray, num_threads: int = 1)
+
+        Construct a KdTree from a numpy array.
+
+        Parameters
+        ----------
+        points : :class:`numpy.ndarray`, shape (n, 3) or (n, 4)
+            The input point cloud.
+        num_threads : int, optional
+            The number of threads to use for KdTree construction. Default is 1.
+       )""")
+
    .def(
      py::init([](const PointCloud::ConstPtr& points, int num_threads) { return std::make_shared<KdTree<PointCloud>>(points, KdTreeBuilderOMP(num_threads)); }),
      py::arg("points"),
-      py::arg("num_threads") = 1)
+      py::arg("num_threads") = 1,
+      R"""(
+        KdTree(points: PointCloud, num_threads: int = 1)
+
+        Construct a KdTree from a small_gicp.PointCloud.
+
+        Parameters
+        ----------
+        points : :class:`PointCloud`
+            The input point cloud.
+        num_threads : int, optional
+            The number of threads to use for KdTree construction. Default is 1.
+       )""")
+
    .def(
      "nearest_neighbor_search",
      [](const KdTree<PointCloud>& kdtree, const Eigen::Vector3d& pt) {
@ -28,11 +71,143 @@ void define_kdtree(py::module& m) {
        double k_sq_dist = std::numeric_limits<double>::max();
        const size_t found = traits::nearest_neighbor_search(kdtree, Eigen::Vector4d(pt.x(), pt.y(), pt.z(), 1.0), &k_index, &k_sq_dist);
        return std::make_tuple(found, k_index, k_sq_dist);
-      })
-    .def("knn_search", [](const KdTree<PointCloud>& kdtree, const Eigen::Vector3d& pt, int k) {
-      std::vector<size_t> k_indices(k, -1);
-      std::vector<double> k_sq_dists(k, std::numeric_limits<double>::max());
-      const size_t found = traits::knn_search(kdtree, Eigen::Vector4d(pt.x(), pt.y(), pt.z(), 1.0), k, k_indices.data(), k_sq_dists.data());
-      return std::make_pair(k_indices, k_sq_dists);
-    });
-}
+      },
+      py::arg("pt"),
+      R"""(
+        Find the nearest neighbor to a given point.
+
+        Parameters
+        ----------
+        pt : numpy.ndarray, shape (3,)
+            The input point.
+
+        Returns
+        -------
+        found : int
+            Whether a neighbor was found (1 if found, 0 if not).
+        k_index : int
+            The index of the nearest neighbor in the point cloud. If a neighbor was not found, the index is -1.
+        k_sq_dist : float
+            The squared distance to the nearest neighbor.
+       )""")
+
+    .def(
+      "knn_search",
+      [](const KdTree<PointCloud>& kdtree, const Eigen::Vector3d& pt, int k) {
+        std::vector<size_t> k_indices(k, -1);
+        std::vector<double> k_sq_dists(k, std::numeric_limits<double>::max());
+        const size_t found = traits::knn_search(kdtree, Eigen::Vector4d(pt.x(), pt.y(), pt.z(), 1.0), k, k_indices.data(), k_sq_dists.data());
+        return std::make_pair(k_indices, k_sq_dists);
+      },
+      py::arg("pt"),
+      py::arg("k"),
+      R"""(
+        Find the k nearest neighbors to a given point.
+
+        Parameters
+        ----------
+        pt : numpy.ndarray, shape (3,)
+            The input point.
+        k : int
+            The number of nearest neighbors to search for.
+
+        Returns
+        -------
+        k_indices : numpy.ndarray, shape (k,)
+            The indices of the k nearest neighbors in the point cloud. If a neighbor was not found, the index is -1.
+        k_sq_dists : numpy.ndarray, shape (k,)
+            The squared distances to the k nearest neighbors (Sorted in ascending order).
+     )""")
+
+    .def(
+      "batch_nearest_neighbor_search",
+      [](const KdTree<PointCloud>& kdtree, const Eigen::MatrixXd& pts, int num_threads) {
+        if (pts.cols() != 3 && pts.cols() != 4) {
+          throw std::invalid_argument("pts must have shape (n, 3) or (n, 4)");
+        }
+
+        std::vector<size_t> k_indices(pts.rows(), -1);
+        std::vector<double> k_sq_dists(pts.rows(), std::numeric_limits<double>::max());
+
+#pragma omp parallel for num_threads(num_threads) schedule(guided, 4)
+        for (int i = 0; i < pts.rows(); ++i) {
+          const size_t found = traits::nearest_neighbor_search(kdtree, Eigen::Vector4d(pts(i, 0), pts(i, 1), pts(i, 2), 1.0), &k_indices[i], &k_sq_dists[i]);
+          if (!found) {
+            k_indices[i] = -1;
+            k_sq_dists[i] = std::numeric_limits<double>::max();
+          }
+        }
+
+        return std::make_pair(k_indices, k_sq_dists);
+      },
+      py::arg("pts"),
+      py::arg("num_threads") = 1,
+      R"""(
+        Find the nearest neighbors for a batch of points.
+
+        Parameters
+        ----------
+        pts : numpy.ndarray, shape (n, 3) or (n, 4)
+            The input points.
+        num_threads : int, optional
+            The number of threads to use for the search. Default is 1.
+
+        Returns
+        -------
+        k_indices : numpy.ndarray, shape (n, k)
+            The indices of the nearest neighbors for each input point. If a neighbor was not found, the index is -1.
+        k_sq_dists : numpy.ndarray, shape (n, k)
+            The squared distances to the nearest neighbors for each input point.
+     )""")
+
+    .def(
+      "batch_knn_search",
+      [](const KdTree<PointCloud>& kdtree, const Eigen::MatrixXd& pts, int k, int num_threads) {
+        if (pts.cols() != 3 && pts.cols() != 4) {
+          throw std::invalid_argument("pts must have shape (n, 3) or (n, 4)");
+        }
+
+        Eigen::Matrix<size_t, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> k_indices(pts.rows(), k);
+        Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> k_sq_dists(pts.rows(), k);
+        k_indices.setConstant(-1);
+        k_sq_dists.setConstant(std::numeric_limits<double>::max());
+
+#pragma omp parallel for num_threads(num_threads) schedule(guided, 4)
+        for (int i = 0; i < pts.rows(); ++i) {
+          size_t* k_indices_begin = k_indices.data() + i * k;
+          double* k_sq_dists_begin = k_sq_dists.data() + i * k;
+
+          const size_t found = traits::knn_search(kdtree, Eigen::Vector4d(pts(i, 0), pts(i, 1), pts(i, 2), 1.0), k, k_indices_begin, k_sq_dists_begin);
+          if (found < k) {
+            for (size_t j = found; j < k; ++j) {
+              k_indices_begin[j] = -1;
+              k_sq_dists_begin[j] = std::numeric_limits<double>::max();
+            }
+          }
+        }
+
+        return std::make_pair(k_indices, k_sq_dists);
+      },
+      py::arg("pts"),
+      py::arg("k"),
+      py::arg("num_threads") = 1,
+      R"""(
+        Find the k nearest neighbors for a batch of points.
+
+        Parameters
+        ----------
+        pts : numpy.ndarray, shape (n, 3) or (n, 4)
+            The input points.
+        k : int
+            The number of nearest neighbors to search for.
+        num_threads : int, optional
+            The number of threads to use for the search. Default is 1.
+
+        Returns
+        -------
+        k_indices : list of numpy.ndarray, shape (n,)
+            The list of indices of the k nearest neighbors for each input point. If a neighbor was not found, the index is -1.
+        k_sq_dists : list of numpy.ndarray, shape (n,)
+            The list of squared distances to the k nearest neighbors for each input point (sorted in ascending order).
+     )""");
+}
--- a/src/python/misc.cpp
+++ b/src/python/misc.cpp
@ -22,6 +22,6 @@ void define_misc(py::module& m) {
      const auto points = read_ply(filename);
      return std::make_shared<PointCloud>(points);
    },
-    "Read PLY file",
+    "Read PLY file. This function can only read simple point clouds with XYZ properties for testing purposes. Do not use this for general PLY IO.",
    py::arg("filename"));
 }
--- a/src/python/pointcloud.cpp
+++ b/src/python/pointcloud.cpp
@ -16,38 +16,144 @@ using namespace small_gicp;
 void define_pointcloud(py::module& m) {
  // PointCloud
  py::class_<PointCloud, std::shared_ptr<PointCloud>>(m, "PointCloud")  //
-    .def(py::init([](const Eigen::MatrixXd& points) {
-      if (points.cols() != 3 && points.cols() != 4) {
-        std::cerr << "points must be Nx3 or Nx4" << std::endl;
-        return std::make_shared<PointCloud>();
-      }
-
-      auto pc = std::make_shared<PointCloud>();
-      pc->resize(points.rows());
-      if (points.cols() == 3) {
-        for (size_t i = 0; i < points.rows(); i++) {
-          pc->point(i) << points.row(i).transpose(), 1.0;
+    .def(
+      py::init([](const Eigen::MatrixXd& points) {
+        if (points.cols() != 3 && points.cols() != 4) {
+          std::cerr << "points must be Nx3 or Nx4" << std::endl;
+          return std::make_shared<PointCloud>();
        }
-      } else {
-        for (size_t i = 0; i < points.rows(); i++) {
-          pc->point(i) << points.row(i).transpose();
-        }
-      }

-      return pc;
-    }))  //
+        auto pc = std::make_shared<PointCloud>();
+        pc->resize(points.rows());
+        if (points.cols() == 3) {
+          for (size_t i = 0; i < points.rows(); i++) {
+            pc->point(i) << points.row(i).transpose(), 1.0;
+          }
+        } else {
+          for (size_t i = 0; i < points.rows(); i++) {
+            pc->point(i) << points.row(i).transpose();
+          }
+        }
+
+        return pc;
+      }),
+      py::arg("points"),
+      R"""(
+      PointCloud(points: numpy.ndarray)
+
+      Construct a PointCloud from a numpy array.
+
+      Parameters
+      ----------
+      points : numpy.ndarray, shape (n, 3) or (n, 4)
+          The input point cloud.
+      )""")
    .def("__repr__", [](const PointCloud& points) { return "small_gicp.PointCloud (" + std::to_string(points.size()) + " points)"; })
    .def("__len__", [](const PointCloud& points) { return points.size(); })
-    .def("empty", &PointCloud::empty)
-    .def("size", &PointCloud::size)
+    .def(
+      "empty",
+      &PointCloud::empty,
+      R"pbdoc(
+        Check if the point cloud is empty
+
+        Returns
+        -------
+        empty : bool
+            True if the point cloud is empty.
+        )pbdoc")
+    .def(
+      "size",
+      &PointCloud::size,
+      R"pbdoc(
+        Get the number of points.
+
+        Returns
+        -------
+        num_points : int
+            Number of points.
+        )pbdoc")
    .def(
      "points",
-      [](const PointCloud& points) -> Eigen::MatrixXd { return Eigen::Map<const Eigen::Matrix<double, -1, -1, Eigen::RowMajor>>(points.points[0].data(), points.size(), 4); })
+      [](const PointCloud& points) -> Eigen::MatrixXd { return Eigen::Map<const Eigen::Matrix<double, -1, -1, Eigen::RowMajor>>(points.points[0].data(), points.size(), 4); },
+      R"pbdoc(
+        Get the points as a Nx4 matrix.
+
+        Returns
+        -------
+        points : numpy.ndarray
+            Points.
+        )pbdoc")
    .def(
      "normals",
-      [](const PointCloud& points) -> Eigen::MatrixXd { return Eigen::Map<const Eigen::Matrix<double, -1, -1, Eigen::RowMajor>>(points.normals[0].data(), points.size(), 4); })
-    .def("covs", [](const PointCloud& points) { return points.covs; })
-    .def("point", [](const PointCloud& points, size_t i) -> Eigen::Vector4d { return points.point(i); })
-    .def("normal", [](const PointCloud& points, size_t i) -> Eigen::Vector4d { return points.normal(i); })
-    .def("cov", [](const PointCloud& points, size_t i) -> Eigen::Matrix4d { return points.cov(i); });
+      [](const PointCloud& points) -> Eigen::MatrixXd { return Eigen::Map<const Eigen::Matrix<double, -1, -1, Eigen::RowMajor>>(points.normals[0].data(), points.size(), 4); },
+      R"pbdoc(
+        Get the normals as a Nx4 matrix.
+
+        Returns
+        -------
+        normals : numpy.ndarray
+            Normals.
+        )pbdoc")
+    .def(
+      "covs",
+      [](const PointCloud& points) { return points.covs; },
+      R"pbdoc(
+        Get the covariance matrices as a list of 4x4 matrices.
+
+        Returns
+        -------
+        covs : list of numpy.ndarray
+            Covariance matrices.
+        )pbdoc")
+    .def(
+      "point",
+      [](const PointCloud& points, size_t i) -> Eigen::Vector4d { return points.point(i); },
+      py::arg("i"),
+      R"pbdoc(
+        Get the i-th point.
+
+        Parameters
+        ----------
+        i : int
+            Index of the point.
+
+        Returns
+        -------
+        point : numpy.ndarray, shape (4,)
+            Point.
+        )pbdoc")
+    .def(
+      "normal",
+      [](const PointCloud& points, size_t i) -> Eigen::Vector4d { return points.normal(i); },
+      py::arg("i"),
+      R"pbdoc(
+        Get the i-th normal.
+
+        Parameters
+        ----------
+        i : int
+            Index of the point.
+
+        Returns
+        -------
+        normal : numpy.ndarray, shape (4,)
+            Normal.
+        )pbdoc")
+    .def(
+      "cov",
+      [](const PointCloud& points, size_t i) -> Eigen::Matrix4d { return points.cov(i); },
+      py::arg("i"),
+      R"pbdoc(
+        Get the i-th covariance matrix.
+
+        Parameters
+        ----------
+        i : int
+            Index of the point.
+
+        Returns
+        -------
+        cov : numpy.ndarray, shape (4, 4)
+            Covariance matrix.
+        )pbdoc");
 }
--- a/src/python/preprocess.cpp
+++ b/src/python/preprocess.cpp
@ -29,10 +29,35 @@ void define_preprocess(py::module& m) {
      }
      return voxelgrid_sampling_omp(points, resolution, num_threads);
    },
-    "Voxelgrid sampling",
    py::arg("points"),
    py::arg("downsampling_resolution"),
-    py::arg("num_threads") = 1);
+    py::arg("num_threads") = 1,
+    R"pbdoc(
+        Voxelgrid downsampling.
+
+        Notes
+        -----
+        When multi-threading is enabled, this function has minor run-by-run non-deterministic behavior due to parallel data collection that results
+        in a deviation of the number of points in the downsampling results (up to 10% increase from the single-thread version).
+
+        Discretized voxel coords must be in 21bit range [-1048576, 1048575].
+        For example, if the downsampling resolution is 0.01 m, point coordinates must be in [-10485.76, 10485.75] m.
+        Points outside the valid range will be ignored.
+
+        Parameters
+        ----------
+        points : :class:`PointCloud`
+            Input point cloud.
+        resolution : float
+            Voxel size.
+        num_threads : int, optional
+            Number of threads. (default: 1)
+
+        Returns
+        -------
+        :class:`PointCloud`
+            Downsampled point cloud.
+        )pbdoc");

  // voxelgrid_sampling (numpy)
  m.def(
@ -49,10 +74,35 @@ void define_preprocess(py::module& m) {
        return voxelgrid_sampling_omp<Eigen::MatrixXd, PointCloud>(points, resolution, num_threads);
      }
    },
-    "Voxelgrid sampling",
    py::arg("points"),
    py::arg("downsampling_resolution"),
-    py::arg("num_threads") = 1);
+    py::arg("num_threads") = 1,
+    R"pbdoc(
+        Voxelgrid downsampling.
+
+        Notes
+        -----
+        When multi-threading is enabled, this function has minor run-by-run non-deterministic behavior due to parallel data collection that results
+        in a deviation of the number of points in the downsampling results (up to 10% increase from the single-thread version).
+
+        Discretized voxel coords must be in 21bit range [-1048576, 1048575].
+        For example, if the downsampling resolution is 0.01 m, point coordinates must be in [-10485.76, 10485.75] m.
+        Points outside the valid range will be ignored.
+
+        Parameters
+        ----------
+        points : [np.float64]
+            Input point cloud. Nx3 or Nx4.
+        resolution : float
+            Voxel size.
+        num_threads : int, optional
+            Number of threads. (default: 1)
+
+        Returns
+        -------
+        :class:`PointCloud`
+            Downsampled point cloud.
+        )pbdoc");

  // estimate_normals
  m.def(
@ -71,7 +121,22 @@ void define_preprocess(py::module& m) {
    py::arg("points"),
    py::arg("tree") = nullptr,
    py::arg("num_neighbors") = 20,
-    py::arg("num_threads") = 1);
+    py::arg("num_threads") = 1,
+    R"pbdoc(
+        Estimate point normals.
+        If a sufficient number of neighbor points (5 points) is not found, a zero vector is set to the point.
+
+        Parameters
+        ----------
+        points : :class:`PointCloud`
+            Input point cloud. Normals will be estimated in-place. (in/out)
+        tree : :class:`KdTree`, optional
+            Nearest neighbor search. If None, create a new KdTree (default: None)
+        num_neighbors : int, optional
+            Number of neighbors. (default: 20)
+        num_threads : int, optional
+            Number of threads. (default: 1)
+        )pbdoc");

  // estimate_covariances
  m.def(
@ -90,7 +155,22 @@ void define_preprocess(py::module& m) {
    py::arg("points"),
    py::arg("tree") = nullptr,
    py::arg("num_neighbors") = 20,
-    py::arg("num_threads") = 1);
+    py::arg("num_threads") = 1,
+    R"pbdoc(
+        Estimate point covariances.
+        If a sufficient number of neighbor points (5 points) is not found, an identity matrix is set to the point.
+
+        Parameters
+        ----------
+        points : :class:`PointCloud`
+            Input point cloud. Covariances will be estimated in-place. (in/out)
+        tree : :class:`KdTree`, optional
+            Nearest neighbor search. If None, create a new KdTree (default: None)
+        num_neighbors : int, optional
+            Number of neighbors. (default: 20)
+        num_threads : int, optional
+            Number of threads. (default: 1)
+        )pbdoc");

  // estimate_normals_covariances
  m.def(
@ -109,7 +189,22 @@ void define_preprocess(py::module& m) {
    py::arg("points"),
    py::arg("tree") = nullptr,
    py::arg("num_neighbors") = 20,
-    py::arg("num_threads") = 1);
+    py::arg("num_threads") = 1,
+    R"pbdoc(
+        Estimate point normals and covariances.
+        If a sufficient number of neighbor points (5 points) is not found, a zero vector and an identity matrix is set to the point.
+
+        Parameters
+        ----------
+        points : :class:`PointCloud`
+            Input point cloud. Normals and covariances will be estimated in-place. (in/out)
+        tree : :class:`KdTree`, optional
+            Nearest neighbor search. If None, create a new KdTree (default: None)
+        num_neighbors : int, optional
+            Number of neighbors. (default: 20)
+        num_threads : int, optional
+            Number of threads. (default: 1)
+        )pbdoc");

  // preprocess_points (numpy)
  m.def(
@ -138,7 +233,29 @@ void define_preprocess(py::module& m) {
    py::arg("points"),
    py::arg("downsampling_resolution") = 0.25,
    py::arg("num_neighbors") = 10,
-    py::arg("num_threads") = 1);
+    py::arg("num_threads") = 1,
+    R"pbdoc(
+        Preprocess point cloud (Downsampling and normal/covariance estimation).
+        See also: :func:`voxelgrid_sampling`, :func:`estimate_normals_covariances`.
+
+        Parameters
+        ----------
+        points : [np.float64]
+            Input point cloud. Nx3 or Nx4.
+        downsampling_resolution : float, optional
+            Resolution for downsampling the point clouds. Input points out of 21bit range after discretization are ignored (See also: :func:`voxelgrid_sampling`). (default: 0.25)
+        num_neighbors : int, optional
+            Number of neighbors used for attribute estimation. (default: 10)
+        num_threads : int, optional
+            Number of threads. (default: 1)
+        
+        Returns
+        -------
+        :class:`PointCloud`
+            Downsampled point cloud.
+        :class:`KdTree`
+            KdTree for the downsampled point cloud.
+        )pbdoc");

  // preprocess_points
  m.def(
@ -157,5 +274,27 @@ void define_preprocess(py::module& m) {
    py::arg("points"),
    py::arg("downsampling_resolution") = 0.25,
    py::arg("num_neighbors") = 10,
-    py::arg("num_threads") = 1);
+    py::arg("num_threads") = 1,
+    R"pbdoc(
+        Preprocess point cloud (Downsampling and normal/covariance estimation).
+        See also: :func:`voxelgrid_sampling`, :func:`estimate_normals_covariances`.
+
+        Parameters
+        ----------
+        points : :class:`PointCloud`
+            Input point cloud.
+        downsampling_resolution : float, optional
+            Resolution for downsampling the point clouds. Input points out of 21bit range after discretization are ignored (See also: :func:`voxelgrid_sampling`). (default: 0.25)
+        num_neighbors : int, optional
+            Number of neighbors used for attribute estimation. (default: 10)
+        num_threads : int, optional
+            Number of threads. (default: 1)
+        
+        Returns
+        -------
+        :class:`PointCloud`
+            Downsampled point cloud.
+        :class:`KdTree`
+            KdTree for the downsampled point cloud.
+        )pbdoc");
 }
--- a/src/python/python.cpp
+++ b/src/python/python.cpp
@ -14,7 +14,7 @@ void define_factors(py::module& m);
 void define_misc(py::module& m);

 PYBIND11_MODULE(small_gicp, m) {
-  m.doc() = "Small GICP";
+  m.doc() = "Efficient and parallel algorithms for point cloud registration";

  define_pointcloud(m);
  define_kdtree(m);
--- a/src/python/result.cpp
+++ b/src/python/result.cpp
@ -30,11 +30,14 @@ void define_result(py::module& m) {
        sst << "error= " << result.error << "\n";
        return sst.str();
      })
-    .def_property_readonly("T_target_source", [](const RegistrationResult& result) -> Eigen::Matrix4d { return result.T_target_source.matrix(); })
-    .def_readonly("converged", &RegistrationResult::converged)
-    .def_readonly("iterations", &RegistrationResult::iterations)
-    .def_readonly("num_inliers", &RegistrationResult::num_inliers)
-    .def_readonly("H", &RegistrationResult::H)
-    .def_readonly("b", &RegistrationResult::b)
-    .def_readonly("error", &RegistrationResult::error);
+    .def_property_readonly(
+      "T_target_source",
+      [](const RegistrationResult& result) -> Eigen::Matrix4d { return result.T_target_source.matrix(); },
+      "NDArray[np.float64] : Final transformation matrix (4x4). This transformation brings a point in the source cloud frame to the target cloud frame.")
+    .def_readonly("converged", &RegistrationResult::converged, "bool : Convergence flag")
+    .def_readonly("iterations", &RegistrationResult::iterations, "int : Number of iterations")
+    .def_readonly("num_inliers", &RegistrationResult::num_inliers, "int : Number of inliers")
+    .def_readonly("H", &RegistrationResult::H, "NDArray[np.float64] : Final Hessian matrix (6x6)")
+    .def_readonly("b", &RegistrationResult::b, "NDArray[np.float64] : Final information vector (6,)")
+    .def_readonly("error", &RegistrationResult::error, "float : Final error");
 }
--- a/src/python/voxelmap.cpp
+++ b/src/python/voxelmap.cpp
@ -18,7 +18,23 @@ using namespace small_gicp;
 template <typename VoxelMap, bool has_normals, bool has_covs>
 auto define_class(py::module& m, const std::string& name) {
  py::class_<VoxelMap> vox(m, name.c_str());
-  vox.def(py::init<double>())
+  vox
+    .def(
+      py::init<double>(),
+      py::arg("leaf_size"),
+      R"pbdoc(
+        Construct a Incremental voxelmap.
+
+        Notes
+        -----
+        This class supports incremental point cloud insertion and LRU-based voxel deletion that removes voxels that are not recently referenced.
+        It can handle arbitrary number of voxels and arbitrary range of voxel coordinates (in 32-bit int range).
+
+        Parameters
+        ----------
+        leaf_size : float
+            Voxel size.
+        )pbdoc")
    .def(
      "__repr__",
      [=](const VoxelMap& voxelmap) {
@ -27,12 +43,38 @@ auto define_class(py::module& m, const std::string& name) {
        return sst.str();
      })
    .def("__len__", [](const VoxelMap& voxelmap) { return voxelmap.size(); })
-    .def("size", &VoxelMap::size)
+    .def(
+      "size",
+      &VoxelMap::size,
+      R"pbdoc(
+        Get the number of voxels.
+
+        Returns
+        -------
+        num_voxels : int
+            Number of voxels.
+        )pbdoc")
    .def(
      "insert",
      [](VoxelMap& voxelmap, const PointCloud& points, const Eigen::Matrix4d& T) { voxelmap.insert(points, Eigen::Isometry3d(T)); },
      py::arg("points"),
-      py::arg("T") = Eigen::Matrix4d::Identity())
+      py::arg("T") = Eigen::Matrix4d::Identity(),
+      R"pbdoc(
+        Insert a point cloud into the voxel map and delete voxels that are not recently accessed.
+
+        Note
+        ----
+        If this class is based on FlatContainer (i.e., IncrementalVoxelMap*), input points are ignored if
+        1) there are too many points in the cell or
+        2) the input point is too close to existing points in the cell.
+
+        Parameters
+        ----------
+        points : :class:`PointCloud`
+            Input source point cloud.
+        T : numpy.ndarray, optional
+            Transformation matrix to be applied to the input point cloud (i.e., T_voxelmap_source). (default: identity)
+        )pbdoc")
    .def(
      "set_lru",
      [](VoxelMap& voxelmap, size_t horizon, size_t clear_cycle) {
@ -40,24 +82,64 @@ auto define_class(py::module& m, const std::string& name) {
        voxelmap.lru_clear_cycle = clear_cycle;
      },
      py::arg("horizon") = 100,
-      py::arg("clear_cycle") = 10)
-    .def("voxel_points", [](const VoxelMap& voxelmap) -> Eigen::MatrixXd {
-      auto points = traits::voxel_points(voxelmap);
-      return Eigen::Map<const Eigen::Matrix<double, -1, -1, Eigen::RowMajor>>(points[0].data(), points.size(), 4);
-    });
+      py::arg("clear_cycle") = 10,
+      R"pbdoc(
+        Set the LRU cache parameters.
+
+        Parameters
+        ----------
+        horizon : int, optional
+            LRU horizon size. Voxels that have not been accessed for lru_horizon steps are deleted. (default: 100)
+        clear_cycle : int, optional
+            LRU clear cycle. Voxel deletion is performed every lru_clear_cycle steps. (default: 10)
+        )pbdoc")
+    .def(
+      "voxel_points",
+      [](const VoxelMap& voxelmap) -> Eigen::MatrixXd {
+        auto points = traits::voxel_points(voxelmap);
+        return Eigen::Map<const Eigen::Matrix<double, -1, -1, Eigen::RowMajor>>(points[0].data(), points.size(), 4);
+      },
+      R"pbdoc(
+        Get the voxel points.
+
+        Returns
+        -------
+        voxel_points : numpy.ndarray
+            Voxel points. (Nx4)
+        )pbdoc");

  if constexpr (has_normals) {
-    vox.def("voxel_normals", [](const VoxelMap& voxelmap) -> Eigen::MatrixXd {
-      auto normals = traits::voxel_normals(voxelmap);
-      return Eigen::Map<const Eigen::Matrix<double, -1, -1, Eigen::RowMajor>>(normals[0].data(), normals.size(), 4);
-    });
+    vox.def(
+      "voxel_normals",
+      [](const VoxelMap& voxelmap) -> Eigen::MatrixXd {
+        auto normals = traits::voxel_normals(voxelmap);
+        return Eigen::Map<const Eigen::Matrix<double, -1, -1, Eigen::RowMajor>>(normals[0].data(), normals.size(), 4);
+      },
+      R"pbdoc(
+        Get the voxel normals.
+
+        Returns
+        -------
+        voxel_normals : numpy.ndarray
+            Voxel normals. (Nx4)
+        )pbdoc");
  }

  if constexpr (has_covs) {
-    vox.def("voxel_covs", [](const VoxelMap& voxelmap) -> Eigen::MatrixXd {
-      auto covs = traits::voxel_covs(voxelmap);
-      return Eigen::Map<const Eigen::Matrix<double, -1, -1, Eigen::RowMajor>>(covs[0].data(), covs.size(), 16);
-    });
+    vox.def(
+      "voxel_covs",
+      [](const VoxelMap& voxelmap) -> Eigen::MatrixXd {
+        auto covs = traits::voxel_covs(voxelmap);
+        return Eigen::Map<const Eigen::Matrix<double, -1, -1, Eigen::RowMajor>>(covs[0].data(), covs.size(), 16);
+      },
+      R"pbdoc(
+        Get the voxel normals.
+
+        Returns
+        -------
+        voxel_covs : list of numpy.ndarray
+            Voxel covariance matrices. (Nx4x4)
+        )pbdoc");
  }
 };

@ -65,6 +147,6 @@ void define_voxelmap(py::module& m) {
  define_class<IncrementalVoxelMap<FlatContainerPoints>, false, false>(m, "IncrementalVoxelMap");
  define_class<IncrementalVoxelMap<FlatContainerNormal>, true, false>(m, "IncrementalVoxelMapNormal");
  define_class<IncrementalVoxelMap<FlatContainerCov>, false, true>(m, "IncrementalVoxelMapCov");
-  define_class<IncrementalVoxelMap<FlatContainerNormalCov>, true, true>(m, "FlatContainerNormalCov");
+  define_class<IncrementalVoxelMap<FlatContainerNormalCov>, true, true>(m, "IncrementalVoxelMapNormalCov");
  define_class<GaussianVoxelMap, false, true>(m, "GaussianVoxelMap");
 }
--- a/src/small_gicp/registration/registration_helper.cpp
+++ b/src/small_gicp/registration/registration_helper.cpp
@ -90,6 +90,8 @@ align(const PointCloud& target, const PointCloud& source, const KdTree<PointClou
      registration.rejector.max_dist_sq = setting.max_correspondence_distance * setting.max_correspondence_distance;
      registration.criteria.rotation_eps = setting.rotation_eps;
      registration.criteria.translation_eps = setting.translation_eps;
+      registration.optimizer.max_iterations = setting.max_iterations;
+      registration.optimizer.verbose = setting.verbose;
      return registration.align(target, source, target_tree, init_T);
    }
    case RegistrationSetting::PLANE_ICP: {
@ -98,6 +100,8 @@ align(const PointCloud& target, const PointCloud& source, const KdTree<PointClou
      registration.rejector.max_dist_sq = setting.max_correspondence_distance * setting.max_correspondence_distance;
      registration.criteria.rotation_eps = setting.rotation_eps;
      registration.criteria.translation_eps = setting.translation_eps;
+      registration.optimizer.max_iterations = setting.max_iterations;
+      registration.optimizer.verbose = setting.verbose;
      return registration.align(target, source, target_tree, init_T);
    }
    case RegistrationSetting::GICP: {
@ -106,6 +110,8 @@ align(const PointCloud& target, const PointCloud& source, const KdTree<PointClou
      registration.rejector.max_dist_sq = setting.max_correspondence_distance * setting.max_correspondence_distance;
      registration.criteria.rotation_eps = setting.rotation_eps;
      registration.criteria.translation_eps = setting.translation_eps;
+      registration.optimizer.max_iterations = setting.max_iterations;
+      registration.optimizer.verbose = setting.verbose;
      return registration.align(target, source, target_tree, init_T);
    }
    case RegistrationSetting::VGICP: {
@ -125,6 +131,8 @@ RegistrationResult align(const GaussianVoxelMap& target, const PointCloud& sourc
  registration.reduction.num_threads = setting.num_threads;
  registration.criteria.rotation_eps = setting.rotation_eps;
  registration.criteria.translation_eps = setting.translation_eps;
+  registration.optimizer.max_iterations = setting.max_iterations;
+  registration.optimizer.verbose = setting.verbose;
  return registration.align(target, source, target, init_T);
 }

--- a/src/test/python_test.py
+++ b/src/test/python_test.py
@ -2,6 +2,7 @@
 # SPDX-FileCopyrightText: Copyright 2024 Kenji Koide
 # SPDX-License-Identifier: MIT
 import numpy
+from scipy.spatial import KDTree
 from scipy.spatial.transform import Rotation

 import small_gicp
@ -188,3 +189,69 @@ def test_registration(load_points):
  
  result = small_gicp.align(target_voxelmap, source)
  verify_result(result.T_target_source, gt_T_target_source)
+
+# KdTree test
+def test_kdtree(load_points):
+  _, target_raw_numpy, source_raw_numpy = load_points
+
+  target, target_tree = small_gicp.preprocess_points(target_raw_numpy, downsampling_resolution=0.5)
+  source, source_tree = small_gicp.preprocess_points(source_raw_numpy, downsampling_resolution=0.5)
+  
+  target_tree_ref = KDTree(target.points())
+  source_tree_ref = KDTree(source.points())
+  
+  def batch_test(points, queries, tree, tree_ref, num_threads):
+    # test for batch interface
+    k_dists_ref, k_indices_ref = tree_ref.query(queries, k=1)
+    k_indices, k_sq_dists = tree.batch_nearest_neighbor_search(queries)
+    assert numpy.all(numpy.abs(numpy.square(k_dists_ref) - k_sq_dists) < 1e-6)
+    assert numpy.all(numpy.abs(numpy.linalg.norm(points[k_indices] - queries, axis=1) ** 2 - k_sq_dists) < 1e-6)
+    
+    for k in [2, 10]:
+      k_dists_ref, k_indices_ref = tree_ref.query(queries, k=k)
+      k_sq_dists_ref, k_indices_ref = numpy.array(k_dists_ref) ** 2, numpy.array(k_indices_ref)
+      
+      k_indices, k_sq_dists = tree.batch_knn_search(queries, k, num_threads=num_threads)
+      k_indices, k_sq_dists = numpy.array(k_indices), numpy.array(k_sq_dists)
+
+      assert(numpy.all(numpy.abs(k_sq_dists_ref - k_sq_dists) < 1e-6))
+      for i in range(k):
+        diff = numpy.linalg.norm(points[k_indices[:, i]] - queries, axis=1) ** 2 - k_sq_dists[:, i]
+        assert(numpy.all(numpy.abs(diff) < 1e-6))
+
+    # test for single query interface
+    if num_threads != 1:
+      return
+
+    k_dists_ref, k_indices_ref = tree_ref.query(queries, k=1)
+    k_indices2, k_sq_dists2 = [], []
+    for query in queries:
+      found, index, sq_dist = tree.nearest_neighbor_search(query[:3])
+      assert found
+      k_indices2.append(index)
+      k_sq_dists2.append(sq_dist)
+    
+    assert numpy.all(numpy.abs(numpy.square(k_dists_ref) - k_sq_dists2) < 1e-6)
+    assert numpy.all(numpy.abs(numpy.linalg.norm(points[k_indices2] - queries, axis=1) ** 2 - k_sq_dists2) < 1e-6)
+
+    for k in [2, 10]:
+      k_dists_ref, k_indices_ref = tree_ref.query(queries, k=k)
+      k_sq_dists_ref, k_indices_ref = numpy.array(k_dists_ref) ** 2, numpy.array(k_indices_ref)
+      
+      k_indices2, k_sq_dists2 = [], []
+      for query in queries:
+        indices, sq_dists = tree.knn_search(query[:3], k)
+        k_indices2.append(indices)
+        k_sq_dists2.append(sq_dists)
+      k_indices2, k_sq_dists2 = numpy.array(k_indices2), numpy.array(k_sq_dists2)
+      
+      assert(numpy.all(numpy.abs(k_sq_dists_ref - k_sq_dists2) < 1e-6))
+      for i in range(k):
+        diff = numpy.linalg.norm(points[k_indices2[:, i]] - queries, axis=1) ** 2 - k_sq_dists2[:, i]
+        assert(numpy.all(numpy.abs(diff) < 1e-6))
+      
+
+  for num_threads in [1, 2]:
+    batch_test(target.points(), target.points(), target_tree, target_tree_ref, num_threads=num_threads)
+    batch_test(target.points(), source.points(), target_tree, target_tree_ref, num_threads=num_threads)
+    batch_test(source.points(), target.points(), source_tree, source_tree_ref, num_threads=num_threads)
--- a/src/test/registration_test.cpp
+++ b/src/test/registration_test.cpp
@ -12,6 +12,7 @@
 #include <small_gicp/factors/icp_factor.hpp>
 #include <small_gicp/factors/gicp_factor.hpp>
 #include <small_gicp/factors/plane_icp_factor.hpp>
+#include <small_gicp/factors/robust_kernel.hpp>
 #include <small_gicp/registration/reduction_omp.hpp>
 #include <small_gicp/registration/reduction_tbb.hpp>
 #include <small_gicp/registration/registration.hpp>
@ -283,7 +284,7 @@ TEST_F(RegistrationTest, PCLInterfaceTest) {
 INSTANTIATE_TEST_SUITE_P(
  RegistrationTest,
  RegistrationTest,
-  testing::Combine(testing::Values("ICP", "PLANE_ICP", "GICP", "VGICP"), testing::Values("SERIAL", "TBB", "OMP")),
+  testing::Combine(testing::Values("ICP", "PLANE_ICP", "GICP", "VGICP", "HUBER_GICP", "CAUCHY_GICP"), testing::Values("SERIAL", "TBB", "OMP")),
  [](const auto& info) {
    std::stringstream sst;
    sst << std::get<0>(info.param) << "_" << std::get<1>(info.param);
@ -338,7 +339,29 @@ TEST_P(RegistrationTest, RegistrationTest) {
      Registration<GICPFactor, ParallelReductionOMP> reg;
      test_registration_vgicp(reg);
    }
+  } else if (factor == "HUBER_GICP") {
+    if (reduction == "SERIAL") {
+      Registration<RobustFactor<Huber, GICPFactor>, SerialReduction> reg;
+      test_registration(reg);
+    } else if (reduction == "TBB") {
+      Registration<RobustFactor<Huber, GICPFactor>, ParallelReductionTBB> reg;
+      test_registration(reg);
+    } else if (reduction == "OMP") {
+      Registration<RobustFactor<Huber, GICPFactor>, ParallelReductionOMP> reg;
+      test_registration(reg);
+    }
+  } else if (factor == "CAUCHY_GICP") {
+    if (reduction == "SERIAL") {
+      Registration<RobustFactor<Cauchy, GICPFactor>, SerialReduction> reg;
+      test_registration(reg);
+    } else if (reduction == "TBB") {
+      Registration<RobustFactor<Cauchy, GICPFactor>, ParallelReductionTBB> reg;
+      test_registration(reg);
+    } else if (reduction == "OMP") {
+      Registration<RobustFactor<Cauchy, GICPFactor>, ParallelReductionOMP> reg;
+      test_registration(reg);
+    }
  } else {
-    std::cerr << "error: unknown factor type " << factor << std::endl;
+    EXPECT_TRUE(false) << "error: unknown factor type " << factor;
  }
 }
--- a/src/test/sort_tbb_test.cpp
+++ b/src/test/sort_tbb_test.cpp
@ -0,0 +1,71 @@
+#include <random>
+#include <algorithm>
+#include <fmt/format.h>
+#include <small_gicp/util/sort_omp.hpp>
+#include <small_gicp/util/sort_tbb.hpp>
+#include <small_gicp/benchmark/benchmark.hpp>
+
+#include <gtest/gtest.h>
+
+using namespace small_gicp;
+
+// Check if two vectors are identical
+template <typename T>
+bool identical(const std::vector<T>& arr1, const std::vector<T>& arr2) {
+  if (arr1.size() != arr2.size()) {
+    return false;
+  }
+
+  for (size_t i = 0; i < arr1.size(); i++) {
+    if (arr1[i] != arr2[i]) {
+      return false;
+    }
+  }
+  return true;
+}
+
+template <typename T>
+void test_radix_sort(std::mt19937& mt) {
+  std::uniform_int_distribution<> size_dist(0, 8192);
+
+  for (int i = 0; i < 20; i++) {
+    std::vector<T> data(size_dist(mt));
+    std::generate(data.begin(), data.end(), [&] { return mt(); });
+
+    std::vector<T> sorted = data;
+    std::stable_sort(sorted.begin(), sorted.end());
+
+    std::vector<T> sorted_tbb = data;
+    radix_sort_tbb(sorted_tbb.data(), sorted_tbb.data() + sorted_tbb.size(), [](const T x) { return x; });
+
+    EXPECT_TRUE(identical(sorted, sorted_tbb)) << fmt::format("i={} N={}", i, data.size());
+  }
+
+  for (int i = 0; i < 20; i++) {
+    std::vector<std::pair<T, std::uint64_t>> data(size_dist(mt));
+    std::generate(data.begin(), data.end(), [&] { return std::make_pair<T, std::uint64_t>(mt(), mt()); });
+
+    std::vector<std::pair<T, std::uint64_t>> sorted = data;
+    std::stable_sort(sorted.begin(), sorted.end(), [](const auto& lhs, const auto& rhs) { return lhs.first < rhs.first; });
+
+    std::vector<std::pair<T, std::uint64_t>> sorted_tbb = data;
+    radix_sort_tbb(sorted_tbb.data(), sorted_tbb.data() + sorted_tbb.size(), [](const auto& x) -> T { return x.first; });
+
+    EXPECT_TRUE(identical(sorted, sorted_tbb)) << fmt::format("i={} N={}", i, data.size());
+  }
+}
+
+// Test radix_sort_tbb
+TEST(SortTBB, RadixSortTest) {
+  std::mt19937 mt;
+
+  test_radix_sort<std::uint8_t>(mt);
+  test_radix_sort<std::uint16_t>(mt);
+  test_radix_sort<std::uint32_t>(mt);
+  test_radix_sort<std::uint64_t>(mt);
+
+  // empty
+  std::vector<std::uint64_t> empty_vector;
+  radix_sort_tbb(empty_vector.data(), empty_vector.data() + empty_vector.size(), [](const std::uint64_t x) { return x; });
+  EXPECT_TRUE(empty_vector.empty()) << "Empty vector check";
+}
Author	SHA1	Message	Date
unclearness	1d8cce8add	Fix VS2022 build (OpenMP optional) and debugger working directory for examples/tests (#119 ) * Pass compile on Windows * Set debugger working diretoriy as project root for examples and tests on Windows	2025-06-10 12:48:47 +09:00
koide3	08bc50beff	fix typo (#115 )	2025-05-07 14:51:07 +09:00
koide3	9befefb198	sequential voxelmap accessor (#113 ) * sequential voxelmap accessor * capital	2025-05-07 11:35:58 +09:00
koide3	3466ea263a	change accessing order for column major (#112 )	2025-05-07 11:25:04 +09:00
koide3	df145cbb15	use ubuntu 24.04 runner to avoid packaging issue (#111 ) * use ubuntu 24.04 runner to avoid packaging issue * ignore errors mismatch	2025-05-07 10:02:51 +09:00
atm	db2f8e6646	Fix typos and enable links (#109 ) * chore: enable youtube link * chore: correct typos in comments	2025-05-07 09:31:31 +09:00
koide3	2c5e9e6092	improve batch_knn_search performance (#101 )	2025-01-12 13:30:09 +09:00
koide3	ff63d5ef76	add ninja-build (#99 ) * add ninja-build * add ninja-build * add ninja-build for coverage	2024-12-30 05:51:45 +09:00
fateshelled	13e0a75cc1	fix: update error value (#97 )	2024-12-30 05:23:05 +09:00
koide3	8e665814a9	projective search (#92 )	2024-09-24 18:04:55 +09:00
Artem Voronov	4e8e745800	feat: add rotation and translation epsilon to adjust convergence criteria in python interface (#91 )	2024-09-22 07:52:42 +09:00
koide3	e669301de3	Update README.md	2024-08-10 23:40:06 +09:00
koide3	f127ae7a51	Update README.md	2024-08-10 23:38:23 +09:00
koide3	d6b0cb8e99	Create CITATION.cff	2024-08-10 23:32:29 +09:00
k.koide	fd29d8cf94	v1.0.0	2024-08-09 13:43:04 +09:00
koide3	8cbec29610	Add paper (#84 ) * add paper.(md\|bib) * Update paper.yml * fix references * revise paper * add missing DOI * consistent citation style * use CC and url commands	2024-08-09 13:39:43 +09:00
k.koide	ad72715259	v0.1.3	2024-08-06 10:59:59 +09:00
koide3	0a2617d035	Documentation of detailed behaviors (#82 ) * detailed documentation (cpp) * doc for invalid normal estimation results * docs of detailed algorithm behavior for py	2024-08-06 10:56:32 +09:00
koide3	aec35190eb	fix iridescence find error and provide a small part of KITTI00 (#81 )	2024-08-05 16:46:23 +09:00
koide3	76b2f004fa	improve knn performance of voxelmaps (#79 ) * improve knn performance of voxelmaps * add voxel search patterns * add (gicp\|vgicp)_model_omp	2024-07-01 16:25:28 +09:00
k.koide	fccb6195a8	v0.1.2	2024-06-28 09:27:22 +09:00
koide3	765da6f68d	CI for license check (#78 ) * license check CI * fix ci * check both headers and source * add SPDX identifiers * trigger on only master	2024-06-27 12:06:17 +09:00
koide3	45b0b29af3	expose verbose option to python (#77 ) * expose verbose option to python * tweak convergence check	2024-06-27 12:06:06 +09:00
Martin Valgur	e95cb19007	Fix BUILD_WITH_OPENMP being set to an incorrect value by default (#76 )	2024-06-27 09:24:28 +09:00
koide3	f48faf0790	Auto generate and deploy documentation (#71 ) * show constructor docs * constructor docs * remove unnecessary dependencies for document generation * trigger on tags	2024-06-21 12:22:44 +09:00
k.koide	be01905b04	v0.1.1	2024-06-21 10:13:52 +09:00
koide3	0d3f5e6315	improve knn performance (#70 ) * improve knn result collection * optimize expmap	2024-06-21 10:11:49 +09:00
koide3	7e42a90d27	parallel batch nearest neighbor search (#68 )	2024-06-20 11:25:51 +09:00
k.koide	ac6c79acb6	v0.1.0	2024-06-18 15:56:32 +09:00
Atticus Zhou	5e367c87f7	feat: add batch knn for kdtrees and docs (#65 ) * feat: add batch knn for kdtrees and docs * fix: update batch nns func name	2024-06-18 15:56:01 +09:00
koide3	06193e3be2	fix typo (#66 )	2024-06-18 15:07:03 +09:00
koide3	d83d6fbbd2	fix typo (#63 )	2024-06-12 20:24:31 +09:00
koide3	4762de7460	radix sort (#60 )	2024-06-12 10:28:33 +09:00
koide3	11f5a304df	Examples (#62 ) * update README * fix exclude path for doxygen	2024-06-12 10:23:45 +09:00
koide3	7b95ffb203	Create CONTRIBUTING.md	2024-06-11 16:15:15 +09:00
koide3	1a6f9a4e79	Update issue templates	2024-06-11 16:03:36 +09:00
koide3	358a45cd5f	Create CODE_OF_CONDUCT.md	2024-06-11 15:59:01 +09:00
koide3	f552ccc9c1	update README and BENCHMARK (#61 )	2024-06-11 14:33:58 +09:00
koide3	84406aefda	Docpy (#59 ) * doc for py * doc for cpp * sphinx * fix sphinx * generate docs	2024-06-10 09:14:48 +09:00
koide3	1183e3d39a	Doxygen (#58 ) * purge deprecated * Doxyfile * fix typo	2024-06-05 18:00:14 +09:00
koide3	76142deb81	Update README.md (#57 )	2024-06-05 10:44:29 +09:00
koide3	94d93b9e01	Robust kernel (#54 ) * pypi * robust kernel * test for robust kernel	2024-06-04 16:16:25 +09:00
Atticus Zhou	5c6f13cfc9	fix: add max_iterations param of align in pybind interface (#52 )	2024-05-18 23:59:13 +09:00
k.koide	602d03762b	fix pypi repository	2024-05-11 01:38:53 +09:00
k.koide	354dca97f5	rename ci	2024-05-11 01:25:59 +09:00