Feature matching based on local windows aggregation

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Summary

The core goal of feature matching is to establish correspondences between two images. Current methods without detectors achieve impressive results but often focus on global features, neglecting regions with subtle textures and resulting in fewer matches in areas with weak textures. This paper proposes a feature-matching method based on local window aggregation, which balances global features and local texture variations for more accurate matches, especially in weak-texture regions. Our method first applies a local window aggregation module to minimize irrelevant interference using window attention, followed by global attention, generating coarse and fine-grained feature maps. These maps are processed by a matching module, initially obtaining coarse matches via the nearest neighbor principle. The coarse matches are then refined on fine-grained maps through local window refinement. Experimental results show our method surpasses state-of-the-art techniques in pose estimation, homography estimation, and visual localization under the same training conditions.

Graphical abstract

Highlights

•

We propose an optimization method to obtain more accurate sub-pixel matching positions
•

We designed a local window aggregation module to obtain better image feature points
•

Perform outstanding match in weak texture region integrating coarse and fine features

Abstract

Applied sciences; Computer science; Network modeling

Related collections

Most cited references 53

Record: found
Abstract: found
Article: not found

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar … (2017)

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. 15 pages, 5 figures

0 comments Cited 3164 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Distinctive Image Features from Scale-Invariant Keypoints

David G. Lowe (2004)

0 comments Cited 1460 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander A Kolesnikov … (2020)

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. Fine-tuning code and pre-trained models are available at https://github.com/google-research/vision_transformer. ICLR camera-ready version with 2 small modifications: 1) Added a discussion of CLS vs GAP classifier in the appendix, 2) Fixed an error in exaFLOPs computation in Figure 5 and Table 6 (relative performance of models is basically not affected)

0 comments Cited 519 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Wenpeng Li

Journal

Journal ID (nlm-ta): iScience

Journal ID (iso-abbrev): iScience

Title: iScience

Publisher: Elsevier

ISSN (Electronic): 2589-0042

Publication date PMC-release: 28 August 2024

Publication date Collection: 20 September 2024

Publication date (Electronic): 28 August 2024

Volume: 27

Issue: 9

Electronic Location Identifier: 110825

Affiliations

[1 ]Heilongjiang University, No. 74 Xuefu Road, Harbin 150080, Heilongjiang, China

[2 ]Qiqihar University, No. 42 Wenhua Street, Qiqihar 161006, Heilongjiang, China

[3 ]Anhui Wenda University of Information Engineering, No. 3 Forest Avenue, Hefei 231201, Anhui, China

Author notes

[∗ ]Corresponding author leewenpeng@ 123456126.com

[4]

Lead contact

Article

Publisher Item ID: S2589-0042(24)02050-9 Publisher ID: 110825

DOI: 10.1016/j.isci.2024.110825

PMC ID: 11416493

PubMed ID: 39310757

SO-VID: 929d3d4a-50cf-441a-8406-f302b7f6824d

License:

This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

History

Date received : 19 May 2024

Date revision received : 31 July 2024

Date accepted : 22 August 2024

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Most referenced authors 861

See all reference authors

Feature matching based on local windows aggregation

Read this article at

Summary

Graphical abstract

Highlights

Abstract

Related collections

Digital Science

Most cited references 53

Attention Is All You Need

Distinctive Image Features from Scale-Invariant Keypoints

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 79

Most referenced authors 861