Aquila gives satellites a smarter voice
en-GBde-DEes-ESfr-FR

Aquila gives satellites a smarter voice

06/05/2026 TranSpread

Remote sensing is now central to tracking crops, cities, coastlines, ecosystems, and emergency events, yet many AI systems still struggle to understand overhead imagery the way experts do. Earlier remote sensing vision-language models often depend on shallow fusion, loosely connecting image features with language outputs. They also face a scale problem: roads, buildings, harbors, and fields can look very different depending on resolution and ground sampling distance. Even when high-resolution data are available, many models cannot fully preserve fine-grained spatial structure during reasoning. Based on these challenges, deeper research into high-resolution, multi-scale remote sensing vision-language modeling is needed.

Researchers from Wuhan Kotei Informatics Co. Ltd., the Chinese Academy of Surveying and Mapping, Emory University, the University of Science and Technology Beijing, and the China Aero Geophysical Survey and Remote Sensing Center for Natural Resources reported (DOI: 10.34133/remotesensing.1041) the study in the Journal of Remote Sensing on March 3, 2026. Their model, Aquila, was designed to tackle a persistent bottleneck in Earth-observation AI: how to connect rich visual detail with language-based reasoning without losing the spatial clues that make remote sensing imagery meaningful.

Aquila improves remote sensing image comprehension through two linked innovations. First, it accepts image inputs up to 1,024 × 1,024 pixels, far higher than the 448 × 448 scale supported by many earlier systems. Second, it combines multi-scale image features and repeatedly re-injects them into the language model, rather than aligning vision and text only once. This strategy produced clear gains: on the challenging FIT_RSFG-Captions benchmark, Aquila outperformed SkySenseGPT by 7.77%, and on FIT_RSFG-VQA it reached 83.87% accuracy, beating SkySenseGPT by 4.11%.

The model is built from three core parts: an Aquila-CLIP ConvNeXt vision encoder, a hierarchical spatial feature integration module, and a multi-layer deep alignment language model based on Llama-3. Instead of relying on a single visual summary, Aquila extracts features from four scales and fuses them with a spatially aware cross-attention design that preserves local structure. This matters in remote sensing, where small objects and spatial layouts often carry the key meaning. In ablation tests, the spatial feature integration module improved captioning by 5.62% and VQA by 6.85% over a concatenation baseline. Adding deep alignment further raised performance by 2.55% in captioning and 4.64% in VQA. Aquila also showed broader grounding ability, reaching an mIoU of 68.33 on the DIOR-RSVG test set.

In the paper, the authors argue that Aquila’s gains come from modeling remote sensing imagery the way the domain demands: with high resolution, multi-scale perception, and persistent image-language interaction throughout reasoning. Their results suggest that fine-grained Earth-observation understanding depends not just on bigger models, but on architectures that preserve spatial evidence instead of compressing it away too early.

The team trained Aquila in two stages. First, they aligned image and language features using about 1 million remote sensing image-text pairs while freezing both the vision encoder and language model. Second, they instruction-tuned the system on 1.8 million high-quality pairs using LoRA. Training ran on four NVIDIA A800 GPUs, with images resized to 1,024 × 1,024, no cropping or padding, and benchmarks covering captioning, visual question answering, and grounding tasks.

Aquila points toward a future in which analysts can interact with satellite and aerial imagery through natural language while still retaining expert-level spatial precision. The authors note that the system remains computationally intensive and currently focuses on single-temporal RGB imagery. But its design offers a foundation for broader geo-foundation models that could integrate multi-temporal, multispectral, or SAR data, expanding applications in urban growth tracking, disaster assessment, environmental surveillance, and intelligent geospatial decision-making.

References

DOI

10.34133/remotesensing.1041

Original Source URL

https://doi.org/10.34133/remotesensing.1041

Funding information

This research was supported by the Faculty Startup Fund of Emory College of Arts & Sciences, the Fundamental Research Funds for the Central Universities (FRF-TP-25-008), the Postdoctoral Fellowship Program of China Postdoctoral Science Foundation (GZC20250171), the National Natural Science Foundation of China (42201440 and 42401500), and the Fundamental Research Funds for Chinese Academy of Surveying and Mapping (AR2410).

About Journal of Remote Sensing

The Journal of Remote Sensing, an online-only Open Access journal published in association with AIR-CAS, promotes the theory, science, and technology of remote sensing, as well as interdisciplinary research within earth and information science.

Paper title: Aquila: A Hierarchically Aligned Vision-Language Model for Enhanced Remote Sensing Image Comprehension
Attached files
  • Architecture overview of the proposed Aquila.
06/05/2026 TranSpread
Regions: North America, United States, Asia, China
Keywords: Applied science, Technology, Artificial Intelligence

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Testimonials

For well over a decade, in my capacity as a researcher, broadcaster, and producer, I have relied heavily on Alphagalileo.
All of my work trips have been planned around stories that I've found on this site.
The under embargo section allows us to plan ahead and the news releases enable us to find key experts.
Going through the tailored daily updates is the best way to start the day. It's such a critical service for me and many of my colleagues.
Koula Bouloukos, Senior manager, Editorial & Production Underknown
We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet

We Work Closely With...


  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2026 by AlphaGalileo Terms Of Use Privacy Statement