Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

qLabs Fires First Shot in Quantum Crypto Race — Can Coinbase Catch Up?

qLabs Fires First Shot in Quantum Crypto Race — Can Coinbase Catch Up?

The rapid progress of quantum computing is forcing the cryptocurrency industry to confront the problem that has long been treated as theoretical. Blockchains th
Share
CryptoNews2026/01/30 22:53
The Anatomy of a Self-Made Billionaire’s Mindset: How Gurhan Kiziloz Reached a $1.7B Net Worth

The Anatomy of a Self-Made Billionaire’s Mindset: How Gurhan Kiziloz Reached a $1.7B Net Worth

There are many paths to wealth in the modern economy, but the one Gurhan Kiziloz took stands out for a simple reason: he built everything himself. By 2026, the
Share
Coinstats2026/01/30 23:07
Aave DAO to Shut Down 50% of L2s While Doubling Down on GHO

Aave DAO to Shut Down 50% of L2s While Doubling Down on GHO

The post Aave DAO to Shut Down 50% of L2s While Doubling Down on GHO appeared on BitcoinEthereumNews.com. Aave DAO is gearing up for a significant overhaul by shutting down over 50% of underperforming L2 instances. It is also restructuring its governance framework and deploying over $100 million to boost GHO. This could be a pivotal moment that propels Aave back to the forefront of on-chain lending or sparks unprecedented controversy within the DeFi community. Sponsored Sponsored ACI Proposes Shutting Down 50% of L2s The “State of the Union” report by the Aave Chan Initiative (ACI) paints a candid picture. After a turbulent period in the DeFi market and internal challenges, Aave (AAVE) now leads in key metrics: TVL, revenue, market share, and borrowing volume. Aave’s annual revenue of $130 million surpasses the combined cash reserves of its competitors. Tokenomics improvements and the AAVE token buyback program have also contributed to the ecosystem’s growth. Aave global metrics. Source: Aave However, the ACI’s report also highlights several pain points. First, regarding the Layer-2 (L2) strategy. While Aave’s L2 strategy was once a key driver of success, it is no longer fit for purpose. Over half of Aave’s instances on L2s and alt-L1s are not economically viable. Based on year-to-date data, over 86.6% of Aave’s revenue comes from the mainnet, indicating that everything else is a side quest. On this basis, ACI proposes closing underperforming networks. The DAO should invest in key networks with significant differentiators. Second, ACI is pushing for a complete overhaul of the “friendly fork” framework, as most have been unimpressive regarding TVL and revenue. In some cases, attackers have exploited them to Aave’s detriment, as seen with Spark. Sponsored Sponsored “The friendly fork model had a good intention but bad execution where the DAO was too friendly towards these forks, allowing the DAO only little upside,” the report states. Third, the instance model, once a smart…
Share
BitcoinEthereumNews2025/09/18 02:28