Open‑YOLO 3D replaces costly SAM/CLIP steps with 2D detection, LG label‑maps, and parallelized visibility, enabling fast and accurate 3D OV segmentation.Open‑YOLO 3D replaces costly SAM/CLIP steps with 2D detection, LG label‑maps, and parallelized visibility, enabling fast and accurate 3D OV segmentation.

Drop the Heavyweights: YOLO‑Based 3D Segmentation Outpaces SAM/CLIP

Abstract and 1 Introduction

  1. Related works
  2. Preliminaries
  3. Method: Open-YOLO 3D
  4. Experiments
  5. Conclusion and References

A. Appendix

3 Preliminaries

Problem formulation: 3D instance segmentation aims at segmenting individual objects within a 3D scene and assigning one class label to each segmented object. In the open-vocabulary (OV) setting, the class label can belong to previously known classes in the training set as well as new class labels. To this end, let P denote a 3D reconstructed point cloud scene, where a sequence of RGB-D images was used for the reconstruction. We denote the RGB image frames as I along with their corresponding depth frames D. Similar to recent methods [35, 42, 34], we assume that the poses and camera parameters are available for the input 3D scene.

\

3.1 Baseline Open-Vocabulary 3D Instance Segmentation

We base our approach on OpenMask3D [42], which is the first method that performs open-vocabulary 3D instance segmentation in a zero-shot manner. OpenMask3D has two main modules: a class-agnostic mask proposal head, and a mask-feature computation module. The class-agnostic mask proposal head uses a transformer-based pre-trained 3D instance segmentation model [39] to predict a binary mask for each object in the point cloud. The mask-feature computation module first generates 2D segmentation masks by projecting 3D masks into views in which the 3D instances are highly visible, and refines them using the SAM [23] model. A pre-trained CLIP vision-language model [55] is then used to generate image embeddings for the 2D segmentation masks. The embeddings are then aggregated across all the 2D frames to generate a 3D mask-feature representation.

\ Limitations: OpenMask3D makes use of the advancements in 2D segmentation (SAM) and vision-language models (CLIP) to generate and aggregate 2D feature representations, enabling the querying of instances according to open-vocabulary concepts. However, this approach suffers from a high computation burden leading to slow inference times, with a processing time of 5-10 minutes per scene. The computation burden mainly originates from two sub-tasks: the 2D segmentation of the large number of objects from the various 2D views, and the 3D feature aggregation based on the object visibility. We next introduce our proposed method which aims at reducing the computation burden and improving the task accuracy.

\

4 Method: Open-YOLO 3D

Motivation: We here present our proposed 3D open-vocabulary instance segmentation method, Open-YOLO 3D, which aims at generating 3D instance predictions in an efficient strategy. Our proposed method introduces efficient and improved modules at the task level as well as the data level. Task Level: Unlike OpenMask3D, which generates segmentations of the projected 3D masks, we pursue a more efficient approach by relying on 2D object detection. Since the end target is to generate labels for the 3D masks, the increased computation from the 2D segmentation task is not necessary. Data Level: OpenMask3D computes the 3D mask visibility in 2D frames by iteratively counting visible points for each mask across all frames. This approach is time-consuming, and we propose an alternative approach to compute the 3D mask visibility within all frames at once.

\

4.1 Overall Architecture

\

4.2 3D Object Proposal

\

4.3 Low Granularity (LG) Label-Maps

\

4.4 Accelerated Visibility Computation (VAcc)

In order to associate 2D label maps with 3D proposals, we compute the visibility of each 3D mask. To this end, we propose a fast approach that is able to compute 3D mask visibility within frames via tensor operations which are highly parallelizable.

\ Figure 3: Multi-View Prompt Distribution (MVPDist). After creating the LG label maps for all frames, we select the top-k label maps based on the 2D projection of the 3D proposal. Using the (x, y) coordinates of the 2D projection, we choose the labels from the LG label maps to generate the MVPDist. This distribution predicts the ID of the text prompt with the highest probability.

\

\

\

4.5 Multi-View Prompt Distribution (MVPDist)

\ Table 1: State-of-the-art comparison on ScanNet200 validation set. We use Mask3D trained on the ScanNet200 training set to generate class-agnostic mask proposals. Our method demonstrates better performance compared to those that generate 3D proposals by fusing 2D masks and proposals from a 3D network (highlighted in gray in the table). It outperforms state-of-the-art methods by a wide margin under the same conditions using only proposals from a 3D network.

\

4.6 Instance Prediction Confidence Score

\

:::info Authors:

(1) Mohamed El Amine Boudjoghra, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ([email protected]);

(2) Angela Dai, Technical University of Munich (TUM) ([email protected]);

(3) Jean Lahoud, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ( [email protected]);

(4) Hisham Cholakkal, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ([email protected]);

(5) Rao Muhammad Anwer, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Aalto University ([email protected]);

(6) Salman Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University ([email protected]);

(7) Fahad Shahbaz Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University ([email protected]).

:::


:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Market Opportunity
YOLO Logo
YOLO Price(YOLO)
$0,000000006852
$0,000000006852$0,000000006852
+0,14%
USD
YOLO (YOLO) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Regulation Advances While Volatility Masks the Bigger Picture

Regulation Advances While Volatility Masks the Bigger Picture

The post Regulation Advances While Volatility Masks the Bigger Picture appeared on BitcoinEthereumNews.com. The Crypto Market Feels Shaky — But Here’s What Actually
Share
BitcoinEthereumNews2025/12/20 04:06
Grayscale ETF Tracking XRP, Solana and Cardano to Hit Wall Street After SEC Pause

Grayscale ETF Tracking XRP, Solana and Cardano to Hit Wall Street After SEC Pause

The post Grayscale ETF Tracking XRP, Solana and Cardano to Hit Wall Street After SEC Pause appeared on BitcoinEthereumNews.com. In brief The SEC said that Grayscale’s Digital Large Cap Fund conversion into an ETF is approved for listing and trading. The fund tracks the price of Bitcoin, Ethereum, Solana, XRP, and Cardano. Other ETFs tracking XRP and Dogecoin began trading on Thursday. An exchange-traded fund from crypto asset manager Grayscale that tracks the price of XRP, Solana, and Cardano—along with Bitcoin and Ethereum—was primed for its debut on the New York Stock Exchange, following long-sought approval from the SEC.  In an order on Wednesday, the regulator permitted the listing and trading of Grayscale’s Digital Large Cap Fund (GDLC), following an indefinite pause in July. The SEC meanwhile approved of generic listing standards for commodity-based products, paving the way for other crypto ETFs. A person familiar with the matter told Decrypt that GDLC is expected to begin trading on Friday. Unlike spot Bitcoin and Ethereum ETFs that debuted in the U.S. last year, GDLC is modeled on an index tracking the five largest and most liquid digital assets. Bitcoin represents 72% of the fund’s weighting, while Ethereum makes up 17%, according to Grayscale’s website. XRP, Solana, and Cardano account for 5.6%, 4%, and 1% of the fund’s exposure, respectively.  “The Grayscale team is working expeditiously to bring the FIRST multi-crypto asset ETP to market,” CEO Peter Mintzberg said on X on Wednesday, thanking the SEC for its “unmatched efforts in bringing the regulatory clarity our industry deserves.” Decrypt reached out to Grayscale for comment but did not immediately receive a response. Meanwhile, Dogecoin and XRP ETFs from Rex Shares and Osprey funds began trading on Thursday. The funds are registered under the Investment Company Act of 1940, a distinct set of rules compared to the process most asset managers have sought approval for crypto-focused products under. Not long ago,…
Share
BitcoinEthereumNews2025/09/19 04:19
U.S. Labor Market Weakness Forecasts Potential Fed Rate Cuts

U.S. Labor Market Weakness Forecasts Potential Fed Rate Cuts

Anxin analyst Chris Yoo signals U.S. labor market strains prompting possible Federal Reserve rate cuts.Read more...
Share
Coinstats2025/12/20 03:48