SAM/SAM 2, how to generate feature embedings and run inference based on the embedings

Kallinteris-Andreas · July 26, 2025, 4:30pm

the inference process of SAM has 2 main stages, (1) Generated image embedding, (2) Generated mask based on prompt.
(1) takes up most of the compute time, so it is a common practice to run (1) once and (2) multiple times, for interactive GUIs

does the ultralytics package support this use case

Toxite · July 26, 2025, 5:30pm

You can use set_image

github.com/ultralytics/ultralytics

ultralytics/models/sam/predict.py

18fee3e77


      
                  >>> predictor.setup_source(None)  # Uses default source if available
          
              Notes:
                  - If source is None, the method may use a default source if configured.
                  - The method adapts to different source types and prepares them for subsequent inference steps.
                  - Supported source types may include local files, directories, URLs, and video streams.
              """
              if source is not None:
                  super().setup_source(source)
          
          def set_image(self, image):
              """
              Preprocess and set a single image for inference.
          
              This method prepares the model for inference on a single image by setting up the model if not already
              initialized, configuring the data source, and preprocessing the image for feature extraction. It
              ensures that only one image is set at a time and extracts image features for subsequent use.
          
              Args:
                  image (str | np.ndarray): Path to the image file as a string, or a numpy array representing
                      an image read by cv2.

model.predictor.set_image("image.jpg")
embeddings = model.predictor.features

Check this too:

github.com/ultralytics/ultralytics

Optimizing SAM Inference for High-Resolution Image Segmentation with Hundreds of Prompts

opened 07:55AM - 23 Jan 25 UTC

closed 01:05AM - 14 Feb 25 UTC

chickencoins

enhancement question segment

### Search before asking - [x] I have searched the Ultralytics YOLO [issues](ht…tps://github.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/orgs/ultralytics/discussions) and found no similar questions. ### Question # Optimizing SAM Inference for High-Resolution Image Segmentation with Hundreds of Prompts ## **Description** I'm currently working on segmenting objects within high-resolution images using SAM 2.1 and MobileSAM. My workflow involves the following steps: 1. **Bounding Box Detection:** - I utilize another model to perform bounding box (bbox) detection. - A single image typically contains **hundreds of objects**, resulting in **hundreds of bbox annotations**. 2. **Segmentation with SAM:** - Mainly using bbox prompts, I employ SAM to segment each detected object. - **Challenge:** While SAM's inference time is relatively fast, I followed the method described in [Ultralytics SAM Documentation](https://docs.ultralytics.com/ko/models/sam-2/) (results = model("path/to/image.jpg", bboxes=[100, 100, 200, 200])). As a result, there are **hundreds of high-resolution image inputs** and **repeated image encodings**, which cause **significant time delays**. ## **Problem Statement** My primary goal is to efficiently segment hundreds of objects in a single high-resolution image. Specifically: - **Multiple Prompts Handling:** - For a single image, I need to input **hundreds of bbox or point prompts for different objects**. - **Performance Bottleneck:** - Current approaches cause significant time lags due to hundreds of high-resolution image inputs and iterative image encoding, although there is only one type of input image. - **Segmentation Accuracy:** - If you use whole image segmentation without specific prompts, objects are not accurately distinguished, so you must use bbox prompts. ## **Desired Outcome** To optimize the segmentation process, I aim to: 1. **Batch Processing of Prompts:** - If that's possibleI, i want input all (or a large batch of) bbox or point prompts **simultaneously** to minimize processing time. 2. **Single Image Encoding:** - **image encoding step is performed only once**, even if prompts are processed sequentially. ## **Request for Assistance** I'm seeking guidance or potential solutions to address the following: - **Efficient Prompt Handling:** - How can I input hundreds of prompts for a single high-resolution image without incurring significant processing delays? - **Optimizing Image Encoding:** - Is there a way to reuse image encoding from multiple prompt inputs, or to pre-encode and enter it when needed? ## **Additional Information** - **Tools & Versions:** - SAM 2.1 / MobileSAM - Ultralytics' SAM implementation - **Image Specifications:** - Ultra-high resolution (>10MB per image) - Hundreds of objects per image I would like to know if an official feature for this is provided, or if you have any reference materials, insights, or suggestions for similar implementations, I would greatly appreciate it. Thank you for your support!

Topic		Replies	Views
Initilize the `SAM`/`SAM2` predictor without inferencing Support code	1	64	July 31, 2025
Segment anything with FastSAM using Ultralytics 🔥 Resources news , ultralytics-official	2	310	July 22, 2024
SAM2 Announced! News sam , news	4	293	August 1, 2024
Ambigueous promt with SAM, return top 3 Masks Support sam , question , support , code	7	147	July 19, 2025
Yolo segmentation dataset conversion to a sam dataset Discussion yolo , support , discussion	3	110	August 26, 2025

SAM/SAM 2, how to generate feature embedings and run inference based on the embedings

Related topics