YOLOv8 ONNX inference in OpenCV DNN CUDA throws error: Reshape2LayerImpl getOutShape

Hello,

The example C++ code from Ultralytics shows error of Reshape2LayerImpl getOutShape.

My setup:
Windows 10.
CUDA Version - 12.6
cuDNN Version - 9.6
ONNX Runtime (GPU) Version - 1.20.1
ONNX Version - 1.17 (using opset 12).
OpenCV Version 5.0.0-Alpha (I have compiled OpenCV with CUDA, cuDNN and ONNX support).

I then Git clone Ultralytics and compiled the C++ example called “YOLOv8-CPP-Inference”.

When I run the program:

Yolov8CPPInference.exe

I get the following error:

Running on CUDA
[ WARN:0@1.047] global net_impl_backend.cpp:193 cv::dnn::dnn5_v20241127::Net::Impl::setPreferableBackend Back-ends are not supported by the new graph engine for now
[ WARN:0@1.047] global net_impl_backend.cpp:229 cv::dnn::dnn5_v20241127::Net::Impl::setPreferableTarget Targets are not supported by the new graph engine for now
OpenCV: terminate handler is called! The last OpenCV error is:
OpenCV(5.0.0alpha) Error: Assertion failed (outTotal == inpTotal) in cv::dnn::Reshape2LayerImpl::getOutShape, file F:\AI_Componets\OpenCV\opencv\modules\dnn\src\layers\reshape2_layer.cpp, line 113

Here is how I exported the YOLOv8 model to .ONNX:

(myenv) F:\AI_Componets\AI_Projects\mangus_ai>yolo export model=F:\AI_Componets\Models\YOLOv8\Models\detection_coco\yolov8x.pt format=onnx dynamic=False imgsz=640 half=True simplify=False batch=8 workspace=4 device=0 opset=12

Ultralytics 8.3.57 🚀 Python-3.12.5 torch-2.6.0.dev20250102+cu126 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12287MiB)
YOLOv8x summary (fused): 268 layers, 68,200,608 parameters, 0 gradients, 257.8 GFLOPs

PyTorch: starting from 'F:\AI_Componets\Models\YOLOv8\Models\detection_coco\yolov8x.pt' with input shape (8, 3, 640, 640) BCHW and output shape(s) (8, 84, 8400) (130.5 MB)

ONNX: starting export with onnx 1.17.0 opset 12...
ONNX: export success ✅ 6.2s, saved as 'F:\AI_Componets\Models\YOLOv8\Models\detection_coco\yolov8x.onnx' (130.2 MB)

Export complete (9.7s)
Results saved to F:\AI_Componets\Models\YOLOv8\Models\detection_coco
Predict:         yolo predict task=detect model=F:\AI_Componets\Models\YOLOv8\Models\detection_coco\yolov8x.onnx imgsz=640 half
Validate:        yolo val task=detect model=F:\AI_Componets\Models\YOLOv8\Models\detection_coco\yolov8x.onnx imgsz=640 data=coco.yaml half
Visualize:       https://netron.app
💡 Learn more at https://docs.ultralytics.com/modes/export

The images which I am testing has only been resized to 640px.:

Thanks for any advice.

Hello,

The issue you’re experiencing (Reshape2LayerImpl getOutShape) when using the exported ONNX model with OpenCV DNN is likely related to the input dimensions of your model and how OpenCV processes them. Here are a few suggestions to help resolve the problem:

  1. Verify Input Dimensions:
    Ensure that the input dimensions of the image you are providing to the OpenCV model match the expected input shape of the ONNX model. The model expects an input shape of (batch_size, 3, 640, 640) as noted in the export logs. If OpenCV resizes or alters the input shape differently, a mismatch could trigger this error. Make sure your input image is resized to (640, 640) and properly normalized.

  2. Static vs. Dynamic Shapes:
    Since you exported the ONNX model with dynamic=False, the model expects a fixed batch size (batch=8). When testing in OpenCV, ensure that the batch size matches this value. If you only want to test single images, re-export the model with a batch size of 1 or enable dynamic input shapes by setting dynamic=True during export.

  3. ONNX Opset Compatibility:
    You used opset=12 for the ONNX export. While this should generally work, certain operations in OpenCV might not fully support specific ONNX opset versions. Consider re-exporting the model with a higher opset version (e.g., opset=16) and test again.

  4. Simplify ONNX Graph:
    Although you set simplify=False, simplifying the ONNX graph can often resolve compatibility issues. Try re-exporting the model with simplify=True.

  5. Debugging in OpenCV:
    OpenCV DNN has limited support for some ONNX operations, and the warnings about unsupported backends/targets suggest this might be contributing to the issue. If possible, try running the ONNX model using ONNX Runtime directly to confirm the issue is isolated to OpenCV DNN.

  6. Check OpenCV Build:
    Ensure your OpenCV build with CUDA, cuDNN, and ONNX support is correctly configured. Verify that the compiled version matches your system setup and CUDA/cuDNN versions.

  7. Try Alternative Backends:
    If the issue persists, consider using ONNX Runtime (GPU) for inference instead of OpenCV DNN. It provides more robust support for ONNX models and might bypass these compatibility issues.

Lastly, if the issue persists, try updating to the latest versions of all relevant dependencies (OpenCV, ONNX, ONNX Runtime, CUDA, cuDNN) to ensure compatibility. Let us know how it goes! :blush:

2 Likes

Hello pderrenger,

Thank you very much for your reply. It has resolved the issues I was having.

I finally got OpenCV to run the Yolov8 inference with your suggestions.

The main goal is to actually have the .ONNX model to be in TensorRT format and run the Yolov8 in TensorRT backend. I guess I will open a new thread on that since I was using OpenCV with DNN for testing purposes and get an idea how to create the pre-processing logic.

As for a live video feed input instead of an image, is it possible to have the dynamic parameter to be set false or it must be true? I assume that having the dynamic parameter option enabled would cause some latency.

Thank You again for your response.

You’re welcome! For live video feed inference with YOLO models, setting dynamic=False is generally more efficient if input dimensions remain consistent, as it avoids the overhead of dynamic shape handling. However, if your input dimensions vary (e.g., different resolutions), dynamic=True is necessary to accommodate them. For minimal latency, ensure your input dimensions are fixed and match the exported model’s settings. Let us know if you need further assistance!

1 Like

Hello pderrenger,

Thank you very much for your reply.

I plan to maintain a fixed resolution video input of 640 x 640 pixels with dynamic=off for minimal latency. However, I am unsure about how to implement the frame rate, as YOLOv8 supports batch processing.

Is it possible to set the input video stream to 30 FPS and also configure the batch processing to handle 30 FPS? Would this approach be more efficient and reduce latency through parallel computation, or is it better for YOLOv8 to process frames one by one? Am I first required to do some video media handling such as take 30 frames from the video and store it to buffer than somehow provide it to the Yolov8 inference?

I couldn’t find any C++ examples or explanations on how to properly interface YOLOv8’s batch processing for video feeds.

I also plan on using GPU but OpenCV seems to be very outdated with the recent CUDA library. I also couldn’t find any ONNX runtime examples or TensorRT examples with Yolov8. All examples seems to use OpenCV which is problematic in windows system, everything needs to be scratch made and not utilize OpenCV library.

Thanks again for your response.

To ensure efficient processing of a 30 FPS video stream using YOLOv8 while maintaining minimal latency, here are some suggestions:

  1. Batch Processing vs. Frame-by-Frame: YOLOv8 supports batch inference, which can improve throughput by processing multiple frames simultaneously. However, if minimizing latency is your priority (e.g., for real-time applications), processing frames one by one might be better. Batch processing is more suitable for offline or high-throughput tasks.

  2. Buffering Frames for Batch Inference: Yes, to use batch processing, you would need to buffer frames from the video stream (e.g., storing 30 frames for a batch size of 30). Once the batch is ready, it can be passed to YOLOv8 for inference. This approach can leverage GPU parallelism but may introduce some latency due to the time required to collect a batch.

  3. Dynamic Parameter: If your input resolution is fixed at 640x640, you can set dynamic=False during ONNX export to disable dynamic shapes. This reduces overhead and latency during inference, as the model does not need to handle varying input sizes.

  4. ONNX Runtime or TensorRT: OpenCV DNN can be limiting, as you’ve observed. For better performance and compatibility with CUDA, consider using TensorRT or ONNX Runtime directly. The Ultralytics documentation on exporting models provides guidance on exporting YOLOv8 models for TensorRT.

    For TensorRT, ensure you use the export command with the engine format:

    yolo export model=yolov8x.pt format=engine imgsz=640 batch=8 device=0
    

    For ONNX Runtime, you can use its Python or C++ API to load the exported ONNX model and perform inference.

  5. Video Stream Handling: For video feeds, you’ll need to use a library like OpenCV (for frame extraction) or a custom video processing pipeline to maintain the 30 FPS rate. Once frames are captured, they can be resized to 640x640 and passed to YOLOv8 for inference.

For C++ examples specifically, while Ultralytics does not currently provide detailed C++ examples for batch processing with video feeds, you can adapt Python-based pipelines to C++. Libraries like OpenCV and TensorRT provide sufficient APIs to handle video frames and perform inference.

If you encounter issues with OpenCV’s compatibility with recent CUDA versions, I recommend focusing on TensorRT or ONNX Runtime directly for better GPU integration. Let me know if you need more clarification or assistance!

1 Like

Hello pderrenger,

Thank You very much for your informative reply. I now have some grounds in putting everything together.

You’re welcome, rajhlinux! It’s great to hear that my previous replies have been helpful. For your latest query about video input and batch processing with YOLOv8, here are some considerations:

YOLOv8 can process frames in batches to leverage GPU parallelism, which is generally more efficient than processing frames individually. To implement this for a 30 FPS video, you can read and buffer 30 frames from the video stream into a batch and then pass the batch to YOLOv8 for inference. However, keep in mind that batching introduces a trade-off: while it can reduce per-frame processing time, it may also introduce latency due to the time required to accumulate a full batch.

For minimal latency, especially in a live feed scenario, you might consider processing frames one by one. This approach avoids the delay of batching and is simpler to implement.

Regarding your concerns with OpenCV on Windows and outdated CUDA libraries, you can explore using TensorRT directly for inference, as it is highly optimized for NVIDIA GPUs. TensorRT provides better performance compared to OpenCV DNN when running ONNX models. You can export your YOLOv8 model to TensorRT using the ultralytics library, as explained in the export documentation.

For C++ examples or guidance on TensorRT and ONNX Runtime with YOLOv8, while we don’t currently provide specific examples, the Ultralytics export utility enables you to generate models compatible with these frameworks. You can then integrate them into your custom pipeline using TensorRT or ONNX Runtime APIs.

If you encounter further issues with the setup or implementation, feel free to open a new thread. The community and the YOLO ecosystem are always here to support you.