New and with a project that got dumped in my lap YESTERDAY!

First off, I do IT work, but, not this kind.

I have been seeing YOLO for a while and thought it was neat, was thinking of using my Raspberry Pi with my Ring camera to count how many cars go up and down the street, but never got around to it due to workload.

Well, yesterday, the web developer for a project I’m helping with (as hardware mostly), turned into a mess, so now I’m basically it (not I.T., IT).

So, the idea was, count the vehicles in a specific area, nothing more fancy than that.

I first off, started with a Q6600 (Intel) with no GPU, after getting the installation working with the project I tested: Create a Car Counter in Python using YOLO and OpenCV

It ran as I expected, but it did work as the example video showed.

So, I thought, this would be a good spot to learn by reading, what was going, it made sense.

This is my setup:

An IP Camera (4k 2.8mm, which can be read by RTSP or pull a single frame).
A computer to run this software (currently Q6600 testing, most likely N150 NUC).
Send the image (not stream) somewhere else (somehow, I did say I don’t do this normally).

The idea is, to count the vehicles within a masked area, nothing more, nothing less, just get the count of the vehicles in the frame (picture).

I want to pull 1 frame (to picture) at a time, because these vehicles don’t move very fast and aren’t always facing forward.

I don’t need this real time, in fact, a N150 NUC would probably be more than sufficient, because with Linux on it, it would just be doing that one frame every 5 seconds, so with the Q6600 taking a 800-1500ms inference cost, the N150 should be 30 to 50% faster, which is much more than I need.

What I aim to do is get the frame and count the vehicles in the masked area. Then I’d need to reduce the image down to something around 320x??? and write the count on top of it. What to do with it, that is what I’m not sure yet, but having the count on top of it is all I really need (after it is shrunk, because I’m not uploading a 4k image for that).

So in a nutshell, count cars in a masked picture, shrink it, write the count on it, put it somewhere, wait 3 seconds, repeat.

Was really interesting to see an old (2007) machine actually do this work, not fast (as expected), but it gave me a reference to what I could get away with hardware wise. (Unless someone has any suggestions otherwise.)

Welcome to the community, and thanks for sharing your project! It’s great that you’ve got a basic counter running and are exploring how to refine it.

For counting vehicles within a specific masked area, you might find our RegionCounter solution quite helpful. This allows you to define one or more polygonal regions of interest and count objects within them. You can see examples and usage in our Object Counting in Different Regions guide.

While many examples process video streams, the RegionCounter operates on individual frames. You can adapt this by loading a single image (e.g., using cv2.imread("your_image.jpg")) and then passing this image frame to an initialized regioncounter instance, like results = regioncounter(your_image_frame). The results.plot_im attribute will then contain the image with detections and counts visually displayed for your defined regions. You can subsequently use OpenCV functions to resize this image. If you need the numerical counts themselves (for example, to draw custom text or send the data elsewhere), these are typically accessible as an attribute of the regioncounter object (e.g., regioncounter.counts).

Regarding your hardware, for the N150 NUC (or even the Q6600 for testing), using a lightweight model like yolo11n.pt is a good choice for faster CPU inference. You can specify this model when you initialize the RegionCounter.

Good luck with your project, and feel free to ask further questions as you progress!

Thank you for the response.

I’ve been working on the Python script (with a lot of headaches with getting the frame from the camera, seems the methods I tried don’t work, so I’m using subprocess as a means to run a shell script to get the frame), though I’m using the yolov8n.pt I found, not sure if there is a better one to work with? I presume I have 11 installed, since it was a recent install. I am getting an error during the enumeration of the vehicles (because there is no AVX2 on the Q6600), but the test I ran of the video car counting, it did work (painfully slow though).

I have managed to read the image, mask it with the mask (more on that), the test using yolov8n.pt is “working”, but cars in the distance are being unseen, not entirely sure why, I am planning on showing the masked image before it checks it, to see if there is an issue with the mask.

The reason why I’m using a mask is, the area I need to check is like a large backwards J, the camera is in the corner of the bottom half of the J where it curves to go upward to the top of the J. It is about 8 or 9 feet up, not sure if that is an issue. The reason why I’m using a mask is, there is sections that the camera can see (it is wide angle), that I don’t want the vehicle traffic to be counted. I’m thinking you might be right that the yolov8n.pt may not be producing the right results. It is finding a skateboard, but the results shows it as a surfboard, plus vehicles in front of the camera about 50 feet away are not being recognized.

I’m also wondering if I need to fine tune the confidence value I’m limiting it to.

Thanks for the reply!

Hello! Thanks for the detailed update on your project.

It sounds like you’re making good progress. For the issue with distant vehicles not being detected and misclassifications, using yolov8n.pt (which is a YOLOv8 nano model) is a good starting point for speed, but larger models like yolov8s.pt or even yolo11s.pt might provide better accuracy for smaller, more distant objects, and potentially reduce misclassifications. This would come at the cost of some inference speed, but given your 5-second interval, it might be an acceptable trade-off.

Experimenting with the conf argument is definitely a good idea. Lowering it might help detect more objects, but you’ll want to find a balance to avoid too many false positives.

You can find more information on object counting, which might provide some additional context, in our Object Counting using Ultralytics YOLO11 guide. Keep up the great work!

I did just try yolo11s.pt and it went from 1 car (right) to 3 cars on the next frame (same 1 car) and 5 on the next one (same car again).

I’m wondering, if the camera itself (it has settings), is causing this mis-interpretation, has the usual trio of settings, wonder what I would need to do to assist the detection with the camera’s output. I have given it an imgsz=(1280,768) to minimize the aspect ratio distortion (2560x1440 originally). Any suggestions on what I can do to fix this? Most of the time very dark vehicles are being merged as 1, without a high aerial view, it isn’t going to count them right, not that I’m really upset with that, but I’m hoping adjusting the camera will help with reducing the highly incorrect vehicle counts. As for the yolo11s.pt, I really didn’t see much difference in speed.

What type of camera stream is it? And what’s the code that you’re using?

Instead of visually masking the image (preprocessing before detection), you could try coordinate filtering the results. When results are returned, you can count only the detections you care about by using something like (not saying ):

import numpy as np
from ultralytics import YOLO

# Load a pretrained YOLO11n model
model = YOLO("yolo11s.pt")

results = model.predict(0)  # inference on camera device '0'

# View results
for r in results:
    np_res = r.numpy()

    # bbox x-center > 300, y-center > 500
    keep = np_res[r.boxes.xywh[:, 0:2] > np.array((300, 500))]  # filter
        ...  # counting logic using `keep` variable

You can also use imgsz=320 during inference model.predict(..., imgsz=320) to lower the resolution of the image for detection. Note that doing this is likely to impact the detection results, so you’ll need to experiment with the settings to determine what the best value(s) work for your use case.

I’m retrieving frames, not from a stream, the camera offers either RTSP or a single image (frame grab). I opted for the frame grab, as I’m not needing real time (5 second intervals), so even the CPU doesn’t need to be a barn burner.

import cv2
import math
import cvzone
import numpy as np
import requests
import torch
import time
import subprocess
from sort import Sort
from ultralytics import YOLO
from hikvisionapi import Client

# Function to get class colors
def getColours(cls_num):
    base_colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255)]
    color_index = cls_num % len(base_colors)
    increments = [(1, -2, 1), (-2, 1, -1), (1, -1, 2)]
    color = [base_colors[color_index][i] + increments[color_index][i] * 
    (cls_num // len(base_colors)) % 256 for i in range(3)]
    return tuple(color)

# Initialize video capture
ret = 0

filename = "/home/wipe/Car_Counter/Media/rawframe.jpg"
maskname = "/home/wipe/Car_Counter/Media/rawmask.png"
webname = "/home/wipe/Car_Counter/Media/carcount.png"
maskedname = "/home/wipe/Car_Counter/Media/maskedframe.png"
putimage = "/home/wipe/Car_Counter/putimage.sh"

class_labels = [
    "person", "bicycle", "car", "motorbike", "aeroplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench",
    "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe",
    "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", 
    "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket",
    "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl",
    "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake",
    "chair", "sofa", "pottedplant", "bed", "diningtable", "toilet", "tvmonitor", "laptop", "mouse", 
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book", 
    "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]

# Load region mask
region_mask = cv2.imread(maskname)

# size to shrink to.
down_width = 320
down_height = 180
down_points = (down_width, down_height)
scaled_down = (640,384)

#font
myFont = cv2.FONT_HERSHEY_SIMPLEX
myOrg = (0, 170)
fontScale = 1
fontColor = (255, 255, 255)
myThickness = 2

torch.backends.nnpack.enabled = False

url = "http://192.168.0.64"
usernm = "admin"
passwd = "Password"

i = 1
vcount = 0

cam = Client(url, usernm, passwd)
while(i):
    # Load YOLO model with custom weights
    yolo_model = YOLO("Weights/yolo11s.pt")
    try:
        vid = cam.Streaming.channels[1].picture(method ='get', type = 'opaque_data')
        good = False
        bytes = b''
        for chunk in vid.iter_content(chunk_size=10240):
            bytes += chunk

        a = bytes.find(b'\xff\xd8')
        b = bytes.find(b'\xff\xd9')
        if a != -1 and b != -1:
            jpg = bytes[a:b+2]
            bytes = bytes[b+2:]
            frame = cv2.imdecode(np.frombuffer(jpg, dtype='uint8'), cv2.IMREAD_COLOR)
            good = True

    except Exception as eRR:
        print(eRR)
        if eRR.find("ConnectTimeoutError") != -1:
            time.sleep(30)
            pass
        else:
            break

    if good:
        # Masking and pre-scale frame
        masked_frame = cv2.bitwise_and(frame, region_mask)
        scaled_frame = cv2.resize(masked_frame, scaled_down, interpolation = cv2.INTER_LINEAR)
        # Perform object detection
        detection_results = yolo_model(scaled_frame, imgsz=scaled_down)

        vcount = 0
        for result in detection_results:
            for box in result.boxes:
                x1, y1, x2, y2 = map(int, box.xyxy[0])
                width, height = x2 - x1, y2 - y1
                confidence = math.ceil((box.conf[0] * 100)) / 100
                class_id = int(box.cls[0])
                class_name = class_labels[class_id]
                print(f'{class_name}:{confidence}')
                # get coordinates
                [x1, y1, x2, y2] = box.xyxy[0]
                # convert to int
                x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2)

                # get the respective color
                colour = getColours(class_id)

                # draw the rectangle
                cv2.rectangle(masked_frame, (x1, y1), (x2, y2), colour, 2)

                # put the class name and confidence on the image
                cv2.putText(masked_frame, f'{class_name} {confidence:.2f}', (x1, y1), cv2.FONT_HERSHEY_SIMPLEX, 1, colour, 2)

                cv2.imwrite(maskedname, masked_frame)

                if class_name in ["car", "truck", "bus"] and confidence > 0.40:
                    vcount += 1

        txt = "Queue: {}".format(vcount)
        if vcount > 7:
            txt += "+"
        print(txt)

        resized_down = cv2.resize(frame, down_points, interpolation = cv2.INTER_LINEAR)
        if vcount:
            cvzone.putTextRect(resized_down, txt, myOrg, scale=fontScale, thickness=myThickness, colorT=(255, 255, 255), colorR=(185, 114, 17), font=myFont)
        cv2.imwrite(webname, resized_down)
        ret = subprocess.call(['sh', putimage])
        time.sleep(1)

I can’t use that idea, the lanes I’m watching, how the camera was put up, causes an odd angle, so I have to block out traffic that camera can see in areas I don’t want to count. If it was straight on the lanes, I could use regions, but it isn’t, just a bizarre angle.

I edited the mask some, after pushing the image to 1280x768, this seems to catch the vehicles all the way down, plus the mask had to be edited a bit because it was catching those parked.

My issue right now is, a vehicle comes in on the right (entering the bottom of the J at the hard right angle to the camera), it is being seen as both car and truck (counted twice), the boxes are practically on top of each other, is there a way to filter this out?
Seeing Double

RTSP can be directly used in Ultralytics.

You shouldn’t also be resizing the image manually

That frame grabbing and writing seems really inefficient and may even outdo any benefits of frame grabbing over RTSP with all the overhead

Also you shouldn’t be loading the model each iteration. It should be outside the loop

All this drawing code is unnecessary. Ultralytics has built in plot() method and class filter method

1 Like

The main issue with using RTSP is I’d only grab one frame, then close the connection per frame check.

As for resizing it first, I’m doing that because the camera is 4k and the results were not really that good if I didn’t specify the size first. I’ll try and see if that makes any difference performance wise, but the issue with the car/truck atop each other seems to be the bigger issue, oddly, some colors of cars don’t work at far away distance, as the camera distorts the distance.

Thanks for the model, thought I’d moved that, might have forgotten to and the plotting, I’ve changed it, that is only for debugging. It won’t do anything if I turn the added dBug to False. Just so I can see the actual identifications.

Hello! Thanks for sharing your code and the detailed description of the issue.

When you see a single object detected as both “car” and “truck” with highly overlapping boxes, it’s often due to the model being slightly uncertain and assigning probabilities to closely related classes. You can often address this by making the Non-Maximum Suppression (NMS) process class-agnostic and adjusting its IoU threshold.

Try modifying your model prediction call like this:

detection_results = yolo_model(scaled_frame, imgsz=scaled_down, agnostic_nms=True, iou=0.5) 

Setting agnostic_nms=True tells the NMS algorithm to suppress overlapping boxes across different classes. In your case, if the “car” detection has higher confidence, it might suppress the “truck” detection for the same object if they overlap significantly. You can also experiment with the iou value (e.g., try values between 0.3 and 0.7) to fine-tune the suppression. A lower iou makes NMS stricter.

This should help reduce such duplicate detections for the same vehicle. Let me know if that improves the results!

Thank you, I was looking at that, but wasn’t 100% sure if that would have gotten me further. I’ve raised the iou to 0.7 and might try higher, as it might be interfering with distant vehicle detection (60 feet away, but the camera makes it look 100 feet away).

So this is what I have and I am running into a serious issue.

torch.backends.nnpack.enabled = False
classids = (2, 5, 7)

        detection_results = yolo_model(scaled_frame, conf=0.25, agnostic_nms=True, iou=0.7, imgsz=scaled_down, max_det=20, classes=classids, stream=True, save=False)
        print("Count Results.")
        vcount = 0
        for result in detection_results:
            for box in result.boxes:
                x1, y1, x2, y2 = map(int, box.xyxy[0])
                width, height = x2 - x1, y2 - y1
                confidence = math.ceil((box.conf[0] * 100)) / 100
                class_id = int(box.cls[0])
                class_name = class_labels[class_id]
                print(f'{class_name}:{confidence}')

This is what I have at the moment, the iou might be causing it to not see distant vehicles, most of the time, it’s dark colored vehicles (cars, trucks, buses aka extended vans). For the most part dark vehicles in the distance either get detected or not, most of the time it sees it, then the next frame it doesn’t, then the next one it does, with the vehicle not moving.

Lately, as of yesterday, the print("Count Results.") is the last thing on the screen as the for result in detection_results locks the computer up requiring a hardware set. RAM is at 24%, CPU not 100% but close, temp is fine. Is there a way to get the results calculated before the detection_results is interated, so I can check whether or not the results are crashing or if something else is crashing the machine (freezing). It happens about 1 to 2 hours in.

I’m waiting for another truck to be up in front to see if it is seen as both, guess I should have saved that frame, but I wasn’t fast enough.

If you remove stream=True, it will return the result directly.

It’s probably related to subprocess. Launchinh a process for each frame is not really a good approach.