System Freeze When Performing Heavy Yolo Inferencing

Hello,

I have been using YOLO inferencing for about a year to analyze 24 camera feeds. I have one dedicated computer for this purpose, a new ASUS desktop that has powerful specs. The inference runs about every 1 second for each of the 24 camera feeds, on yolo26x.pt model. Each of the cameras are associated to any of 17 zones, and each zone is a different process with its own yolo model loaded. each zone process then hands off the yolo model to worker threads that then do the inferencing on the camera feeds. the camera feeds are pulled from a hikvision system, from its picture snapshots and not from RTSP, because the RTSP is too heavy. These picture snapshots are then provided to the yolo inferencing every second. After some time though, the entire computer hard freezes. And I have to manually shut down the computer and turn on again. The computer does not freeze if I don’t do yolo inferencing. But when I do the yolo inferencing, after some time, it freezes. I’ve noticed that its at around the 1-2 hour mark that it freezes. These are my computer specs and the code in reference:

ASUS ROG G700TF-XS987 - Ultra 9-285K - RTX 5080
ADVANCED LIQUID COOLING
Video Card
NVIDIA® GeForce RTX™ 5080 PRIME w/ 16 GB GDDR7
Processor
Intel® Core Arrow Lake Ultra 9-285K 24 Core - 24 Thread Processor, 3.2 GHz (Max Turbo Frequency 5.7 GHz), 36 MB Smart Cache
Thermal Interface Materials
Stock TIM - Stock thermal compound and thermal pads
Power Supply
ASUS 850W power supply (80+ Gold, peak 900W)
Memory
HIDevolution Approved Standard 64 GB Dual Channel DDR5 6000MHz (2 x 32 GB) - Speeds subject to system capability - installed by HIDevolution

_____________________

RELEVANT CODE (SIMPLIFIED)

import threading
import time
import multiprocessing
import yaml
from ultralytics import YOLO
import torch
import cv2
import logging
import numpy as np   
from multiprocessing import freeze_support
from multiprocessing import Manager

def load_camera_config():
    try:
        with open("configs.yaml", "r") as f:
            return yaml.safe_load(f)
    except Exception as e:
        return {}

#SNAPSHOTS AND QUEUE ARE POPULATED ELSEHWERE
snapshots = None
queued_zones = None
sound_queue = None
manager = None

config = load_camera_config()
cameras_config = config["cameras"]
zones_config = config["zones"]       

def poll_armed_zones_from_ha(previously_armed_zones=None):
    #CODE TO PULL ARMED ZONES, 17 ZONES TOTAL
    return {}       

def monitor_zone_process(zone, queued_zones, sound_queue, snapshots):
    logging.getLogger("ultralytics").setLevel(logging.ERROR)

    local_model = YOLO("yolo26x.pt")
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    local_model.model.to(device)

    zone_name = zone["name"]
    camera_keys = zone.get("cameras", [])

    cam_procs = {}

    while True:
        for camera_key in camera_keys:
            if camera_key not in cam_procs or not cam_procs[camera_key].is_alive():
                cam_proc = threading.Thread(
                    target=monitor_camera_using_still_image,
                    args=(camera_key, zone_name, local_model, queued_zones, sound_queue, snapshots),
                    daemon=True
                )
                cam_proc.start()
                cam_procs[camera_key] = cam_proc      

def monitor_camera_using_still_image(camera_key, zone_name, local_model, snapshots):
    while True:
        image_bytes = snapshots.get(camera_key)

        try:
            np_arr = np.frombuffer(image_bytes, dtype=np.uint8)
            frame = cv2.imdecode(np_arr, cv2.IMREAD_COLOR)
        except Exception as decode_err:
            time.sleep(0.2)
            continue

        results = local_model(frame)
            
        detected = False

        for x1, y1, x2, y2, conf, cls in results[0].boxes.data.tolist():
            if int(cls) != 0:
                continue
            detected = True

        if detected:
            sound_queue.put({"zone": zone_name, "camera": camera_key})
            queued_zones[zone_name] = True
            
        time.sleep(1)  

def main_loop():
    monitor_processes = {}

    while True:
        armed_zones = poll_armed_zones_from_ha(previously_armed_zones)
        previously_armed_zones = set(armed_zones)
        
        for zone in zones_config:
            zone_name = zone["name"]
            if zone_name in armed_zones:
                if zone_name not in monitor_processes:
                    proc = multiprocessing.Process(target=monitor_zone_process, args=(zone, queued_zones, sound_queue, snapshots), daemon=True)
                    monitor_processes[zone_name] = (proc)
                    proc.start()
        time.sleep(1)

if __name__ == '__main__':
    freeze_support()

    manager = Manager()
    sound_queue = manager.Queue()
    queued_zones = manager.dict()
    
    while True:
       main_loop()

Thanks for the detailed repro — the biggest red flag in this Ultralytics YOLO setup is that each zone process loads yolo26x.pt and then shares that same YOLO instance across multiple threads. That pattern is not thread-safe, as shown in the thread-safe inference guide. On top of that, if several zones are armed at once, you may be loading multiple yolo26x.pt models into one 16 GB GPU, which can push the NVIDIA driver pretty hard.

I’d first retest on the latest ultralytics, then simplify the architecture: ideally use one GPU inference worker process per GPU, load one model there, and feed frames to it through a queue, batching cameras where possible. If you keep threads, don’t pass local_model into multiple threads unless you serialize access:

from ultralytics import YOLO
from ultralytics.utils import ThreadingLocked
import torch

model = YOLO("yolo26s.pt")  # test with s/m first

@ThreadingLocked()
def infer(frame):
    with torch.inference_mode():
        return model.predict(frame, verbose=False, half=True)

Also, a full machine hard-freeze usually points more to GPU driver / VRAM / power / thermal stress than a normal Python bug. If a single-process, single-model, single-camera test still freezes, I’d check GPU temps, VRAM usage, NVIDIA driver, and Windows Event Viewer next.

If you want, I can sketch a safer “1 inference process + camera queue” version of your current design.

Those are really good points. I made the following changes based on your own documentation to make it thread safe and also downgrade the model used (although x is REALLY good and catches things the other models dont):

#downgraded from x to l
model = YOLO("yolo26l.pt")

#each model instance has a thread lock that is passed to each of the threads
model_lock = threading.Lock() 
#within the worker in each thread the locking is used
with model_lock:
  results = local_model(frame)

Instead of the thread lock decorator I decided to implement my own locking, since I need the locking to happen at the zone level not globally across all zones. In each zone that has its own model instantiated, I am created a lock, that is then passed to each of the threads within that zone that share the same model.

With these changes, I stress tested everything at maximum inferencing across all zones and cameras, and got a peak of 10 GB in GPU VRAM, unlike the 15 GB GPU VRAM peak I was getting before this change. I do also think I might’ve hit the ceiling of VRAM and caused a system freeze. Or maybe it was the thread issue. Either way, I’ve addressed both issues. I’ll let you know what happens.

I had another freeze happen at around 11:37 AM this morning, even with these changes. I analyzed my logs at that time and didn’t observe any spikes in GPU or VRAM, so the issue has to do be something else ASSOCIATED with yolo, because it only happens when I do yolo inferencing. I decided to create a new yolo model instance FOR EACH thread, rather than sharing it across different threads, if perhaps the sharing across threads is what was causing the crashes. Here are my logs from this last crash, which was analyzed by ChatGPT:

I looked at the CSV around the last samples before logging stopped at 2026-03-21 11:37:06.

What stands out:

  • This does not look like a thermal event. GPU temp was only about 47–49°C, and RAM usage stayed around 47.5–48.3%.

  • It also does not look like a system-wide memory exhaustion event. Your Python memory in that final window was roughly 1.5–1.6 GB, and there was no last-second runaway climb.

  • Worker load was steady: alive_worker_count = 7 and yolo_zone_count = 7 the whole time.

  • CPU was high but not extreme: mostly 72–82% in the final minute.

  • NVIDIA GPU utilization was bursty, which is normal for intermittent inference: it jumped between low values and peaks like 55% and 81%, but VRAM stayed basically flat at 6863–6864 MB.

The last recorded samples were basically:

  • 11:36:44 — CPU 82.0%, Python RAM 1540.7 MB, NVIDIA GPU 7%, temp 47°C

  • 11:36:55 — CPU 75.7%, Python RAM 1506.1 MB, NVIDIA GPU 14%, temp 47°C

  • 11:37:00 — CPU 77.2%, Python RAM 1520.0 MB, NVIDIA GPU 10%, temp 47°C

  • 11:37:06 — CPU 74.1%, Python RAM 1516.1 MB, NVIDIA GPU 18%, temp 47°C

So the crash does not appear to have been caused by:

  • overheating,

  • VRAM filling up,

  • system RAM maxing out,

  • or a sudden explosion in worker count.

I am still getting a freeze when using using one model yolo instance per camera thread. there are 24 yolo instances running concurrently per thread on one NVIDIA 16 GB GPU. I am using yolo26x.pt. I just downgraded to yolo26l.pt to see if it still freezes. Am I missing something here, thats causing the freeze? the freeze only happens some time after I start doing yolo inferencing. I also switched the NVIDIA driver to studio driver. I read that the game ready driver could be less stable. I’m running out of options here, and am even considering just switching to CPU inferencing, or buying another video card, or buying another computer. Not sure what to do here.

I got a recommendation form ChatGPT to change threads to process in python, because of this:

We suspect your system freezes were caused by unsafe concurrent GPU (CUDA) access from multiple threads, where your script was running YOLO inference in parallel threads within the same process; even though each thread had its own separate YOLO model instance (as recommended by Ultralytics), all threads still shared a single CUDA context, so running up to 24 models simultaneously likely led to race conditions or internal GPU driver deadlocks, since PyTorch GPU operations are not fully thread-safe at that level—resulting in hard system freezes with no logs.

I am going to go ahead and try it and see if the freezes stop.

The freezing is now resolved by changing threads to processes. The problem is that even though I created a new instance of yolo per thread, the underlying cuda context is still shared, causing an eventual absolute system freeze when there’s inference concurrency to it from all the different yolo threads. Whereas when python process is used is sets aside a different cuda context per process. I think ultralytics needs to absolutely discourage the use of threads altogether. Threads should never be used with yolo inferencing. All these weeks I have been experiencing system wide freezes that have also halted my security systems altogether, just because of this. Not cool.

For multi-camera setup, it’s better to load the model in a different process and make it act as a server instead of within the same process. Triton Inference Server is one way to do this.

Threading would have probably worked fine if you had used that setup instead