Roast me

Hi! This is my first ever post on a technical forum and the whole developer thing is uncharted territory. And hey maybe this isn’t even how you use forums but I’ll give it a shot. One year ago I found a passion for computer vision and AI and has since then been working with a Nvidia Jetson Orin Nano, just learning linux, how to code python, etc. Quite early I said I wanted to make a self driving golf trolley as a project because it sounds fun and you can actually sell it - just something to keep the motivation going. And man I’ve been learning so much and been on so many side tracks it feels like.

But the only colleague I’ve had is ChatGPT and while that is a very good coach, it’s difficult to know if you’re on the right track without a human in the loop. So I wonder if someone could please roast my project a little. To just check the architecture, the way the code is written, the directory structure and if the AI part is how you usually do it professionally. I don’t ask for a code review as in finding bugs, just to tell me what obvious noob mistakes I’ve done to the project becuase I’m fumbling in the dark a little without human colleagues.

I do have an engineering background but it’s mostly testing and QA with other tools than classic coding. I had some coding in university but that is long forgotten.

Some info about the project:

The idea is to with hand gestures control the golf trolley, as in telling it to follow you, or to stop. For that I use YOLO for person tracking and mediapipe for gestures. I want the trolley to lock on to the person that holds up an open hand for example, and once a person is being followed, only that person can stop the trolley by showing an open hand again. And after that it again looks for the next person who holds up an open hand. This is a work on progress and I have built a little minicart that can move but no steering yet. Just got the person and hand detection for that and will use it to have something to steer towards and to control it.

Github repo: Seb0daniem/GolfTrolley_yolo: Self driving golf trolley with YOLO

Wish you a good 2026!

Fun project, and you’re already past the hardest part (getting something moving + a perception loop running on a Jetson). If you want a “roast” from a robotics/vision POV, the biggest beginner traps in this exact “follow me + gesture control” setup are usually less about model choice and more about system glue.

The main thing I’d challenge is any control logic that reacts to single-frame outputs. A trolley that starts/stops/locks target based on one open-hand detection will feel “possessed” in real life (jitter, false positives, missed frames). Make the behavior a tiny state machine (“searching → candidate → locked → following → stopping”) and require the gesture to be stable for a short window before changing states, plus a cooldown so it can’t flip-flop. This is the difference between a demo and something you can safely test around people.

For the “only the person I locked can stop it” requirement, don’t try to solve identity from scratch—just lean on track IDs. Ultralytics YOLO has tracking built-in, so you can lock to a track_id and ignore everyone else until you explicitly release it. The tracking docs under Track mode are the right reference point. Minimal shape of it looks like:

from ultralytics import YOLO

model = YOLO("yolo11n.pt")  # good default for Jetson

for r in model.track(source=0, stream=True, persist=True, tracker="bytetrack.yaml"):
    boxes = r.boxes
    # boxes.id holds track IDs when tracking is enabled
    # gate "lock" + "stop" decisions using a chosen track_id

On Jetson, the other classic “noob pain” is accidentally building a memory leak / latency spiral by buffering frames or accumulating results. Make sure you’re streaming your inference loop (Ultralytics supports this directly), like shown in the robotics example using stream=True in this robotics primer, and keep your per-frame work bounded (no unbounded lists, no saving every frame to disk while debugging unless you rotate).

On the gesture side, MediaPipe is fine, but you might eventually simplify your stack by moving to a single model family. Ultralytics YOLO11 can do hand keypoints via pose (custom-trained), which can make deployment and optimization cleaner on embedded, and the background/context is covered in this YOLO11 hand pose overview. Even if you stick with MediaPipe, I’d still “gate” gesture inference to the locked person’s ROI (crop to that person) instead of running hands on the full frame every time.

If you want more concrete feedback on architecture/directory structure, point me to the current entrypoint you actually run on the Jetson (the one that starts the camera loop + motor commands). That’s usually where the “mixing hardware I/O, vision, and UI in one file” issue shows up first, and it’s the easiest place to suggest a clean split without doing a full code review.

Thank you very much for you input!

Yes my idea was to have a requirement that the hand gesture needs to be shown for ~2 seconds or something. For this I was gonna do like a meter filling up between frames so that if the gesture is accidentally shown a couple of frames, the meter would go down again during the frames where the gesture is no longer there. Once the meter is filled up → track the person.

I just learned about state machines now and i does sound like the clear choice! And the cooldown sounds like a good thing, thank you.

Yes I’m using yolo’s track for the ID!

This memory leak I’m a bit worried for because since I use two different models, (yolo and mediapipe), I can’t really use the stream=True right? I feel I need to take frame by frame and put it through a inference pipeline. I do use persist=True, which I read is what you use when you go frame by frame “manually”. I don’t know much about memory and wouldn’t know where the memory gets lost in that pipeline. I will keep an eye out for that, thank you!

I was happy when I saw that I could do hand landmark detection with YOLO pose. I followed ultralytics guide Hand Keypoints Dataset - Ultralytics YOLO Docs. But frankly, the predictions are just not good enough I think, I wasn’t able to trust the landmarks that came out from it. So I went with mediapipe, and yes I does make the stack a bit messy unfortunately. I was thinking cropping the person ROI and do the hand inference on that, but maybe only of I notice that it’s needed for latency? Also maybe check hands every 3 frames or something.

The entrypoint I suppose is main.py in the repository where the looping through the video is. So since I use two different models I fetch the frame from a video stream module, and process each frame. Right now the hardware part is not implemented in the full system, I have only created some separate modules for making it move. The plan is to add the motor control into the state machine soon.

So main.py is basically

  1. Grab a frame
  2. Run yolo track
  3. Run mediapipe
  4. Put the results from that into the state machine
  5. Then I save a video for debugging

I appreciate your input :slight_smile: