@BurhanQ @Jordan_Cooper
I’m working on an object detection system for human intrusion detection. Right now I’m using a YOLO-based model for frame-by-frame detection. The issue I’m facing is false positives when there’s sudden movement in the scene — the model sometimes flags motion artifacts or partial shapes as a person.
To fix this, I want to add temporal context so that the detector remembers information from previous frames and avoids spurious detections. My plan was to:
-
take the YOLO feature maps or head outputs,
-
feed them into an LSTM (or ConvLSTM) to capture temporal information,
-
then output detections that are temporally consistent.
Questions:
-
Is this approach reasonable for reducing false positives?
-
Where exactly is the best place to inject an LSTM (backbone features vs detection head)?
-
Are there simpler or more robust alternatives — e.g. optical flow, temporal smoothing, or post-processing with a tracker (like ByteTrack/DeepSORT) — instead of modifying YOLO internals?
-
For real-time inference, how do people usually maintain LSTM state between frames?
My end goal is a detector that works better on live video streams, not just individual frames. I’m open to either training-time modifications (YOLO+LSTM end-to-end) or post-processing methods if they’re more practical.
Any guidance or examples would be really appreciated!
Thanks in advance!