How to add ConvLSTM in the detection head of yolov11n for temporal context

@BurhanQ @Jordan_Cooper
I’m working on an object detection system for human intrusion detection. Right now I’m using a YOLO-based model for frame-by-frame detection. The issue I’m facing is false positives when there’s sudden movement in the scene — the model sometimes flags motion artifacts or partial shapes as a person.

To fix this, I want to add temporal context so that the detector remembers information from previous frames and avoids spurious detections. My plan was to:

  • take the YOLO feature maps or head outputs,

  • feed them into an LSTM (or ConvLSTM) to capture temporal information,

  • then output detections that are temporally consistent.

Questions:

  1. Is this approach reasonable for reducing false positives?

  2. Where exactly is the best place to inject an LSTM (backbone features vs detection head)?

  3. Are there simpler or more robust alternatives — e.g. optical flow, temporal smoothing, or post-processing with a tracker (like ByteTrack/DeepSORT) — instead of modifying YOLO internals?

  4. For real-time inference, how do people usually maintain LSTM state between frames?

My end goal is a detector that works better on live video streams, not just individual frames. I’m open to either training-time modifications (YOLO+LSTM end-to-end) or post-processing methods if they’re more practical.

Any guidance or examples would be really appreciated!

Thanks in advance!

  1. It can work
  2. It’s complicated enough that it can’t be explained in a response.
  3. Using optical flow or frame difference and using that as the 4th channel and training YOLO is the simplest with the least modifications required.
  4. From what I know, you reset it when there’s no activity. Or reset it after something positive is detected.