Guidance on Freezing Layers for YOLOv8x-seg Transfer Learning

Hello,

I’m working with the [YOLOv8x-seg] (yolov8x-seg.pt) trained for a building footprints segmentation task for my study area and am currently setting up transfer learning for a different region. I want to freeze the entire backbone during training and have set freeze=12 in my training configuration. This freezes the first 12 layers, but I’m unsure if this includes all layers in the backbone.

Could someone clarify:

  1. How many layers are there in the backbone of the YOLOv8 segmentation model?
  2. What is the correct freeze value I should use to freeze the entire backbone during training?
  3. Which sections (backbone, neck, head) do the frozen layers belong to when using freeze=12?

Additionally, I have the following questions regarding transfer learning:
4. Should I freeze only the backbone layers, or should I also consider freezing the neck layers for transfer learning? (What is the correct freeze value I should use to freeze the entire backbone+neck during training?)
5. How does freezing the backbone versus freezing both the backbone and neck layers impact model accuracies?

Ideally, I’m more interested in increasing the Recall.

Any guidance on these points would be greatly appreciated!
Thanks in advance!

@Bhanu_Prasad_CHINTAK welcome to the forums!

Below is the YAML file for YOLOv8-seg (segment) models, with a link at the bottom to the GitHub source location. In the backbone you can see that there are multiple layers, with some including a trailing comment # (i)-P(j)/(k) where i denotes the zero-index of a given layer. The last layer under backbone is the SPPF layer, with an index of # 9, which means you would use freeze=10 to freeze only the layers in the backbone. For each of the related questions:

  1. There are 10 layers in the backbone of YOLOv8-seg models
  2. Using model.train(..., freeze=10) (other args still required) will freeze all the layers of the backbone. There are other means to do this as well, but this is the easiest and shortest.
  3. Using freeze=12 will include the entire backbone, plus two additional layers of the head layers (up to layer index 11, the first Concat layer in the head).
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv8-seg instance segmentation model. For Usage examples see https://docs.ultralytics.com/tasks/segment

# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-seg.yaml' will call yolov8-seg.yaml with scale 'n'
  # [depth, width, max_channels]
  n: [0.33, 0.25, 1024]
  s: [0.33, 0.50, 1024]
  m: [0.67, 0.75, 768]
  l: [1.00, 1.00, 512]
  x: [1.00, 1.25, 512]

# YOLOv8.0n backbone
backbone:
  # [from, repeats, module, args]
  - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
  - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
  - [-1, 3, C2f, [128, True]]
  - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
  - [-1, 6, C2f, [256, True]]
  - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
  - [-1, 6, C2f, [512, True]]
  - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
  - [-1, 3, C2f, [1024, True]]
  - [-1, 1, SPPF, [1024, 5]] # 9

# YOLOv8.0n head
head:
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 6], 1, Concat, [1]] # cat backbone P4
  - [-1, 3, C2f, [512]] # 12

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 4], 1, Concat, [1]] # cat backbone P3
  - [-1, 3, C2f, [256]] # 15 (P3/8-small)

  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 12], 1, Concat, [1]] # cat head P4
  - [-1, 3, C2f, [512]] # 18 (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 9], 1, Concat, [1]] # cat head P5
  - [-1, 3, C2f, [1024]] # 21 (P5/32-large)

  - [[15, 18, 21], 1, Segment, [nc, 32, 256]] # Segment(P3, P4, P5)

source YAML file

Freezing layers, just like many other aspects of training neural network models, is going to be highly subjective. If only considering the change in the count of layers frozen, there are still many other variables that will factor into the performance of the final model for transfer learning purposes. Namely the dataset used for training the “source” model and the dataset of the new model (which you’re training via transfer learning). Additionally, the performance of the “source” model is likely to be a factor, and this would be an extension of the original dataset, however it would also mean the training arguments would be a factor as well.
All of this means you’ll have to test things out yourself. It’s highly subjective and empirical, so it’s really no feasible for anyone to know with a high-level of certainty what’s “best” for your situation. This is the case with most aspects regarding neural networks. So, to directly address your questions:

  1. You’ll need to test to see what works “best” for you, of course this implies you have established a definition of what “best” is in your case. You can freeze up to any number of layers \le 22. The final layer, has a zero-index of 22 (it’s the 23rd layer), so freezing all layers up to the final layer would use freeze=22 for training.
  2. As mentioned earlier, there’s no way for me (or any) to be able to tell you a priori how it various amounts of layer freezing will impact your model performance. The only way for you to understand this, is to test it yourself.

I’ll try to share some input on how you can try to decide what might make sense to start with, however know that this is from both my subjective and limited experience. I’ve only tested layer freezing for a single project, so I can only share what I learned from this experience, plus some additional “textbook” understanding I have.

  • If the “source” model dataset and the new dataset are very similar, you might be able to freeze more layers. For instance, if my “source” model contains book as a class, but I want to train a model to learn phone book, I could probably freeze many if not all the layers in the model.
  • Freezing more layers means your model will likely train faster, so it could make sense to test freezing more first, but if it doesn’t perform well, then I would try freezing 10 layers next.
  • Definitely set a benchmark for what “best” means. This could be something like
    • Training your new model without any transfer learning.
    • Check performance against a model with 22 layers frozen.
    • It could also just be a metric benchmark, something like, score higher than X for Z metric.
    • You could also try transfer learning from the COCO model weights with the same number of layers frozen, say freeze=10, to compare transfer learning from your “source” model.
  • Freezing layers might allow you to train a larger model since the GPU resource usage is reduced when freezing layers. My project was using an s model with no layers frozen, but I tested using freeze=10 with an l model, and saw a lot of improvement (I was also using the COCO weights for transfer learning).

Lots of info here I know, but I hope it’s helpful! It’s also possible that my answer is only a part of the whole picture, and there could be other important factors to consider, but I tried to cover the primary ones.

@BurhanQ

Thank you for the detailed and insightful response! I appreciate you taking the time to explain the nuances of freezing layers and how it can affect model performance, especially in the context of transfer learning.

Thanks again for your guidance!

1 Like

Hello,

Thank you so much for your response earlier. When I tried using the freeze=10, I got the below output when using my (yolov8x-seg.pt):
Transferred 657/657 items from pretrained weights

Freezing layer ‘model.0.conv.weight’
Freezing layer ‘model.0.bn.weight’
Freezing layer ‘model.0.bn.bias’
Freezing layer ‘model.1.conv.weight’
Freezing layer ‘model.1.bn.weight’
Freezing layer ‘model.1.bn.bias’
Freezing layer ‘model.2.cv1.conv.weight’
Freezing layer ‘model.2.cv1.bn.weight’
Freezing layer ‘model.2.cv1.bn.bias’
Freezing layer ‘model.2.cv2.conv.weight’
Freezing layer ‘model.2.cv2.bn.weight’
Freezing layer ‘model.2.cv2.bn.bias’
Freezing layer ‘model.2.m.0.cv1.conv.weight’
Freezing layer ‘model.2.m.0.cv1.bn.weight’
Freezing layer ‘model.2.m.0.cv1.bn.bias’
Freezing layer ‘model.2.m.0.cv2.conv.weight’
Freezing layer ‘model.2.m.0.cv2.bn.weight’
Freezing layer ‘model.2.m.0.cv2.bn.bias’
Freezing layer ‘model.2.m.1.cv1.conv.weight’
Freezing layer ‘model.2.m.1.cv1.bn.weight’
Freezing layer ‘model.2.m.1.cv1.bn.bias’
Freezing layer ‘model.2.m.1.cv2.conv.weight’
Freezing layer ‘model.2.m.1.cv2.bn.weight’
Freezing layer ‘model.2.m.1.cv2.bn.bias’
Freezing layer ‘model.2.m.2.cv1.conv.weight’
Freezing layer ‘model.2.m.2.cv1.bn.weight’
Freezing layer ‘model.2.m.2.cv1.bn.bias’
Freezing layer ‘model.2.m.2.cv2.conv.weight’
Freezing layer ‘model.2.m.2.cv2.bn.weight’
Freezing layer ‘model.2.m.2.cv2.bn.bias’
Freezing layer ‘model.3.conv.weight’
Freezing layer ‘model.3.bn.weight’
Freezing layer ‘model.3.bn.bias’
Freezing layer ‘model.4.cv1.conv.weight’
Freezing layer ‘model.4.cv1.bn.weight’
Freezing layer ‘model.4.cv1.bn.bias’
Freezing layer ‘model.4.cv2.conv.weight’
Freezing layer ‘model.4.cv2.bn.weight’
Freezing layer ‘model.4.cv2.bn.bias’
Freezing layer 'model.4.m.0.cv1.conv.weight'
Freezing layer 'model.4.m.0.cv1.bn.weight'
Freezing layer 'model.4.m.0.cv1.bn.bias'
Freezing layer 'model.4.m.0.cv2.conv.weight'
Freezing layer 'model.4.m.0.cv2.bn.weight'
Freezing layer 'model.4.m.0.cv2.bn.bias'
Freezing layer 'model.4.m.1.cv1.conv.weight'
Freezing layer 'model.4.m.1.cv1.bn.weight'
Freezing layer 'model.4.m.1.cv1.bn.bias'
Freezing layer 'model.4.m.1.cv2.conv.weight'
Freezing layer 'model.4.m.1.cv2.bn.weight'
Freezing layer 'model.4.m.1.cv2.bn.bias'
Freezing layer 'model.4.m.2.cv1.conv.weight'
Freezing layer 'model.4.m.2.cv1.bn.weight'
Freezing layer 'model.4.m.2.cv1.bn.bias'
Freezing layer 'model.4.m.2.cv2.conv.weight'
Freezing layer 'model.4.m.2.cv2.bn.weight'
Freezing layer 'model.4.m.2.cv2.bn.bias'
Freezing layer 'model.4.m.3.cv1.conv.weight'
Freezing layer 'model.4.m.3.cv1.bn.weight'
Freezing layer 'model.4.m.3.cv1.bn.bias'
Freezing layer 'model.4.m.3.cv2.conv.weight'
Freezing layer 'model.4.m.3.cv2.bn.weight'
Freezing layer 'model.4.m.3.cv2.bn.bias'
Freezing layer 'model.4.m.4.cv1.conv.weight'
Freezing layer 'model.4.m.4.cv1.bn.weight'
Freezing layer 'model.4.m.4.cv1.bn.bias'
Freezing layer 'model.4.m.4.cv2.conv.weight'
Freezing layer 'model.4.m.4.cv2.bn.weight'
Freezing layer 'model.4.m.4.cv2.bn.bias'
Freezing layer 'model.4.m.5.cv1.conv.weight'
Freezing layer 'model.4.m.5.cv1.bn.weight'
Freezing layer 'model.4.m.5.cv1.bn.bias'
Freezing layer 'model.4.m.5.cv2.conv.weight'
Freezing layer 'model.4.m.5.cv2.bn.weight'
Freezing layer 'model.4.m.5.cv2.bn.bias'
Freezing layer 'model.5.conv.weight'
Freezing layer 'model.5.bn.weight'
Freezing layer 'model.5.bn.bias'
Freezing layer 'model.6.cv1.conv.weight'
Freezing layer 'model.6.cv1.bn.weight'
Freezing layer 'model.6.cv1.bn.bias'
Freezing layer 'model.6.cv2.conv.weight'
Freezing layer 'model.6.cv2.bn.weight'
Freezing layer 'model.6.cv2.bn.bias'
Freezing layer 'model.6.m.0.cv1.conv.weight'
Freezing layer 'model.6.m.0.cv1.bn.weight'
Freezing layer 'model.6.m.0.cv1.bn.bias'
Freezing layer 'model.6.m.0.cv2.conv.weight'
Freezing layer 'model.6.m.0.cv2.bn.weight'
Freezing layer 'model.6.m.0.cv2.bn.bias'
Freezing layer 'model.6.m.1.cv1.conv.weight'
Freezing layer 'model.6.m.1.cv1.bn.weight'
Freezing layer 'model.6.m.1.cv1.bn.bias'
Freezing layer 'model.6.m.1.cv2.conv.weight'
Freezing layer 'model.6.m.1.cv2.bn.weight'
Freezing layer 'model.6.m.1.cv2.bn.bias'
Freezing layer 'model.6.m.2.cv1.conv.weight'
Freezing layer 'model.6.m.2.cv1.bn.weight'
Freezing layer 'model.6.m.2.cv1.bn.bias'
Freezing layer 'model.6.m.2.cv2.conv.weight'
Freezing layer 'model.6.m.2.cv2.bn.weight'
Freezing layer 'model.6.m.2.cv2.bn.bias'
Freezing layer 'model.6.m.3.cv1.conv.weight'
Freezing layer 'model.6.m.3.cv1.bn.weight'
Freezing layer 'model.6.m.3.cv1.bn.bias'
Freezing layer 'model.6.m.3.cv2.conv.weight'
Freezing layer 'model.6.m.3.cv2.bn.weight'
Freezing layer 'model.6.m.3.cv2.bn.bias'
Freezing layer 'model.6.m.4.cv1.conv.weight'
Freezing layer 'model.6.m.4.cv1.bn.weight'
Freezing layer 'model.6.m.4.cv1.bn.bias'
Freezing layer 'model.6.m.4.cv2.conv.weight'
Freezing layer 'model.6.m.4.cv2.bn.weight'
Freezing layer 'model.6.m.4.cv2.bn.bias'
Freezing layer 'model.6.m.5.cv1.conv.weight'
Freezing layer 'model.6.m.5.cv1.bn.weight'
Freezing layer 'model.6.m.5.cv1.bn.bias'
Freezing layer 'model.6.m.5.cv2.conv.weight'
Freezing layer 'model.6.m.5.cv2.bn.weight'
Freezing layer 'model.6.m.5.cv2.bn.bias'
Freezing layer 'model.7.conv.weight'
Freezing layer 'model.7.bn.weight'
Freezing layer 'model.7.bn.bias'
Freezing layer 'model.8.cv1.conv.weight'
Freezing layer 'model.8.cv1.bn.weight'
Freezing layer 'model.8.cv1.bn.bias'
Freezing layer 'model.8.cv2.conv.weight'
Freezing layer 'model.8.cv2.bn.weight'
Freezing layer 'model.8.cv2.bn.bias'
Freezing layer 'model.8.m.0.cv1.conv.weight'
Freezing layer 'model.8.m.0.cv1.bn.weight'
Freezing layer 'model.8.m.0.cv1.bn.bias'
Freezing layer 'model.8.m.0.cv2.conv.weight'
Freezing layer 'model.8.m.0.cv2.bn.weight'
Freezing layer 'model.8.m.0.cv2.bn.bias'
Freezing layer 'model.8.m.1.cv1.conv.weight'
Freezing layer 'model.8.m.1.cv1.bn.weight'
Freezing layer 'model.8.m.1.cv1.bn.bias'
Freezing layer 'model.8.m.1.cv2.conv.weight'
Freezing layer 'model.8.m.1.cv2.bn.weight'
Freezing layer 'model.8.m.1.cv2.bn.bias'
Freezing layer 'model.8.m.2.cv1.conv.weight'
Freezing layer 'model.8.m.2.cv1.bn.weight'
Freezing layer 'model.8.m.2.cv1.bn.bias'
Freezing layer 'model.8.m.2.cv2.conv.weight'
Freezing layer 'model.8.m.2.cv2.bn.weight'
Freezing layer 'model.8.m.2.cv2.bn.bias'
Freezing layer 'model.9.cv1.conv.weight'
Freezing layer 'model.9.cv1.bn.weight'
Freezing layer 'model.9.cv1.bn.bias'
Freezing layer 'model.9.cv2.conv.weight'
Freezing layer 'model.9.cv2.bn.weight'
Freezing layer 'model.9.cv2.bn.bias'
Freezing layer 'model.22.dfl.conv.weight'

Why is it freezing the model.22.dfl.conv.weight&apos layer? Can I omit it? How do I solve it?

@Bhanu_Prasad_CHINTAK the model.22.dfl.conv.weight layer is always frozen. Even when you don’t specify any layers to be frozen, this module will always show as frozen; see the code here. To be 100% honest with you, I haven’t looked into the why for this, but I wanted to let you know it is normal and expected.

Glad my explanation was helpful and thank you for marking it as solved! I think this is the first thread marked as solved since relaunching the forums :pepejams:

1 Like

Understood. Thank you so much for your prompt response @BurhanQ .

I am currently facing a new error while transfer learning using the freeze=10 on multi GPU. Could you please help me in solving it?

  Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
    285/500        12G      1.163      1.991     0.6634      1.029       1068        640: 100%|██████████| 56/56 [00:33<00:00,  1.66it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 32/32 [00:20<00:00,  1.55it
                   all       1022      48870      0.661      0.532      0.599      0.305       0.63      0.492      0.557      0.247
EarlyStopping: Training stopped early as no improvement observed in last 50 epochs. Best results observed at epoch 235, best model saved as best.pt.
To update EarlyStopping(patience=50) pass a new patience value, i.e. `patience=300` or use `patience=0` to disable EarlyStopping.

285 epochs completed in 4.536 hours.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/opt/conda/envs/ultralytics/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/envs/ultralytics/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in _pin_memory_loop
    do_one_step()
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/opt/conda/envs/ultralytics/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 307, in rebuild_storage_fd
    fd = df.detach()
  File "/opt/conda/envs/ultralytics/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/opt/conda/envs/ultralytics/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/opt/conda/envs/ultralytics/lib/python3.8/multiprocessing/reduction.py", line 157, in recvfds
    msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_SPACE(bytes_size))
ConnectionResetError: [Errno 104] Connection reset by peer
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/opt/conda/envs/ultralytics/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Optimizer stripped from runs/segment/train/weights/last.pt, 144.0MB
Optimizer stripped from runs/segment/train/weights/best.pt, 144.0MB

Validating runs/segment/train/weights/best.pt...
Ultralytics YOLOv8.2.77 🚀 Python-3.8.16 torch-2.0.1+cu117 CUDA:0 (NVIDIA A10G, 22592MiB)
                                                           CUDA:1 (NVIDIA A10G, 22592MiB)
                                                           CUDA:2 (NVIDIA A10G, 22592MiB)
                                                           CUDA:3 (NVIDIA A10G, 22592MiB)
YOLOv8x-seg summary (fused): 295 layers, 71,721,619 parameters, 0 gradients, 343.7 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95):   0%|          | 0/32 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/ubuntu/.config/Ultralytics/DDP/_temp_wm5upll9139621694695552.py", line 13, in <module>
    results = trainer.train()
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/ultralytics/engine/trainer.py", line 208, in train
    self._do_train(world_size)
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/ultralytics/engine/trainer.py", line 473, in _do_train
    self.final_eval()
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/ultralytics/engine/trainer.py", line 651, in final_eval
    self.metrics = self.validator(model=f)
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/ultralytics/engine/validator.py", line 187, in __call__
    preds = self.postprocess(preds)
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/ultralytics/models/yolo/segment/val.py", line 83, in postprocess
    proto = preds[1][-1] if len(preds[1]) == 3 else preds[1]  # second output is len 3 if pt, but only 1 if exported
TypeError: object of type 'NoneType' has no len()
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4173) of binary: /opt/conda/envs/ultralytics/bin/python3.8
Traceback (most recent call last):
  File "/opt/conda/envs/ultralytics/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/ultralytics/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/ubuntu/.config/Ultralytics/DDP/_temp_wm5upll9139621694695552.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-16_10:31:32
  host      : ip-172-30-2-70
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 4173)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/opt/conda/envs/ultralytics/bin/yolo", line 8, in <module>
    sys.exit(entrypoint())
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/ultralytics/cfg/__init__.py", line 834, in entrypoint
    getattr(model, mode)(**overrides)  # default args from model
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/ultralytics/engine/model.py", line 811, in train
    self.trainer.train()
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/ultralytics/engine/trainer.py", line 203, in train
    raise e
  File "/opt/conda/envs/ultralytics/lib/python3.8/site-packages/ultralytics/engine/trainer.py", line 201, in train
    subprocess.run(cmd, check=True)
  File "/opt/conda/envs/ultralytics/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/conda/envs/ultralytics/bin/python3.8', '-m', 'torch.distributed.run', '--nproc_per_node', '4', '--master_port', '38953', '/home/ubuntu/.config/Ultralytics/DDP/_temp_wm5upll9139621694695552.py']' returned non-zero exit status 1.

Thank you so much for your help.

@Bhanu_Prasad_CHINTAK try updating to ultralytics==8.2.78 as there was an issue with DDP training/validation that might be the source of this problem. If that doesn’t fix your problem, next few things to try:

  1. Test with toy dataset like data="coco128.yaml" to verify the issue will occur with a standard dataset (helps us with testing too).
  2. Create a new virtual environment and install everything fresh and try again (toy dataset and if that works, also your custom dataset)
  3. If the above two fail at any point, then it would be good to open up a Bug Report Issue on GitHub
1 Like

Dear @BurhanQ

Thank you for your prompt response. Could you please clarify whether the training process was completed successfully? My assumption is that the error occurred during the validation phase when checking for accuracy, after the training was halted due to no improvement observed in the last 50 epochs. Based on this, I believe I don’t need to retrain the model. Instead, should I proceed to validate the newly trained model using the model.val() command?

@Bhanu_Prasad_CHINTAK yes, it does appear the training completed and that the error only occurred during the validation step. I would make sense to use model.val() after loading runs/segment/train/weights/best.pt for your model.

1 Like