Different results over same trainingn and dataset

I’m new to the community and currently working on training a model for object detection using a custom dataset. So far, things have been going well, but I’ve reached the point where I need to experiment with data augmentation to address some slight color imbalances in my dataset.

Due to the time-consuming nature of the training process, I’ve been running the training on a server as well as occasionally on my personal computer. However, I’ve noticed something puzzling — despite using the same dataset, model (yolo11n.pt), and identical settings, the results differ between the two environments. I’ve double-checked by running the test twice, but I’m still seeing inconsistent outcomes.

I’d appreciate any insights or advice on what might be causing these differences. Thank you in advance for your help!

YOLO11n summary (fused): 238 layers, 2,582,542 parameters, 0 gradients, 6.3 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|███████
                   all          5        126      0.356      0.315      0.305      0.137
                1          5         43      0.419      0.209       0.25     0.0974
                2          5         83      0.293       0.42      0.359      0.176
Speed: 0.4ms preprocess, 25.0ms inference, 0.0ms loss, 15.0ms postprocess per image

YOLO11n summary (fused): 238 layers, 2,582,542 parameters, 0 gradients, 6.3 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<0
                   all          5        126      0.442      0.302      0.379      0.157
                1          5         43      0.337      0.302       0.33      0.123
                2          5         83      0.546      0.301      0.428      0.192

Validating runs\detect\deterministic2\weights\best.pt...
Ultralytics 8.3.1 🚀 Python-3.12.5 torch-2.4.1+cu118 CUDA:0 (NVIDIA RTX A3000 Laptop GPU, 6144MiB)
YOLO11n summary (fused): 238 layers, 2,582,542 parameters, 0 gradients, 6.3 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<0
                   all          5        126     0.0179      0.201     0.0398     0.0108
                1          5         43    0.00994      0.186     0.0282    0.00582
                2          5         83     0.0259      0.217     0.0514     0.0158
Speed: 0.4ms preprocess, 2.7ms inference, 0.0ms loss, 1.4ms postprocess per image
Results saved to runs\detect\deterministic2

Ultralytics 8.3.2 🚀 Python-3.8.10 torch-2.4.1+cu121 CPU (Intel Xeon E5-2676 v3 2.40GHz)
YOLO11n summary (fused): 238 layers, 2,582,542 parameters, 0 gradients, 6.3 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|███████
                   all          5        126     0.0127      0.143     0.0285     0.0137
                1          5         43    0.00587      0.093    0.00487    0.00136
                2          5         83     0.0196      0.193     0.0522      0.026

At first, I thought the issue might be related to differences between the server and my computer. But after running the test again (so now I have four results from the same test), all of them produced different results. Is this expected behavior, or am I missing something?

I’d appreciate any insights or advice on what might be causing these differences. Thank you in advance for your help!

Hello and welcome to the community! :blush:

It’s great to hear you’re diving into object detection with YOLO. The differences you’re observing can be puzzling, but there are a few common factors that might be causing this:

  1. Random Seed: Ensure that you set a random seed for reproducibility. This helps in making the training process deterministic across different runs and environments. You can set it in your training script like this:

    import torch
    import random
    import numpy as np
    
    torch.manual_seed(0)
    random.seed(0)
    np.random.seed(0)
    
  2. Hardware Differences: Variations in hardware, such as GPU vs. CPU or different GPU models, can lead to slight differences in floating-point calculations, which might affect the results.

  3. Software Versions: Double-check that the versions of Python, PyTorch, and other dependencies are consistent across both environments. Even minor version differences can impact results.

  4. Data Augmentation: If you’re using data augmentation, ensure that it’s applied consistently. Augmentation randomness can lead to different training outcomes.

  5. Batch Size and Learning Rate: Differences in batch size or learning rate due to hardware constraints can also affect training results.

For more detailed troubleshooting, you might find our YOLO Common Issues Guide helpful.

If the issue persists, try running the training with logging enabled to capture more details about each run. This can help pinpoint where the differences might be occurring.

Feel free to reach out if you have more questions. Happy training! :rocket:

There’s a few likely explanations (assuming your configurations are exactly the same):

  1. It appears that training is running on different devices.

Ultralytics 8.3.2 :rocket:
Python-3.8.10
torch-2.4.1+cu121
CPU (Intel Xeon E5-2676 v3 2.40GHz)

Ultralytics 8.3.1 :rocket:
Python-3.12.5
torch-2.4.1+cu118
CUDA:0 (NVIDIA RTX A3000 Laptop GPU, 6144MiB)

I would not expect the results from CPU and GPU to be exactly the same. This is the most probable reason given the same configuration.

  1. Differences in OSes or underlying infrastructure (dependent libraries or drivers) could lead to differences as well.
    • I have seen a user report this happening with annotations \gt5 significant figures. This was due to differences in OS rounding, which led to deviation in the model training process.
  2. You could try using the seed argument in your training configuration but the default is 0 so it’s likely already the same.
  3. (less likely) Differences in Python versions 3.8.10 vs 3.12.5 might result in different outcomes. Same can be said about the versions of ultralytics but that’s less likely (would still be good to use the same).

I would recommend using the Ultralytics Docker container for consistency in versions and python environments across devices. This will remove a lot of variables (not all tho) in determining what is causing the discrepancy.

I’m not clear on what you mean here. What would be helpful to understand is how you went about testing step-by-step and seeing the training configuration/settings you used. Something like:

Train settings

from ultralytics import YOLO

model = YOLO("yolo11n.pt")
results = model.train(
    data=...,
    {{ other custom settings }},
)

Devices

Device A = Server
Output from yolo checks CLI command
{{ device A environment details }}
Device B = My PC
Output from yolo checks CLI command
{{ device B environment details }}

Experiment

  1. Run training with custom settings on both devices.
  2. Record outcomes
    • Results:
    {{ result from Device A }}
    {{ result from Device B }}
    
  3. Transfer weights from Device A to Device B and from Device B to Device A
  4. Run validation using weights A on Device B, record results
    • Results:
    {{ validation result with weights A on Device B }}
    
  5. Run validation using weights B on Device A, record results
    • Results:
    {{ validation result with weights B on Device A }}
    

Besides precision differences, CPU training also doesn’t use multiple workers, while GPU training does. This is important as each worker has a different seed, and hence RNG state. So they won’t be training on exactly the same images as the random augmentations would be different.

1 Like

Hello pderrenger,Burhan thanks for the quick reply!
I did try again with the latest version of ultralitycs and it seems to be correct! thanks for your help!