“OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training”, Sami Jaghouar, Jack Min Ong, Johannes Hagemann2024-07-10 (, ; similar)⁠:

OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models.

We provide a reproducible implementation of the DiLoCo experiments, offering it within a scalable, decentralized training framework using the Hivemind library.

We demonstrate its effectiveness by training a model across two continents and 3 countries, while maintaining 90–95% compute usage. Additionally, we conduct ablation studies focusing on the algorithm’s compute efficiency, scalability in the number of workers, and show that its gradients can be all-reduced using FP16 without any performance degradation.

Furthermore, we scale OpenDiLoCo to 3× the size of the original work, demonstrating its effectiveness for billion parameter models.