“OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training”, 2024-07-10 (; similar):
OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models.
We provide a reproducible implementation of the DiLoCo experiments, offering it within a scalable, decentralized training framework using the Hivemind library.
We demonstrate its effectiveness by training a model across two continents and 3 countries, while maintaining 90–95% compute usage. Additionally, we conduct ablation studies focusing on the algorithm’s compute efficiency, scalability in the number of workers, and show that its gradients can be all-reduced using FP16 without any performance degradation.
Furthermore, we scale OpenDiLoCo to 3× the size of the original work, demonstrating its effectiveness for billion parameter models.