“MultiRay: Optimizing Efficiency for Large-Scale AI Models”, Nikhil Gupta, Michael Gschwind, Don Husa, Christopher Dewan, Madian Khabsa2022-11-18 ()⁠:

…As part of our push to make our AI systems more efficient, we’ve developed MultiRay, a new platform for running state-of-the-art AI models at scale. MultiRay allows multiple models to run on the same input, and share the majority of the processing costs while incurring only a small per-model cost. Doing this helps us optimize the total cost of performing these AI tasks. We can more easily introduce AI accelerators due to the concentration of company-wide computation into a single model, and we can also trade off between compute power and storage at the company level.

MultiRay’s universal models are trained to perform well across a wide set of tasks and domains. Such a jack-of-all-trades model delivers better quality than the much smaller per-task specialized models we used previously. With MultiRay, teams across Meta [Facebook] can more quickly improve and iterate on machine learning (ML) models for myriad applications, ranging from topic tagging of posts to hate speech detection. These tasks can also be achieved with better efficiency and less human effort than if each team were to build large end-to-end models from scratch.

MultiRay’s first model, TextRay, has been in production since 2020 and supports text understanding applications, such as detecting inauthentic content and improving users’ search experience.

PostRay, MultiRay’s second model, brings together text and image understanding into the same model. Since posts across FB and IG often contain both text and image data, PostRay reduces the need for teams to have their own text and image understanding. PostRay has several use cases across Meta, including topic classification, which is used for Instagram Reels.

PostRay models, because they incorporate cutting-edge research in multiple fields simultaneously, are more complex to train, deploy, and maintain. With MultiRay, we only have to do these tasks a single time, and the whole company reaps the benefits. A centralized system serving a jack-of-all-trades model allows us to work directly with cutting-edge research teams and bring their work to production soon after it is published.

How MultiRay works: MultiRay’s primary aim is to democratize access to large foundational models at Meta. It does so by centralizing the execution on accelerators like GPUs and using a cache to save on cost of recomputation as much as possible. Currently, MultiRay powers over 125 use cases across Meta, and it supports up to 20 million queries per second (QPS) while serving 800 billion queries per day.

…Large models and latency constraints demand execution on accelerators like GPUs. Accelerators (specialized hardware) are in high demand across Meta, and even with them, state-of-the-art models consume a lot of energy to train and host. MultiRay’s client teams split the bill for training and hosting these large models, as the same hardware and processing can be used multiple times. These are much larger and higher quality than what each team could have hosted alone. In this case, the whole is greater than the sum of the parts… Since MultiRay is a centralized service used by over 125 clients, improvements benefit all the clients. As a result, MultiRay has become a sandbox for our ML and systems specialists to contribute key optimizations that support the broader PyTorch and accelerator ecosystem. MultiRay, for example, was the first large use case to deploy PyTorch’s BetterTransformer in production at Meta. This brought large capacity savings with no impact on quality.

Cache: Trade-off compute and storage: MultiRay uses a cache to save on cost of recomputation as much as possible. It is a multilayered cache to minimize cost and latency, with each layer bringing more hit rate, at the cost of lower speed. The layers start from a fast but small per-host local cache in the RAM of every MultiRay server, and they end with a slower but much larger globally distributed cache in flash memory. The MultiRay models are large, and they produce large embeddings (many kilobytes) to preserve universality. For text understanding, these embeddings are much larger than the inputs themselves! It takes less energy to serve an embedding out of cache than to recompute it, but it’s not zero.

Since the cache storage available is finite, it is not possible to cache the results for a long time. MultiRay measures the request patterns across clients to figure out the best cache settings (size, time-to-live, update policies) to reduce the total cost of the service. For example, we use these measured data to simulate the energy required for various cache lifetime settings, trading off the cost of recomputation of a request on accelerators versus serving it from cache. This feedback loop allowed us to improve the efficiency of MultiRay even while client behavior constantly changes.

…the research from Meta’s Foundational AI Research (FAIR) team that led to its development: