DeepSeek has captured the attention of the AI market with the launch of its open-source foundation models, demonstrating that cutting-edge models can be developed without spending billions of dollars in compute. The most interesting part of these models is not their reasoning performance as compared to other leading models like GPT 4o. Instead, it’s the methods by which the model was trained (as described in released papers DeepSeek R1, DeepSeek V3). If validated, the model release has fundamentally changed the perceived costs and efficient frontier of training LLMs.
DeepSeek cleverly reduced a memory bottleneck in their training by reducing the precision, and consequently the memory needed, for certain arithmetic operations in the model. DeepSeek worked around a networking bandwidth bottleneck by writing custom compiler code in a low level programming instruction set called PTX to reconfigure a portion of the compute available (20 of the 132 streaming microprocessors on a chip) to optimize server-server communication.
Performing GPU optimizations close to the hardware can provide meaningful improvements to the speed and cost of interacting with AI models. However, doing so is outside the capabilities of all but a handful of engineers and researchers.
We’re excited to see CentML’s rapid deployment of DeepSeek’s R1 model on its platform within days of the model’s release. With more optimizations to be made, CentML has already delivered inference with leading throughput, latency, and context window size (see Artificial Analysis benchmarks here). As we wrote about when we first partnered with CentML, deep parallel computing optimization expertise is isolated to a few teams in the world, and the recent success of DeepSeek shows how critical optimization is for both training and inference of AI models.
CentML was founded with the exciting premise to bring this expertise to all developers by exposing efficient AI training and inference to all. CentML’s platform optimizes compute, memory, networking, and other constraints for optimized performance for all major models and all major chips. We encourage you to try out their solution.