InfoTechTarget and Informa Tech's Digital Businesses Combine.

Together, we power an unparalleled network of 220+ online properties covering 10,000+ granular topics, serving an audience of 50+ million professionals with original, objective content from trusted sources. We help you gain critical insights and make more informed decisions across your business priorities.

Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Presented by

Tianle Cai - PhD Candidate, Princeton University, working on efficient inference of large models

About this talk

The inference process in Large Language Models (LLMs) is often limited due to the absence of parallelism in the auto-regressive decoding process, resulting in most operations being restricted by the memory bandwidth of accelerators. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. Recorded at SambaNova Systems, this seminar covers Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel.
SambaNova Systems

SambaNova Systems

457 subscribers62 talks
Generative AI built for the Enterprise
Customers turn to SambaNova to quickly deploy state-of-the-art generative AI capabilities within the enterprise. Our purpose-built enterprise-scale AI platform is the technology backbone for the next generation of AI computing.
Related topics