Episodes 448
Avg. Duration 16m
Activity Highly Active
Since Aug 2025
Latest Episode Feb 2026

Outreach Signals

Open to Sponsors

Publishing Details

Schedule
Hourly
Format
Episodic
Hosting
anchor.fm

Contact & Outreach

About This Podcast

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.

Explore Statistics

Recent Episodes

MatFormer: Nested Transformer for Elastic Inference

Feb 28, 2026 20m

In a collaboration between Google DeepMind, University of Texas at Austin, University of Washington and Harvard published on December 2024 researchers introduce MatFormer, a novel elastic Transformer…

Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models

Feb 28, 2026 17m

Speculative Streaming is a novel inference method designed to accelerate large language model (LLM) generation without the need for traditional auxiliary "draft" models. By integrating multi-stream…

Apple's Mirror Speculative Decoding: Parallel LLM Inference via Heterogeneous Accelerators

Feb 28, 2026 20m

Apple researchers have introduced on December 2025 Mirror Speculative Decoding (Mirror-SD), an advanced inference algorithm designed to accelerate large language models by overcoming the sequential…

EAGLE: Evolution of Lossless Acceleration for LLM Inference

Feb 28, 2026 19m

The provided documents describe the development and evolution of EAGLE, a high-efficiency framework designed to accelerate Large Language Model (LLM) inference through speculative sampling. By…

Fast Inference from Transformers via Speculative Decoding

Feb 28, 2026 24m

These sources review historically speculative decoding, an innovative technique designed to accelerate Large Language Model (LLM) inference without reducing output quality. Large models are…

Building Production-Ready Speculative Decoding with TensorRT-LLM

Feb 28, 2026 17m

This article outlines how Baseten optimized speculative decoding using the TensorRT-LLM framework to accelerate model inference. The authors detail overcoming technical hurdles such as inefficient…

QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding

Feb 28, 2026 21m

QuantSpec is a novel self-speculative decoding framework designed to accelerate the inference of Large Language Models, particularly in long-context scenarios. The system addresses memory and latency…

CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation

Feb 28, 2026 22m

The researchers introduce CXL-SpecKV, a specialized architecture designed to overcome the memory bottlenecks of large language model serving by offloading key-value caches to remote memory. By…

Unified Latents (UL): How to train your latents

Feb 28, 2026 19m

On the February 19, 2026 paper Google Deepmind introduces Unified Latents (UL), a novel framework for generative modeling that jointly trains an encoder, a diffusion prior, and a diffusion decoder.…

MagicDec: Breaking Latency-Throughput Tradeoffs via KV-Compressed Speculative Decoding

Feb 28, 2026 17m

We review an April 3, 2025 research collaboration between CMU, Moffett AI and Together AI which introduces MagicDec, a new framework designed to accelerate the serving of long-context large language…

KV selection algorithms: static (SnapKV) Vs dynamic (PQCache)

Feb 28, 2026 18m

We review three different papers which focus on different KV cache optimizations techniques using different KV selection algorithms types: static vs dynamic. StreamingLLM and SnapKV use static KV…

Adaptive Control for Batched Speculative Decoding in LLM Serving

Feb 28, 2026 18m

We review two papers which examine the integration of speculative decoding and request batching to accelerate Large Language Model (LLM) inference. While both techniques aim to improve GPU hardware…

Optimizing Verification and Efficiency in Multi-Draft Speculative Decoding

Feb 26, 2026 21m

These sources explore advanced techniques for accelerating **Large Language Model (LLM) inference** through **speculative decoding**, a process where smaller "draft" models predict tokens for a…

Evaluating Collective Behaviour of Hundreds of LLM Agents

Feb 26, 2026 20m

This research collaboration between King’s College London, Google DeepMind on a research paper published on February 19, 2026 introduces a novel framework for evaluating the **collective behavior**…

Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Feb 26, 2026 21m

The February 12.2026 research from the University of Virginia and Google introduces the deep-thinking ratio (DTR), a novel metric designed to measure the true reasoning effort of large language…

Deep Learning Frameworks for Robust Quadrupedal Locomotion

Feb 26, 2026 21m

These sources detail advanced **reinforcement learning frameworks** designed to improve how **quadruped robots** navigate difficult, real-world environments. The first source introduces a…

MEDUSA: Parallel Decoding Heads for Accelerated LLM Inference

Feb 26, 2026 22m

MEDUSA is a novel framework introduced on June 24 2024 designed to accelerate Large Language Model (LLM) inference by overcoming the delays caused by sequential token generation. Instead of relying…

Taming the Long-Tail: Efficient Reasoning RL with Adaptive Drafters

Feb 26, 2026 18m

On a paper published January 21, 2026 researchers from MIT and NVIDIA explain how they have have developed a new system called Taming the Long Tail (TLT) to solve computational inefficiencies in…

FastGRPO: Concurrency-Aware Speculative Decoding for Policy Optimization

Feb 26, 2026 19m

The September 26 2025 research paper introduces FastGRPO, a high-efficiency framework designed to accelerate the training of large language models using Group Relative Policy Optimization. The…

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Feb 26, 2026 16m

Researchers introduced on May 2024 self-speculative decoding, a novel "plug-and-play" inference scheme designed to accelerate Large Language Models (LLMs) without requiring auxiliary models or extra…

Frequently Asked Questions

How many episodes does AI: post transformers have?

AI: post transformers has published 448 episodes since August 2025, covering topics in Technology.

Is AI: post transformers still active?

AI: post transformers is currently highly active with new episodes hourly. Average episode length is 16m.

How do I contact AI: post transformers for sponsorship or guest appearances?

Sign up on Grep.FM to access contact details for AI: post transformers, including email and social media links.

Similar Podcasts