Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/6936bc88-f39c-800d-8af0-70d83304b81c
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Assistant: Here’s the context and what the statement actually implies, and why it’s a big deal if true: === Here’s the context and what the statement actually implies, and why it’s a big deal if true: ==== This is a direct reference to the current compiler-and-DSL arms race (Triton, ThunderKittens, Mojo, ThunderOpt, CUTLASS 4, Hidet, etc.). ==== The message is saying: The breakthrough isn’t a new language. It’s that @makora_ai is ''automatically generating state-of-the-art kernels''—no human-written DSL needed. Meaning: * Kernels are being auto-searched / auto-derived by an ML system. * It replaces human CUDA/ROCm kernel engineers. * Works across operators, shapes, and hardware. * And it’s producing kernels that rival or beat expert-tuned ones. This parallels: * Sakana AI’s early kernel optimizer (but more stable). * Hidet autotuner + TVM + auto-scheduled IR. * Google’s AlphaDev/AlphaTensor search for optimal compute routines. * Nvidia’s Blackwell compiler “automated heuristics” efforts. But here: Makora is claiming real, production-speed kernels—automatically. ==== This implies that the Makora-generated kernels—when run on ROCm / MI300-series or MI325-class hardware—are delivering: ==== ===== for all sizes shown in their benchmark. ===== This has several implications: ===== H100 uses: ===== * 3.35 TB/s HBM3 memory bandwidth (80GB SXM) * Strong but not flexible shared memory + warp scheduling MLA variants (like MQA, GQA, fused attention, sliding-window attention) can expose bottlenecks that rely on: * Large SRAM tiles * High compute reuse * High memory-level parallelism * Flexible scheduler → AMD excels here ===== On MI300/MI325: ===== * Larger shared memory pools * Wide RV pipes * Matrix engines with high FP8/FP6 throughput * Higher internal bandwidth * Transparent wavefront-level scheduling MLA kernels can use: * Larger tiles * Lower register pressure * Higher occupancy * Full bandwidth saturation This is why AMD often beats H100 on: * FlashAttention v1/v2/v3 * Long-sequence attention * Fused MLP ops * MoE router kernels * QKV fused matmuls ===== Then AMD’s compiler story gets a huge upgrade: ===== * No more “ROCm is behind CUDA for kernels” * No more “H100 dominates small/medium shapes” * No more “AMD is only competitive at batch size 1” Instead: Makora automatically bridges that gap—without human CUDA/ROCm engineers. That is enormous. ==== Three trends point in this exact direction: ==== ===== AlphaTensor showed RL discovers matrix decompositions humans never found. ===== Makora claims the same for GPU kernels. ===== That helps auto-generated kernels exploit wider search spaces. ===== ===== MLA stress-tests: ===== * shared memory * dataflow flexibility * instruction mix * partial-tile reuse CDNA is unusually good here. ==== If Makora’s benchmarks are real: ==== ===== A genuine performance advantage over H100** ===== in one of the most important emerging workloads (MLA, sliding-window, long-context). This means: * AMD can leapfrog CUDA in kernel performance without a DSL war. * ROCm gets turbocharged by algorithmic search, not human hand-tuning. * MI350/MI355/MI400X could scale this advantage even further. * Nvidia’s moat (compiler + kernel ecosystem) is eroding quickly. If you want, I can break down exactly how MLA stresses H100 vs MI300/MI325 and why AMD wins here—or model what MI450X MLA speedups look like.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)