Rendered at 20:33:40 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
jeanloolz 3 hours ago [-]
I am no expert on the matter but I always thought ternery weight should be part of the neural net nature, trained on those, rather than a compression mesure for inference. Are they any training made on ternery weight models that are proven to be effective?
fatihturker 24 hours ago [-]
I've been thinking about whether extreme weight compression could fundamentally change the hardware requirements for large language models.
Most LLM deployments assume large GPU clusters mainly because of memory constraints (VRAM / RAM). But if weights are aggressively compressed — for example using ternary representations ({-1, 0, +1}) — the memory footprint drops dramatically.
In theory this could reduce model size by roughly an order of magnitude compared to FP16 weights.
If you combine that with:
• dynamic sparsity
• memory-mapped weight streaming from NVMe
• speculative decoding
• fast tensor unpacking on GPU/Metal
it raises an interesting possibility:
Could extremely large models (100B–500B+) become runnable on consumer machines, even if they stream weights from SSD instead of holding everything in RAM?
Of course bandwidth, latency, and compute efficiency become major bottlenecks.
Would love to hear thoughts on whether this approach is realistic or fundamentally limited by bandwidth and compute.
MrDrMcCoy 6 hours ago [-]
My attempts to try ternary encodings from Unsloth with llama.cpp on ROCm failed miserably. Either ggml or ROCm simply can't run it at this time on gfx1201, and CPU isn't fast enough.
fatihturker 24 hours ago [-]
One question I’m particularly curious about:
At what point does SSD bandwidth become the main bottleneck for inference when weights are heavily compressed? If anyone has experience with streaming layers or low-bit runtimes, would love to hear how you approach it.
LargoLasskhyfv 24 hours ago [-]
So MS-BitNet advanced next-gen, or what?
fatihturker 23 hours ago [-]
It’s inspired by ideas similar to BitNet, but I wouldn’t call it “next-gen BitNet.” BitNet focuses mainly on model representation, while OpenGraviton is about inference — pushing the limits of running large models efficiently on consumer hardware. Similar motivation (more efficient models), different layer (inference engine).
Most LLM deployments assume large GPU clusters mainly because of memory constraints (VRAM / RAM). But if weights are aggressively compressed — for example using ternary representations ({-1, 0, +1}) — the memory footprint drops dramatically.
In theory this could reduce model size by roughly an order of magnitude compared to FP16 weights.
If you combine that with:
• dynamic sparsity • memory-mapped weight streaming from NVMe • speculative decoding • fast tensor unpacking on GPU/Metal
it raises an interesting possibility:
Could extremely large models (100B–500B+) become runnable on consumer machines, even if they stream weights from SSD instead of holding everything in RAM?
Of course bandwidth, latency, and compute efficiency become major bottlenecks.
I'm curious if anyone here has experimented with:
• ternary / ultra-low-bit networks • SSD-streamed inference • sparse LLM architectures • MoE-style routing combined with quantization
Would love to hear thoughts on whether this approach is realistic or fundamentally limited by bandwidth and compute.
At what point does SSD bandwidth become the main bottleneck for inference when weights are heavily compressed? If anyone has experience with streaming layers or low-bit runtimes, would love to hear how you approach it.