The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality! The model was quantization aware trained (qat) at int4. Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm:
It generated a fully functional space invaders game no problem. It only used a few hundred thinking tokens and 3500 overall which is quite nice.
328.05K