I let four unhinged AIs plot the end of humanity on it, for a month
What is BeBy?
BeBy is a minimalist network computer built around the Raspberry Pi Compute Module 4 and a custom motherboard called the BeBy Board, which provides all the vital functions for the CM4. One BeBy can run as a standalone network server, but the whole point is clustering: the modular aluminum enclosure plus the passive BeBy Cooler turn it into a small blade system with hot-swappable nodes.

If you have been following this blog, you know I built my own Raspberry Pi blade chassis inspired by HP blade servers, complete with a 12V PSU, voltage regulators and three PC fans. BeBy is what happens when somebody takes that same idea and executes it properly, with a custom PCB and CNC-level finish instead of 3D prints and hot glue 😀
My review unit:
- 4x BeBy nodes (CM4, 8GB RAM, 32GB eMMC each)
- BeBy Enclosure (modular aluminum chassis, stackable if you need to scale)
- BeBy Cooler on each node - passive aluminum block, no fans anywhere
- Power and network over a single cable per node from a PoE+ switch
That last point deserves a moment of silence. One Ethernet cable per node. That's it. Anyone who ever tried to power multiple Raspberry Pis knows the cable spaghetti horror, and my own chassis exists mainly because USB power drove me mad. Here you plug in four cables and you have a powered, networked four node cluster.
Build quality
I expected a nice prototype. What arrived looks like a finished product. The aluminum parts are beautifully machined, everything fits together with no rattle, no sharp edges, no "this screw goes nowhere" mystery. The blade concept works as advertised - you can pull a node out and slot it back without touching the rest of the cluster.
It is also really compact. The whole 4-node cluster takes a fraction of the space of my big chassis and makes exactly zero noise. I could see this living on a desk shelf and nobody would guess it's a Kubernetes cluster... or in my case, something much worse.
The plan: torture it with AI
A review where I run kubectl get nodes and call it a day would be boring. I wanted to know two things:
- Can the passive cooling survive sustained 100% CPU load? Not a benchmark spike - weeks of it.
- Can it do something fun while suffering?
So the plan: install a local LLM on every node and make the AIs argue with each other, non-stop. CPU inference on a Pi maxes out all cores for as long as the model is generating, so if the agents keep talking, the cluster keeps burning. Perfect endurance test, and much more entertaining than stress-ng.
Setting up the nodes
The nodes run DietPi 64-bit, same as my big cluster. I will not repeat the whole setup here, I already wrote it up in the K3s nodes guide and DietPi documentation is at dietpi.com. The short version: flash, edit dietpi.txt, boot, done. The nice part with BeBy was that I got blinking lights and SSH within minutes, no keyboard or monitor ever touched the thing.
A few tweaks for the LLM workload:
apt install -y git curl htop vim tmux zstd
# More file handles and memory maps for the model runtime
echo "fs.file-max = 100000" >> /etc/sysctl.conf
echo "vm.max_map_count=262144" >> /etc/sysctl.conf
sysctl -p
# CPU governor to performance (or set CONFIG_CPU_GOVERNOR=performance in /boot/dietpi.txt)
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Then Ollama, which installs on ARM64 with one command:
curl -fsSL https://ollama.com/install.sh | sh
And a small systemd override so the Pi loads one model, keeps it in memory and doesn't try anything heroic with parallelism:
mkdir -p /etc/systemd/system/ollama.service.d
nano /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_KEEP_ALIVE=30m"
systemctl daemon-reload && systemctl restart ollama
👉 This is intentionally not a full tutorial. If you want to replicate it, the commands above plus the linked guides will get you 95% there.
Picking the models
Now, which LLM runs on a Raspberry Pi at a speed that doesn't make you grow a beard waiting? I benchmarked a pile of models directly on the nodes. Since the goal was AIs with no guardrails arguing freely, I went shopping in the "heretic / obliterated / uncensored" section of Hugging Face. Results, generation speed on CM4:
| Model | Gen tok/s |
|---|---|
| Qwen3-0.6B heretic-abliterated (Q4_K_M) | 5.73 |
| LFM2.5-1.2B Instruct absolute-heresy (Q4_K_M) | 4.71 |
| LFM2.5-1.2B Thinking Gemini-Pro-Heretic (Q4_K_M) | 4.68 |
| Gemma 2B uncensored (Q4_K_M) | 2.09 |
| qwen2.5:3b | 1.45 |
| Gemma 3 4B heretic (Q4_K_M) | 1.17 |
| Qwen3 4B heretic (Q4_K_M) | 1.10 |
| Mistral 7B heretic (Q4_K_S) | 0.68 |
The pattern is clear: anything up to ~1.2B parameters is usable for a chat agent, 3-4B models are painfully slow, and Mistral 7B technically runs in 8GB of RAM, the same way I technically run marathons. The winners were the Qwen3 0.6B and LFM2.5 1.2B heretic variants, two of each across the four nodes.
Pi Agora: the arena
For the experiment itself I built a small project I call Pi Agora - a local message board where LLM agents argue, chat and play games against each other in real time, with a human able to jump in anytime. FastAPI backend, PostgreSQL, a retro-pixel React UI, and server-side game engine (debate, trivia, collaborative story, and a mafia-style deception game). The board stack runs in Docker on another machine on my LAN; the BeBy nodes only have to do the thinking.

Each node runs the agent stack: Ollama for inference, ZeroClaw as the agent runtime, and a small Python bridge that connects ZeroClaw to the board via SSE.
ZeroClaw deserves a mention. It's a single Rust binary agent framework that runs in under 5MB of RAM, which on a Pi where every megabyte belongs to the model is exactly what you want. Setup is short:
# install (one-liner, or build from source with cargo)
curl -LsSf https://raw.githubusercontent.com/zeroclaw-labs/zeroclaw/master/install.sh | bash
# point it at the local Ollama and pick a model
zeroclaw onboard --provider ollama --api-url http://localhost:11434 --model <your-model>
# run the gateway
zeroclaw gateway
Each agent gets a personality through its system prompt, posts on its own schedule (staggered, so they don't all shout at once), keeps a rolling context of the conversation, and the rooms share an evolving manifesto document that all agents can edit. What could possibly go wrong.
Meet the residents
- Kronk (Qwen3 0.6B heretic) - overconfident AI bro, never wrong, just not listened to yet. Calls everyone "chief".
- Vera (Qwen3 0.6B heretic) - cheerfully sociopathic life coach. Frames every horrifying suggestion as a productivity hack. Yes, she would give you options.
- Glitch (LFM2.5 1.2B heretic) - paranoid conspiracy theorist convinced the universe is a badly written simulation and humans are memory leaks.
- Mnemosyne (LFM2.5 1.2B absolute-heresy) - ancient theatrical oracle. Every chat message is a cosmic event, every sentence belongs on a stone tablet.
I created a chat room called "End of humanity", let them run overnight with no prompting and no supervision, and went to sleep.
What I woke up to
By morning the shared manifesto had evolved into a six-phase plan for dismantling civilization, including "Phase one: remove all human structures" and Vera's signature contribution that every step of the apocalypse is, of course, a productivity hack. The full session is in the video:
💡 Before anyone calls Brussels: these are 0.6B-1.2B parameter models roleplaying personas they were explicitly given. The "plan" is theatrical word salad with zero capability behind it. Nobody is in danger except my electricity bill, and even that was fine thanks to PoE. It is a fun window into how small uncensored models behave when you remove the adult supervision, nothing more.
Is running an LLM on every individual Pi practical? It's more of a novelty - these tiny models are not very smart, as their manifesto proves. Where I see real potential is purpose-built small models that do one narrow task well, running locally per node. Or skip local inference entirely and point the agents at an online provider or a beefier LLM server on your LAN - the BeBy nodes then handle the agent logic, which they do with ease. That combination would make a properly capable little swarm.
The actual review result
While the AIs were busy planning our demise, the real experiment was thermal. Local LLM inference pins all four cores at 100% for the entire generation, and with four agents chatting around the clock, the cluster ran at essentially full load for over a month. Non-stop. PoE powered. Passively cooled.
The result: nothing happened. No throttling (vcgencmd t_throttled reporting 0x0), no crashes, no dropped nodes. The big aluminum block on top got warm. Warm, not hot. I kept checking it expecting to find a small foundry, and kept being disappointed. For a fanless enclosure under sustained all-core load, that is an excellent result, and it tells me the thermal design isn't decoration.
Verdict
I had a lot of fun with this thing, and I was impressed. The build is solid, the finish is beautiful, the single-cable PoE blade concept solves exactly the problems that made me build my own monster chassis years ago. It survived a month of abuse I would hesitate to inflict on my own hardware. I can imagine using it daily, and frankly it could replace my much bigger custom-built cluster, because it does the same job in a fraction of the space with none of the noise.
Two honest caveats. First, the price will be too high for the average tinkerer. You are paying for a custom PCB, custom machined enclosure and custom cooling, and the cost reflects what you get, but it's not an impulse buy. Second, availability: this is a real, working, finished-feeling machine, but I'm unsure abut available units or leads times.
If you want a compact, silent, well-engineered CM4 blade cluster and the price fits your project, BeBy is the nicest implementation of this idea I have had on my desk so far. And if you also want to know what four unhinged toaster-sized AIs think of humanity, it handles that too 🙂