Комментарии:
This was interesting. Thanks 👍🏻
ОтветитьI wish I had $2000 to spend to ask a computer what it's name is and to tell me a story
ОтветитьThis configuration for a LLM model will be suitable doing the merge of the machines using Kubernetes?
ОтветитьI do a lot of CFD work using OpenFOAM. It is insanely memory access intensive. Any chance you could run the OpenFOAM benchmark on your cluster ?
ОтветитьWhats to stop the 4090 from using the 1tb of ddr5?
ОтветитьThis video, the explanation and the tests were so well done. First time on your channel, now I have subscribed. Thank you.
ОтветитьWhat is that stackable tower you’re using for 5 minis? Is it custom?
ОтветитьDo you have an airconditioner?
ОтветитьThis guy takes every chance to flash his 4090 lol
ОтветитьThe audio transitions in your video are odd and jarring.
Like this isn't CSI ha
Try the caldigit thunderbolt hub Only thunderbolt ports no display out, I have noticed a much better data rate on it.
Ответитьhow did you get exo running!? i followed all the instructions on the git page. have a m3max mbp any tips would be great!
Ответитьexcuse me, I didn't get it, did you mention the same performance on a 4090 card? otherwise the comparison is pointless without knowing performance per watt
ОтветитьYou can either buy 1 M4 pro mini with 64gb ram and 8tb storage for $5000 or 9 M4 mini with a total combined 144gb ram and 2.3tb storage for $5400. Apple's bean counter computes maths.
ОтветитьHi, Alex, thank you for your excellent sharing. I am wondering that is it possible to apply such clustering to bioinformatics, which need more CPU(threads) and ram, rather than GPU. If yes, how can I set up for two mac mini.
ОтветитьMy god what a geek!😂😂😂 did not understand f*uck all! Ahhahahaha well i just get one base model with 512ssd that’s all😂😂😂😂
ОтветитьNice vid, there r some interesting conclusions, however I am missing direct comparison in the end with the Nvidia's cards like top 4090 u have showed in the video. I think making single video comparing AI performance per watt between M4 Mac Mini (cluster) vs MBP with M4 Max and 4090 would be much more informative and interesting.
ОтветитьYou could have created a connection using thunderbolt in a Mesh,
a -> b, c
b -> c, d
c -> d, e
d -> e, a
e -> a, b
15 Thunderbolt cables, setup as a bridge and just give each device it's own IP address, You could setup. It does mean to get from a to d it has to go through b or c.
tiktok ahh edit
ОтветитьSo i can only imagine how much sonnet is using for something like cursor AI
ОтветитьWould you use a M4 Mini stack to mine crypto?
ОтветитьWould it be possible to daisy chain Macs via thunderbolt? Wouldn’t this network topology allow more elements in the network and at greater speed?
Ответитьhow about 8700g??
ОтветитьCould you daisy chain the mac's 1-2 2-3 3-4 4-5 for its networking or could you do a ring ?
ОтветитьI want a cluster to run DaVinci Resolve on for colour grading. Too often Resolve chokes when it runs out of GPU, and CPU cores to allocate. Waiting for nodes and clips to render interrupts focus. Blackmagic do not facilitate cluster computing in Resolve. It surely someone can figure out how to throw dozens of lower cost Mac’s at Resolve.
ОтветитьGreat video, thank you!
ОтветитьYou need GPUs and RAM for training, that's what people buy them for. Inference can run in your wrist watch if it has enough RAM to handle the model.
ОтветитьGreat video! And I was thinking why my m2 air (8gb) stopped when try to run QwenCode 2.5 14b Lkkkkkkkkkk Just broke... Chatgpt still a good deal if you are no putting sensitive data.
ОтветитьI litearlly self learn how to run ai on my mac becasue i want to make studying materials with it for myself. And now i saw the vidioes you made
ОтветитьMac mini m4 looks nice and fast but i'll wait for a new one next year without the power button on the bottom,
ОтветитьHi Alex, nice job! It's so cool to see those setups. I was wondering if it would be viable to use a LattePanda Sigma as a larger model server. Could that be a potential next project for you?
ОтветитьI thought i'd try and test TPS with an i9-13900k and RTX3060ti 12gb vram 128gb ddr5 4000mt/s(slow ram, bleep quad channel)
root@226408baa9c8:~# ollama run llama3.2-vision "tell me a story about M4 Mac Mini cluster and RTX 3060ti comparison, do we beat 280 tps" --verbose
Here's a story for you:
In the heart of Silicon Valley, there was a team of developers at NovaTech, a cutting-edge AI research firm. They were on a mission to break through the boundaries of machine learning and deep learning performance. The team leader, Alex, had assembled an elite
squad of experts: John, a seasoned software engineer; Rachel, a brilliant data scientist; and Mark, a wizard with hardware.
Their challenge was to optimize a specific workload that required processing vast amounts of data in parallel. They were determined to beat the current benchmark of 280 transactions per second (tps) on a standard dataset.
The team decided to compare two systems: an M4 Mac Mini cluster and a single high-end PC equipped with an NVIDIA RTX 3060 Ti GPU.
*M4 Mac Mini Cluster*
Alex's team set up four M4 Mac Minis, each equipped with:
* Intel Core i9 processor (up to 3.6 GHz)
* 32 GB of DDR4 RAM
* Four 1 TB SSDs in RAID 10 configuration
The team installed a custom-built cluster management system, allowing them to distribute the workload evenly across all four nodes.
*RTX 3060 Ti PC*
John, being the hardware aficionado, opted for a high-end PC:
* Intel Core i9 processor (up to 3.6 GHz)
* 64 GB of DDR4 RAM
* NVIDIA GeForce RTX 3060 Ti GPU with 8 GB GDDR6 memory
The team ran the benchmark on both systems, using the same dataset and workload configuration.
*Results*
After hours of intense computing, the results were in:
M4 Mac Mini Cluster: *310 tps*
RTX 3060 Ti PC: *285 tps*
Alex's team was thrilled – they had beaten the 280 tps benchmark by a significant margin! The M4 Mac Mini cluster performed 30% better than the single high-end PC.
*What made it possible?*
Several factors contributed to this success:
1. **Parallel processing**: By distributing the workload across four nodes, the team was able to take advantage of multi-core processors and significantly improve overall performance.
2. **Custom-built cluster management**: The custom-built system allowed for efficient distribution of tasks, reducing communication overhead and ensuring that all nodes were utilized optimally.
3. **Optimized software**: The team had carefully optimized the software to minimize bottlenecks and maximize throughput.
*Conclusion*
Alex's team at NovaTech had successfully pushed the boundaries of machine learning performance. By leveraging the power of parallel processing and custom-built cluster management, they achieved a remarkable 30% improvement over the single high-end PC equipped with
an RTX 3060 Ti GPU.
The M4 Mac Mini cluster stood as a testament to the power of distributed computing and the potential for innovation in AI research. As Alex's team celebrated their victory, they knew that this was only the beginning – there were still many challenges to overcome,
but they were ready to take on the next frontier!
total duration: 30.436800181s
load duration: 4.443920349s
prompt eval count: 36 token(s)
prompt eval duration: 188ms
prompt eval rate: 191.49 tokens/s
eval count: 618 token(s)
eval duration: 25.803s
eval rate: 23.95 tokens/s
root@226408baa9c8:~# ;) Just a comparison at the end of the day, try open-webui, piplines all in docker.
I would have liked to see it running a model that can´t fit in any single computer, like a 180B parameters model, it may be super slow but it would be a use case for all that ram
Ответитьis no one going to meantion on how his play button is super crooked?
ОтветитьI would like to see the comparison of the system over 10gb lan and thunderbolt
ОтветитьTurn the bottom machine upside down to make sure it can breath.
Ответитьнеплохой английский. акцент практически отсутствует. вы либо переехали в подростковом возрасте либо очень усердно работали
ОтветитьFun fact, WiFi ethernet packets are up to 64kB, by default, with my 2020 wifi 6 router. When streaming videos with DLNA protocol it uses maximum length WiFi ethernet packets.
ОтветитьThank you for this video Alex, it's what I needed to get it. Coles notes for other folks.
If you want the top perf on the highest param models, you need to use CUDA capable cards, aka NVIDIA.
If you want to save energy and you have some mac mini's hanging around they do an OK job and will be most cost effective.
Cheers.!
can apple just make propriety apple silicon GPU's for the Mac Pro already
ОтветитьThis like button is for those that agree his play button is 100% level relative to flat earthers theory.
ОтветитьWasn't Thunderbolt daisy chaining designed to avoid issues with the hub you having ? It should probably handle everything in hardware so shouldn't introduce super delays even if traffic is passing through 3 computers on its way
ОтветитьGreat video Alex! You mention setting up for Jumbo MTU sizes...did that make any difference? In theory you could get 9K packets...but I have my doubts.
ОтветитьAmazing test. 4 x 64GB Mac Mini M4 Pro. Would that run Llama 405B?
ОтветитьThis kind of investigation is great and truly valuable. It’s a significant contribution to the community. This could become a rabbit hole once you start running tests. Thanks for sharing!
ОтветитьThe thumbnail got me instantly, I used to run a grip of Intel Mac Minis at home.
ОтветитьWhat if you got a bunch of thunderbolt to 10gbps adapters and plugged them all into a big switch. Then use link aggregation. If you set it up right it could get up to 40Gbps between any two machines if you had Mac minis with 10Gbps ports + three thunderbolt. Probably super expensive though and there might be thunderbolt to network adapters that can utilize more of the bandwidth than 10Gbps.
Ответить