How I Got a 32GB Local LLM to Run on My 28GB System Memory PC

2026-04-02 · 18m · 자막 —

01한국어 번역 · Korean

28GB 메모리 PC에서 32GB 로컬 LLM 돌리기: SSD 스트리밍 실험기

원본: https://www.youtube.com/watch?v=68MVhAU21ac · 업로드: 2026-04-02 · 길이: 19m · 채널: Onchain AI Garage

실험의 출발점

이번 영상에서는 총 메모리가 28GB밖에 안 되는 제 PC에서 32GB짜리 AI 모델을 돌려본 과정을 공유하려고 합니다. 상식적으로는 불가능해 보이지만, 몇 가지 영리한 기법을 조합하면 가능했습니다.

이 실험은 트위터에서 본 Daniel Isaac의 트윗에서 아이디어를 얻었습니다. 그는 “AI 모델 전체를 메모리에 올리지 말고, SSD에서 필요한 부분만 스트리밍해서 쓰자”는 매우 영리한 접근을 보여줬어요. 애플 M4 Max 칩이 탑재된 맥북에서 SSD로부터 초당 19.6GB의 읽기 속도를 뽑아냈다고 합니다. 꽤 인상적인 수치죠.

저는 1년 전쯤 게임을 더 설치하려고 SSD를 샀었는데, 이걸 활용해서 제 윈도우 게이밍 PC에서도 같은 기법이 통할지 직접 해봤습니다. 결과적으로 되더군요.

MoE 모델이란 무엇인가

이 방법이 가능한 이유는 전부 MoE(Mixture of Experts, 전문가 혼합) 모델 덕분입니다. 요즘 나오는 오픈소스 로컬 LLM(local LLM)들 상당수가 이 구조를 채택하고 있죠.

일반적인 AI 모델은 하나의 커다란 뇌라고 생각하면 됩니다. 질문이 들어오면 모든 뉴런이 다 발화해서 답을 계산하죠. 이게 동작은 하지만, 모델이 커질수록 엄청난 연산 자원이 필요합니다.

반면 MoE 모델은 병원에 비유할 수 있습니다. 병원에 128명의 전문의가 상주하지만, 환자 한 명은 그중 8명만 만나면 되죠. 팔이 부러져서 왔다면 피부과 의사는 필요 없잖아요? MoE도 똑같습니다. 이번에 쓸 모델은 총 300억(30B) 파라미터지만, 실제로 한 번에 활성화되는 건 30억(3B)뿐입니다.

여기서 핵심 질문이 나옵니다. 128명 중 8명만 쓸 건데, 정말 128명 전부를 메모리에 상주시킬 필요가 있을까?

AI 모델은 점점 더 커지고 있다

누구나 한 번쯤 겪어본 문제죠. AI 모델은 크고, 출시될 때마다 더 커지고 있습니다. 7B 파라미터 모델이 약 4GB, 13B가 8GB, 30B급이 되면 양자화(quantization)에 따라 다르지만 보통 VRAM을 상당히 먹습니다.

제가 쓰는 GPU는 RTX 3060입니다. VRAM이 12GB뿐이라 32GB짜리 모델을 돌린다는 건 말도 안 되는 얘기처럼 들립니다. 하지만 Daniel의 아이디어가 머릿속에 맴돌아서 한번 실험해보기로 했습니다.

3단 메모리 계층 설계

제가 구상한 시스템은 3단계 메모리 계층 구조입니다.

GPU VRAM (12GB): 가장 빠른 저장소입니다. 여기에는 어텐션 레이어와 라우팅 로직을 올립니다. 즉, “어떤 전문가를 부를지 결정하는 두뇌” 부분이죠. 모델의 본체가 여기 살면서 전문가를 호출합니다.
시스템 RAM (16GB): 제 PC의 CPU 쪽 RAM입니다. GPU보다 느리지만 용량이 크고, 일종의 캐시 역할을 합니다. 자주 호출되는 전문가들이 대기하는 “대기실”이라고 생각하면 됩니다.
SSD (2TB): 용량으로는 최고지만 속도는 가장 느립니다. 여기에 128명 모든 전문가가 전부 저장됩니다.

모델이 전문가를 필요로 하면 먼저 RAM을 확인합니다. 이미 올라와 있으면 즉시 사용, 없으면 SSD에서 불러옵니다. 넷플릭스와 같은 원리죠. 라이브러리 전체를 다운로드하지 않고 필요한 것만 스트리밍하는 겁니다.

하드웨어 스펙과 병목

GPU: NVIDIA RTX 3060, 12GB VRAM
CPU RAM: 16GB
SSD: Sabrent Rocket 4.0 2TB NVMe

SSD는 스펙상 초당 5GB를 뽑아줘야 하는데, 제 PC에서 쓰는 M.2 어댑터 탓에 실제로는 초당 1.5GB 정도에 머물렀습니다. 어댑터를 몇 개 바꿔봤지만 크게 나아지지 않았어요. SSD에 몇 달 작업한 중요한 프로젝트 파일이 있어서 너무 험하게 건드리기도 싫었고요. 컴퓨터 빌드 고수분들은 뭘 잘못한 건지 감이 오실지도 모르겠네요.

어쨌든 원래 기대치인 3.5GB/s의 절반도 안 되는 1.5GB/s로 밀고 나가기로 했습니다.

llama.cpp와 mmap 트릭

도구로는 llama.cpp를 썼습니다. 오픈소스 LLM 추론 엔진으로, 로컬 LLM 커뮤니티에서 아주 인기가 많죠. CPU, GPU, 또는 둘 다 섞어서 돌릴 수 있고, Ollama의 대안 격입니다.

결정적으로 유용했던 건 텐서별로 배치 위치를 지정할 수 있는 플래그였습니다. 저는 이렇게 설정했어요.

어텐션 레이어(핵심 추론 로직) → GPU
전문가(Expert) 가중치 → CPU 쪽

CPU 쪽에 두는 대신 전부 RAM에 올리지는 않았습니다. 여기서 등장하는 게 mmap(memory-mapped file, 메모리 매핑) 입니다. 운영체제가 SSD에 있는 모델 파일을 가상 메모리에 매핑해두고, 실제로 접근되는 부분만 필요할 때 SSD에서 로드합니다. 결과적으로 SSD가 메모리 계층의 일부가 되는 거죠.

윈도우는 자주 쓰이는 전문가를 자동으로 RAM에 캐싱하고, 나머지는 필요할 때 SSD에서 스트리밍합니다. 운영체제가 공짜로 “전문가 캐시 매니저” 역할을 해주는 셈입니다.

테스트 모델: Qwen3 30B A3B

테스트에 쓴 모델은 Qwen3 30B A3B입니다. 300억 파라미터 MoE 오픈소스 모델로, 128명의 전문가가 있고 한 번에 8명이 활성화됩니다.

모델 파일 구성을 보면:

어텐션 및 라우팅 로직(공유 파라미터): 약 3GB → GPU에 여유롭게 들어감
전문가 가중치: 약 29GB → CPU RAM(16GB)의 거의 두 배
합계: 32GB

제 PC 전체 메모리(GPU + CPU RAM)가 28GB니까, 이 모델은 원칙적으로는 돌아가면 안 됩니다. 하지만 어쨌든 해봤죠.

파이프라인의 흐름

사용자가 프롬프트를 입력하면:

GPU에 있는 라우터(어텐션 레이어)가 128명 중 8명의 전문가를 선택합니다.
그 전문가가 RAM(CPU 페이지 캐시)에 있으면 즉시 읽어옵니다. 사실상 즉각입니다.
없으면 SSD에서 스트리밍합니다. 제 환경에서는 0.5GB/s 수준이었습니다.
전문가가 계산을 수행하고, 결과는 다시 GPU로 돌아갑니다.

실제 돌려본 결과

터미널에서 llama.cpp로 모델을 띄웠습니다. 실시간 모니터도 하나 만들어서 SSD 읽기 속도, 페이지 폴트(page fault), 가용 RAM, 캐시 히트 비율을 관찰했어요. 제 깃허브(Tobii Studio)에 오픈소스로 공개할 예정입니다.

처음 모델을 올릴 때는 SSD 읽기가 폭발적으로 치솟습니다. 그 뒤로 프롬프트가 없으면 모두 초록(캐시 히트 상태)으로 돌아옵니다.

“Hello” 같은 간단한 인사: 약 4.3 토큰/초. 공용 전문가만 쓰기 때문에 SSD 접근이 거의 없었습니다.
“MoE 모델이 뭐야?” 같은 기술 질문: 약 3.7 토큰/초. 추론하면서 희귀한 전문가들을 계속 SSD에서 꺼내와야 해서 디스크 접근이 눈에 띄게 증가했습니다.

속도는 확실히 느립니다. 답변 한 번 받는 데 몇 분이 걸릴 때도 있었고요. 하지만 읽을 만한 속도이고, 무엇보다 제 PC가 터지지 않았습니다. 화면 녹화 프로그램, 브라우저 탭 여러 개, VS Code를 동시에 켜둔 상태에서도 시스템이 부드럽게 돌아갔어요.

벤치마크 비교

SSD 스트리밍 없이 30B 모델: 불가능.
Q4 + CPU 오프로드 (18GB 버전): 9.3 토큰/초. 전문가가 전부 RAM에 들어감.
Q8 + SSD 스트리밍 (32GB 버전): 2.5~4+ 토큰/초. 진짜 SSD 스트리밍.

속도 변동은 프롬프트 주제에 달려 있습니다. 흔한 주제면 핫한 전문가들이 RAM에 이미 있고, 기술적인 주제일수록 SSD를 때리게 됩니다.

이 실험으로 증명한 것

Daniel Isaac이 애플 기기에서 한 SSD 전문가 스트리밍 기법이 윈도우에서도 통한다는 걸 확인했습니다. 100% 똑같지는 않고 약간 변형했지만, 핵심 아이디어는 동일합니다. 맥북에서는 pread를 썼다면, 저는 llama.cpp의 mmap을 활용했죠.

OS의 페이지 캐시가 자동 LRU(Least Recently Used) 캐시 역할을 해주기 때문에, 뜨거운 전문가는 RAM에 남고 차가운 전문가는 디스크에서 스트리밍됩니다. 어댑터 병목만 해결해도 처리량이 두 배는 뛸 거라고 봅니다.

결국 제가 하고 싶은 말은 이겁니다. AI의 미래는 그저 더 큰 GPU가 아니라, 더 똑똑한 스트리밍에 있을지도 모릅니다.

마무리

Daniel Isaac의 원아이디어에 다시 한번 감사드립니다. 이 주제를 먼저 연구한 다른 분들이 계셨다면 댓글로 알려주세요. 크레딧을 드리고 싶습니다. 훌륭한 도구를 만들어준 llama.cpp 커뮤니티에도 감사를 전합니다.

관심 있는 분들은 공개 깃허브 저장소 moe-ssd-streaming-windows를 참고하세요. 제가 쓴 도구 대부분은 이미 공개되어 있는 것들이지만, 벤치마크와 윈도우 NVMe 처리량 측정용 C 코드, 실시간 디스크 읽기 모니터 같은 커스텀 유틸리티들이 한곳에 정리되어 있습니다.

앞으로도 TurboQuant 실험 시리즈처럼, 중급 게이밍 PC에서 더 큰 로컬 모델을 돌리는 다양한 방법을 계속 실험해볼 예정입니다. 로컬 LLM 커뮤니티에는 맥 사용자가 많지만, 윈도우 PC로 이런 걸 해보고 싶은 분들도 많으리라 생각합니다. 다음 영상에서 또 만나요.

02리서치 문서 · Document

GPU가 부족한 당신을 위한 MoE SSD 스트리밍: 32GB 모델을 28GB PC에서 돌리는 법

원본 영상: YouTube · Onchain AI Garage · 2026-04-02

서론: 메모리보다 큰 모델을 돌릴 수 있을까

로컬 LLM(local LLM)을 써본 사람이라면 누구나 비슷한 벽에 부딪힌다. 모델은 점점 커지는데, 내 GPU VRAM과 시스템 RAM을 합쳐도 파일 크기의 절반밖에 안 된다. 30B급 모델은 사치고, 70B는 상상 속 얘기다. “결국은 더 비싼 GPU를 사야 하나?”라는 체념에 가까운 결론으로 이어지기 쉽다.

그런데 Onchain AI Garage의 최근 영상은 이 체념에 작은 균열을 낸다. RTX 3060(12GB VRAM)과 16GB RAM, 즉 총 28GB 메모리로 32GB짜리 Qwen3 30B A3B 모델을 실제로 돌려낸 실험 기록이다. 비결은 두 가지다. 첫째, 모델 구조가 MoE(Mixture of Experts, 전문가 혼합)라는 사실. 둘째, 운영체제의 mmap(memory-mapped file, 메모리 매핑)과 llama.cpp의 텐서 배치 제어를 결합해 SSD를 사실상 메모리 계층의 한 층으로 끌어올리는 기법이다.

이 글에서는 이 실험이 왜 가능한지, 어떤 원리에 기대고 있는지, 그리고 2026년 현재 커뮤니티에서 같은 기법이 어떻게 논의되고 있는지를 정리한다.

본론

1. MoE 모델이 “불가능”을 가능하게 만드는 이유

전통적인 밀집(dense) 트랜스포머는 토큰 하나를 처리할 때 모델의 모든 파라미터를 사용한다. 뇌 전체가 매 질문마다 발화하는 셈이다. 반면 MoE는 여러 개의 “전문가” 피드포워드 네트워크(feed-forward network)를 두고, 입력 토큰마다 라우터가 그중 소수만 골라 활성화한다. NVIDIA의 정리에 따르면 MoE는 파라미터 수를 폭발적으로 늘리면서도 토큰당 실제 연산량은 억제하는, “희소 활성화(sparse activation)“로 효율을 확보하는 구조다(NVIDIA Glossary: Mixture of Experts).

영상에서 사용한 Qwen3 30B A3B는 128명의 전문가 중 매 토큰에 8명만 활성화된다. 총 파라미터 300억 중 실제 활성화되는 건 30억뿐이다. 이 활성화 전문가 수가 소수라는 사실이 실험의 핵심 지렛대다. 128명 모두를 DRAM에 상주시킬 필요 없이, “지금 필요한 8명”만 실시간으로 확보하면 된다는 발상이 성립하기 때문이다.

MoE 추론 최적화를 다룬 최근 서베이는 이 아이디어가 학계에서도 정식 연구 주제임을 보여준다. 엣지 환경에서 GPU 메모리가 모델 전체를 담지 못할 때 CPU 메모리 또는 SSD로 파라미터를 내리고, 희소 활성화 패턴을 활용해 필요할 때만 가져오는 “expert offloading” 기법이 체계적으로 발전해왔다(A Survey on Inference Optimization Techniques for Mixture of Experts Models).

2. llama.cpp의 텐서 배치와 mmap

영상의 주인공은 오픈소스 추론 엔진 llama.cpp다. llama.cpp는 원래부터 mmap을 통해 모델 파일을 가상 메모리에 매핑하도록 설계되어 있다. 덕분에 운영체제가 파일의 어느 페이지를 물리 메모리에 올릴지, 언제 버릴지(LRU, Least Recently Used)를 자동으로 관리해준다. 사용자는 파일 크기가 RAM보다 커도 일단 매핑을 걸 수 있고, 실제로 접근된 페이지만 RAM으로 승격된다.

여기에 더해 llama.cpp는 텐서 단위 배치 제어 플래그를 제공한다. --override-tensor와 같은 옵션을 사용하면 특정 이름 패턴의 텐서(예: 전문가 가중치 ffn_*_exps)를 CPU 쪽으로, 나머지 어텐션/라우팅 로직은 GPU로 보낼 수 있다. 실전 가이드들은 이를 통해 Qwen-3 235B 같은 거대 MoE 모델도 GPU-poor 환경에서 돌리는 법을 상세히 설명한다(How to run big MoE models like Qwen-3-235B-A22B in Llama.cpp via partial offloading to CPU, Performant local mixture-of-experts CPU inference with GPU acceleration in llama.cpp).

영상의 설정도 같은 공식을 따른다. 어텐션 레이어와 라우팅 로직은 RTX 3060의 12GB VRAM에, 29GB에 달하는 전문가 가중치는 CPU 쪽으로 보내고, RAM이 부족한 나머지는 mmap을 통해 SSD가 자연스럽게 흡수한다. 핵심은 “SSD 스트리밍 기능을 별도로 구현한 게 아니라, OS 페이지 캐시와 mmap이 이미 하고 있던 일을 제대로 활용한 것”이라는 점이다. llama.cpp 커뮤니티의 논의에서도 “llama.cpp가 MoE를 디스크로 내리느냐”라는 질문에 대한 답은 결국 “mmap 덕분에 사실상 그렇게 동작한다”는 쪽으로 수렴한다(Does llama.cpp offload MoE to disk?).

3. 3단 메모리 계층: VRAM → RAM → SSD

실험 구성을 메모리 계층으로 다시 보면 구조가 명확해진다.

1층 — GPU VRAM (12GB): 어텐션, 임베딩, 라우터. 매 토큰마다 반드시 쓰이는 “공용” 파라미터. 속도가 결정적이라 가장 빠른 저장소에 둔다.
2층 — 시스템 RAM (16GB): 자주 선택되는 “핫(hot)” 전문가들이 OS 페이지 캐시에 상주한다. 전체 전문가 풀의 일부만 담을 수 있지만, 실사용 트래픽이 특정 전문가에 쏠리기 때문에 캐시 히트율이 생각보다 높다.
3층 — NVMe SSD (2TB): 모든 전문가의 원본 가중치. 접근 시 초당 1.5GB(영상의 어댑터 병목 기준)의 속도로 스트리밍된다. Sabrent Rocket 4.0의 스펙은 원래 5GB/s지만, M.2 슬롯 어댑터 때문에 실측은 그 절반 이하였다.

이 구성은 CPU 캐시 계층의 철학과 똑같다. 가장 뜨거운 데이터는 가장 빠른 층에, 차가운 데이터는 느리지만 큰 층에. 다른 점이 있다면 계층 사이의 이동을 애플리케이션이 아니라 OS 커널이 담당한다는 것이다. 사용자 코드가 “어떤 전문가를 언제 로드할지” 스케줄링할 필요가 없다.

llama.cpp 이슈 트래커에는 한 걸음 더 나아가 GPU+RAM을 활용한 2단 전문가 캐시와 교체 정책을 플러그인화하자는 제안도 올라와 있다(Feature Request: Two-tier GPU+RAM expert cache for MoE offload). 현재는 OS에 의존하지만, 앞으로는 엔진 레벨에서 더 정교한 제어가 가능해질 가능성이 크다.

4. 실측 결과와 속도의 함수 관계

영상이 인상적인 이유는 이론에 그치지 않고 숫자를 보여주기 때문이다.

구성	가능 여부	토큰/초
30B 무오프로드	불가	—
Q4 + CPU 오프로드 (18GB)	가능	9.3
Q8 + SSD 스트리밍 (32GB)	가능	2.5 ~ 4+

“Hello” 같은 짧은 인사는 4.3 토큰/초. 공용 전문가만 타기 때문에 거의 전부 RAM 히트다. 반면 “MoE 모델이 뭐야?”처럼 기술적이고 드문 주제에는 3.7 토큰/초로 떨어지고, 모니터링 화면에서 디스크 읽기 지표가 빨갛게 치솟는 걸 확인할 수 있다. 프롬프트의 주제에 따라 히트율이 출렁이는 것, 그게 이 시스템의 본질이다.

속도만 보면 Claude나 상용 API에 한참 못 미치지만, “이론상 불가능한 조합을 읽을 만한 속도로 돌린다”는 점이 포인트다. 그리고 어댑터 병목(1.5GB/s)을 해결해 SSD의 스펙상 속도(5GB/s)로 끌어올리면 처리량이 두세 배 뛸 거라는 추정도 합리적이다.

5. 주의할 점: 에너지와 내구성

이 기법을 진지하게 프로덕션에 쓰려는 사람은 몇 가지 함정을 알고 있어야 한다. 최근 연구 하나는 SSD로 MoE 가중치를 오프로드할 때 비트당 읽기 에너지가 DRAM보다 훨씬 높다는 점을 지적한다. 접근 지연은 프리페칭(prefetching)으로 숨길 수 있지만, 토큰당 에너지 소비는 근본적으로 올라간다(SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency).

데이터센터 관점에서는 이게 상당한 제약이다. 다만 개인 PC에서 “어차피 돌지 않던 모델”을 간헐적으로 돌리는 용도라면, 전력비보다 “가능 vs 불가능”이 훨씬 큰 가치다. 또한 NVMe SSD의 TBW(Total Bytes Written) 한도는 읽기 전용 워크로드에는 큰 영향을 주지 않지만, 페이지 캐시 압박이 심해지면 시스템 체감 성능이 떨어질 수 있다.

핵심 인사이트

MoE의 희소 활성화는 “메모리 압축”이 아니라 “메모리 계층화”의 기회다. 128명 중 8명만 쓴다는 사실이, 128명을 서로 다른 속도의 저장소에 분산시킬 자유를 준다.
mmap은 공짜 LRU 캐시 매니저다. 애플리케이션이 직접 전문가 캐시를 관리하지 않아도, OS 페이지 캐시가 자동으로 핫 전문가를 RAM에 유지해준다. llama.cpp의 기여는 “이걸 의식적으로 활용하기 좋은 텐서 배치 인터페이스를 제공한 것”이다.
AI의 미래는 더 큰 GPU만이 아니라 더 똑똑한 데이터 이동에 있다. MoE 추론의 병목이 FLOPS가 아니라 메모리 대역폭과 I/O라는 사실이 명확해질수록, 엔진 레벨에서의 캐시 정책·프리페칭·라우팅 예측이 중요한 경쟁력이 된다.
“GPU-poor” 사용자도 실험의 프런티어가 될 수 있다. 맥북 M4 Max에서 처음 증명된 아이디어를 윈도우 게이밍 PC에 이식한 이번 실험처럼, 제약이 있는 환경이 오히려 창의적인 해법을 만든다.

더 알아보기

NVIDIA Glossary: Mixture of Experts (MoE) — MoE의 기본 개념과 희소 활성화 구조를 간결하게 정리한 레퍼런스.
A Survey on Inference Optimization Techniques for Mixture of Experts Models (ACM Computing Surveys) — expert offloading을 포함한 MoE 추론 최적화 기법 전반을 다룬 서베이.
Performant local mixture-of-experts CPU inference with GPU acceleration in llama.cpp (Hugging Face Blog) — llama.cpp로 MoE 오프로드를 실전 구성하는 상세 가이드.
How to run big MoE models like Qwen-3-235B-A22B in Llama.cpp via partial offloading to CPU (Medium) — 235B급 거대 MoE를 소비자 하드웨어로 돌리는 구체적 플래그 조합.
Does llama.cpp offload MoE to disk? (GitHub Discussion) — mmap 기반의 디스크 오프로드 동작에 대한 커뮤니티 Q&A.
SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency (arXiv) — SSD 오프로드의 에너지 트레이드오프를 정량 분석한 최신 연구.
Feature Request: Two-tier GPU+RAM expert cache for MoE offload (llama.cpp Issue #20757) — 엔진 레벨에서 MoE 전문가 캐시를 개선하려는 진행형 논의.

03찬반 토론 · Debate

토론: “SSD 스트리밍으로 로컬 LLM의 하드웨어 한계를 돌파할 수 있는가”

논제: RTX 3060 + 16GB RAM 같은 중급 PC에서 SSD mmap 스트리밍을 활용해 32GB MoE 모델을 돌리는 것은, 로컬 LLM의 실용적 미래이자 권장할 만한 접근인가?

이 영상은 “28GB 시스템 메모리로 32GB 모델을 돌렸다”는 자극적인 결과를 제시한다. 겉으로는 기술 승리담이지만, 이 접근이 진정 실용적 미래인지, 아니면 한정된 조건에서만 성립하는 재미있는 트릭인지는 따져볼 여지가 충분하다. 3라운드에 걸쳐 찬반을 교환해본다.

Round 1

🟢 Pro — “SSD 스트리밍은 GPU-poor 시대의 구원투수다”

첫째, 이 접근은 경제적 진입 장벽을 극적으로 낮춘다. 70B급 이상의 프런티어 오픈소스 모델이 매달 쏟아지는데, 소비자 GPU의 VRAM 증설 속도는 그 흐름을 전혀 따라잡지 못한다. RTX 3060 같은 3~4년 된 카드에서도 Qwen3 30B A3B가 돌아간다는 것은, 수백만 명의 잠재 사용자가 당장 모델을 경험할 수 있음을 의미한다. “클라우드 API를 쓰거나 4090/H100을 사거나” 식의 이분법이 깨진다.

둘째, 이 기법은 새로운 하드웨어를 요구하지 않는다. 이미 깔려 있는 NVMe SSD, 이미 OS에 내장된 mmap, 이미 오픈소스인 llama.cpp만으로 성립한다. 추가 비용 0원이고, 설정만 바꾸면 된다. 학계 서베이도 expert offloading을 MoE의 정규 최적화 기법으로 다루고 있으며, 엣지 추론에서 가장 현실적인 선택지로 본다.

셋째, 이 접근의 성능은 이론적 상한이 명확하고 개선 여지가 많다. 영상의 1.5GB/s는 어댑터 병목 때문이지 기술의 한계가 아니다. 스펙대로 5GB/s를 뽑거나, PCIe 5.0 NVMe로 가면 10GB/s 이상도 가능하다. 프리페칭, 라우팅 예측 같은 연구(ExpertFlow 등)가 얹히면 토큰당 I/O 스톨은 더 줄어든다. 지금의 2.5~4 토큰/초는 출발점이지 종점이 아니다.

🔴 Con — “느리고, 에너지 비효율적이며, 특수 조건에서만 작동한다”

첫째, 정직하게 말해 2.5~4 토큰/초는 실사용에 못 미치는 속도다. 사람이 읽는 속도와 얼추 비슷해 “읽을 만하다”고 표현되지만, 이건 단일 사용자 데모용이지 워크플로우에 박아둘 수 있는 수준이 아니다. 코딩 어시스턴트, 에이전트, 배치 요약 같은 실제 쓰임새는 최소 20~30 토큰/초 이상을 요구하고, 거기서 한 자릿수 초반대로 떨어지면 생산성 자체가 붕괴한다.

둘째, 에너지 효율이 나쁘다는 정량 증거가 이미 존재한다. 2025년 발표된 한 연구는 SSD에서 MoE 가중치를 읽는 비트당 에너지가 DRAM 대비 월등히 높아서, 토큰당 생성 에너지를 크게 증가시킨다는 점을 수치로 보였다. 프리페칭으로 지연은 숨길 수 있어도 에너지는 숨길 수 없다. 이걸 “로컬 LLM의 미래”로 선언하는 건 환경·전력 관점에서 무책임하다.

셋째, 이 기법은 MoE + 충분한 VRAM 공간 + 빠른 NVMe + 잘 짜인 OS 페이지 캐시라는 정확한 조합이 맞을 때만 작동한다. 밀집 모델에는 무용지물이고, 어텐션 로직조차 VRAM에 못 넣는 더 작은 GPU엔 소용이 없으며, HDD나 느린 SATA SSD 환경에선 오히려 시스템이 얼어붙는다. “중급 PC의 구원투수”라고 부르기엔 작동 조건이 지나치게 협소하다.

Round 2

🟢 Pro (재반론) — “Con의 지적은 현재 스냅샷일 뿐, 추세를 놓치고 있다”

Con의 첫째(속도 부족) 에 대해 답한다. 2.5~4 토큰/초라는 수치는 어댑터 병목이 걸린 최악의 구성에서 나온 값이다. 동일한 방법론을 PCIe 4.0 풀스피드(5GB/s)만 확보해도 영상 자체가 “처리량이 두 배로 뛸 것”이라고 추정했다. 거기에 llama.cpp에서 논의 중인 GPU+RAM 2단 전문가 캐시, ExpertFlow류의 라우팅 예측 기반 프리페칭까지 더하면 10 토큰/초 영역은 현실적이다. “지금 느리다”는 사실에서 “앞으로도 못 쓴다”로 비약하면, 초기 로컬 LLM이 느렸을 때 “로컬은 안 된다”고 말하던 사람들과 똑같은 오류를 반복한다.

Con의 둘째(에너지 비효율) 에 대해 답한다. 에너지 논문의 맥락을 정확히 봐야 한다. 그 연구는 데이터센터 규모 서빙을 가정했고, 토큰당 에너지 비용이 운영비와 탄소 배출에 직접 연결되는 환경을 다룬다. 개인 PC 워크스테이션에서 하루 몇 시간, 가끔 쿼리하는 용도라면 절대 전력량 자체가 미미하다. 무엇보다 “에너지 비용이 높다”는 건 “기술이 틀렸다”가 아니라 “워크로드에 맞게 써야 한다”는 결론을 이끌 뿐이다.

Con의 셋째(조건 의존성) 에 대해 답한다. “작동 조건이 협소하다”는 주장은 오히려 거꾸로다. 오늘 출시되는 주요 오픈소스 프런티어 모델 상당수가 이미 MoE다. DeepSeek, Qwen3, Mixtral, Llama 4 계열까지. 즉 이 기법이 대응하는 모델 계열이 시장 점유율이 점점 커지는 쪽이다. NVMe SSD는 2026년 기준 중급 PC의 표준이고, HDD 얘기는 이제 엣지 케이스다. 협소하긴커녕 시장의 중심을 정확히 겨냥한다.

🔴 Con (재반박) — “Pro의 낙관은 벤치마크와 커뮤니티 경험을 과소평가한다”

Pro의 첫째(진입 장벽 완화) 에 대해 답한다. 진입 장벽이 낮아지는 것과, 그 결과 나온 경험이 “만족스럽다”는 것은 전혀 다른 얘기다. 사용자가 Qwen3 30B를 설치하는 데 성공한 뒤 2 토큰/초로 답을 기다리다가, 결국 “그냥 Claude 쓰자”로 돌아가는 패턴은 이미 커뮤니티에 수없이 보고되어 있다. “돌아가긴 한다”가 채택으로 이어지지 않는다. 오히려 진입 장벽이 너무 낮아서 사람들이 무리한 기대를 품고 실망하는 경우가 더 많다.

Pro의 둘째(무료 기술 스택) 에 대해 답한다. “공짜”는 소프트웨어만의 얘기다. 실제로는 2TB NVMe SSD의 수명, 추가 쿨링, 전력 소모, 설정에 드는 시간이 전부 원가다. 더 중요한 건 기회비용이다. 같은 돈과 시간을 들여 API 크레딧을 사면 훨씬 더 좋은 모델을 훨씬 빠른 속도로 쓸 수 있다. “내가 원래 가진 걸로 된다”는 감성적 주장일 뿐, 냉정한 비용 비교를 빗겨간다.

Pro의 셋째(개선 여지) 에 대해 답한다. 이론적 상한이 크다는 논리는 모든 실험적 기법에 붙일 수 있는 공허한 약속이다. 실제로 llama.cpp 이슈 트래커는 mmap 기반 SSD 오프로드가 맥OS에서 커널 패닉을 유발하거나, Qwen3 30B A3B가 서버에서 로드된 뒤 추론 시점에 죽는 버그 등, 현재진행형 안정성 문제로 가득하다. “미래엔 좋아질 거야”는 지금 당장 쓰려는 사용자에게 의미 있는 답이 아니다.

Round 3

🟢 Pro — “안정성과 속도는 엔지니어링 문제지, 접근의 원리적 결함이 아니다”

Con의 첫째(사용자 만족도 부족) 에 대해 답한다. “2 토큰/초로 실망하고 API로 돌아간다”는 서사는 맞지만, 그건 Q8 32GB 모델이라는 가장 극단적인 구성을 겨냥했을 때 얘기다. 같은 기법으로 Q4 18GB 버전이면 9.3 토큰/초가 나온다는 점을 Con은 편리하게 건너뛰었다. 실제 사용자들은 속도와 품질 사이에서 선택하고, SSD 스트리밍은 “극한 모델을 가능하게 해주는 옵션”으로 설 자리가 충분하다. 모든 사용자가 32GB 풀 Q8을 돌려야 하는 것도 아니다.

Con의 둘째(기회비용) 에 대해 답한다. API 크레딧 비교는 데이터 주권, 프라이버시, 오프라인, 커스텀 파인튜닝 같은 로컬 LLM의 본질적 가치를 무시할 때만 성립한다. 의료·법률·보안 같은 도메인에서는 “인터넷에 데이터를 못 보낸다”가 전제 조건이다. 그 전제 아래에서는 “API가 더 싸고 빠르다”는 비교 자체가 무의미해지고, “내가 가진 PC에서 어떻게든 돌리는 법”이 유일한 길이 된다. SSD 스트리밍은 바로 그 유일한 길을 현실적으로 만들어준다.

Con의 셋째(버그와 안정성) 에 대해 답한다. Con이 인용한 llama.cpp의 커널 패닉 이슈와 Qwen3 로딩 버그는 정확히 이 기법이 활발한 엔지니어링 영역에 있다는 증거다. 버그 리포트가 쌓이는 프로젝트는 죽은 프로젝트가 아니라 실제로 사용되고 개선되는 프로젝트다. 초기 CUDA도, 초기 PyTorch도 악명 높은 버그 투성이였다. “지금 버그 있음 → 원리적으로 틀림”은 성급한 일반화다.

🔴 Con — “원리의 문제가 아니라, 이걸 ‘미래’로 포장하는 태도가 문제다”

Pro의 첫째(Q4 버전이 있다) 에 대해 답한다. 그건 정확히 이 토론의 논점을 흐리는 반론이다. 논제는 “32GB 모델을 28GB PC에서 돌리는 SSD 스트리밍 접근”이다. Q4 18GB 모델은 SSD 스트리밍이 필요 없다. 애초에 RAM에 다 들어간다. Pro가 “더 작은 양자화로 가면 빠르다”고 빠져나가는 순간, 이 영상의 핵심 주장(SSD 스트리밍의 실용성) 이 무너진다. 증명이 필요한 구성이 아니라 우회로를 내세우는 건 논점 회피다.

Pro의 둘째(데이터 주권) 에 대해 답한다. 프라이버시가 중요하다는 건 동의하지만, 그 논리는 왜 이 특정 기법을 써야 하는가를 전혀 설명하지 못한다. 데이터 주권이 목적이라면 답은 명확하다. 16GB VRAM GPU 중고로 사거나, Mac mini M4 Pro를 사서 통합 메모리로 돌린다. 두 방법 모두 SSD 스트리밍보다 5~10배 빠르고 훨씬 안정적이다. “어떻게든 내가 가진 PC로 돌리는 법”이라는 감성은 훌륭하지만, 의료·법률 전문가에게 권할 솔루션은 아니다.

Pro의 셋째(버그는 활발한 개발의 증거다) 에 대해 답한다. 그 논리는 절반만 맞다. 활발한 개발이라는 건 곧 API와 동작이 매주 바뀐다는 뜻이고, 프로덕션이나 반복 가능한 워크플로우를 세우려는 사람에겐 악몽이다. CUDA와 PyTorch의 초기 버그를 견뎌낸 건 그것이 유일한 선택지였기 때문이다. 지금 로컬 LLM에는 Ollama, LM Studio, vLLM, MLX 등 훨씬 안정적인 대안이 많다. “최첨단 실험”과 “권장 가능한 실용안”은 다르다.

🧭 종합

합의 지점

기술적으로 실제 작동한다는 사실 자체에는 양측 모두 동의한다. MoE의 희소 활성화 + llama.cpp의 텐서 배치 + OS의 mmap 조합이 28GB 메모리 위에서 32GB 모델을 돌리는 것은 마술이 아니라 잘 문서화된 시스템 엔지니어링의 산물이다.
이 기법의 이상적 적용 지점은 “극단적 구성”이다. 모델이 RAM에 들어가는 경우엔 의미가 없고, 전혀 안 돌아가는 경우엔 대안이 없다. “돌아는 가지만 빠듯한” 영역이 바로 이 기법의 홈 그라운드다.
워크로드 적합성이 관건이라는 점도 공통 인식이다. 대화형·간헐적·연구·프라이버시 민감 작업에는 맞고, 에이전트·대량 배치·지연 민감 워크플로우에는 부적합하다.

열린 질문

llama.cpp 수준에서 OS 페이지 캐시를 넘어 엔진 고유의 전문가 캐시와 프리페처가 도입되면, 토큰당 처리량을 얼마까지 끌어올릴 수 있을까? 2단 GPU+RAM 캐시 제안(Issue #20757)이 실제로 머지되면 벤치마크가 어떻게 바뀔까?
NVMe SSD의 내구성(TBW)과 전력 관점에서, MoE 스트리밍 워크로드를 장기간 돌렸을 때 실제 수명·에너지 비용은 어떻게 집계되는가? 에너지 연구(2508.06978)의 결론이 개인 사용자 스케일에서도 유효한가?
MoE가 아닌 밀집 모델에 같은 철학(하드웨어 계층화 + OS 지원)을 응용할 수 있는가? 레이어 단위의 hot/cold 패턴은 훨씬 불리해 보이지만, 특정 토큰에서만 활성화되는 희소성은 어텐션 쪽에서도 연구 중이다.
모델 개발자 입장에서 “이 기법 친화적인 모델 포맷” 을 만들 여지는 없는가? 전문가 단위로 파일이 샤딩되거나, 자주 쓰이는 전문가가 파일 앞쪽에 배치된 포맷이 나오면 캐시 효율이 크게 달라질 수 있다.

더 나아간 관점

이 토론의 본질은 “SSD 스트리밍이 빠른가 느린가”가 아니다. 로컬 LLM 생태계에서 ‘불가능을 가능으로 바꾸는 해킹’을 어떻게 평가할 것인가라는 문화적 질문이다.

Pro의 입장은 개방성과 실험 정신의 가치를 옹호한다. “내가 가진 하드웨어로 어디까지 갈 수 있는지 밀어보자”는 태도는 로컬 LLM 커뮤니티의 DNA 그 자체다. Daniel Isaac이 맥북에서 시도한 기법을 윈도우 PC로 포팅한 이 영상은 그 DNA의 표본이다. 속도가 느리다고 해서 의미가 없는 게 아니라, 속도가 느려도 가능하다는 사실 자체가 커뮤니티를 전진시키는 연료다.

Con의 입장은 성숙한 엔지니어링 규율을 대변한다. 재미있는 해킹과 권장할 만한 솔루션을 혼동하지 말라는 경고는 프로덕션에 코드를 올려본 사람이라면 누구나 공감할 만하다. “YouTube 데모에서 된다”와 “내가 매일 쓸 수 있다” 사이의 거리는 생각보다 멀다.

결국 둘 다 옳다. 이 기법은 “최첨단 실험실의 프런티어”이면서 동시에 “권장 가능한 디폴트 설정은 아닌 상태” 에 걸쳐 있다. 영상이 “미래”라고 선언한 문장(“AI의 미래는 더 똑똑한 스트리밍에 있다”)은 맞지만, 그 미래가 이미 도착했다고 읽으면 과장이다. 정확히 말하면 미래로 가는 방향성을 증명한 프로토타입이다. 그리고 프로토타입이 충분한 수로 누적될 때, 그것이 곧 표준이 된다. 그게 오픈소스 생태계가 일하는 방식이다.

04영문 원본 · Transcript

so today i'm going to break down how i was able to run a ai model that's 32 gigabytes on my pc
that only has 28 gigabytes of total memory something that should not be impossible but
that was made possible through a few clever techniques and this experiment was inspired
by daniel isaac here on twitter who i saw this tweet from him and he showed something really
clever instead of trying to fit an entire ai model into your memory at once you can stream
pieces of it from your sd ssd that's the fast storage on your computer and only load parts of
it that you actually need now he tested this on a macbook with apple's m4 max chip and got some
pretty impressive numbers 19.6 gigabytes per second reading straight from the ssd
so like a year back i had bought an ssd for my own gaming pc and this was mainly just to run
games that i couldn't fit on my c drive but i tried to adapt it and see if this
is going to work or not and i found out that it was going to work and it was going to work and
this could actually work on my budget windows gaming pc so this method is made possible thanks
to moe models which you may have heard of and that stands for a mixture of experts and there are a
lot of local lms open source models that use this structure but you can think of a regular ai model
like a single brain every time you ask it a question the entire brain has to think about it
every single neuron fires that works but it means bigger models need way more computing power a mix
of experts model though works differently think of it like the hospital here a hospital has 128
specialist doctors on staff but each patient only sees eight only the ones they need if you walk in
with a broken arm you don't need a skin doctor necessarily and that's how this model works it has
a 30 billion total parameters in the case we're going to be using but only 3 billion are active
so the key question is if only eight of the 128 experts are needed at any moment
do you really need to load all 128 in the
memory at the same time and here's the problem that i'm sure we've all encountered ai models
are huge and they're getting bigger with every release a 7 billion model a 7 billion parameter
model will fit in around 4 gigabytes and that pins cup comfortably on modern gpu but a 13 billion
parameter model 8 gigabytes you go up into 30 billion and depending on your quantization they
will fit in around 4 gigabytes and that's a lot of work so the gpu i'm running locally on my pc is
rtx 3060 which only has 12 gigs of vram so trying to run something like a 32 gig model would seem
pretty difficult right but i wanted to experiment with this trick that daniel had brought up and see
if we could actually do it so this was my experiment starting with the gpu vram which
is the brain attention and routing logic which has 12 gigs it's extremely fast and this is where the
model really lives and it contains all the routing logic that you need to decide which experts to
call in the middle here you have the system ram which is your cpu i have 12 gigs on my pc it's
slower than the gpu which is why i usually don't use it to run local models but it is bigger and
this acts as like a cache it's a waiting room where frequently used experts hang out so they
can be called up quickly at the bottom you have the ssd so i had a two terabyte ssd and it's the
best of the three but it can hold everything all 128 experts live here so when the model needs an
expert it first checks the ram and if the expert is already there great that's instant if not it
goes to the ssd it's like netflix you don't download the whole library you only stream
what you need and it would stream from 1.5 gigabytes per second off the ssd so this is
my setup your performance is going to depend a lot on exactly what kind of hardware you have
if you are using a ssd you can play with your ssd and get the full experience at your own pace
the most powerful ssd system i have is the nvidia rtx 3060 12 gigs of vram for my cpu i have 16 gigs
and my ssd was a sabrent rocket 4.0 two terabytes and vme now it was rated at five gigabytes per
second in terms of speed but the bottleneck we hit is my adapter because i could only do one m2 slot
on my computer
speed I was supposed to get. Now I tried a couple different adapters. And I wasn't able to improve
on this, I didn't want to mess with it too much. My SSD contains a lot of my project files,
important stuff that I've been working on for months, I didn't want to screw around with it
too much. Some of you I'm sure are pros and building computers and could probably point me
in the direction of what I screwed up here. But the end result of my findings were that I could
only get 1.5 gigabytes per second instead of what should have been 3.5 gigabytes. But I didn't give
up, we're going to work with what we have. So the tool we used was llama.cpp. And this is a big
community. It's an open source LLM inference engine. And it's very popular among local LLM
community. It runs on CPU, GPU, or both. And it was really perfect for this. You could think of
it like a alternative for O-Lama. And the critical feature that it had is this flag, this
flag. And it's a flag that you can put on your computer. And it's a flag that you can put on
your computer. And it's a flag that you can put on your computer. And it's a flag that you can put
on your computer. And it's a flag that you can put on your computer. And it's a flag that you can
It lets you tell llama.cpp exactly where to put each model. So I told it put the attention layers,
the core thinking logic on the GPU, because that needs to be fast. But keep the expert weights on
the CPU side. And this was kind of the trick here. When the expert weights are on the CPU side,
they're not loaded into RAM all at once. Instead, the operating system use something called mmap
down here, mmap. And it's memory mapping. It's a technique where the operating system
system maps the model file on your SSD directly into virtual memory, then it only loads the parts
that are actually accessed on demand from the SSD. So SSD becomes part of the memory hierarchy.
Windows automatically caches the most frequently used experts in the RAM in the CPU and streams
the rest from the drive when they're needed. The operating system essentially becomes our expert
cache manager for free. So this is the model I chose to use for this test. This is a QEM3
30B A3B. It's an open source mixture of experts model with 30 billion parameters.
You can see the 128 experts here, each one of these squares is a specialist. The green ones
that are the eight are active right now. So every time you ask a new question, you can see different
experts light up. And you can see the breakdown, the three gigabytes of shared, that's the attention
mechanism I was talking about, the routing logic, and that fits easily on the GPU. But the expert
weights,
are 29 gigabytes. So that's double my total system RAM of 16 gigabytes in my CPU. So in total,
the model file is 32 gigabytes. So my entire PC, the entire system memory,
GPU and RAM from the CPU combined only has 28 gigabytes. So this model literally should not
be possible to run on my PC as it is right now. But I wanted to try anyway. So here's another
kind of breakdown of this system pipeline. So you type in a prompt like you usually do to a model.
The GPU router, right, which is the attention layer picks eight of 128 experts. So the question
is, is the expert in RAM, in my CPU? If it's yes, it reads from the page cache. And that's instant,
pretty much. If it's no, it has to stream from the SSD. And that's at point five gigabytes per
second, like I said with my my lousy adapter, and then the experts computes and all the results go
back to the GPU. Okay, so let's actually try this in terminal now. And I can show you what results
I got trying to apply this method. So we're starting this up in terminal now. You could see
I just ran this file. And we got to llama.cpp, which we're running this through. And this is
basically your normal terminal model that you could just write in questions, whatever you want,
these are put in your prompts, and it'll answer back to you. And for this to show you the speeds,
this is the monitor that I set up. I'll try to open source this. If people find it useful,
you'd find it on Tobii studio on GitHub. And you could see here at the beginning,
when it originally loads the experts, it's requiring a lot of SSD. And the disk reads
per second here, this is when it's actually reading off of the SSD, you can see at first,
when it first loaded, it was very high. When it's read, it means it's reading off the SSD a lot. But
now I haven't put in any prompts. So now it's in the green.
So this next column page faults, that's how many times the operating system tried to access a
memory page that wasn't available. Including both salt soft faults, and hard faults, soft bolts,
being the ones that are found in page cache and fast, our faults are when I had to read from the
SSD. So the higher number with low discrete means the cache is working. High number with high disk
reads means cold experts were streaming from the SSD. Free RAM here. So this is the
available physical memory. So you're going to watch this drop as expert waits fill the RAM. When
it gets very low, that means it's going to force the operating system is going to force the model
to read more off of the SSD, you can see at the first, when it was loading it, it was very low.
Now it's back up to around 5000 ish. The estimated hit, this means it's still hitting the CPU.
Because I haven't put in any prompts. But
this starts to turn red. It means it's hitting the SSD more. So let's try some prompts. So let's just
say hello. So this, like I said, this is the model here, you could see when 330 bill three, a three
B. So you can see it's thinking, it's not going to be very fast. But it is operating. And that's the
key thing. It's not exploding my computer. Here we go. Hello, nice little smiley face. How can I
assist you? With your questions needs help?
With something or just want to chat? I'm here with you. I'm here for you. What's on your mind. So you
can see here we got generation at 4.3. T a second. And that stands for tokens per second. And this is
going to vary depending on the prompts. Like I said, how much it's going to hit off the SSD. And
you can see in the monitor. It was all green when I didn't have the prompt. And you can see here went
straight to red, it started reading. This is the reading off of the SSD didn't need to go too heavy.
Since it was
just a hello. But it did use it a little bit. So let's try something a little bit more technical. See how it
does. Let's have it explain itself. What is a mo e model? Or a technical question. And we could see it's
going to answer live. It will take some time. But let's see what the streaming so you could already see once I
put in the prompt, it's hitting the SSD file, the SSD.
To answer the question, you see it started thinking here, list all of the reasoning. And this isn't the actual output. This is just it's kind of reasoning that it's streaming. And you could see here as it's getting answering a more technical question, it has to really dig deeper. Getting higher disk speeds than we got previously, because it loading more of the cold experts from the SSD. You can see here this percentage in the last column. When this hits zero, that means it's reading using all of the
than we got previously, because it loading more of the cold experts from the SSD. You can see here
this percentage in the last column, when this hits zero, that means it's reading using all of the
experts from the SSD. When it's at 100%, like it was kind of earlier, that means it's only using
what's in RAM, what's on the CPU. It see as it's answering a more technical question has to really
dig deep here. Let's see up ended its thinking and it's answering a mixture of experts AI models,
a type of neural network architecture that uses specialized sub models called experts to handle
different parts of a task with a gate mechanism that routes input data to the most relevant
experts. It's inspired by the idea of having a team of specialists where each expert focuses
on a specific aspect of a problem and decides which expert to consult based on the input.
Maybe when
it's too slow or too slow. It's like a
little bit of a gamut. I can't explain it better than I did, but that's generally the explanation
I gave I think. So it's going into more detail here, pretty detailed response of how it works,
the key benefits. And you can see here back in the streaming, it's heavily going into the SSD
to find this information and produce the answer. And it is slow, more slower than Claude, it's
actually not that slow output. But considering this as a model, I shouldn't be able to even run
destroying my computer, I have obviously recording application running, I have a couple browser tabs
open, I have VS code open, and everything's still running fairly smooth. So there's the response
gave you example use cases challenges, generation at 3.7 tokens a second. So there you go, you got
to see it run live, we made the impossible possible. So this is a slide version from our
previous runs. But you can see the same thing in the green means everything is cached, just using
the hot experts. But then I give it a more technical prompt, the ramps filling up here in
the yellow, oh, no, we need SSD streaming, full speed experts from the disk. And then after we
have the cache forming, and you can see it moving through these different devices, all in real time.
So the results were, without this SSD
streaming,
streaming setup, we can't run a 30 billion parameter model like this, it would just be
impossible. With the q4 and CPU offload. Without the streaming, you could run 18 gigs. And this
is something we did earlier, we got 9.3 tokens a second. Now it's experts just in RAM. And then
finally using q8 and SSD streaming 32 gigabyte model. At between we actually got higher
tokens.
Between like 2.5 and a little over four tokens a second. Now is true SSD streaming. So a 32 gigabyte
model on 28 gigabyte system. The speed does vary by topic, the hello response was fast because it
just relied on the common experts. Something more technical, like explain mixture of experts,
it was slower, and had to hit rare experts on the SSD. So what do we prove in this experiment that
SSD expert streaming works on Windows, like it had worked for.
for Daniel Isaac on his Apple machine.
We adapted his method a little bit.
It's not an exact one-to-one,
but the central concept was the same.
So even with our bottleneck cheap adapter SSD,
we were still able to run a 30 billion model
at a readable speed.
I think you saw it.
It is slow, but it's readable.
I mean, it came out in a couple of minutes to answer.
So the OS page cache means automatic expert LRU cache.
So hot experts stay in the RAM cold one stream
from the disk.
So obviously fixing my adapters set up
would double the throughput and it'd be even faster.
So we found what we proved in this experiment
was that Dan Isaac's technique translated
with some tweaking to Windows.
Instead of using the P read,
we used MM map from the Lama CPP.
So the future of AI isn't just bigger GPUs,
it's smarter streaming.
So this was my experiment.
When I saw this tweet,
I thought it was interesting
and I wanted to try this locally.
And it was pretty successful as you saw,
I was able to run quite a large model locally on my computer.
Credit again to Daniel Isaac.
There may have been other researchers
who've worked on this issue.
This was just the one person I saw on Twitter doing this.
So I apologize if I'm missing somebody.
And please let me know in the comments,
I'd like to give credit.
And credit also to the Lama.cpp community
for building this great tool
that I was able to do this with.
And if you're interested,
I have this public open source GitHub repo moe-ssd-streaming-windows.
And most of the tools I used were public like the Lama CPP,
but this just gives you an overview of what I did
and specifically might be useful are the benchmarks.
It gives you a whole rundown of what we did.
And the monitoring,
this was something custom we did that might be useful.
Like you see, it shows the real-time disc reads
like I was showing earlier.
It could be useful if you want to test this out,
this method and see what kind of results you get,
as well as the benchmarks for the Windows NVMe throughput
benchmarks that we wrote in C.
So not, most of the tools, like I said,
are available elsewhere,
but this just gives you a central place
to find some of the benchmarks and monitors.
And that's pretty much it.
I hope you learned something.
I hope you found this little experiment interesting.
I'm going to continue similar
to the TurboQuant experiments I've been running,
trying to find different ways to run larger local models
on my kind of mid-range gaming PC.
I know a lot of people in the local LM community use Macs,
but I think there's a lot of people just with Windows PCs
that want to try to do this stuff.
So I'll keep trying different experiments for that.
But thank you for watching.
Please leave a comment.
Please subscribe to my YouTube channel.
I'll see you in the next video.
Bye.
Don't forget to subscribe.
Please leave a like.
Let me know what you think of this video or this experiment,
and I will see you in the next one.
Thank you for watching.