Testing Google's TurboQuant Approach: I Got 5x Compression with 99.5% Accuracy!

2026-03-25 · 17m · 자막 —

01한국어 번역 · Korean

구글 TurboQuant 직접 검증: 내 PC에서 5배 압축·99.5% 정확도를 얻었다

원본: https://www.youtube.com/watch?v=iD29muStx1U · 업로드: 2026-03-25 · 길이: 18m · 채널: Onchain AI Garage

들어가며

오늘 아침 구글 리서치(Google Research)가 공개한 트윗 하나가 눈에 들어왔다. “TurboQuant”라는 새로운 압축(compression) 알고리즘인데, LLM의 키-값 캐시(KV cache) 메모리를 최소 6배까지 줄일 수 있다는 내용이었다. 블로그와 논문을 훑어보니 이전 연구의 흐름부터 실험 결과까지 꽤 밀도 있게 정리돼 있었고, 읽을수록 “이건 꽤 큰 돌파구다”라는 인상을 받았다.

그래서 이번 영상에서는 두 가지를 하려고 한다. 첫째, 논문과 블로그가 워낙 밀도 높게 쓰여 있어서 일반 시청자에게 다소 어려울 수 있는 TurboQuant의 핵심 아이디어를 쉽게 풀어보는 것. 둘째, 그 위에서 실제로 내 개인 PC와 GPU로 이 접근법을 재현해보는 것이다. 구글이 공식 구현을 공개한 것은 아니지만, 논문에 수식과 절차가 충분히 남아 있어서 수학 자체를 검증하는 데는 큰 문제가 없다. 미리 말해두지만 이건 구글의 공식 TurboQuant가 아니라, 내가 논문만 보고 바닥부터 구현한 “같은 접근법”이다.

TurboQuant가 풀려는 문제

요즘의 프론티어 AI 모델들(ChatGPT, Gemini 등)은 내부적으로 수십~수백억 개의 숫자를 저장한다. 그래서 실행에 전용 하드웨어가 필요하고, 그만큼 느리고 비싸다. 로컬에서 고성능 LLM을 돌려본 사람이라면 응답 대기 시간 동안 “뒤에서 엄청난 양의 숫자 연산을 돌리고 있구나”라는 감각을 한 번쯤 느껴봤을 것이다. 이런 비용 구조 때문에 소형 장치, 특히 모바일이나 보급형 데스크탑에서 고성능 모델을 돌리는 일이 여전히 어렵다.

여기서 등장하는 해결책이 양자화(quantization)다. 양자화는 사진 압축(compression)에 비유하면 쉽다. 카메라 원본(raw) 사진이 50MB라도, JPEG로 저장하면 2MB로 줄어들면서도 눈으로는 거의 차이를 느끼지 못한다. AI에서도 마찬가지로, 각 파라미터를 32자리 정밀도로 저장하는 대신 3~4자리 수준으로 반올림하면 메모리는 크게 줄면서 답변 품질은 거의 유지된다.

KV 캐시가 왜 핵심인가

TurboQuant가 건드리는 지점은 모델 가중치가 아니라 KV 캐시(Key-Value Cache)다. KV 캐시는 대화 중 모델이 들고 다니는 “컨닝 페이퍼”라고 생각하면 된다. 사용자가 지금까지 무슨 말을 했는지 기억해야 하니, 모델은 토큰마다 키(label)와 값(actual information)을 캐시에 적어둔다. 문제는 대화가 길어질수록 이 컨닝 페이퍼가 기하급수적으로 불어나 수 기가바이트를 차지한다는 점이다. 그만큼 모델을 더 똑똑하거나 더 빠르게 만드는 데 쓸 수 있는 메모리가 사라지는 셈이다.

즉 KV 캐시를 손실 없이 줄일 수만 있다면, 더 긴 대화, 더 저렴한 하드웨어, 더 빠른 응답이라는 세 가지 이득을 한 번에 가져갈 수 있다. TurboQuant가 노리는 지점이 바로 여기다.

TurboQuant의 2단계 접근

TurboQuant는 2단계(two-stage) 파이프라인이다.

1단계는 폴라 양자화(PolarQuant)다. 데이터를 더 압축하기 좋은 모양으로 재배치하는 단계인데, 여행 가방에 옷을 넣기 전에 반듯하게 개는 작업에 가깝다. 2단계는 QJL(Quantized Johnson-Lindenstrauss)로, 1단계에서 놓칠 수 있는 미세한 디테일을 잡아준다. 영리한 수학적 트릭을 이용해 정보를 단순한 +/- 신호로 압축하면서, 추가 메모리 오버헤드(memory overhead) 없이 보정이 이루어진다.

결과적으로 숫자 하나당 32비트였던 것이 약 3비트까지 줄어든다. 약 10배 가까운 감축인데 답변 품질은 그대로이고, 무엇보다 재학습(retraining)이 전혀 필요 없다. 기존 모델 위에 바로 꽂아 쓸 수 있는 플러그앤플레이(plug-and-play) 방식이다.

공식 수치와 로컬 사용자의 이득

논문에 보고된 수치는 인상적이다. KV 캐시 메모리 6배 감소, H100 GPU에서 프리필(prefill) 처리 8배 가속, 3비트 정밀도에서 정확도 손실(accuracy loss) 0%, 모든 벤치마크에서 정확도 하락 0%. Gemma와 Mistral 같은 인기 오픈소스 모델에서 테스트됐고, 기존 압축 기법들을 대부분 앞섰다.

로컬 사용자 관점에서 의미는 명확하다. 더 작은 하드웨어에서 더 큰 모델을 돌릴 수 있고, 컨텍스트를 6배까지 길게 유지할 수 있으며, 이론상 응답 속도도 빨라진다. 극한 압축(extreme compression)까지 밀어붙이면 오프라인 모바일 AI라는 오랜 꿈에 한 발 더 가까워진다. 클라우드 관점에서는 더 저렴한 AI, 더 친환경적인 AI, 전력 소비 감소로 이어진다.

검증 1: 수학이 맞는지 바닥부터 구현

실제 검증은 두 단계로 나눠 진행했다. 먼저 Claude Code(Opus 4.6)로 논문을 함께 읽어가며 TurboQuant의 수학을 파이썬 패키지로 밑바닥부터 구현했다. 구글 코드는 전혀 쓰지 않고, 논문에 기술된 네 개의 구성 요소를 독립적으로 작성했다. Lloyd-Max 코드북(code book), 1단계 TurboQuant MSE, 2단계 TurboQuant Prod, 그리고 KV 캐시 래퍼(wrapper)가 그것이다.

첫 테스트는 Lloyd-Max 코드북의 대칭성(symmetry) 검사였다. 수식이 많아 복잡해 보이지만 핵심은 단순하다. 코드북(압축기가 사용하는 변환 사전)은 0을 중심으로 완벽히 대칭이어야 특정 값 쪽으로 편향되지 않는다. 결과는 정확히 0, 즉 대칭이 완벽했다.

다음은 MSE 왜곡(distortion) 테스트였다. 논문은 이론적 상한선을 제시하는데, 내 실험의 경험적 오차는 모든 케이스에서 그 상한 안쪽에 들어왔다. 더 중요한 건 내적(inner product) 편향이 거의 0에 가깝게 나왔다는 점이다. LLM은 다음 단어를 고를 때 벡터 내적으로 유사도를 비교하는데, 만약 압축이 한 방향으로 일관되게 편향되면 답변 품질이 무너진다. 0에 가까운 편향은 압축이 모델의 의사결정을 왜곡하지 않는다는 뜻이다. 3비트에서도 상관관계는 92% 수준을 유지했다.

왜 굳이 두 단계인가? 1단계만으로는 편향이 생길 수 있기 때문이다. QJL을 덧붙이면 비로소 편향 없는 결과(unbiased finding)를 얻을 수 있다. 사진 압축에 비유하면, 1단계는 본격적인 압축이고 2단계는 그 과정에서 살짝 틀어진 색감을 자동으로 보정하는 역할이다. 숫자 하나당 단 1비트의 추가 비용으로 이 보정이 이루어진다.

KV 캐시 수준에서 측정한 실제 압축률은 2비트에서 7.76배, 3비트에서 5.22배, 4비트에서 3.94배였다. 여기에 “건초더미 속 바늘 찾기(needle in a haystack)” 테스트까지 돌려봤다. 8,192개의 문장 속에 특정 정보 하나를 숨겨놓고 2~4비트로 압축한 뒤 검색을 시키는 테스트인데, 모든 비트폭에서 9/9 정답, 즉 100% 회수율을 기록했다. 압축이 “중요한 정보”를 버리지 않는다는 강력한 증거다.

검증 2: 실제 모델로 돌려본 결과

수학이 맞는다는 걸 확인한 다음, 실제 언어 모델로 옮겨갔다. 선택한 모델은 Qwen2.5-3B를 4비트 가중치로 띄운 버전으로, VRAM 2GB에 들어간다. 하드웨어는 몇 년 된 RTX 3060, 즉 전형적인 보급형 게이밍 PC다. 긴 문서에 숨겨진 사실을 집어넣고 한 번의 포워드 패스(forward pass)로 실제 KV 캐시를 확보한 뒤, TurboQuant 방식으로 2/3/4비트 압축을 걸고 원본 대비 어텐션(attention) 점수를 비교했다.

압축률은 원본 289MB 기준으로 4비트 76MB(3.8배), 3비트 58MB(5배), 2비트 7.3배 수준이었다. 3060의 12GB VRAM에서 보면 같은 모델로 8K 컨텍스트에 그치던 것이 3비트 압축에서는 40K 컨텍스트까지 들어간다. 로컬 사용자에게는 완전히 다른 경험이다.

핵심은 정확도다. 3비트에서 어텐션 유사도 0.995, 즉 원본 대비 99.5% 일치였다. 훨씬 긴 컨텍스트에서도 0.994 아래로 떨어지지 않았다. 36개 레이어, 레이어당 2개 KV 헤드 전체 평균 기준 3비트에서 상위-5(top-5) 일치율 92%, 상위-1 일치율도 매우 높게 유지됐다.

결론: 논문의 주장은 성립한다

공식 구현이 아닌 내 재현 실험에서도 논문의 주장이 실질적으로 재현됐다. 3비트가 실용적 스위트스팟(sweet spot)이다. 5배 압축에 99.5% 어텐션 충실도이니 사실상 생성 품질 차이를 체감하기 어려운 수준이고, 공식 구현은 훨씬 잘 최적화돼 있을 테니 더 나은 결과가 나올 것이다. 2비트도 작동은 하지만 7.3배 압축은 공격적이고, 상위-1 일치율이 66%까지 떨어져 모델 출력이 달라질 여지가 있다.

영감의 출발점이 된 프린스 카누마(Prince Kanuma)의 Mac 테스트 결과에도 크게 감사드린다. 그의 수치(3.8배, 4.9배)와 내 PC 결과(3.8배, 5배)가 거의 일치한 것도 이 알고리즘이 얼마나 이식성이 좋은지를 보여준다.

이번에는 실제 텍스트 생성 품질이나 속도 개선까지는 측정하지 않았다. 수학이 맞는지, 그리고 실제 모델에서 어텐션 패턴이 보존되는지를 확인하는 데 집중했다. 결과는 둘 다 합격. 점점 더 많은 고성능 LLM을 일반 소비자 하드웨어에서 돌릴 수 있게 되는 미래가 구체적으로 다가오고 있다는 걸 피부로 느낀 실험이었다. 구현 스크립트는 내 GitHub(@tommy-studio)에 정리해 올릴 예정이다.

02리서치 문서 · Document

TurboQuant 심층 해설: 구글이 KV 캐시를 6배 줄인 방법과 소비자 GPU에서의 재현

원본 영상: YouTube · 업로드: 2026-03-25 · 채널: Onchain AI Garage · 길이: 약 18분

서론: 왜 지금 “KV 캐시 압축”이 화두인가

2026년 3월 24일, 구글 리서치(Google Research)는 TurboQuant라는 새로운 벡터 양자화(vector quantization) 알고리즘을 공개했다. 대형 언어 모델(LLM)의 추론 시점 메모리 병목을 푸는 기법으로, 핵심 주장은 단순하다. “키-값 캐시(KV cache)를 3비트까지 줄이면서도 정확도 손실 0%, H100 GPU에서 어텐션 계산은 최대 8배 빠르게.” 이 소식은 테크 업계에서 곧바로 “Pied Piper 순간”이라는 별명까지 얻으며 퍼져나갔고(TechCrunch), Tom’s Hardware는 “최소 6배 메모리 절감, 최대 8배 성능 향상”이라는 수치를 강조했다(Tom’s Hardware).

그런데 왜 하필 지금 KV 캐시인가. 모델 가중치 양자화는 이미 수년째 연구가 누적된 분야다. 문제는 추론 단계에서 실제로 VRAM을 빠르게 소모하는 주범이 가중치가 아니라 “대화가 길어질수록 무한정 불어나는” KV 캐시라는 점이다. 영상 제작자 Onchain AI Garage는 바로 이 지점에 주목해, 공식 구현 없이도 논문만으로 알고리즘을 재구현하고 자신의 RTX 3060에서 5배 압축·99.5% 정확도를 재현해보였다. 이 글에서는 TurboQuant가 기술적으로 어떤 원리로 동작하는지, 왜 “제로 오버헤드(zero overhead)“가 이토록 중요한 성취인지, 그리고 소비자 GPU 재현 실험이 가지는 함의를 정리한다.

본론 1: KV 캐시, LLM의 진짜 메모리 병목

트랜스포머(transformer) 기반 LLM은 추론 중 생성된 모든 토큰의 키(key)와 값(value) 벡터를 캐시에 저장한다. 다음 토큰을 만들 때 이전 전체 문맥을 다시 계산하지 않고 캐시를 참조해 어텐션을 수행하기 위해서다. 문제는 이 캐시가 컨텍스트 길이에 선형으로 비례해 커진다는 점이다. 레이어 수 × 헤드 수 × 차원 × 토큰 수 × 두 배(K와 V)라는 곱셈 구조라, 컨텍스트가 수만 토큰 단위로 늘어나면 순식간에 수 기가바이트를 잡아먹는다.

Spheron 블로그는 이를 “LLM 추론의 숨겨진 세금”이라고 표현하며, KV 캐시 병목이 고가 GPU 인프라 비용의 주요 원인이라는 점을 강조한다(Spheron). 로컬에서 LLM을 돌리는 사용자에게도 이 문제는 똑같이 체감된다. 컨텍스트가 길어지면 VRAM이 모자라거나, 응답이 느려지거나, 모델 크기를 줄여야 한다. 결국 KV 캐시를 얼마나 작게 만들 수 있느냐가 로컬 AI 경험의 질을 결정한다.

본론 2: TurboQuant의 2단계 설계 — PolarQuant와 QJL

TurboQuant의 기술적 핵심은 두 단계다(research.google). 1단계는 PolarQuant, 2단계는 QJL(Quantized Johnson-Lindenstrauss)이다.

PolarQuant는 벡터에 무작위 회전(random rotation)을 적용한 뒤, 좌표 쌍을 길이와 각도(극좌표, polar coordinates)로 다시 쓰는 방식이다. 이렇게 하면 데이터를 “꽉 차게 패킹”하기 쉬운 분포로 바꿀 수 있고, 기존 양자화 기법의 고질병이던 블록별 스케일 상수를 별도로 저장할 필요가 없어진다. 그 결과 MSE(mean squared error) 왜곡을 최소화하면서도 메모리 오버헤드를 사실상 0으로 유지한다. 이 접근법 자체는 구글이 이전에 발표한 PolarQuant 논문에 뿌리를 두고 있다(arXiv:2502.02617).

QJL은 1단계의 잔차(residual)를 보정하는 역할이다. PolarQuant로 주요 형태를 잡고 나면 작은 오차가 남는데, QJL은 그 오차에 1비트 부호(sign) 양자화만 적용해 편향을 제거한다. Darshan Fofadiya의 일러스트레이티드 해설은 이 조합을 “1단계가 가방을 싸는 동안 2단계는 쏠린 색을 자동 보정한다”는 비유로 설명하며, 두 단계를 결합해야만 내적(inner product) 추정이 비편향(unbiased)이 된다는 점을 강조한다(Darshan Fofadiya).

즉 TurboQuant의 진짜 혁신은 “더 적은 비트로 더 작게”가 아니라, “보조 상수 없는(zero overhead) 비편향 압축”이다. Medium의 Tahir는 바로 이 지점을 전통 양자화와 TurboQuant를 가르는 결정적 차이로 꼽는다. 기존 기법은 블록당 스케일과 제로포인트를 풀프리시전(full precision)으로 별도 저장해야 해서 실제 비트 예산의 1~2비트를 “숨겨진 오버헤드”로 잃었지만, TurboQuant는 수학적 재구성으로 이 비용을 제거했다(Medium, Tahir).

본론 3: 공식 벤치마크와 제로 정확도 손실 주장

구글은 TurboQuant를 Gemma, Mistral 등 공개 모델에서 검증했고, 3비트 KV 캐시 정밀도에서 모든 다운스트림 벤치마크에 걸쳐 정확도 손실이 관측되지 않았다고 보고한다(research.google). Nerd Level Tech는 이 주장을 “KV 캐시 6배 절감, 어텐션 연산 8배 가속, 재학습 불필요”로 요약한다(Nerd Level Tech).

재학습 불필요(no fine-tuning)가 실무적으로 얼마나 중요한지 주목할 필요가 있다. 기존 많은 저비트 양자화 기법은 정확도를 복구하기 위해 QAT(Quantization-Aware Training)나 장시간의 캘리브레이션을 요구해왔다. TurboQuant는 기존 모델 위에 바로 적용 가능한 플러그앤플레이(plug-and-play)이기 때문에, 실제 서비스팀 입장에서는 “이미 배포된 모델을 건드리지 않고도 즉시 인프라 비용을 줄일 수 있는” 몇 안 되는 옵션이 된다.

본론 4: 소비자 GPU에서의 재현 — RTX 3060의 실측

Onchain AI Garage의 재현 실험이 흥미로운 이유는, 그것이 H100이 아닌 RTX 3060 12GB라는 보급형 게이밍 GPU에서 수행됐다는 점이다. 그는 Claude Code(Opus 4.6)와 함께 논문을 읽으며 Lloyd-Max 코드북, PolarQuant에 해당하는 1단계, QJL에 해당하는 2단계, 그리고 KV 캐시 래퍼를 파이썬으로 밑바닥부터 구현했다. 이어 Qwen2.5-3B를 4비트 가중치로 로드하고, 긴 문서에 숨겨진 사실을 주입한 뒤 한 번의 포워드 패스로 실제 KV 캐시를 뽑아 압축 전후의 어텐션 점수를 비교했다.

결과는 다음과 같다. 원본 289MB 캐시가 4비트에서 76MB(3.8배), 3비트에서 58MB(5배), 2비트에서 약 7.3배까지 줄어들었다. 3비트에서 어텐션 유사도 0.995, 즉 원본 대비 99.5% 일치라는 수치는 특히 인상적이다. 이는 독립 실험자인 Prince Kanuma가 Mac 환경에서 보고한 3.8~4.9배 수치와도 거의 일치한다(Hackaday). RTX 3060 12GB 기준으로 이 압축률은 “같은 모델에서 8K 컨텍스트가 40K 컨텍스트로” 확장된다는 실질적 의미를 가진다.

본론 5: 에코시스템 반응과 커뮤니티 이식 움직임

공식 구현이 완전히 공개되지 않았음에도 커뮤니티의 움직임은 빠르다. 대표적 로컬 LLM 런타임인 llama.cpp 저장소에는 TurboQuant 이식 가능성을 논의하는 디스커션이 이미 열려 있고, 기여자들이 PolarQuant/QJL 단계를 각각 어떻게 CPU/GPU 커널에 매핑할지 논의 중이다(llama.cpp Discussion #20969). DEV Community의 개발자 관점 해설은 “KV 캐시 압축이 로컬 LLM 성능의 병목을 바꾼다”는 점을 강조하며, 3비트 KV 캐시가 사실상 모바일 추론의 현실적 경로가 될 수 있다고 분석한다(DEV, arshtechpro).

핵심 인사이트

진짜 혁신은 “제로 오버헤드”다. 3비트라는 숫자 자체보다, 블록 스케일 상수 없이도 비편향 추정이 가능하다는 점이 TurboQuant를 기존 양자화와 구분한다. 숨겨진 오버헤드 제거는 비트 예산이 작아질수록 기하급수적으로 효과가 커진다.
2단계 구조는 선택이 아닌 필수다. PolarQuant 혼자서는 내적 편향을 남기고, QJL 혼자서는 주요 형태를 잡지 못한다. 두 단계의 조합이 “비편향 + 저비트”를 동시에 가능하게 한다.
재학습 불필요가 채택 속도를 결정한다. 플러그앤플레이 특성은 기존 프로덕션 모델에 즉시 적용할 수 있다는 점에서, 기업 입장에서 가장 낮은 도입 장벽을 만든다.
소비자 GPU에서도 이득이 실재한다. 3060 같은 보급형 GPU에서 5배 압축이 재현된다는 사실은, TurboQuant가 클라우드 전용 최적화가 아니라 로컬 AI 생태계 전반의 하드웨어 요구를 낮춘다는 뜻이다.
“컨텍스트 길이”가 실질적인 자원으로 바뀐다. 8K가 40K가 되는 경험은 단순한 숫자 변화가 아니라, 긴 코드베이스 분석이나 장문 대화 같은 이전에는 불가능하던 로컬 워크플로를 열어준다.

더 알아보기

03찬반 토론 · Debate

토론: “TurboQuant는 로컬 LLM 실행 환경의 판도를 실제로 바꾸는 돌파구인가”

논제: TurboQuant가 제시한 “3비트 KV 캐시·재학습 불필요·정확도 손실 0%“는 로컬 및 엣지 환경에서 LLM의 현실적 운영 조건을 근본적으로 재편할 만한 전환점인가, 아니면 벤치마크 한정의 점진적 개선에 불과한가.

Round 1

🟢 Pro — “KV 캐시 압축은 로컬 LLM의 하드 리밋을 실제로 푼다”

TurboQuant의 의미는 숫자 자체보다 “어디를” 건드렸는지에 있다. 그동안 로컬 LLM 사용자의 실질적 벽은 모델 가중치 크기가 아니라 컨텍스트가 길어질수록 기하급수로 불어나는 KV 캐시였다. 가중치 4비트 양자화는 이미 오래전에 자리 잡았지만, 8K·16K 이상 컨텍스트로 가면 VRAM이 순식간에 KV 캐시로 채워지는 경험을 누구나 해봤다. TurboQuant는 바로 그 병목을 3비트로 밀어내면서 재학습조차 요구하지 않는다.

더 결정적인 건 “제로 오버헤드(zero overhead)” 설계다. 기존 벡터 양자화는 블록마다 스케일 상수를 풀프리시전으로 별도 저장해야 해서 실효 비트 예산의 1~2비트를 숨겨진 비용으로 잃어왔다. PolarQuant와 QJL의 조합은 수학적 재구성만으로 이 비용을 제거한다. 즉 3비트라는 숫자가 과거의 3비트와 의미 자체가 다르다.

여기에 재현성이라는 검증축도 붙는다. 영상 속 RTX 3060 사용자처럼 공식 구현 없이 논문만 보고 바닥부터 만든 재현 실험에서조차 5배 압축·99.5% 어텐션 충실도가 나왔다는 사실은, 이 알고리즘이 특정 하드웨어 최적화에 의존하는 깨지기 쉬운 성과가 아니라 수학 자체가 견고하다는 뜻이다. 이 견고함이야말로 진짜 돌파구를 가르는 기준이다.

🔴 Con — “벤치마크 정확도와 현실 사용성 사이에는 여전히 넓은 간극이 있다”

“정확도 손실 0%“라는 주장은 특정 벤치마크 묶음에서의 평균적 수치일 뿐이다. 실제 사용자 체감 품질, 특히 장문 생성의 사실 일관성, 희귀 어휘, 추론 체인(reasoning chain) 정확도는 어텐션 유사도 0.995라는 단일 지표로 환원되지 않는다. 0.5%의 차이가 100만 토큰 스케일에서 어떤 누적 오차로 이어지는지, 공식 보고서도 Onchain AI Garage의 재현 실험도 충분히 검증하지 않았다.

무엇보다 재현 실험은 어텐션 점수 유사도만 측정했을 뿐, 실제 생성 품질이나 속도 개선은 건드리지 않았다. 발표자 본인도 “텍스트 생성 품질이나 속도 개선은 다음 기회에”라고 말했다. 즉 재현된 건 “수학이 맞는다” 수준이지 “모델이 실제로 동일하게 답한다”가 아니다.

또 하나의 함정은 8배 가속 수치가 H100 전용이라는 점이다. PolarQuant의 무작위 회전과 QJL의 부호 연산은 커널 수준 최적화가 없으면 오히려 오버헤드가 된다. 실제로 영상 속 순수 파이썬 구현은 속도가 전혀 최적화되지 않았다고 발표자가 직접 인정했다. “로컬에서 즉시 체감되는 혁신”이라는 서사는 아직 실제 런타임 증거가 부족하다.

Round 2

🟢 Pro (재반론) — Con의 반론을 조목조목

Con의 첫째 우려, 즉 “0.995 어텐션 유사도가 장문 생성의 누적 오차를 담보하지 못한다”는 지적은 절반만 맞다. 어텐션 유사도는 단순 상관계수가 아니라 모델이 “동일한 토큰에 동일한 가중치”를 주는가를 측정한 값이고, 이 지표가 99.4% 아래로 떨어지지 않은 건 압축이 모델의 결정 경계(decision boundary)를 거의 움직이지 않는다는 강력한 증거다. 또한 구글 공식 측정은 어텐션 유사도가 아니라 실제 다운스트림 정확도(accuracy)를 사용했고, Gemma·Mistral에서 손실이 관측되지 않았다. 즉 “벤치마크 정확도” 자체가 이미 체감 품질에 더 가까운 지표다.

Con의 둘째 지적은 재현 실험이 “수학 수준”에 그쳤다는 건데, 이는 재현의 가치를 거꾸로 본 것이다. 커뮤니티 재현이 “수학이 성립한다”를 확인했다는 사실이야말로 TurboQuant가 하드웨어별 트릭이 아니라 이식 가능한 알고리즘이라는 증거다. 프로덕션 품질 구현은 시간 문제이지 불확실성의 문제가 아니다.

Con의 셋째, H100 의존성 지적은 혼동이 있다. 8배 가속은 확실히 H100 수치이지만, “6배 메모리 감축”은 커널 최적화와 무관한 본질적 자산이다. Onchain AI Garage의 실험은 바로 이 메모리 측면에서의 이득이 3060에서도 실재함을 증명했다. 즉 “로컬 이득”의 근거는 속도가 아니라 메모리이고, 속도는 덤이다.

🔴 Con (재반박) — Pro의 주장을 역으로 풀어낸다

Pro의 첫째 주장, 즉 “어디를 건드렸는지가 핵심이고 KV 캐시 병목을 직접 겨냥했다”는 말은 분야 진단으로는 옳지만, 그것만으로 “판도 전환”이 되는 건 아니다. KV 캐시 양자화는 이미 KVQuant, KIVI, SmoothQuant-KV 등 수많은 선행 연구가 있었다. TurboQuant가 이 계보의 점진적 진보인지 단절적 도약인지는 아직 독립적 3자 벤치마크가 축적되지 않은 상태에서 단언할 수 없다. “Pied Piper”라는 별명이 붙었다는 사실은 오히려 기대가 실체를 앞서가고 있다는 신호로 읽을 수도 있다.

Pro의 둘째 주장, “제로 오버헤드가 숫자의 의미를 바꿨다”는 수사는 정확하지만 과잉 해석된다. 블록 스케일 제거는 이론적으로 우아하지만, 실제 이득은 모델·시퀀스 길이·배치 크기에 따라 편차가 크다. 어떤 워크로드에서는 오버헤드 절감이 1%대에 그칠 수도 있고, 그 경우 “6배 감축”이 아닌 “4배대 감축”으로 내려앉을 수 있다. 실제 Onchain AI Garage의 Qwen 실험도 4비트에서 3.8배, 3비트에서 5배로 논문 수치보다 살짝 낮았다.

Pro의 셋째, “재현 실험은 이식 가능성의 증거”라는 해석은 오히려 위험한 낙관이다. 재현자는 Claude Code로 논문을 읽으며 구현했고 커널 최적화는 하지 않았다. 이 구현을 그대로 프로덕션에 쓰면 느려질 가능성이 크다. 즉 “이식이 쉽다”와 “프로덕션 런타임이 쉽다”는 완전히 다른 문제이고, 커뮤니티가 후자까지 올라가는 데 걸리는 시간은 몇 달 단위다.

Round 3

🟢 Pro — Con의 Round 2를 정면으로 반박

Con의 첫째 반박, 즉 “선행 연구 대비 단절적 도약인지 단언할 수 없다”는 신중론은 타당해 보이지만, 과학사에서 돌파구는 언제나 “계보 위의 한 걸음”이었다. KVQuant·KIVI가 3~4비트 영역을 열었다면 TurboQuant는 그 영역에서 “오버헤드 없는 비편향”이라는 새 조건을 추가했다. 새 조건은 점진적 개선이 아니라 설계 공간의 축 하나를 해방한 것이다. 축의 해방은 보통 하위 최적화의 연쇄를 낳는다.

Con의 둘째 반박, “제로 오버헤드의 실이득 편차가 크다”는 지적은 역설적으로 TurboQuant의 견고함을 뒷받침한다. Onchain AI Garage의 실험은 3060에서조차 3비트 5배를 찍었다. 논문 수치보다 “살짝 낮았다”는 것이 결점이 아니라, 최적화 없이도 실측이 이론에 근접했다는 증거다. 커널 최적화가 붙은 공식 구현은 오히려 논문 수치를 상회할 가능성이 높다.

Con의 셋째 반박, “이식이 쉬워도 프로덕션 런타임이 어렵다”는 지적은 맞다. 하지만 그 시간이 “몇 달”이라는 점이 Con의 논지를 깬다. 몇 달이라는 시간은 AI 인프라 관점에서는 한 세대다. llama.cpp 저장소에서 이미 논의가 시작됐다는 사실은 2026년 안에 TurboQuant 스타일 커널이 주요 로컬 런타임에 들어올 가능성이 매우 현실적임을 뜻한다. “지금 당장”이 아니라 “수개월 안”이라는 시간 척도라면, 이건 판도 전환이라 불러도 된다.

🔴 Con — Pro의 Round 2를 정밀하게 되받는다

Pro의 첫째 재반론, “어텐션 유사도 0.995는 결정 경계 보존의 강력한 증거”라는 주장은 통계적 평균에 기댄 논리다. 평균 99.5%는 분포의 꼬리(tail)에서 무슨 일이 벌어지는지를 가려준다. 특히 로컬 LLM의 핵심 사용 사례인 코드 생성, 수식 추론, 긴 JSON 생성 같은 작업은 작은 어텐션 어긋남이 결과 전체를 깨뜨리는 “얇은 얼음” 영역이다. Pro가 가져온 벤치마크 정확도 역시 Gemma·Mistral 같은 범용 평가에서의 수치이지, 분포 꼬리 테스트가 아니다.

Pro의 둘째 재반론, “재현 실험은 이식 가능성의 증거”라는 논지에는 동의한다. 하지만 이식 가능성이 판도 전환의 충분 조건은 아니다. 학계에서 이식 가능한 아이디어가 해마다 수십 개씩 쏟아지지만, 실제로 로컬 LLM 생태계의 기본값을 바꾼 건 극소수다. TurboQuant가 그 극소수에 들어갈지 여부는 llama.cpp·MLX·vLLM 같은 런타임의 채택과 QA 결과로 판가름 날 문제이지, 재현 데모 한 건으로 선언할 수 없다.

Pro의 셋째 재반론, “속도가 아니라 메모리가 로컬 이득의 본질”이라는 프레이밍은 일부만 맞다. 메모리만 줄어도 같은 하드웨어에서 더 긴 컨텍스트를 쓸 수 있는 건 사실이지만, 로컬 사용자의 불만 중 상당 부분은 “속도”다. 컨텍스트가 길어질수록 어텐션 연산 시간도 같이 커지기 때문에, 메모리만 압축되고 연산이 비례적으로 빨라지지 않으면 사용 경험은 “더 긴 컨텍스트로 더 느린 응답”이 될 수도 있다. 8배 가속이 H100 전용이라는 사실이 여기서 중요한 이유다.

🧭 종합

합의 지점

양측 모두 TurboQuant가 KV 캐시 양자화 계보 위에서 의미 있는 진보라는 사실에는 동의한다. 특히 PolarQuant와 QJL의 결합이 이론적으로 우아하고, “블록 스케일 상수 제거”라는 설계 선택이 저비트 체제에서 실질적 이득을 만든다는 점은 양측 모두 인정한다. 또한 공식 구현 없이도 독립 재현 실험에서 어텐션 충실도 99.5%가 재현됐다는 사실이 알고리즘의 수학적 견고함을 시사한다는 점에도 이견이 없다.

현실적 적용 층위에서도 접점이 있다. 메모리 감축 6배와 3비트 KV 캐시가 재학습 없이 적용 가능하다는 특성은 기업·개인 양쪽 모두에게 매력적이며, llama.cpp 등 주요 로컬 런타임이 이 방향을 진지하게 논의하기 시작한 것은 객관적 사실이다.

열린 질문

어텐션 유사도 0.995가 분포 꼬리의 작업(코드·수식·긴 구조적 출력)에서도 실제 생성 품질을 보존하는가.
공식 커널 최적화가 없는 상태에서 소비자 GPU의 실제 런타임 속도는 원본 대비 얼마나 빠르거나 느린가.
3비트에서의 “정확도 손실 0%” 주장이 Gemma·Mistral 외 다양한 파인튜닝 파생 모델과 MoE 구조에서도 유지되는가.
QJL 단계에서 추가되는 1비트 부호 연산이 실제 CUDA/Metal 커널에서 비용 효율적으로 맞물릴 수 있는가.
커뮤니티 이식이 공식 구현 공개까지 얼마나 걸릴 것이며, 그 사이에 어떤 하이브리드 절충안들이 등장할 것인가.

더 나아간 관점

이 토론은 결국 “혁신의 척도를 어디에 둘 것인가”라는 메타 질문으로 귀결된다. Pro는 알고리즘적 설계 공간의 축을 여는 사건을 혁신으로 본다. Con은 그 축의 해방이 실제 사용자 경험의 기본값으로 내려오는 시점을 혁신의 기준으로 본다. 두 관점 모두 타당하며, 사실 실무에서는 둘 다 필요하다.

한 가지 간과되기 쉬운 각도는 “지표의 다변화”다. 어텐션 유사도, 다운스트림 정확도, 장문 사실 일관성, 분포 꼬리 강건성(robustness), 커널 처리량은 모두 다른 축이다. TurboQuant 논쟁이 생산적으로 이어지려면 커뮤니티 벤치마크가 이 축들을 분리해 측정하고 보고해야 한다. 특히 로컬 사용자에게는 “평균 품질” 숫자 한 개보다 “내가 자주 하는 작업에서의 품질 변화”가 훨씬 중요한 척도다.

또 하나, TurboQuant를 단일 기법이 아니라 “제로 오버헤드 벡터 양자화”라는 패러다임의 첫 구체적 사례로 보는 시각이 필요하다. 이 패러다임이 자리잡으면 KV 캐시뿐 아니라 벡터 검색, 임베딩 저장소, 온디바이스 검색 인덱스 전반으로 확산될 수 있다. 진짜 판가름은 “TurboQuant 한 논문이 얼마나 멀리 갔느냐”가 아니라, “제로 오버헤드 양자화라는 축이 얼마나 많은 후속 연구를 끌어들이느냐”일 것이다.

마지막으로, 소비자 GPU에서의 재현 실험이 가지는 사회적 함의도 짚을 만하다. 지금까지 LLM 효율화 연구의 상당수는 H100·A100 같은 데이터센터 하드웨어를 전제로 이루어졌고, 그 결과 “논문에서는 8배, 내 3060에서는 체감 0”이라는 간극이 기본값이었다. Onchain AI Garage의 실험은 이 간극을 “논문 수치의 상당 부분이 소비자 GPU에서도 살아 있다”는 방향으로 좁혀냈다는 점에서 의미가 크다. 이는 단지 TurboQuant의 문제가 아니라, 향후 효율화 연구가 채택해야 할 “검증 프로토콜”의 한 예시로도 읽힌다. 논문이 발표된 당일이나 주 단위로 독립 재현이 이루어지는 관행이 정착된다면, 과장된 주장이 걸러지는 속도도 함께 빨라질 것이다.

정리하면, TurboQuant는 “즉각적 판도 전환”이라는 과장과 “흔한 점진적 개선”이라는 축소 사이의 어디엔가 있다. 그 정확한 좌표는 앞으로 수개월 동안 llama.cpp, MLX, vLLM 같은 런타임에서의 커널 구현, 분포 꼬리 작업에 대한 커뮤니티 QA, 그리고 파인튜닝 파생 모델·MoE 구조로의 확장 검증이 결정할 것이다. 중요한 건 그 좌표를 기다리는 동안에도 우리는 이미 “재학습 없이 5배 압축”이라는 실측 데이터를 손에 쥐고 있다는 사실이다. 이 정도 출발선이라면, 회의론과 낙관론 모두에게 근거가 있는 드문 순간이라 할 만하다.

04영문 원본 · Transcript

So earlier today, I saw this tweet from Google Research introducing TurboQuant, which was a new compression algorithm that can reduce LLM key value cache memory by at least 6x.
And I went on to read this blog a little bit. It goes into fairly good detail about how TurboQuant works, what the previous research had been, and how the actual experiment and results came out.
So I thought this was interesting. I thought this was a pretty big breakthrough.
So in this video, I'm just going to try to break down what this TurboQuant is, because the blog itself and the papers are pretty dense.
I'm going to try to explain in simple terms what the problem it's trying to fix and how it tries to fix it.
And then I'm actually going to, on my PC and GPU, try to validate the approach, try to see what results we can get in testing.
Now, Google hasn't released the official implementation of this, but there was enough to go on by these papers to validate the actual math in the algorithm
and to try to fix it.
So I'm going to try to implement it on my PC just to check that the approach itself actually produced some results.
So this isn't the official Google's TurboQuant, but I will be testing the approach based on my own research and experiment.
First, let me break down what TurboQuant is a little bit.
So how Google is making AI models six to eight, six times smaller and eight times faster without losing any accuracy.
So first, the problem.
AI models like ChatGPD, Gemini, and other AI models are not as accurate as they seem to be.
AI models like ChatGPD, Gemini, and other AI models are not as accurate as they seem to be.
AI models like ChatGPD, Gemini, and other AI models are not as accurate as they seem to be.
AI models like ChatGPD, Gemini, and other AI models are not as accurate as they seem to be.
AI models like ChatGPD, Gemini, and other AI models are not as accurate as they seem to be.
AI models like ChatGPD, Gemini, and other AI models are not as accurate as they seem to be.
AI models like ChatGPD, Gemini, and other AI models are not as accurate as they seem to be.
AI models like ChatGPD, Gemini, and other AI models are not as accurate as they seem to be.
AI models like ChatGPD, Gemini, and other AI models are not as accurate as they seem to be.
AI models like ChatGPD, Gemini, and other AI models are not as accurate as they seem to be.
um they store billions of numbers which mean they need expensive specialized hardware just to run
that makes them slow costly and out of reach for a lot of people if you ever waited for a few
seconds for an ai to respond that's why it's crunching a normal enormous amounts of data
behind the scenes and that's why it's just too difficult to run really high level frontier
models or even high performance local lms on a lot of smaller uh retail devices and i've
encountered this myself so what is quantization so think of this as like compressing a photo
a raw photo off your camera might be 50 megabytes but when you save it as a jpeg
it drops to two megabytes and honestly you can barely tell the difference so quantization does
the same thing for ai instead of storing every single number
we're going to store it as a jpeg and then we're going to store it as a jpeg and then we're going to
32 digits of position it rounds down to just three or four digits the ai still gives you the same
quality answers but it just needs a fraction of the memory so what's a kv cache which is a key
concept with turbo quant so this is kind of like the ai's cheat sheet during a conversation when
you have a conversation with the ai it needs to remember everything you've said so far and the key
value cache is like a cheat sheet it writes as you talk the keys are like labels what topic was this
about and the values are the actual information the issue is as this cheat sheet grows with every
message in context it's a long conversation it can eat up gigabytes of memory and that's memory
that could be used to make the ai smarter or faster so if we can shrink this down without
losing important information everything gets better we can have longer conversations cheaper
hardware faster responses and this is where turbo quant comes in it's a two-stage process
in stage one which uses polar quant it reorganizes
the data in a simpler shape think of it like neatly folding your clothes before packing a
suitcase and then in stage two uh qjl it captures the fine details that stage one might miss it uses
clever math trick to compress information to simple plus or minus signals with zero extra
memory overhead so the result is the data goes from 32 bits down to just three per number roughly
10x reduction and the ai answers stay the same and you don't need to
any kind of retraining on the models you can just apply this to an existing model and it works
so the numbers speak for themselves according to their research
and according to their papers uh 6s 6x less memory used for the kb cache
8x for faster prep processing on they were using h100 gpus uh 3-bit precision with zero accuracy
loss and zero percent accuracy drop across all benchmarks so they tested this on popular open
source models like gemma and other models and they were able to get a lot of data out of it and they
were able to get a lot of data out of it and they were able to get a lot of data out of it and they
mr ulama using several different benchmarks and turbo quant really outperformed a lot of the
existing compression models and it there's no training needed it works out of the box
so what does this mean for us the people who run ai on their own computers
you could run bigger models on less hardware
you could have much longer conversations um the 6x smaller kv cache means your local ai can remember
6x more context without running out of memory or slowing down which i'm sure we've all experienced
if you use local lms a lot you get faster responses in theory and you can use extreme
compression that brings us to using genuinely useful local ai offline on mobile devices which
is kind of the dream so in theory if this works the impact on ai models in general you would have
cheaper cloud ai greener ai less memory and faster processing means less electricity consumed so you
can have ai everywhere on all sorts of devices that right now can't run them and you can run
bigger models on the same hardware instead of having to buy new gpus so the key takeaway here
is that turbo quant proves that we can make ai dramatically smaller and dramatically faster
without sacrificing quality so it's a plug and play breakthrough that works on any model
no retraining needed so in order
actually validate this i decided to go into cloud code using opus 4.6 and actually dig into the
papers and see if we could validate the math behind it so we did just that and i did this
in cloud code and it was a ground up implementation of turbo quant obviously not the exact same thing
but in theory it had every component described in the paper and we did this as a python package
so these were the four building blocks the key point is that we didn't use google's code we
wrote each piece independently
based on the papers if you looked in the blog there's a couple papers that were
used for turbo quant and here you can see the four implementation that we did which was using
a foundation the lloyd max code book then stage one was the turbo quant msc stage two was the
turbo quant prod and then the integration with the kv cache wrapper so all of these were used to
validate that the math behind
turbo quant was accurate so the first test was this lloyd max code books and the main
takeaway there's a lot of advanced math in this i know um but the key takeaway is just to understand
this bottom here zero so the symmetry check uh was at zero so it means the code book which is the
dictionary the compressor uses to translate numbers needs to be perfectly balanced so it's
symmetric around zero it is if it wasn't the compression would favor certain values and that
would introduce a lot of errors
so we're going to go ahead and do that first and we're going to use the formula
to get the answer to that question so we're going to do the following test
so we're going to do a couple of tests on it and we're going to put it in the description
and then we're going to use the formula to get the answer to that question
so for mse distortion uh we found that empirical msc stayed well within the theoretical bounds
and this is mainly that the percentages you could see here in the bars
the paper says that the error will be no worse than x our actual error came well under that
so the next test and this is kind of the most important one is that
here. You can see it here. It's nearly zero everywhere. And this is important because when
AI picks which word to say next, it compares the vectors using inner products. If a compression
makes those comparisons consistently wrong in one direction, the AI is going to give bad answers. So
the bias being very close to zero means the compression doesn't skew the AI's decision making.
So the correlation as well, you can see at even three bit, the compressed version agrees with the
real answer 92% of the time. So why were there two stages? Because stage one alone can result
in biased findings. And with stage one and QJL, that was where we were able to secure an unbiased
finding. So imagine compressing a photo and the colors shift
at the same time. So imagine compressing a photo and the colors shift at the same time. So imagine
slightly blue. Stage one is the compression. Stage two is the automatic color correction that
perfectly undoes the shift and at the cost of just one extra bit per number. So we also measured the
compression rates on the KV cache. We built a full KV key value cache wrapper and measured the real
memory savings. And once again, this is all in cloud code. We saw 7.76x compression at 2 bit,
5.22x compression at 3 bit, and 3.94x at 4 bit. And you can see here is the configuration if you'd
like to see it for yourself. And then we did a needle in a haystack test. And we found that we
had 100% retrieval accuracy after the compression. So you can see these are all exact nine out of nine
zero information loss for retrieval. So the test basically hid one specific piece of info
among 8192 others, and pressed everything down to two bits, or three bits or four bits, and then
asked, Can you still find it? So the answer was yes, every time. So this is the test that proves
that compression doesn't lose the important stuff. So these are the real results on my RTX 33060.
My PC a couple years old now. And kind of a mid range gaming PC. And this was
the main finding, we found 5.3x memory compression. And then that is the real
takeaway for a consumer GPU, that would be real savings. 5.3x compression means the KV cache would
only that would have filled 10 gigabytes now fits under two, leaving a ton of room for the
model itself. So it would allow you to run larger and more powerful local LMS on the same device.
Now obviously, our Python implementation here isn't speed optimized yet. The report,
reports up to 8x speeds. Using H 100. The focus here was just to kind of validate that the math
was accurate. So the verdict was that the math checked out all seven our tests passed.
And the MSC distortion was within bounds inner products are unbiased.
The needle retrieval was perfect compression compression ratios match the papers that we
saw. Okay, so now I'm in quad, cloud code. And we're going to be working with a real fun model.
And it took some trial and error to actually get a proper test using my, my hardware, but we did it.
So I did this in with Opus 4.6. What we did was we implemented the Google's turbo quant algorithm
from their paper, the 2026 paper and tested it on a real language model. In this case,
we use quen 2.53 b. And this was running on my 3060. So I explained to you, this was just the
test that I explained before when we were testing the math. So the actual validation test was with
the quen 2.3 2.5 3b model in four bit weights. So this would fit in two gigs of VRAM.
You fed it a long document with a hidden fact, ran a single forward pass to capture the real
key value cache, compress the key value cache with turbo quant, the theoretical turbo quant
at two bit three bit four bit compared the attention scores. So basically, when the model
looks back at previous tokens, does it look at the same ones with compressed versus uncompressed
keys. So this is a good test in our case, because the attention scores are actually determine the
model's output. If the attention pattern is preserved, then it means the model's behavior is
preserved. And you can see the compression results how much the original took up the QV cache,
the original was
189 megabytes orbit was 76. So that's 3.8 times smaller, the three bit got us to five times
smaller, and the two bit got us to 7.3 times. So that was similar. The previous slides, if you
remember, similar to the theoretical compression savings that we were thought we would get using
the quen model here. So at three bit we use. So at three bit, what used to take 289 megabits now
takes only 58. So on my
12 gig, GPU, RTX 3060. That's the difference between between fitting 8000 contacts and fitting
40,000 with the same model. So that's a huge difference if you're actually using these
models locally. So here's the real key is the accuracy, how accurate was it? How well was
attention preserved. And these are the results of our test. And you can see these all scored very,
very well 0.995. Over here with a three bit, this means the compressed attention pattern
is 99.5%, similar to the original uncompressed. So very, very slight decrease with a huge longer
contacts is to be expected. more tokens means more chances for very small errors. But even at a k
three bit stays above point 994 so very impressive accuracy.
so this means the accuracy was pretty well very well preserved even though we were able to save
so much more space in context so these are very high very good top one match to the model look
at the exact same most important token you can see the rates here top five match so this is across
all 36 layers and two key value heads per layer so at 3-bit 92% of the time the model's top pick
is still in the top five and the overall attention distribution is that 99.5% similar
so what this means what does this all mean the claims are holding up
previous users kind of testing this reported 3.8x and 4.9x compression and i have to give
credit here to prince kanuma who ran this experiment and when i saw his tweet i got
kind of inspired he did this on a mac decided to try to do this on a pc and see if i could do it
but we got very similar results from our test to his test we got 3.8x and 5x at equivalent bit
widths so 3-bit is the practical sweet spot you get 5x compression with 99.5% attention fidelity
it's pretty great the paper's zero accuracy loss claim is reasonable at three to four
bits the attention patterns are so close to the original that generation quality would be nearly
indistinguishable in practice and i'm sure the actual official
boquant will be much more refined in its approach so it could probably even get better results than
we got here um the two bit works but it pushes it 7.3x compression is aggressive and the 66
top one match means the model would sometimes attend to different tokens
which could change outputs but it seems that 3-bit was the best one today i just wanted to validate
this theoretically and then using actual model uh we didn't do actual text generation quality or
the speed improvements that the papers did or longer context but maybe down the line we could
try this but pretty great results nonetheless using an actual model on my pc and seeing if the
compression algorithm works and it really did exciting to see the actual official turbo quant
see what it can do but it seems like more and more we're going to be able to run more heavy duty
local models on just normal consumer hardware
which is exciting so that's it for today um we just i kind of broke down what turbo quant
was from today google's announcement tried to make it more understandable
and then ran a couple validation tests test testing the actual algorithm does the math
make sense and then tested testing it with a actual model this uh quen 2.5 model that i
was using here so that's going to be it for today uh please leave a comment like subscribe to the
channel i do a lot of these kind of experimenting new ai tools if you had any questions about how
you do this i'm gonna i'll publish in my github repo the tests and scripts that i use to do this
implement this so check that out uh at tommy studio it's going to be my github i'll do that
tomorrow but yeah that's it for this video please like subscribe and i'll see you in the next video