I Used Autoresearch to Fine-Tune GPT-2 Using D&D Dialogue

2026-03-24 · 23m · 자막 —

01한국어 번역 · Korean

오토리서치로 D&D 대사를 활용해 GPT-2를 파인튜닝하다

원본: https://www.youtube.com/watch?v=T6pQVgIt8ZY · 업로드: 2026-03-24 · 길이: 24m · 채널: Onchain AI Garage

들어가며: 파인튜닝이란 무엇인가

오늘 영상에서는 AI 모델의 파인튜닝(fine-tuning)을 직접 해보려고 합니다. 여기에 안드레이 카파시(Andrej Karpathy)가 공개한 오토리서치(Autoresearch) 아이디어를 접목할 예정입니다. 본격적으로 클로드 코드(Claude Code)에 들어가기 전에, 먼저 파인튜닝이 무엇인지 기본 개념부터 정리하고 넘어가겠습니다.

언어 모델(language model)은 본질적으로 “문장에서 다음 단어를 예측하는” 시스템입니다. 인터넷, 책, 기사, 웹사이트에서 수집한 방대한 텍스트로 학습되죠. 오늘 다룰 GPT-2는 오픈AI(OpenAI)가 약 100억 단어로 학습시킨 1억 2,400만 파라미터 규모의 모델입니다. 언어의 패턴을 학습한 덕분에 놀라울 정도로 자연스러운 문장을 생성할 수 있습니다. 비유하자면 책 백만 권을 읽은 학생과 같습니다. 어떤 주제든 그럴싸하게 쓸 수 있지만, 어느 한 분야의 전문가는 아닙니다. “고양이가 앉아 있는 곳은…”이라는 문장이 주어지면, 과거에 본 수십억 개의 예시를 바탕으로 다음 단어를 확률적으로 맞춥니다.

파인튜닝이 왜 필요한가

파인튜닝은 이미 학습된 모델을 가져와서, 훨씬 작고 구체적인 데이터셋으로 추가 학습을 시키는 작업입니다. 모델이 언어를 이미 이해하고 있으므로, 파인튜닝은 여기에 “전문성”을 덧붙이는 역할을 합니다. 처음부터 학습시키는 것보다 훨씬 적은 데이터와 시간으로 가능하다는 점이 핵심입니다. 요리 학교를 졸업한 셰프가 스시 전문점에 들어가 스시 만드는 기술만 집중적으로 연마하는 것과 비슷합니다. 기본 요리 능력을 잃지 않으면서, 한 분야에서 뛰어난 실력을 갖추게 되는 거죠.

이 방식은 사실상 모든 주요 AI 연구소가 사용합니다. 오픈AI는 GPT-4를 만든 뒤 이를 파인튜닝해 ChatGPT를 내놨고, 앤트로픽(Anthropic)은 베이스 모델을 학습시킨 뒤 클로드(Claude) 변형들을 만들어냈습니다. 메타(Meta)는 LLaMA를 학습시킨 뒤 LLaMA Chat 같은 특화 모델들을 추가로 공개했습니다.

가장 큰 적: 파국적 망각

파인튜닝에서 부딪히는 핵심 난제는 파국적 망각(catastrophic forgetting)입니다. 모델이 새로운 내용을 학습하는 과정에서 기존에 알고 있던 지식을 잃어버리는 현상이죠. 새 데이터에 너무 많이 학습시키면 문법과 일관성을 잃고, 너무 적게 학습시키면 새로운 스타일을 전혀 흡수하지 못합니다. 이 균형을 잡는 것이 파인튜닝의 진짜 어려움입니다.

사실 예전에 원피스(One Piece) 줄거리만 학습한 모델을 만들어보려 했던 프로젝트가 있었습니다. 큰 언어 모델에 아주 좁은 원피스 시놉시스 데이터로 파인튜닝을 시도했는데, 며칠을 쏟고도 결국 실패했습니다. 대형 모델과 작은 데이터셋 사이의 균형이 너무 어긋났기 때문이었습니다. 원피스는 에피소드가 많긴 하지만 머신러닝 관점에서 보면 턱없이 작은 데이터셋이었던 거죠. 이번 프로젝트는 그 실패에서 배운 교훈을 토대로 “제대로 된” 파인튜닝을 해보려 합니다.

LoRA: 게임을 바꾼 기법

여기서 등장하는 것이 2021년에 공개된 LoRA(Low-Rank Adaptation) 기법입니다. GPT-2의 1억 2,400만 파라미터를 모두 건드리는 대신, 전체의 1~5%에 해당하는 작은 어댑터 레이어(adapter layer)만 추가로 학습시킵니다. 원래 모델은 완전히 얼려둔 채로 유지되므로, 문법 능력은 설계상 보존됩니다. 작은 어댑터만이 새 스타일을 배우는 방식이죠. 덕분에 학습 속도가 빠르고, 메모리를 덜 쓰며, 결과도 더 안전합니다. 마치 교과서를 새로 쓰지 않고 형광펜과 포스트잇으로 주석을 다는 것과 같습니다. 원문은 그대로인데 읽는 방식이 달라지는 겁니다.

데이터셋: 크리티컬 롤 D&D 대사

베이스 모델은 GPT-2, 학습시킬 데이터셋은 크리티컬 롤(Critical Role)이라는 인기 라이브 D&D(Dungeons & Dragons) 쇼 대사입니다. 마이크로소프트가 공개한 이 데이터셋은 약 40만 턴의 대화, 2천만~5천만 토큰 규모로, GPT-2 파인튜닝에 딱 맞는 크기입니다. 던전 마스터(DM)의 서사 설명, 플레이어 간 대사, 전투 롤플레이가 뒤섞여 있어서 파인튜닝 소재로도 재미있습니다. GPT-2가 원래 학습한 데이터보단 작지만, 실패했던 원피스 데이터셋보다는 몇 배나 큽니다.

이중 평가 지표: 두 마리 토끼 잡기

성공 여부를 제대로 측정하려면 두 가지를 동시에 봐야 합니다. 하나는 D&D 점수(모델이 D&D 스타일을 얼마나 잘 학습했는지), 다른 하나는 문법 페널티(범용 영어 능력을 얼마나 잃었는지)입니다. 최종 점수는 “D&D 점수 + 문법 페널티 × 가중치” 식으로 계산합니다.

두 지표가 모두 필요한 이유는, 문법 페널티가 없으면 모델이 “치트”를 쓸 수 있기 때문입니다. D&D스러운 단어들을 억지로 기억해내면서도 전혀 말이 안 되는 영어를 내뱉는 상황이 가능하거든요. 실제로 평가 지표는 잘 나오는데 출력물은 엉망인 경우가 종종 있습니다. 이중 지표는 이런 꼼수를 막아줍니다.

오토리서치를 얹다

오토리서치는 수십 개의 실험을 자동으로 돌려가며 최적 설정값을 찾아내는 시스템입니다. 설정을 바꾸고 → 학습시키고 → 점수를 측정하고 → 좋으면 유지, 아니면 버리고 → 반복. 이런 루프를 통해 탐색합니다. 조정할 수 있는 파라미터는 학습률(learning rate, 모델이 얼마나 빠르게 적응할지), 어댑터 크기(새 지식을 담을 용량), 적용할 레이어, 학습 지속 시간과 워밍업, 가중치 감쇠(weight decay) 같은 규제(regularization) 값 등 다양합니다.

자동화의 이점은 명확합니다. 수십 가지 조합이 있고 실험 하나가 5분 걸린다면, 시스템이 밤새 수백 개 구성을 돌려볼 수 있습니다. 사람이 수동으로는 며칠이 걸릴 작업이 자동화 덕분에 하룻밤으로 줄어드는 겁니다.

파인튜닝 전이라면 GPT-2가 이런 뉴스 톤의 일반적인 문장을 생성합니다. “대통령이 오늘 새 정책을 발표했고, 이에 주가가 2% 올랐습니다.” 문법적으로는 완벽하지만 개성이 없죠. 파인튜닝 후에는 DM의 내레이션, 캐릭터 간의 걸쭉한 대사, 전투 묘사가 섞인 D&D 세션 톤이 나오기를 기대합니다.

실전: 파이프라인 구축과 첫 베이스라인

이제 클로드 코드로 들어갑니다. 전체 순서는 이렇습니다. 데이터셋 다운로드, 데이터 탐색 및 정제, GPT-2 토크나이저로 토큰화, 학습/검증/테스트 분할, 베이스라인 측정, 학습 루프 실행, 이중 지표 평가, 그리고 오토리서치 루프 돌리기.

먼저 깃허브에서 크리티컬 롤 D&D 대사를 내려받았습니다. JSON 파일 280개, 대화 턴 72만 5천 개가 들어왔습니다. 예전 원피스 데이터셋보다 훨씬 큰 규모였습니다. 텍스트 파일로 추출한 뒤 에피소드별로 학습 세트, 검증 세트, 테스트 세트로 나눴습니다. 모델은 학습 세트로 배우고, 검증 세트로 중간 점검을 하며, 테스트 세트는 최종 평가용으로 남겨둡니다.

다음으로 토큰화를 진행했습니다. 모델은 실제로 단어를 읽지 못하고 토큰 ID(숫자)만 읽기 때문에, 텍스트를 숫자로 한 번 변환해 저장해두면 학습이 훨씬 빨라집니다. 결과적으로 1,330만 개의 학습 토큰이 확보됐습니다.

그 다음엔 베이스라인을 측정했습니다. 파인튜닝 없이 기본 GPT-2를 검증 세트에서 돌려 얻은 점수입니다. 이 숫자가 오토리서치 루프가 이겨야 할 기준선입니다.

첫 실험: 버그와 수정

첫 번째 학습 실행 결과는 실망스러웠습니다. 결합 점수(combined score)가 2.091로, 원래 베이스라인보다 오히려 훨씬 나빴죠. 원인은 클로드가 처음 작성해준 학습 스크립트에 있었습니다. 학습 타겟이 잘못 정렬돼 있어서, 매 예측이 두 토큰 앞의 정답과 비교되고 있었던 겁니다. 이런 조용한 버그는 파인튜닝 코드를 짤 때 반드시 확인해야 할 포인트입니다.

버그를 고친 뒤 다시 돌리자 결과가 확 달라졌습니다. D&D 점수와 문법 점수 모두 베이스라인을 넘어섰고, 결합 점수는 베이스라인의 1.132에서 1.05로 떨어졌습니다. 문법 페널티는 0. 즉, LoRA 파인튜닝이 D&D 대사 능력을 끌어올리면서 범용 문법까지 살짝 개선한 거죠. 이 정도가 정확히 우리가 원했던 결과입니다. 첫 샘플도 읽어보면 문법적으로 맞고, 형식도 바르며, “리스트 좀 봐요”, “내일 어디로 가죠?” 같은 D&D스러운 분위기가 살짝 감지됩니다.

하이퍼파라미터 탐색: 세 가지 질문

이제 하이퍼파라미터를 제대로 탐색할 차례입니다. 총 9개 실험, 각 5분씩, 45분 예산으로 세 단계를 돌립니다. 1단계는 학습률 스윕, 2단계는 타겟 모듈(target modules), 3단계는 LoRA 랭크(rank)입니다.

학습률은 매 업데이트마다 모델이 얼마만큼의 보폭으로 움직일지 정합니다. 너무 작으면 5분 안에 거의 변하지 않고, 너무 크면 과하게 움직여서 쓸모없는 출력을 내놓습니다.

타겟 모듈은 LoRA 어댑터를 GPT-2의 어느 부위에 붙일지를 결정합니다. 트랜스포머(transformer) 블록에는 여러 종류의 레이어가 있는데, 어텐션(attention) 레이어는 다음 단어 예측 시 어떤 단어에 집중할지 결정하는 부분으로 스타일 학습에 가장 중요합니다. 출력 프로젝션(output projection) 레이어는 어텐션 결과를 합치는 부분이고, 피드포워드(feed-forward) 레이어는 더 깊은 추론과 처리를 담당합니다. 타겟 모듈이 많을수록 학습 용량이 커지지만, 베이스 모델을 건드릴 위험도 같이 커집니다.

LoRA 랭크는 각 어댑터의 너비입니다. r=4는 아주 작은 어댑터라 미세한 조정만, r=8(현재 설정)은 적당한 용량, r=16은 더 복잡한 패턴 학습 가능, r=32는 용량이 크지만 오버피팅(overfitting) 위험이 따릅니다.

1차 탐색 결과와 2차 라운드

9개 실험이 끝났는데 전부 베이스라인보다 나은 점수를 기록했고, 최고 점수는 1.04였습니다. 가장 큰 영향을 준 것은 학습률로, 최적값은 초기 보수적 설정보다 10배 높았습니다. 모든 실행에서 문법 저하는 발견되지 않았습니다. LoRA가 베이스 모델을 잘 얼려둔 덕분이죠. 타겟 모듈이 많을수록 도움이 됐고, 랭크는 높일수록 개선되지만 수확 체감이 있었습니다.

이 결과를 토대로 승자 조합을 교차시킨 2차 라운드 7개 실험을 돌렸습니다. 더 높은 학습률, 최고 학습률 × 더 많은 모듈, 최고 학습률 × 더 높은 랭크 등을 섞어보면서 “더 빨리 배울 수 있는가?”, “모듈과 용량을 결합하면 점수가 시너지를 낼까?”를 물었습니다.

결과적으로 최고 성능은 세 가지 모듈을 모두 타겟으로 삼고 최적 학습률을 쓴 구성이었습니다. 원래 베이스라인 대비 9% 개선. 모듈 확장이 가장 큰 승리였고, 학습률은 1e-3 부근에서 정체했으며, 랭크 증가는 수확 체감이 뚜렷했습니다. 문법 저하는 여전히 제로.

최종 샘플

총 16개 실험(9+7)을 돌린 뒤 베스트 구성으로 샘플을 뽑았습니다. 출력은 이랬습니다. “문이 삐걱 열린다. 긴 로브를 입은 검은 수염의 덩치 큰 남자가 보인다, 스콧이 보고 있던 바로 그 로브와 거의 똑같다. 오, 안 돼…” 한 샘플은 주사위 굴리기 관련해 “this, this, this…” 하고 루프에 빠지긴 했지만, 전체적으로는 D&D 세션다운 형식을 제대로 갖춘 대사가 나왔습니다. 화자 라벨, DM(매트)의 내레이션, 플레이어들 간의 농담과 크로스토크, 게임 메커니즘, 캐릭터 이름, 웃음, 그리고 대체로 문법적인 영어까지.

완벽하진 않습니다. 이니셔티브(initiative) 굴림 프롬프트에선 숫자 루프에 빠졌고, 몇몇 문법도 어색합니다. 하지만 그냥 평범한 GPT-2에서 시작해 여기까지 끌어올렸다는 점을 감안하면 꽤 좋은 결과입니다.

마치며

파인튜닝은 범용 AI가 전문가로 변신하는 방식이고, 여러분이 써봤을 모든 챗봇 뒤에 있는 기술입니다. LoRA는 이 과정을 효율적이고 안전하게 만들어줍니다. 모델의 핵심 지식을 지키면서 새 능력을 얹게 해주죠. 좋은 평가 지표, 특히 오늘 사용한 이중 지표는 “진짜 중요한 것”을 측정하기 위한 필수 장치입니다. 데이터 품질과 양이 성공을 결정하고, 적절한 도메인 이동(domain shift)이 있는 충분한 데이터가 황금 지점입니다. 마지막으로 오토리서치 같은 자동 탐색은 사람이 수작업으로는 발견하기 어려운 최적 설정을 체계적으로 찾아줍니다. 파인튜닝에 관심이 있다면 앞으로 이 주제로 더 많은 영상을 다룰 예정입니다. 이번 편도 유익했길 바라며, 다음 영상에서 뵙겠습니다.

02리서치 문서 · Document

LoRA와 오토리서치로 GPT-2를 D&D 화자로 만들기: 한 편의 파인튜닝 실험기

원본 영상: YouTube · 채널: Onchain AI Garage · 업로드: 2026-03-24 · 길이: 약 24분

서론: 왜 2026년에 GPT-2를 다시 꺼내는가

2026년 현재 프론티어 모델은 조(兆) 단위 파라미터를 자랑하지만, 정작 “파인튜닝(fine-tuning)을 몸으로 익히기” 가장 좋은 교재는 여전히 2019년에 나온 GPT-2입니다. 1억 2,400만 파라미터, 단일 소비자용 GPU에서 수 분 안에 학습 루프를 돌려볼 수 있고, 결과의 차이가 사람 눈에 그대로 보이기 때문이죠. Onchain AI Garage 채널의 이번 실험은 바로 그 GPT-2를 베이스로 삼아, 크리티컬 롤(Critical Role)의 D&D(Dungeons & Dragons) 대사 40만 턴을 학습시키는 프로젝트입니다. 여기에 안드레이 카파시(Andrej Karpathy)가 2026년 3월 공개해 엄청난 반향을 일으킨 오토리서치(Autoresearch) 루프를 얹어, 하이퍼파라미터를 에이전트가 자동 탐색하도록 했습니다.

이 한 편의 실험에는 사실상 현대 LLM 파인튜닝의 거의 모든 키워드가 응축돼 있습니다. 파라미터 효율 파인튜닝(PEFT), LoRA 어댑터, 파국적 망각(catastrophic forgetting), 이중 평가 지표, 자동화된 실험 루프, 그리고 버그를 찾아내 다시 돌리는 “현장 디버깅”까지. 이 글에서는 영상의 흐름을 따라가면서, 각 단계가 왜 그렇게 설계되었는지, 그리고 배경에 있는 연구가 무엇을 말하는지 조금 더 깊게 파고들어 보겠습니다.

1. 파인튜닝과 파국적 망각: 균형의 문제

파인튜닝은 일반 모델에 “전문성”을 덧붙이는 작업입니다. 사전학습된 모델이 이미 언어를 알고 있으므로, 훨씬 작은 특화 데이터셋으로 새 스타일이나 지식을 가르칠 수 있습니다. 문제는 이 과정에서 “파국적 망각”이 발생할 수 있다는 점입니다. 새 데이터에 모델이 너무 많이 적응하면, 이전에 알고 있던 문법과 상식을 잊어버리는 현상이죠.

2024~~2025년에 걸친 최근 연구들은 이 문제가 1B~~7B 규모의 LLM 전반에서 관찰되는 일반적인 현상임을 보여주고 있습니다. 흥미로운 것은 그 원인 중 하나로 “손실 지형(loss landscape)의 날카로움”이 지목된다는 점입니다. 손실 함수의 골짜기가 좁고 가파를수록 모델은 새 태스크에 적응할 때 기존 지식을 더 쉽게 잃어버립니다. 이를 완화하는 대표적 접근으로는 샤프니스-어웨어 미니마이제이션(sharpness-aware minimization), 원소별 파라미터 중요도에 기반한 정규화(regularization), 그리고 이전 모델의 출력을 소프트 타겟으로 삼는 지식 증류(knowledge distillation) 방식인 Learning without Forgetting 등이 있습니다. 자세한 서베이는 Revisiting Catastrophic Forgetting in LLM Tuning과 An Empirical Study of Catastrophic Forgetting in LLMs에서 볼 수 있습니다.

영상 속 화자가 “원피스(One Piece) 파인튜닝이 실패했다”고 고백하는 대목이 바로 이 문제의 축약판입니다. 수백만 단어 규모의 큰 모델에 원피스 시놉시스라는 작디작은 도메인을 욱여넣자, 모델은 문법을 잃어버리거나 아예 스타일을 학습하지 못하는 두 극단 사이에서 길을 잃었습니다. 이번 프로젝트에서는 데이터 규모(약 72만 턴)와 모델 크기(124M)를 훨씬 균형 있게 맞췄고, 여기에 “전체 파라미터를 건드리지 않는” 또 다른 안전장치를 덧붙였습니다. 바로 LoRA입니다.

2. LoRA: 저랭크 어댑터로 파인튜닝을 바꿔놓다

LoRA(Low-Rank Adaptation)는 2021년 마이크로소프트 연구진이 제안한 파라미터 효율 기법으로, 발표 이후 사실상 오픈소스 파인튜닝의 표준이 되었습니다. 핵심 아이디어는 간단합니다. 파인튜닝 중 일어나는 가중치 변화(ΔW)가 “낮은 고유 랭크(intrinsic rank)“를 가진다는 가설을 세우고, 이 변화를 두 개의 작은 행렬(A, B)의 곱으로 근사합니다. 원래 가중치 W는 완전히 얼린 채, 새로 추가된 A·B만 학습하는 거죠.

이 접근의 효과는 인상적입니다. 원논문 LoRA: Low-Rank Adaptation of Large Language Models에 따르면, GPT-3 175B를 Adam 옵티마이저로 풀 파인튜닝하는 것과 비교했을 때 LoRA는 학습 가능 파라미터 수를 1만 배, GPU 메모리 요구량을 3배까지 줄이면서도 RoBERTa, DeBERTa, GPT-2, GPT-3 전반에서 풀 파인튜닝과 동등하거나 더 나은 품질을 보여줬습니다. 결정적으로 추론 시점(inference time)에 추가 지연이 없다는 점이 어댑터(adapter) 방식 대비 큰 장점입니다. 구현 코드는 microsoft/LoRA 리포지토리에서 공개돼 있습니다.

영상 속 실험에서는 GPT-2 파라미터의 약 0.65%만이 LoRA 어댑터를 통해 학습됩니다. 이는 파국적 망각을 구조적으로 차단하는 역할도 합니다. 베이스 모델의 언어 지식은 “얼려진 채” 그대로 남고, 새로 배운 D&D 스타일은 오직 어댑터 안에만 저장되기 때문입니다. 실제 실험 결과에서도 16번의 실험 전체에 걸쳐 문법 페널티가 0으로 유지됐다는 점이 이를 뒷받침합니다. 최근에는 LoRA와 EWC(Elastic Weight Consolidation)를 결합한 EWCLoRA 같은 기법이 등장하면서, PEFT 방식이 파국적 망각 연구의 중심축으로 자리 잡고 있습니다.

3. 오토리서치: 카파시가 쏘아올린 자율 실험 루프

이번 프로젝트의 또 다른 주인공은 오토리서치입니다. 안드레이 카파시가 2026년 3월 7일 공개한 630줄짜리 파이썬 도구로, 공개 며칠 만에 깃허브 스타 2만 1천 개, 소셜 조회수 860만 회를 기록하며 화제가 됐습니다. 핵심은 “제안 → 학습 → 평가 → 채택/폐기” 루프입니다. LLM 에이전트가 코드를 직접 수정하고, 정확히 5분간 학습시킨 뒤, 검증 손실(val_bpb)이 개선되면 유지하고 아니면 되돌립니다. 시간당 약 12개의 실험이 돌아가므로 잠든 사이 100개 가까운 실험이 누적됩니다.

오토리서치가 Optuna나 Ray Tune 같은 기존 하이퍼파라미터 튜닝 도구와 결정적으로 다른 점은, 사전에 정의된 파라미터 공간이 아니라 LLM이 떠올릴 수 있는 “임의의 코드 변형”을 탐색 공간으로 삼는다는 데 있습니다. 카파시 본인의 실험에서는 QK-norm의 어텐션 샤프닝 스케일러 누락, 밸류 임베딩(value embeddings) 정규화, AdamW 베타 파라미터 튜닝 같은 구조적 발견이 자동으로 쏟아졌습니다. 쇼피파이(Shopify) CEO 토비 뤼트케(Tobi Lütke)는 이 도구를 내부 쿼리 확장 모델에 적용해 0.8B 모델에서 37회 실험만으로 검증 점수 19% 개선을 얻었다고 공유했습니다. 자세한 해설은 DataCamp의 AutoResearch 가이드와 The New Stack의 소개 기사에서 볼 수 있습니다.

영상 속 실험에서는 이 아이디어를 약간 변형해, 학습률·타겟 모듈·LoRA 랭크 세 축을 중심으로 16개 실험(9+7)을 돌렸습니다. 결과는 분명했습니다. 학습률이 가장 큰 레버였고, 최적값은 처음 설정의 10배에 달했습니다. 타겟 모듈을 세 개 레이어 타입(어텐션, 출력 프로젝션, 피드포워드)에 모두 걸었을 때 가장 큰 점수 상승이 왔고, 랭크는 키울수록 수확 체감이 뚜렷했습니다. 베이스라인 대비 9% 개선이라는 숫자는 크진 않아 보이지만, 45분 남짓한 총 실험 시간을 생각하면 “사람이 직접 조합을 돌려봤을 때”와는 비교가 안 되는 효율입니다.

4. 이중 평가 지표: 모델의 꼼수를 막는 장치

파인튜닝 프로젝트에서 자주 간과되는 것이 평가 지표의 설계입니다. 이번 실험은 “D&D 점수 + 문법 페널티 × 가중치”라는 이중 지표를 사용합니다. D&D 점수만 본다면 모델은 “D&D스러운 단어”를 암기해 점수를 올리면서도 전혀 문법에 맞지 않는 출력을 내놓을 수 있습니다. 실제로 파인튜닝 연구에서 “리워드 해킹(reward hacking)“으로 불리는 이런 현상은 매우 흔합니다.

문법 페널티를 결합 점수에 곱하는 방식은, 모델이 기본 영어 능력을 포기하는 순간 점수가 폭락하도록 만드는 안전장치 역할을 합니다. 영상 속 결과에서 문법 페널티가 0으로 유지됐다는 건, LoRA의 구조적 보호와 이중 지표의 견제가 함께 작동했다는 증거입니다. 이는 GPT-2 같은 작은 모델에서 특히 중요한데, 파라미터 여유가 적을수록 “한쪽을 배우려고 다른 쪽을 희생”하는 경향이 강해지기 때문입니다.

5. 버그와 디버깅: 파인튜닝 현장의 진짜 모습

블로그 글과 튜토리얼은 보통 “잘된 결과”만 보여주지만, 이번 영상이 흥미로운 또 다른 이유는 중간에 버그를 발견하고 고치는 과정을 그대로 노출한다는 점입니다. 첫 학습 실행에서 결합 점수가 2.091로 베이스라인보다 오히려 악화됐는데, 원인은 클로드가 작성한 학습 스크립트에서 타겟 토큰이 두 칸 어긋나 있었기 때문이었습니다. 즉, 모델은 “다음 토큰” 대신 “다다음 토큰”을 예측하도록 학습되고 있었던 거죠. 이런 오프-바이-원(off-by-one) 실수는 자가회귀(autoregressive) 학습에서 가장 흔한 함정 중 하나입니다.

이 대목은 AI 에이전트와 함께 코드를 짜는 워크플로의 현실을 잘 보여줍니다. 에이전트가 빠르게 스크립트를 써주는 것은 강력한 레버리지지만, 학습 손실이 “이상하게 떨어지지 않는다”는 초기 신호를 사람이 반드시 감지해야 합니다. 파인튜닝에서 품질 회귀가 관측될 때 체크리스트는 거의 늘 같습니다. 레이블 정렬, 토큰화 옵션, 어텐션 마스크, 학습률 스케줄, 그리고 데이터셋 셔플 여부. 이 영상은 그 현실을 축약해 보여주는 사례로 꽤 좋은 교재입니다.

핵심 인사이트

작은 모델 + 작은 어댑터 = 큰 교훈. 프론티어 모델 대신 GPT-2로 연습하면 파인튜닝의 전체 파이프라인을 한 사람이 한 오후에 돌려볼 수 있습니다. LoRA는 이 실험을 더 빠르고 안전하게 만듭니다.
파국적 망각은 구조적으로 막아야 한다. 베이스 모델을 얼리고 어댑터만 학습시키는 LoRA의 설계는, 정규화나 KD 같은 추가 기법 없이도 문법 보존을 기본값으로 만들어줍니다.
평가 지표가 모델의 목표다. D&D 점수만 봤다면 모델은 문법을 버리고 단어를 암기했을 수 있습니다. 이중 지표는 그 꼼수를 원천 차단합니다.
자동 실험 루프는 인간 실험자의 상한을 뚫는다. 오토리서치는 “실험 한 번 = 5분”이라는 단순한 제약 아래에서 인간이 수동으로는 절대 시도하지 못할 탐색 공간을 커버합니다.
가장 큰 위험은 알고리즘이 아니라 데이터 파이프라인 버그다. 타겟 토큰 정렬 오류 같은 사소한 실수 하나가 전체 실험을 망칩니다. “손실이 이상하면 알고리즘보다 데이터와 레이블을 먼저 의심하라”는 고전적 조언이 여전히 유효합니다.

더 알아보기

LoRA: Low-Rank Adaptation of Large Language Models (원논문) — 개념, 실험, GPT-2/3 결과까지 다루는 원전.
microsoft/LoRA GitHub 리포지토리 — 레퍼런스 구현과 예제 코드.
karpathy/autoresearch GitHub 리포지토리 — 630줄짜리 자율 실험 루프의 원본 구현.
A Guide to Andrej Karpathy’s AutoResearch — DataCamp — 오토리서치의 동작 원리와 활용 예시를 정리한 튜토리얼.
Karpathy’s 630-line Python script ran 50 experiments overnight — The New Stack — 오토리서치가 던진 의미와 초기 반응을 다룬 기사.
Revisiting Catastrophic Forgetting in LLM Tuning (EMNLP 2024) — 파국적 망각과 손실 지형, 완화 기법을 체계적으로 다루는 최신 논문.

03찬반 토론 · Debate

토론: “LLM 에이전트가 주도하는 자동 실험 루프(오토리서치)는 파인튜닝 연구의 기본 패러다임이 되어야 하는가”

논제: GPT-2 규모의 실험에서 LoRA + 오토리서치 루프가 보여준 것처럼, LLM 에이전트가 직접 하이퍼파라미터와 코드를 수정하며 돌리는 자율 실험 루프를 머신러닝 연구의 기본 워크플로로 받아들여야 하는가?

Round 1

🟢 Pro — “자율 실험 루프는 연구자의 시간을 해방시키는 레버리지다”

첫째, 오토리서치 같은 루프는 연구자 한 명의 실험 처리량을 수십 배로 증폭시킵니다. 영상 속 실험만 보더라도 9+7개의 실험을 45분 만에 돌리고, 각 실험에서 학습률·타겟 모듈·LoRA 랭크의 상호작용까지 분석해 베이스라인 대비 9% 개선을 얻었습니다. 같은 분석을 사람이 수동으로 하려면 며칠, 현실적으로는 아예 시도조차 하지 않았을 가능성이 높습니다. 카파시 본인도 630줄짜리 스크립트로 하룻밤 사이 50~700개의 실험을 돌렸다고 공개한 바 있습니다. 이 레버리지는 “작은 랩”과 “대형 랩” 사이의 격차를 좁히는 실질적 수단입니다.

둘째, 자율 루프는 사람의 인지 편향을 우회합니다. 숙련된 연구자일수록 “학습률을 10배 올린다”는 과감한 시도를 본능적으로 피합니다. 그러나 영상 속 실험에서 가장 큰 레버는 바로 그 “10배 높은 학습률”이었습니다. 에이전트는 사람의 경험적 보수주의 없이 탐색 공간을 훑기 때문에, 인간이라면 그냥 지나쳤을 최적점을 정직하게 찾아냅니다.

셋째, 결과가 재현 가능하고 체계적입니다. 모든 실험 구성과 점수가 로그로 남기 때문에, 사후 분석도 수월하고 팀원 간 공유도 깔끔합니다. “느낌상 이게 더 나았다”는 수공예 튜닝과 비교할 수 없는 명료함이죠.

🔴 Con — “자동 루프는 과학적 사고를 ‘점수 올리기 게임’으로 축소시킨다”

첫째, 오토리서치의 평가 함수는 결국 하나의 스칼라(val_bpb, 혹은 이번 영상의 “결합 점수”)로 귀결됩니다. 에이전트가 최적화하는 것은 이 숫자 하나뿐이고, 루프는 “왜 그 변화가 효과적이었는가”를 묻지 않습니다. 영상 속 실험조차 결합 점수가 2.091로 튀었을 때 사람이 개입해 “타겟 토큰 정렬 버그”를 발견했기에 살아났지, 에이전트에게만 맡겼다면 그 버그는 “나쁜 하이퍼파라미터”로 오인돼 버려졌을 겁니다.

둘째, 탐색 공간의 구조적 편향 문제가 있습니다. LLM 에이전트가 제안하는 코드 변형은 결국 훈련 데이터에서 흔히 본 패턴에 치우칩니다. “상식적인 변화”는 잘 시도하지만, 진짜 혁신적 아이디어는 인간 연구자의 직관과 가설 수립에서 나오는 경우가 압도적으로 많습니다. 오토리서치는 기존 레시피의 미세 조정에는 탁월하지만, 트랜스포머의 등장 같은 패러다임 전환을 만들어내진 못합니다.

셋째, 자원 측면에서도 문제가 있습니다. “하룻밤 100회 실험”은 단일 GPU에서 멋져 보이지만, 연구실 전체로 확장하면 무의미한 전력과 탄소 비용이 누적됩니다. 사람 머리로 1시간 생각하면 얻었을 인사이트를, GPU로 10시간 태워 찾는 건 종종 반과학적이기까지 합니다.

Round 2

🟢 Pro (재반론)

Con의 첫째 주장, 즉 “스칼라 하나만 최적화하므로 버그를 해석하지 못한다”는 지적은 절반만 맞습니다. 영상 속 토큰 정렬 버그는 오히려 이중 평가 지표(D&D 점수 + 문법 페널티)가 “결합 점수 2.091”이라는 명백한 이상 신호를 냈기 때문에 사람이 빠르게 감지한 사례입니다. 다시 말해, 자동 루프와 잘 설계된 평가 지표가 결합되면 버그는 “점수 급락”이라는 형태로 더 빨리 가시화됩니다. 오토리서치 자체가 해석을 포기하는 게 아니라, 해석이 필요한 지점을 더 선명하게 드러내는 도구인 거죠. 또한 자동 루프는 디버깅 후 같은 실험을 즉시 재실행할 수 있게 해주므로, 디버깅 사이클도 짧아집니다.

Con의 둘째, “혁신은 인간에게서 나온다”는 낭만화된 전제에도 반박할 여지가 많습니다. 카파시의 오토리서치 실제 결과에서는 QK-norm의 누락된 스케일러, 밸류 임베딩 정규화, 밴디드 어텐션 튜닝처럼 논문 한 편으로 이어질 만한 구조적 발견이 쏟아졌습니다. 쇼피파이 CEO 토비 뤼트케의 사례에서도 37회 실험으로 19% 개선이 나왔고요. 이는 “기존 레시피 미세 조정” 이상의 성과입니다. 인간의 가설 수립 능력과 에이전트의 탐색 처리량은 경쟁 관계가 아니라 보완 관계입니다.

Con의 셋째, 자원 문제는 오히려 오토리서치가 잘 답한 지점입니다. 이 도구의 설계 원칙 자체가 “단일 GPU, 5분 실험, 6억 파라미터 이하”입니다. 대형 랩의 수천 GPU-시간 스윕보다 오토리서치의 단일 GPU 하룻밤이 전력 측면에서 훨씬 경제적입니다.

🔴 Con (재반박)

Pro의 첫째, “자율 루프 덕분에 실험 처리량이 수십 배 늘었다”는 주장은 양의 관점에서만 옳습니다. 처리량이 늘어나는 만큼 “의미 없는 실험”의 비율도 함께 늘어납니다. 영상에서 나온 9% 개선이라는 숫자도, 1단계의 랜덤한 학습률 스윕 없이 “학습률부터 한 자릿수 올려본다”는 인간의 기초 실험 하나면 얻었을 결과입니다. 자동화가 레버리지인 건 맞지만, 그 레버리지가 “이미 답이 거의 정해진 문제”에만 작동한다면 큰 의미는 없습니다.

Pro의 둘째, “에이전트는 인지 편향이 없다”는 말은 위험한 과장입니다. LLM 에이전트는 훈련 데이터의 통계적 편향을 그대로 이어받습니다. 학습률 10배가 최적이라는 사실을 “과감하게” 발견한 게 아니라, 이미 수많은 LoRA 튜토리얼에서 권장하는 값 중 하나였을 가능성이 훨씬 큽니다. 진짜 편향은 인간이 아니라, LLM이 “흔히 본 설정”을 먼저 제안하는 쪽에서 나옵니다.

Pro의 셋째, “재현성과 체계성” 주장도 반만 맞습니다. 로그가 남는다고 해서 과학적 설명력이 확보되는 건 아닙니다. 오토리서치 로그는 “뭐가 잘됐다”는 결과만 남기지, “왜 잘됐는가”에 대한 가설과 검증은 남기지 않습니다. 재현성과 설명력은 다른 문제이고, 자율 루프는 후자를 자동화하지 못합니다.

Round 3

🟢 Pro (Round 2 Con에 대한 재반박)

Con이 Round 2에서 가장 강하게 밀었던 것은 “자동화가 무의미한 실험 비율을 함께 늘린다”는 주장이었습니다. 그러나 이 주장은 “실험 한 번의 비용”에 대한 오해에 기반합니다. 오토리서치의 설계상 실험 한 번은 5분, 단일 GPU입니다. 즉 “의미 없는 실험 100개”의 비용이 “의미 있는 실험 1개”의 비용과 같은 차원이 아닙니다. 탐색의 경제학은 기대값 × 시행 수로 계산되는 게 맞고, 시행 단가가 거의 0에 수렴하면 “의미 없는 실험”의 허용 비율도 자연스럽게 올라갑니다. 이건 낭비가 아니라 최적화된 탐색 전략입니다.

Con의 두 번째 공격, “LLM의 제안도 결국 훈련 데이터 편향을 반영한다”는 지적은 일리가 있지만, 문제를 해결하는 방향이 “오토리서치를 거부하라”는 게 아니라 “제안 모델을 다양화하라”가 되어야 합니다. 에이전트 앙상블, 온도 파라미터 조정, 인간의 시드 아이디어 주입 같은 기법으로 이 편향은 완화됩니다. 인간 연구자도 자기 편향에서 자유롭지 않다는 사실을 떠올리면, 이건 자동화의 약점이 아니라 공통의 과제입니다.

마지막으로 Con의 “로그는 설명을 대체하지 못한다”는 지적은 정확하지만, 오토리서치가 해야 할 일과 인간이 해야 할 일이 섞인 주장입니다. 로그는 설명의 재료이고, 그 재료를 해석하는 건 여전히 사람의 몫입니다. 자동화의 약속은 “설명까지 대체한다”가 아니라 “설명할 재료를 싸게 확보한다”입니다. 이 구분이 핵심입니다.

🔴 Con (Round 2 Pro에 대한 재반박)

Pro가 Round 2에서 들었던 “이중 평가 지표가 버그를 드러낸 사례”는, 사실 Con의 논지를 강화합니다. 그 버그를 발견한 건 오토리서치 루프가 아니라 사람입니다. 루프는 “점수가 낮다”고 기록만 했을 뿐, “타겟 토큰이 두 칸 어긋나 있다”는 진짜 원인은 사람이 스크립트를 읽고 추론해 찾아냈습니다. 이는 자동 루프가 “해석 가능한 신호를 만든다”는 의미이지, “해석까지 자동화한다”는 의미가 아니라는 걸 증명합니다. Pro가 주장한 “자동 루프 + 이중 지표의 시너지”는, 결국 사람이 루프 밖에서 계속 감시해야만 성립하는 반쪽짜리 자동화입니다.

Pro가 카파시의 QK-norm 사례나 쇼피파이의 19% 개선을 들어 “구조적 발견도 자동으로 나온다”고 주장한 것에 대해서도, 이 사례들은 모두 “매우 잘 정의된 태스크 + 숙련된 연구자의 사전 세팅” 위에서 나왔다는 점을 잊어선 안 됩니다. 진짜 혁신은 그 앞단, 즉 “어떤 문제를 풀지”, “어떤 지표로 평가할지”에서 결정되는데, 오토리서치는 그 상류 결정을 전혀 자동화하지 못합니다. 요컨대 에이전트는 “잘 정의된 문제의 해법 공간”에서만 유능합니다.

자원 문제에 대한 Pro의 반박(“단일 GPU니까 괜찮다”)도 지역적으로만 맞습니다. 개별 연구자 단위로는 가볍지만, 이 관행이 수천 명의 연구자에게 퍼지면 집계된 비용은 기존 스윕을 훌쩍 넘을 수 있습니다. “내 GPU 한 장은 싸다”는 논리는 클라우드 시대의 가장 위험한 경제적 착각 중 하나입니다.

🧭 종합

합의 지점

양측 모두 최소 세 가지에 동의합니다. 첫째, 오토리서치 같은 자동 실험 루프는 “실험 단가”를 실제로 크게 떨어뜨렸습니다. 둘째, 이 도구는 잘 정의된 태스크와 견고한 평가 지표 위에서만 제대로 작동합니다. 셋째, 최종 해석과 과학적 설명은 여전히 사람의 몫이며, 자동화가 이를 대체하지는 못합니다. 즉, “도구로서 유용하다”는 점과 “도구 이상의 지위를 부여해서는 안 된다”는 점에 대해서는 양쪽이 암묵적으로 합의하고 있습니다.

열린 질문

해결되지 않은 질문들도 분명합니다. (1) 자동 루프의 탐색 공간이 LLM의 데이터 편향에 얼마나 종속되는가? (2) 단일 GPU에서 돌아가는 5분짜리 실험이 100B 이상 대형 모델로 확장될 때도 “실험 단가가 낮다”는 명제가 유지되는가? (3) 에이전트의 코드 수정이 “구조적 혁신”을 만들어낼 수 있는가, 아니면 본질적으로 기존 레시피의 로컬 최적화에 머무르는가? (4) 오토리서치 로그가 축적되면 “왜 잘됐는가”에 대한 메타분석을 부분적으로라도 자동화할 수 있는가? 이 질문들은 향후 1~2년 사이 실증적으로 답이 나올 영역입니다.

더 나아간 관점

토론을 한 단계 위에서 보면, 진짜 쟁점은 “자동화 vs 인간”이 아니라 “연구 파이프라인의 어느 지점을 자동화해야 하는가”입니다. 오토리서치는 “잘 정의된 문제의 해법 탐색” 구간을 공략하는 도구이고, 영상 속 파인튜닝 실험은 그 구간이 전체 프로젝트의 50%를 넘지 않는다는 걸 보여줍니다. 문제 정의, 데이터셋 설계, 평가 지표 설계, 버그 디버깅, 결과 해석은 여전히 사람의 영역이었습니다. 그리고 이 구성은 앞으로도 한동안 유지될 가능성이 큽니다.

따라서 “오토리서치를 기본 패러다임으로 삼을 것인가”라는 원래 논제는 거짓 이분법에 가깝습니다. 더 정확한 질문은 “연구 파이프라인의 각 단계에서 자동화 비율을 어떻게 설계할 것인가”입니다. 건강한 답은 아마 이런 모습일 겁니다. 문제 정의와 지표 설계에는 인간이, 해법 탐색과 로그 수집에는 에이전트가, 해석과 일반화에는 다시 인간이 앞장서는 삼단 구조. 영상 속 실험은 의도치 않게도 바로 그 구조의 살아있는 예시를 보여준 셈입니다. 파인튜닝 레시피는 에이전트가 탐색했지만, “D&D 대사를 데이터셋으로 삼는다”는 문제 정의와 “이중 지표를 쓴다”는 평가 설계, 그리고 “타겟 토큰이 두 칸 어긋났다”는 버그의 정체는 모두 사람이 짚어냈습니다.

한 가지 더 짚어둘 점은, 자동화 비율 자체가 시간에 따라 이동하는 변수라는 사실입니다. 2021년의 LoRA 공개는 파라미터 효율 파인튜닝의 자동화 여지를 크게 열었고, 2026년의 오토리서치는 하이퍼파라미터와 코드 수정까지 에이전트의 영역으로 밀어 넣었습니다. 다음 단계는 평가 지표 자체의 자동 설계, 혹은 “버그 후보를 먼저 제시하는 에이전트”가 될 가능성이 큽니다. 그 경계가 이동할 때마다 “인간이 잘하는 일”의 정의도 갱신되어야 합니다. 중요한 건 매 시점마다 “에이전트가 방금 밀어낸 경계 바로 위에서 사람이 무엇을 해야 하는가”를 스스로에게 다시 묻는 습관입니다. 이번 영상 속 24분의 실험은, 그 질문을 구체적인 숫자와 샘플로 번역해준 드문 교재라는 점에서 여전히 가치가 있습니다.

04영문 원본 · Transcript

So in today's video, we're going to be exploring fine tuning AI models. And this will also
incorporate photo research from Andrej Karpathy. First, I'm going to give a little bit of an
overview of what fine tuning is when you're using AI models. And then we're going to actually go
into cloud code and begin this process. So very basic, what is a language model? At its core,
a language model predicts the next word in a sentence. It's trained on a massive amount of
text from the internet, books, articles, websites. GVT2 is 124 million parameters trained by open AI
on roughly 10 billion words. By learning patterns in language, it can generate remarkably fluent
text. So think of it as a student who has read a million books. They can write convincingly about
almost any topic, but they're a generalist, not a specialist in anything. The cat sat on the,
they're able to predict based on previous billions and billions.
Of examples, what the next word will likely be. So what is fine tuning? Fine tuning is taking a
pre-trained model and training it further, but on a smaller specific data set. The model already
understands language. Fine tuning teaches it a specialty, and this requires far less data and
time than training it from scratch. So a chef who graduated from culinary school, which is a
sushi restaurant, which is fine tuning. They don't forget how to cook. They just get really
good at one specific skill, making sushi. So this is used by every major AI lab. Open AI
developed GBT4 and then fine tuned it into chat GBT. Anthropic trained its base model and then
into specific clawed models. Meta trained Lama and then specific Lama chat models.
So the challenge,
that we're going to face is catastrophic forgetting. And this is when a model learns new
content, it can forget what it already knew. So too much training on new data, the model loses
grammar and coherence. Too little training, the model doesn't learn the new style at all. Finding
the right balance is the central challenge of fine tuning. That is the balance you need to do.
Previously, I had tried to do a project, you've probably heard of me talking about this in other
videos.
And I wanted to do a project that was focused on一下 one, one piece, but I was trying to fine tune
a larger language model that knew English on very specific one piece synopsis. I spent several days
trying to get this project to work, but ultimately couldn't because the balance was just too far off.
All the larger models that I was trying to use didn't balance well with a very small Data set of
one piece synopsis,
apos.
There's obvious a lot of one piece episodes. But in the world of Machine Learning and AI And it's it's a little exciting amount of time to get this project to work, but ultimately it couldn't, because the balance was just too far off. We couldn't expect all the larger models that I was trying to use didn't balance well with a very small dataset of one piece synopsis. There's obviously
of machine learning and AI models, that's actually not that much data. So that was ultimately a
failure. But from it, we're going to be creating this project and learning how to do fine tuning
properly. And this is how you do it with Laura low rank adaptation, which is a technique from 2021
that changed the game. And this is for efficient fine tuning. Instead of changing all 124 million
parameters, we add a small adapter layers, only one to 5% extra parameters are trained.
The original model stays completely frozen. So the grammar is preserved by design.
Only the tiny adapters learn the new style, faster training, less memory and safer results.
Instead of rewriting a textbook, you add sticky notes with annotations. The original text is
untouched, but the notes change how you read it. So our base model is going to be GBT two,
and the data set that we're going to fine tune it on is called Critical Role,
which is a popular live play D&D show. So basically, this is going to be this kind of
D&D dialogue that we're fine tuning our model to. There's approximately 400,000 dialogue turns.
This was published by Microsoft, and it's estimated to have 20 to 50 million tokens,
which is ideal for GBT two fine tuning. And it's a mix of DM narration,
player dialogue, combat roleplay. But I thought it'd be a fun data set to try to do fine tuning.
And even though it's smaller than the what the base model GBT two was trained on,
it's still many, many, many times larger than the data set I had for the one piece synopsis.
So the dual metric of measuring success, we need to measure two things at once to know
our fine tuning is working. We need to know the D&D score, how well the model learned the D&D
dialogue style, and the grammar penalty, how much general English ability was lost.
So the combined score is the D&D score plus the penalty times maximum. So why both without the
grammar penalty, the model could cheat, memorizing D&D type text while producing incoherent English.
And that's why you need this at once, because sometimes the system can
cheat the metrics, but the actual output is not very good. This dual metric ensures that the model
learns D&D style while still writing properly. So how are we going to incorporate auto research
into this? So auto research is an automated system that searches the best settings by running dozens
of experiments, change a setting, train, measure score, keep or discard what it searches. So
there's a bunch of different parameters. But these are some examples learning rate,
how fast the model adapts adapter size, how much capacity it has for new knowledge, which layers to
be applied. The training duration and warm up, weight decay regularization, we'll go into this in
more detail as we as we go through this. So why do we automate this? With auto research,
there's dozens of possible combinations. Each experiments takes five minutes, the system can
run overnight testing configurations humans would take days to do manually, and they did,
and then results are logged for analysis later on. What we expect to see,
don't know if we're going to actually get this. I haven't run this yet, but before fine-tuning,
you're going to get something like this. The president announced today that the new policy
would take effect immediately. Stocks rose 2% on the news. Just generic news-like text. It's
grammatically correct, but there's no personality. There's no style to it. After fine-tuning,
we're hoping to see this kind of D&D dialogue style with the narrator, character voice,
and proper grammar. So the goal is a model that generates text sounding like a D&D session.
DM descriptions, player banter, combat narration, while maintaining
coherent grammatical English throughout. So that is the balance we are talking about.
So just some key takeaways before we get started in CloudCode. Fine-tuning is how general AI models
become specialists. It's behind every chatbot you've used, and it's a very common practice
in AI labs. Laura, L-O-R-A, makes fine-tuning efficient and safe.
Preserving the model's core knowledge while adding capabilities.
Good evaluation metrics are crucial. The dual metric ensures that we measure what actually
matters. Or data quality and quantity determines success. Enough data with a moderate domain shift
is the sweet spot. And automated search, auto-research, systematically finds optimal
settings that humans would struggle to discover manually. So the same principles that PowerChat
uses in CloudCode. So the same principles that PowerChat uses in PowerChat. So the same principles
that PowerChat uses in PowerChat. So the same principles that PowerChat uses in PowerChat.
Every modern AI assistant applied to a 20-sided die. So here we are in CloudCode. I'm using
Opus 4.6, and these are the steps that we're going to start going. We're going to download
the dataset, explore and clean the data. It should be pretty clean, but we're going to
check it to make sure. Tokenize with GBT2's tokenizer. That's going to split it into the
train valuation test sections. We're going to measure the baseline and see what we get from
the baseline. So we're going to measure the baseline and see what we get from the baseline.
We're going to measure the baseline and see what we get from the baseline. We're going to measure the stock GBT score and then we're going to begin the training loop. Right now, previously it has my failed one piece data. So we're going to replace it with the D&D and then maintain the dual metric evaluation. We're going to run the first manual experiment to get our baseline, set up the auto-research loop and then run the loop. So hopefully by the end of this, we have a perfect fine-tuned model on D&D dialogue.
Let's get started. Okay, so in Cloud Code, this is what it just did. It downloaded the data set,
grabbed the Critical Role D&D transcripts from GitHub. They came in as 280 JSON files.
And it was 725,000 dialog turns. So quite a hefty data set actually much larger than the
one piece one I was trying to work with. So it built textiles from this automatically,
it extra extracted the raw dialogue into a simpler format, and then split it by episode.
So it created a training set, and then a validation set. And then a test set.
The training set is what the model is going to learn from. So this is what it's actually going
to be trained on. The validation set is going to be used to measure how well it's learning.
And then test set is going to be held out for final evaluation. And this is how this kind of
training regimen is usually organized. So we have the data, that's a big step.
Now, we have the data, that's a big step. Now, we have the data, that's a big step. Now,
we have the data, that's a big step. Now, we have the data, that's a big step. Next, we're going
to tokenize it, which means we're going to convert the text into numbers that GPT two
understands. The model itself doesn't read words, it reads token IDs. So we need to do this
conversion once and save it. So training is fast. And that's true of any kind of model like this.
tokenize is something go here, it converts the text into numbers that the model can understand
easily. Then we're going to do the measure the baseline. So we're going to run the stock GPT
two without any fine tuning on the validation set to get a baseline score.
So that'll tell us how well GPT two already understands D&D dialogue. And that's the number
we need to beat by doing the auto research run, and adjusting the parameters to kind of optimize
it. And then we're going to fine tune it with Laura, attach a small adapter layer to GPT two
and train them specifically on the D&D data, measuring both the D&D score and grammar
preservation. So that's the dual metric we were talking about, we want something that is both
grammatical as English, but also has the
format and the content that we want. So we're going to continue with this now.
Okay, so we tokenized the data got 13.3 million training tokens.
Now we're going to measure the GPT baseline on the validation set, and then adapt the train
Laura Python script. So now it's going to adapt this to load the D&D data,
instead of the failed one piece one that I was trying to work. And after that,
it's going to measure the baseline, which will give us the number that we need to be,
as we do the auto research loop on it. Okay, so we ran for the baseline. And what we basically did
was load the GPT two model that already knows English attached the Laura adapters. And these
are the small trainable layers that we're going to use for fine tuning. But it's only point 65%
of the model, the vast majority is frozen, and not going to be touched. And that's the main GPT
two model, because we want proper English grammar, spelling, coherence. So then,
it we loaded the tokenized D&D data, which is what we made earlier. Now is our training set.
We loaded the grammar validation set. But we need to fix that. But that's going to be used for the
metric. And then we measured the baselines. So we measured it on both validation sets, the D&D
BBB, how well GPT already understands D&D dialogue, and then the grammar BPB, how well it
handles general English. And we trained it for five minutes. And then we loaded the tokenized D&D data,
feeding the D&D dialogue through the model, adjusting only this small Laura adapter weight
to minimize prediction errors. So we saw around 500,000 tokens. And then we ran the evaluation,
which was merit measuring with a dual metric that I talked about earlier.
The combined score was 2.091, which was much worse than the original baseline.
So there was a
bug in the original script that Claude wrote for the training script. The model was being
trained on misaligned targets. So every prediction was being compared to the token two positions of
head. So that's why the last couple runs, the first couple runs, I should say, we were seeing
a lot of issues, the performance was actually getting a lot worse than the baseline. But it was
just because there was a bug in the training script. So something to keep in mind if you're
going to do this, to check for that stuff. But yeah, that's it. Thanks for watching. I'll see you in the next video.
as well. But we fixed it. And the next run should improve.
The fix we had worked after the first run of training. We beat the baseline in both the D&D
and grammar. So no grammar penalty, the combined score is down from 1.132 in the baseline to 1.05.
So the Laura fine tuning made the model better at the D&D dialog while also slightly improving general grammar.
So the Laura of fine tuning made the model better at the D&D dialog, while also slightly improving general grammar.
grammar. So there was zero grammar penalty. So that's exactly what we wanted to see. The
previous runs were all broken because the double shifted labels bug. And that was the
bug I was talking about before. And here are our first little samples here. This is after
the first fine tuning run. Let's see, I want to hear what you guys are doing. You're right,
the whole place is pretty busy because of that. Going so well. So do we know where they
go tomorrow? We'll be back in a couple days. Look at this list. These are pretty good.
Everything is grammatical. It's not perfect, obviously, but the formatting is correct.
The sentences themselves are pretty good. There is some D&D content in here. We won't
lose out unless they have an extra attack or something. So this one will probably just
end up being in your own hand right now anyway. So yeah, pretty good starting point. So now
we're really going to get into
the water research loop and try to optimize the hyperparameters further. So since we already
did the first loop, we don't need to do the full 30 minutes again, we're going to go back
to the five minute per experiment. So the things we're going to try to experiment with
are the learning rate. The current one worked, but it may not be optimal. The LoRa target
modules, trying different layers, and LoRa rank. So as we go through this, I'll explain
in more detail.
So now that we have a working baseline, here is the plan we're going to do nine experiments,
five minutes each, 45 minutes total, so we only needed the 30 minute on the first run
there, just to get a baseline, make sure everything was working. So phase one, we're going to
check the learning rate sweep. Phase two target modules, and then phase three LoRa rank. So
what do these things actually mean? The learning rate is how big of a step the model takes
when updating its weights each iteration. So think of it like a speed dial on learning.
If it's too low, the model barely changes in five minutes, it doesn't learn enough.
If it's too high, the model changes too aggressively overshoots and produces garbage.
And so we're testing for values to try to find the sweet spot in the learning rate there.
And then in the second phase, we're doing target modules.
And this is which parts of GBT two's brain we attach the LoRa adapters to.
GBT two has several layer types inside each transformer block.
This is a little bit technical, but simply put, there's these different layer types,
the attention layer where the model decides which words to focus on when predicting the
next one.
This is the most important layer for learning style, the output projection after attention,
where the model combines what it focused on.
Adding this gives LoRa more places to make adjustments.
And then finally, the feed forward layer, where the mod model does its deeper reasoning
and processing.
So adding this means
we're adapting almost everything.
So the more modules you target, the more capacity to learn, but there's also more risk of disrupting
the base model, the GBT two.
And then lastly, in phase three, we're checking the LoRa rank parameters.
So how wide each adapter is.
So think of it as the adapters capacity are four is very small adapters can only make
subtle adjustments are eight, which is the current setting, we have some moderate capacity,
R 16 can learn more complex patterns, or 32 can has a lot of capacity, but you risk overfitting
on the trading data instead of getting a generalized model was trying to find the sweet spot between
all of these parameters.
In short, we're asking three questions, how fast should it learn?
What parts of the model should we adapt?
And how much capacity should each of the adapters have?
So we're going to run these nine experiments, it should take around 45 minutes, and to try
to find the sweet spot between all of these parameters.
So those nine experiments finished, we got all of the scores were better than our baseline.
So that's pretty good.
The best we got was 1.04.
And the best overall, we found that learning rate mattered the most.
The learning rate here, one convincingly, convincingly, was 10 times higher than our
conservative pick, we didn't have any grammar degradation across all the runs.
So that's LoRa doing its job well.
It froze the base model well, to preserve their really good grammar, having more target
modules helped.
But higher rank helps but had some diminishing returns.
So the one one E three was the highest that we tried.
Of the four values we tested were these four.
And this is actually the fastest and it scored the best.
So each jump is roughly three times faster learning.
So the pattern.
Is clear, higher learning rate was a better score, though, we were only had a five minutes
of training.
So higher learning rate, and that each step can make a bigger adjustment to the adapter
weights.
So the model learns more in a limited amount of time.
So I think we're going to do one more round from the learning, learning here.
So we're going to combine the winners and run a second round.
So basically, we're going to be doing seven experiments testing with higher learning rates
even more higher, the best learning rate and more modules.
And best learning rate and higher ranks.
So essentially, we're going to be asking, can we learn even faster without breaking?
And does the best learning rate and more modules capacity compound into a better score.
So we're going to kick that off, do one more round, see if we can get a better metric.
Okay, all seven experiments are now done.
And we got a best score again.
That was using all three modules with the optimized learning rate.
So it's a 9% improvement from the original baseline that we had.
So the key takeaways that were more target modules was the biggest win.
So adapting to all three layer types with our higher learning rate dominated.
So the learning rate plateaued around what we had before 183.
Going much higher didn't really help.
And higher rank had diminishing returns.
And we had zero grammar integration.
So the auto research loop has paid off.
This was a slightly modified auto research, but the concept is similar.
We did nine runs the first time and then we did seven.
So 16 rounds, 16 experiments all together.
Let me ask for a sample from best configuration.
So let's see if our fine tune model produced any good sample.
So this should be.
D&D style dialogue, basically.
Okay, this is what we got.
The door creaks open.
You see this large black bearded man in a long robe that's almost exactly like the one Scott
was looking at.
Oh, no.
No, wait.
I mean, we're talking about what?
What is he wearing here?
Is something of interest to him or not?
Where are they now?
When these things came from?
Do they have robes as well?
Laughter.
So it's properly formatted.
D&D dialogue.
You can see had this, this, this, this, this, this, this, this, this, this, this, this, this, this,
this, this, this, this, this, this, this, this, this, this, this one sample fell into, uh, some
number of a loop.
It looks looks like maybe it had to do with the role rolling of the dice.
It fell into that loop, but otherwise it looks pretty good.
As you enter a tab and your mind has changed to more mundane world, world.
You're now able to see the world down the path of what I've been doing for a while, really
pretty much.
The only thing that can be done about these things is a few things.
So.
You're now able to see through what is an expansive stone wall
with various small portals of some kind.
So we can't walk past walls anymore.
The windows are gone.
We have no way back yet.
So it's proper D&D-style dialogues,
if you've ever read the Critical Role D&D dialogues.
And this was started from a just generic GBT-2 model,
and we fine-tuned it to get to this far.
So a pretty good result, I would say.
And Claude agrees.
The model clearly has learned the D&D session style
with speaker labels, DM narration from Matt,
player banter and crosstalk, game mechanisms,
character names, laughter, and grammatical English throughout.
It's not perfect.
Like I said, the role initiative prompt devolved into numbers,
and some of the grammar is not exactly perfect.
But I think that was a pretty,
pretty good example of fine-tuning.
And we could keep running this,
but I think this video has gone on long enough.
I think you just kind of get the idea of how to fine-tune
and some of the concepts behind it.
If you're interested in this,
because fine-tuning is a pretty important concept
in model building and machine learning,
I'll probably do some other videos on this topic as well,
delving into different aspects.
But I thought this was a good example
of how you could use auto-research type of loop
to run experiments and optimize a fine-tuning
of a model.
So I think it's a pretty good example.
But that's going to be it for today.
If you liked the video, please subscribe,
leave a comment, leave a like.
Let me know what other types of experiments you're running,
what you'd like to see from me,
and I'll see you in the next video.