Learn How to Make AI Models w/ ML: 1. PyTorch

2026-03-31 · 37m · 자막 —

01한국어 번역 · Korean

AI 모델을 직접 만드는 법 1편 — 파이토치(PyTorch)로 시작하기

원본: https://www.youtube.com/watch?v=WyMIlq5zJZo · 업로드: 2026-03-31 · 길이: 37m · 채널: Onchain AI Garage

시리즈를 시작하며

이 영상은 앞으로 12편으로 이어질 머신러닝 엔지니어링(ML Engineering) 시리즈의 첫 편이다. 기획이 조금 특이한데, 나는 이 분야의 전문가가 아니라 “나도 배우면서 같이 가르치는” 방식을 택했다. 머신러닝 엔지니어에게 실제로 어떤 기술이 필요한지 알고 싶어서, 에이전트(agent)를 돌려 Indeed의 채용 공고 수십 개를 긁어 공통적으로 요구되는 스킬셋과 툴을 추려 냈다. 교과서가 아닌 현업의 언어로 짠 12편짜리 커리큘럼이고, 시리즈가 끝날 즈음에는 실무에서 통하는 기본기가 손에 남을 것이다.

이 시리즈의 철학 — 코드가 아니라 개념

이 시리즈는 코딩 튜토리얼이 아니다. 클로드 코드(Claude Code)나 커서(Cursor) 같은 AI 도구는 이미 웬만한 보일러플레이트 코드는 알아서 써 주고, 솔직히 평범한 소프트웨어 엔지니어보다도 잘 쓴다. 그래서 “파이썬(Python)에서 for 루프 쓰는 법”을 다루는 건 의미가 없다. 정말 중요한 건 무엇을 만들지 결정하고, 왜 고장났는지 디버깅하고, 어떤 접근법이 내 문제에 맞는지 판단하는 능력이다. AI가 아직 대체하지 못하는 영역—시스템 이해력, 디버깅 직관, 아키텍처 판단, 도메인 전문성—이 우리가 붙잡고 갈 영역이다. 타자수(typist)가 아니라 엔지니어가 되려면 툴과 라이브러리가 바뀌어도 살아남는 오래가는 멘탈 모델이 필요하다.

매 에피소드의 구성과 12편 로드맵

각 영상은 두 부분으로 나뉜다. 첫째는 5~10분 정도의 지식(Knowledge) 파트로, 개념이 무엇인지·왜 존재하는지·언제 쓰는지를 비유와 다이어그램으로 풀어 놓는 구간이다. 코드도 수식의 벽도 없다. 둘째는 작업(Work) 파트로, 같은 개념을 써서 장난감이 아닌 실제 프로젝트를 만든다. 예측 시장(prediction market), 크립토, 블록체인, 스포츠, 트레이딩 같은 관심 도메인에서 쓸 만한 걸 만들고, 필요한 부분은 AI 도구를 쓰되 가능한 한 수동으로 진행한다. 프로젝트는 서로 쌓여서 5편의 결과물이 6편에서 다시 쓰이는 식이다.

채용 공고에서 뽑아낸 12개 주제는 이렇다. 파이토치, 트랜스포머(Transformer), 허깅페이스(Hugging Face), 임베딩(embedding), 파인튜닝(fine-tuning), LoRA와 QLoRA, DPO, 평가(evaluation), GPU 최적화, 분산 학습, 강화학습(RL), 그리고 인프라 영역인 트리톤 커널(Triton kernels). 네 가지 원칙은 명확하다. 코드보다 개념을, 실제 프로젝트를, 튜토리얼 스타일로, AI로 대체하기 어려운 스킬에 집중.

에피소드 1 — 파이토치: 현대 머신러닝을 굴리는 프레임워크

파이토치란 무엇인가

파이토치는 신경망(neural network)을 만들고 학습시키는 데 필요한 도구와 함수가 미리 만들어져 있는 프레임워크다. 오늘날 거의 모든 AI 모델과 대형 언어 모델(LLM)의 뼈대에 파이토치가 깔려 있다고 보면 된다. 연구자 대부분이 쓰고, 프로덕션 현장에서도 점점 더 많이 쓰인다. 한때의 주요 경쟁자는 구글의 텐서플로(TensorFlow)였지만, 적어도 지금까지는 파이토치가 생태계 전쟁에서 사실상 이긴 상태다. 파이썬 기반이라 대부분의 데이터 사이언스 워크플로와 자연스럽게 맞물린다.

핵심 개념 1 — 텐서(Tensor)와 GPU

텐서는 격자 모양으로 배열된 숫자의 컨테이너다. 스칼라(scalar)는 0차원, 즉 숫자 하나다. 트레이딩 카드 한 장의 가격이라고 생각하면 된다. 벡터(vector)는 1차원 줄로 카드 네 장의 가격 리스트, 행렬(matrix)은 2차원이다. 예를 들어 각 행은 카드 한 장이고 각 열은 희귀도, 가격, 세트 같은 다른 피처(feature)를 담는다. 이 행렬을 여러 장 쌓으면 3차원 텐서가 된다. 엑셀이나 구글 시트의 워크북 같은 것이다. 신경망이 하는 일은 결국 이 텐서들을 곱하고 더하고 변형해서 또 다른 텐서를 만드는 것뿐이다.

모든 텐서는 세 가지 속성을 가진다. 모양(shape)은 각 차원의 크기, dtype은 숫자의 종류(정밀도에 따라 메모리 사용량이 크게 달라진다—거대한 LLM에서는 결정적이다), 디바이스(device)는 이 텐서가 CPU에 있는지 GPU에 있는지다. 함께 연산할 텐서는 반드시 같은 디바이스 위에 있어야 한다.

CPU는 강력한 코어 몇 개가 작업을 하나씩 처리해 복잡한 분기 로직에 강하지만, GPU는 약한 코어 수천 개가 동시에 움직여 거대한 숫자 배열 연산에서 압도적이다. CPU에서 24시간 걸리는 학습이 좋은 GPU에서는 30분이면 끝나는 경우가 흔하다. 그래서 머신러닝 이야기에 “VRAM이 얼마냐”는 말이 따라붙는다—GPU 메모리가 클수록 더 큰 배치(batch)와 더 큰 모델을 올릴 수 있기 때문이다.

핵심 개념 2 — 오토그라드(autograd)

신경망은 가중치(weight)라는 숫자 뭉치로 이뤄져 있고, 처음엔 전부 랜덤이다. 학습이란 이 가중치들을 조금씩 조정해 예측 정확도를 높이는 과정이다. 이때 쓰는 도구가 그래디언트(gradient, 기울기)다. 그래디언트는 “이 가중치를 살짝 올리면 오차가 얼마만큼 바뀌는가”를 알려 주는 숫자로, 양수면 내려야 하고 음수면 올려야 한다. 절댓값이 크면 영향력이 크다는 뜻이다.

오토그라드(automatic gradients의 줄임말)는 이 그래디언트를 자동으로 계산해 주는 파이토치의 시스템이다. 텐서 연산을 내부적으로 그래프로 기록해 두고, backward()를 호출하면 연쇄 법칙(chain rule)을 따라 그래프를 거꾸로 타면서 모든 가중치의 그래디언트를 한 번에 구한다. 수식을 직접 유도할 필요가 없다. 그래디언트 자체는 오차가 가장 빠르게 커지는 방향(오르막)을 가리키므로, 줄이려면 그 반대로 내려가야 한다. 한 걸음의 크기를 결정하는 게 학습률(learning rate)이다. 너무 크면 최적점을 뛰어넘어 튕기고, 너무 작으면 영원히 도착하지 못한다. 학습률은 학습 중 가장 중요한 단 하나의 숫자다.

핵심 개념 3 — 학습 루프(training loop)

학습 루프는 모든 머신러닝 학습의 심장이다. 아주 작은 데모부터 GPT-4 같은 초거대 모델까지, 전부 이 네 단계를 반복한다.

순전파(forward pass): 학습 데이터 한 배치를 모델에 넣고 각 층(layer)을 통과시켜 예측값을 얻는다. 지금은 랜덤에 가까운 추측일 뿐이다.
손실 계산(loss calculation): 예측값과 정답을 비교해 얼마나 틀렸는지를 하나의 숫자로 나타낸다. 이번 프로젝트에서는 평균제곱오차(MSE, Mean Squared Error)를 쓴다. 손실이 낮을수록 모델이 더 정확하다.
역전파(backward): 손실 값에서 backward()를 호출하면 오토그라드가 모든 가중치에 대한 그래디언트를 구해 준다.
옵티마이저 스텝(optimizer step): 옵티마이저가 그 그래디언트를 바탕으로 실제로 가중치를 조금씩 민다. 학습률이 이 “한 걸음”의 크기를 결정한다.

이 과정을 수천, 수만 번 반복해 손실이 충분히 작아질 때까지 학습을 돌린다. 중요한 관용구가 하나 있다. 매 반복이 끝나면 반드시 zero_grad()를 불러 이전 그래디언트를 초기화해 줘야 한다. 파이토치는 기본적으로 그래디언트를 누적하기 때문에, 초기화하지 않으면 이전 값이 계속 쌓여 업데이트가 엉망이 된다. 초보자가 가장 자주 빠지는 함정이다.

핵심 개념 4 — `nn.Module`

nn.Module은 파이토치가 제공하는 파이썬 클래스(class)로, 모든 신경망 구성 요소가 상속받는 범용 빌딩 블록이다. 자기 모델을 만들 때는 nn.Module을 상속하는 클래스를 짜고, 두 가지를 정의한다. 하나는 생성자(__init__)로, 이 모델이 어떤 층과 컴포넌트를 가질지 선언한다. 각 층은 학습 가능한 가중치를 품고 있고, nn.Module이 알아서 전부 추적해 준다. 다른 하나는 forward 메서드로, 데이터가 이 컴포넌트를 통과할 때 어떤 연산을 어떤 순서로 수행할지를 정의한다.

이번 프로젝트의 예시 모델은 18개의 카드 피처를 입력으로 받아 64개의 내부 값으로 확장한 뒤 F.relu 같은 활성화 함수(activation function)를 통과시킨다. ReLU는 음수 값을 0으로 만들어 신경망이 직선 관계를 넘어 복잡한 비선형 패턴까지 학습할 수 있게 해 주는 장치다. 마지막 층에서 다시 하나의 숫자—예측 가격—로 출력이 모인다.

핵심 개념 5 — 데이터 로더(DataLoader)

데이터는 절대 한 번에 전부 넣지 않는다. 현실의 데이터셋은 GPU 메모리를 간단히 넘어서 버리기 때문이다. 대신 작은 묶음, 즉 배치로 나눠서 먹인다. 파이토치의 DataLoader는 이 일을 자동으로 해 준다. 데이터셋을 배치로 쪼개고, 매 에폭(epoch)마다 순서를 섞어 모델이 순서를 외워 버리지 않게 하고, 병렬로 로딩해 GPU가 놀지 않게 한다.

배치 크기(batch size)는 트레이드오프다. 크면 GPU 메모리를 많이 쓰지만 가중치 조정 방향이 더 안정적이고 GPU도 효율적으로 쓴다. 작으면 약간의 랜덤성이 추가돼, 학습 데이터를 넘어서 일반화되는 패턴을 잡는 데 도움이 된다. 보통은 GPU 메모리에 들어가는 가장 큰 값을 택하지만 데이터셋 특성에 따라 다르다.

다섯 가지 요약

텐서는 격자에 배열된 숫자의 컨테이너다.
오토그라드는 그래디언트를 자동으로 계산해 준다.
학습 루프는 순전파 → 손실 → 역전파 → 옵티마이저 스텝, 네 단계의 반복이다.
nn.Module은 모든 모델 컴포넌트의 템플릿으로, 가중치를 보유하고 순전파 동작을 정의한다.
DataLoader는 데이터를 배치로 쪼개고 섞어 효율적으로 공급한다.

작업 파트 — 포켓몬 카드 가격 예측기 만들기

이번 프로젝트는 파이토치로 포켓몬 카드 가격 예측 모델을 만드는 것이다. 실제 데이터는 TCGplayer API에서 가져왔다. 각 카드에는 희귀도, 타입, HP, 세트, 카드 변형(홀로그래픽 여부) 같은 피처가 붙어 있고, 최종 목표는 이 피처들로 달러 단위의 시장 가격을 예측하는 것이다. 환경 세팅과 API 호출, CSV 포맷팅까지는 Claude Code(Opus 4.6)에게 맡기고, 실제 파이토치 파트는 가급적 수동으로 작성했다. 약 2,142장의 카드, 카드당 18개의 피처로 시작했다.

1~2단계 — 텐서 변환과 모델 정의

torch.tensor로 피처와 가격 배열을 모두 float32 텐서로 바꾼다. 모양을 찍어 보면 피처는 (2142, 18), 가격은 (2142,)이고, CPU에 있던 텐서를 GPU로 옮긴 뒤 card_tensors.pt로 저장했다. 이어서 nn.Module을 상속한 18 → 128 → 64 → 32 → 1 구조의 다층 퍼셉트론(MLP, Multi-Layer Perceptron)을 정의한다. 18개 피처를 왜 128개로 “넓히느냐”면, 128개의 뉴런(neuron)이 피처의 서로 다른 조합을 동시에 들여다보면서 가능한 많은 패턴을 탐색하게 하려는 것이다. 층이 내려갈수록 좁아지며 발견된 패턴을 강한 신호로 압축하고, 마지막에 단일 가격 숫자로 수렴시킨다. 층 사이에는 ReLU를 걸지만 마지막 층은 그대로 출력한다. 총 파라미터 수는 약 12,801개였다.

3~4단계 — 학습 루프 한 번, 그리고 전체 학습

DataLoader로 배치 크기 64의 배치를 만들고 학습 루프를 한 스텝 돌려 보면, 학습 전엔 없던 그래디언트가 역전파 후 붙고 옵티마이저가 가중치를 아주 조금씩만 민다. 이어서 데이터를 70/15/15로 학습·검증·테스트로 나누고 100 에폭을 돌렸다. 검증(validation)은 학습 중 한 번도 안 본 카드로 손실을 측정해 “진짜 패턴을 배웠는가, 아니면 학습 데이터를 외우고 있는가”를 가르는 장치다. 이상적으로는 학습 손실과 검증 손실이 나란히 내려간다. 한쪽만 내려가면 과적합(overfitting)이다. 실제로 학습 손실은 에폭 10의 0.38에서 0.16까지, 검증 손실은 0.89에서 0.85로 내려갔다.

5단계 — 평가와 한계

테스트 세트에서 평균 절대 오차(MAE)는 $7.39, 중앙값 절대 오차는 $0.08이었다. 값싼 카드는 매우 정확하게 맞혔고, 소수의 고가 카드에서 큰 오차가 나 평균이 왜곡됐다. matplotlib로 그린 곡선을 보면 검증 손실이 중간에 다소 들쭉날쭉해 약간의 과적합 기미도 있었다. 한계는 명확하다. 데이터셋 대부분이 저가 카드여서 모델이 “무엇이 카드를 비싸게 만드는가”는 끝내 배우지 못했고, MLP는 각 피처를 독립적으로 보기 때문에 “리자몽(Charizard) + 홀로그래픽 + 소량 인쇄” 같은 조합 효과를 이해하지 못한다.

마무리

이런 “피처 간의 관계”를 모델이 스스로 배우도록 하려면 다음 편에서 다룰 트랜스포머가 필요하다. 이번 편은 파이토치의 기본기—텐서, 오토그라드, 학습 루프, 모듈, 데이터 로더—를 한 번에 돌려 보며 감을 잡는 데 목적이 있었다. 어떤 툴과 라이브러리가 바뀌더라도, 이 다섯 가지가 머릿속에 단단히 박혀 있으면 앞으로 다룰 고급 주제들을 이해하는 데 큰 무기가 된다.

02리서치 문서 · Document

파이토치로 시작하는 머신러닝: “코드”가 아니라 “개념”을 먼저 배워야 하는 이유

원본 영상: YouTube · 채널: Onchain AI Garage · 업로드: 2026-03-31 · 길이: 약 37분

서론 — “엔지니어”가 되기 위해 필요한 것은 더 이상 타자 속도가 아니다

2026년의 머신러닝 입문자에게 가장 난감한 질문은 “무엇부터 배워야 하는가”다. 한쪽에는 클로드 코드(Claude Code), 커서(Cursor) 같은 AI 코딩 도구가 파이썬(Python) 코드를 단숨에 쏟아내는 현실이 있고, 다른 한쪽에는 여전히 “코드 한 줄 한 줄을 직접 쳐야 배움이 된다”는 전통적인 튜토리얼이 있다. Onchain AI Garage의 새로운 12편 머신러닝 엔지니어링(ML Engineering) 시리즈는 이 간극을 정면으로 다룬다. 1편 주제는 파이토치(PyTorch)다. 하지만 영상의 본질은 “파이토치 API를 외우는 법”이 아니라 “AI 도구에게 무엇을 시킬지 판단할 수 있는 멘탈 모델을 세우는 법”이다.

창작자는 실제 Indeed 채용 공고 수십 개를 에이전트(agent)로 수집하고, 거기서 현업 머신러닝 엔지니어에게 공통적으로 요구되는 스킬만을 뽑아 커리큘럼을 짰다. 즉, 이 시리즈는 학교 교재가 아니라 “구인 시장의 언어”로 쓰인 로드맵이다. 그리고 그 첫 관문이 파이토치라는 사실은 단순한 유행이 아니라, 2026년 현재 머신러닝 생태계의 권력 지형을 정확히 반영한다.

본론 1 — 왜 하필 파이토치인가: 2026년 프레임워크 판세

오랫동안 “연구는 파이토치, 프로덕션은 텐서플로(TensorFlow)“라는 이분법이 통용됐다. 그러나 Second Talent의 2026 비교 리포트에 따르면 이 구도는 사실상 무너졌다. 주요 AI 학회(NeurIPS, ICML, ICLR) 논문의 약 85%가 파이토치를 사용하며, 이는 2023년 NeurIPS 기준 80%에서 더 올라간 수치다. 프로덕션 쪽에서도 2025년 3분기 기준 파이토치의 점유율이 55%까지 올라와 텐서플로(38%)와의 전통적 격차를 좁혔다.

UpCloud의 “Beyond PyTorch vs. TensorFlow 2026” 분석은 두 프레임워크가 기능적으로도 수렴하고 있다고 지적한다. 파이토치는 torch.compile()로 한때 텐서플로의 전유물이었던 정적 그래프 최적화를 흡수했고, 케라스(Keras) 3는 파이토치·텐서플로·JAX를 모두 백엔드로 지원해 코드 이식성을 확보했다. 그 결과 많은 조직이 “연구는 파이토치, 배포는 텐서플로 또는 ONNX”라는 하이브리드 전략을 택하지만, 입문자 입장에서 “어디서 출발할 것인가”에 대한 답은 명확하다. 파이토치다. 영상 속 창작자가 “생태계 전쟁에서 파이토치가 이겼다”라고 단호하게 말한 데는 이유가 있다.

본론 2 — 다섯 가지 핵심 개념: 텐서, 오토그라드, 학습 루프, nn.Module, DataLoader

영상의 지식 파트는 다섯 가지 개념으로 정리된다. 첫째, **텐서(tensor)**는 격자에 배열된 숫자의 컨테이너다. 스칼라(0차원), 벡터(1차원), 행렬(2차원)을 일반화한 n차원 구조로, 신경망(neural network)이 하는 모든 연산의 기본 단위다. 모든 텐서는 모양(shape), 데이터 타입(dtype), 디바이스(device) 세 가지 속성을 가진다. 디바이스가 중요한 이유는 GPU가 CPU 대비 수십~수백 배 빠르기 때문이다. CPU는 강력한 코어 몇 개가 순차 작업을 처리하는 반면, GPU는 약한 코어 수천 개가 동시에 숫자 배열 연산을 돌린다.

둘째, **오토그라드(autograd)**는 모델 학습의 실질적인 엔진이다. PyTorch 공식 튜토리얼과 GeeksforGeeks의 오토그라드 설명이 공통적으로 강조하듯, 오토그라드는 텐서 연산의 그래프를 자동으로 기록한 뒤 loss.backward() 호출 시 연쇄 법칙(chain rule)을 거꾸로 타면서 모든 가중치(weight)의 그래디언트(gradient)를 한 번에 계산한다. 수식을 직접 유도할 필요가 없다는 뜻이며, 이 한 가지만으로도 파이토치의 진입 장벽을 극적으로 낮춘다.

셋째, **학습 루프(training loop)**는 네 단계의 반복이다. 순전파(forward) → 손실 계산(loss) → 역전파(backward) → 옵티마이저 스텝(optimizer step). apxml의 “Anatomy of a Training Loop” 문서가 지적하는 것처럼, 이 루프는 “바깥쪽 에폭(epoch) 루프 × 안쪽 배치(batch) 루프”의 이중 구조로 돌아가며, 매 배치의 시작에서 optimizer.zero_grad()를 호출해 이전 그래디언트를 초기화하는 것이 결정적이다. 파이토치가 기본값으로 그래디언트를 누적하기 때문이다. 이 단 한 줄을 빼먹으면 학습이 조용히 망가진다—초보자가 가장 자주 만나는 함정이다.

넷째, **nn.Module**은 모든 신경망 컴포넌트의 베이스 클래스다. __init__에서 층(layer)을 선언하고 forward에서 데이터 흐름을 정의하면, 가중치 추적·장치 이동·저장까지 자동으로 관리된다. 다섯째, **DataLoader**는 큰 데이터셋을 GPU 메모리에 맞는 배치로 쪼개고 순서를 섞고 병렬로 공급한다. 배치 크기(batch size)는 메모리, 수렴 안정성, 일반화 성능 사이의 전형적 트레이드오프다.

이 다섯 개념은 서로 독립된 지식이 아니다. “텐서는 데이터, 오토그라드는 그래디언트, 학습 루프는 알고리즘, nn.Module은 구조, DataLoader는 공급 파이프”라는 하나의 유기적 시스템으로 이해해야 한다. Sebastian Raschka의 “PyTorch in One Hour”가 이 다섯 축을 단일 흐름으로 엮어 설명하는 이유도 여기 있다.

본론 3 — 포켓몬 카드 가격 예측기: “장난감이 아닌 프로젝트”의 힘

영상의 작업 파트는 TCGplayer API에서 2,142장의 포켓몬 카드 데이터와 18개 피처(희귀도, 타입, HP, 세트, 홀로그래픽 여부 등)를 수집해 가격 예측 모델을 만드는 과정이다. 구조는 18 → 128 → 64 → 32 → 1의 단순한 다층 퍼셉트론(MLP, Multi-Layer Perceptron)이지만, 입문자에게는 이 구조가 던지는 질문이 더 중요하다. “왜 18에서 갑자기 128로 넓혔다가 다시 좁아지는가?” 창작자의 대답은 명쾌하다. 처음엔 모델이 발견할 수 있는 패턴의 후보를 넓게 펼쳐 두고, 층이 내려갈수록 쓸모 있는 신호만 간추린다. “탐색 → 압축”이라는 일반적 아키텍처 철학이 이 단순한 예제 안에 그대로 들어 있다.

학습 결과는 교훈적이다. 100 에폭 학습 후 중앙값 절대 오차는 $0.08, 평균 절대 오차는 $7.39였다. 숫자만 보면 성공 같지만, 창작자는 “대부분의 카드가 저가(1달러 미만)였기 때문에 모델이 ‘비싼 카드를 비싸게 만드는 요인’은 끝내 배우지 못했다”라고 짚는다. 데이터 분포가 모델의 한계를 결정한다는 사실, 그리고 단순 MLP는 피처를 독립적으로 본다는 사실—“리자몽 + 홀로그래픽 + 소량 인쇄”의 시너지를 이해하지 못한다는 사실—이 자연스럽게 다음 편(트랜스포머, Transformer)의 존재 이유로 이어진다. 장난감이 아닌 진짜 데이터로 돌려 볼 때만 드러나는 가르침이다.

본론 4 — AI 코딩 시대의 학습 전략: “AI가 못 하는 것”에 집중하라

영상이 던지는 가장 날카로운 메시지는 학습 전략에 관한 것이다. 창작자는 “AI 도구는 이미 보일러플레이트 코드를 웬만한 소프트웨어 엔지니어보다 잘 쓴다. 문제는 ‘무엇을 만들지 결정하고, 왜 고장났는지 디버깅하고, 어떤 접근이 맞는지 판단하는 일’이다”라고 말한다. 이는 공허한 선언이 아니라 시장 신호와 일치한다. PyTorch vs TensorFlow 2026 비교 자료가 지적하듯, 프레임워크 수준에서는 파이토치와 텐서플로가 기능적으로 수렴하고 있다. 입문자가 특정 API의 철자 하나에 집착할 이유는 점점 줄어든다. 대신 시스템 이해력, 디버깅 직관, 아키텍처 결정, 도메인 전문성—즉 “타자가 아니라 엔지니어를 만드는 요소”—의 상대적 가치가 올라간다.

이 관점에서 보면 시리즈의 구성도 설득력을 얻는다. 각 편은 5~10분의 “지식 파트”(수식과 코드 없음)와 실전 프로젝트로 나뉜다. 개념을 붙잡는 멘탈 모델을 먼저 세우고, 그 위에 프로젝트로 증명하는 흐름이다. 도구가 다시 한 번 뒤집혀도 살아남는 지식을 쌓는 방법이다.

핵심 인사이트

파이토치는 2026년 현재 사실상 표준이다: 연구 85%, 프로덕션 55%까지 올라온 점유율은 “배울 만한 프레임워크”가 아니라 “먼저 배워야 할 프레임워크”임을 시사한다.
다섯 개념이 전부다: 텐서, 오토그라드, 학습 루프, nn.Module, DataLoader. 이 다섯 가지만 머릿속에서 연결되면, 이후 트랜스포머·LoRA·DPO 같은 고급 주제도 같은 뼈대 위에 얹을 수 있다.
zero_grad()를 잊는 순간 학습은 조용히 망가진다: 파이토치가 그래디언트를 누적한다는 사실은 기술적 디테일이 아니라 디버깅 직관의 출발점이다.
데이터 분포가 모델의 천장을 정한다: 포켓몬 카드 예제의 $0.08 중앙값 오차와 $7.39 평균 오차의 간극은, 손실 숫자를 액면으로 믿으면 안 된다는 것을 보여 준다.
AI 시대의 학습은 “AI가 못 하는 것”에 투자해야 한다: API 문법 대신 시스템·디버깅·아키텍처·도메인 전문성에 시간을 쓰는 것이 합리적이다.

더 알아보기

03찬반 토론 · Debate

토론: “AI 코딩 도구 시대, 머신러닝 입문자는 ‘코드’가 아니라 ‘개념’부터 배워야 하는가”

논제: Onchain AI Garage가 제안한 “코드 타이핑 대신 개념·디버깅·시스템 이해를 먼저 학습하라”는 커리큘럼 철학은 2026년 ML 입문자에게 최선의 전략인가?

Round 1

🟢 Pro — “개념 우선 학습은 AI 시대의 유일한 합리적 출발점이다”

파이토치(PyTorch) API 문법을 외우는 데 쏟는 시간은 이제 복리 수익이 거의 나지 않는다. 클로드 코드(Claude Code)와 커서(Cursor) 같은 도구는 이미 평균적 소프트웨어 엔지니어보다 빠르고 정확하게 보일러플레이트를 뽑아 낸다. 영상이 지적했듯, 인간이 가져야 할 경쟁력은 “무엇을 만들지 결정하고, 왜 고장났는지 진단하고, 어떤 접근이 내 문제에 맞는지 판단하는 능력”이다. 이는 텐서(tensor)의 모양·디바이스·dtype을 읽을 수 있는 시선, 오토그라드(autograd)가 왜 zero_grad()를 요구하는지 이해하는 직관, 학습 손실과 검증 손실이 벌어지는 그래프를 보고 과적합(overfitting)을 짚어 내는 안목에서 나온다. 이런 능력은 코드 타이핑으로는 길러지지 않는다.

특히 파이토치의 다섯 핵심 개념—텐서, 오토그라드, 학습 루프(training loop), nn.Module, DataLoader—은 이후 트랜스포머(Transformer), LoRA, DPO, 분산 학습 같은 고급 주제의 뼈대가 된다. 뼈대가 흔들리면 그 위에 아무리 코드를 쌓아도 디버깅 불가능한 스파게티가 된다. 게다가 최근 연구는 AI 도구에 과의존하면 “개념 이해, 코드 독해, 디버깅 능력이 저하된다”고 경고한다. 개념을 먼저 붙잡아야 AI 도구를 지휘할 수 있다.

🔴 Con — “코드를 직접 쓰지 않고 얻는 개념 이해는 허상이다”

“개념 우선” 주장은 듣기엔 그럴듯하지만, 실제로 개념은 코드를 직접 치면서 부딪힐 때에만 내면화된다. torch.tensor를 만들어 모양이 기대와 다르게 찍히는 순간, requires_grad=True를 빼먹어 그래디언트가 None으로 나오는 순간, optimizer.zero_grad()를 잊어 손실이 발산하는 순간—이 모든 “실패의 디테일”이 개념을 살로 만든다. 영상이 약속하는 “5~10분짜리 비유와 다이어그램”은 시작점으로는 쓸 만하지만, 그것만으로는 멘탈 모델이 고착되지 않는다.

더 근본적으로, “AI가 코드를 써 줄 것”이라는 전제는 의존성을 정당화한다. 입문자는 AI가 생성한 코드가 맞는지 틀렸는지 판단할 능력이 없는 상태에서 AI를 쓰기 시작하는데, 이는 블랙박스 위에 또 다른 블랙박스를 얹는 꼴이다. 시장 신호도 반대 방향을 가리킨다. BrainStation의 2026 가이드는 여전히 “머신러닝을 효과적으로 구현하려면 기본적 프로그래밍 능력이 필요하다”고 명시한다. “타자수에서 편집자로” 역할이 바뀐다고 해서, 타자 능력 자체가 사라져도 된다는 뜻은 아니다.

Round 2

🟢 Pro (재반론) — Con의 비판에 답한다

Con의 첫째 주장—“개념은 코드를 치면서만 내면화된다”—는 일부 맞지만 과장돼 있다. 영상이 제안하는 건 “코드를 절대 쓰지 말라”가 아니라 “코드 타이핑을 학습의 중심축으로 두지 말라”는 것이다. 실제로 작업 파트에서는 파이토치 코드를 손으로 짜고, 터미널에서 텐서 모양을 찍어 보고, 가중치 값이 학습 스텝 전후로 어떻게 바뀌는지 확인한다. 다만 이 작업은 개념이라는 좌표계를 세운 뒤에 이뤄진다. Con이 말한 “실패의 디테일”은 그 좌표계 위에서만 의미를 가진다. 좌표계 없이 만나는 None 그래디언트는 그냥 혼란일 뿐이다.

Con의 둘째 주장—“입문자는 AI 코드의 옳고 그름을 판단할 수 없다”—는 오히려 Pro의 논지를 강화한다. 판단 능력이 곧 “개념 이해”이기 때문이다. 타이핑을 반복한다고 판단력이 생기는 게 아니다. nn.Module이 왜 가중치를 자동 추적해 주는지, 왜 DataLoader가 배치를 섞는지를 이해할 때 판단력이 생긴다. BrainStation이 말한 “기본적 프로그래밍 능력”은 2026년 맥락에서는 더 이상 “문법 타이핑 능력”이 아니라 “코드를 읽고 구조를 파악하는 능력”에 가깝다. 영상의 접근은 정확히 이 지점을 겨냥한다.

🔴 Con (재반박) — Pro의 논리를 뒤집는다

Pro의 첫째 주장—“5~10분의 지식 파트가 좌표계를 세운다”—는 낙관적이다. 37분짜리 영상 하나로 텐서, 오토그라드, 학습 루프, nn.Module, DataLoader를 “머릿속에 박아 넣는다”는 건 인지과학적으로 비현실적이다. 실제 학습 연구는 반복 노출과 능동적 인출(active retrieval), 즉 스스로 코드를 재구성하는 경험이 개념 보존에 결정적이라고 말한다. 영상은 보기 좋은 분석 도구이지 학습 도구로는 부족하다. 오히려 “개념을 먼저 배웠다”라는 환상만 심어 줄 위험이 있다.

Pro의 둘째 주장—“판단 능력 = 개념 이해”—는 순환 논리다. 판단 능력은 경험 없이 개념만으로 얻어지지 않는다. Pro가 말한 “가중치를 학습 전후로 찍어 보는 경험”이 바로 내가 말한 “실패의 디테일”이다. 그 디테일을 쌓는 양은 영상 하나로는 절대적으로 부족하다. 게다가 포켓몬 카드 예제는 MAE $7.39, 중앙값 $0.08이라는 극단적 분포 편향을 보였다. 창작자 본인조차 “고가 카드의 가격 요인은 배우지 못했다”라고 인정한다. 이 한계는 이론 부족이 아니라 데이터 엔지니어링 경험 부족, 즉 “손으로 부딪힌 시간의 부족”에서 온다. “개념 우선”이 답이라면 창작자는 이 문제를 사전에 예측했어야 한다.

Round 3

🟢 Pro — Round 2의 Con에 재반론

Con의 첫째—“영상 하나로 개념이 박힐 리 없다”—는 시리즈의 구조를 오독한다. 창작자는 “파이토치 한 편으로 끝”이라고 말하지 않았다. 12편에 걸쳐 같은 다섯 개념이 트랜스포머, 파인튜닝(fine-tuning), LoRA, 분산 학습 맥락에서 반복 재활용된다. 이것이 바로 Con이 요구한 “반복 노출과 능동적 인출”이다. 한 번의 영상이 아닌, 12번의 프로젝트에 걸친 나선형 커리큘럼 안에서 개념이 살로 바뀐다. 창작자 스스로가 “5편의 결과물이 6편에서 다시 쓰인다”고 밝힌 대목을 되짚어야 한다.

Con의 둘째—“포켓몬 예제의 분포 편향이 개념 우선 접근의 실패”—는 사실 개념 이해가 작동했다는 증거다. 창작자는 정확히 “데이터가 저가 카드에 쏠려 있었다. 그래서 모델이 고가 요인을 못 배웠다. 피처가 독립적으로 처리되니 조합 효과를 못 잡는다”라고 진단했다. 이 진단은 손실 숫자를 액면으로 받아들이는 입문자가 절대 내릴 수 없는 것이다. 개념—즉 “모델은 본 데이터의 분포만 배운다”, “MLP는 피처를 독립 처리한다”—을 장착했기 때문에 가능한 판단이다. 그리고 이 진단이 곧 다음 편(트랜스포머)의 필요성으로 이어진다. Con이 요구한 “디테일의 경험”과 개념 학습은 대립하지 않는다. 개념이 있을 때 디테일이 신호가 된다.

🔴 Con — Round 2의 Pro에 재반박

Pro의 첫째—“12편 시리즈가 나선형 반복을 제공한다”—는 미래 약속에 근거한 방어다. 1편만 놓고 평가할 때 Pro의 논지는 성립하지 않는다. 게다가 시리즈가 쌓이면 각 편의 개념 밀도가 오히려 희석된다. 트랜스포머 편이 과연 파이토치 기본기를 다시 “능동적으로 재구성”할 공간을 줄 수 있을까? 대개의 경우 다음 편은 새 개념을 덧붙이느라 이전 개념을 블랙박스로 취급한다. 나선형 커리큘럼은 이론적 이상일 뿐, 실제 교수 설계에서는 구현이 까다롭다.

Pro의 둘째—“창작자의 진단이 개념 이해의 증거”—는 창작자와 입문자를 혼동한다. 창작자는 이미 예측 시장·트레이딩 경험이 있는 사람이다. 그의 분석은 “개념 우선 학습의 결과”가 아니라 “기존 도메인 경험의 결과”일 가능성이 크다. 입문자가 같은 영상을 본 뒤 같은 진단을 내릴 수 있을지는 전혀 다른 문제다. 그리고 Towards Data Science의 AI 코딩 경험 분석과 Towards AI의 “How I’d Learn ML in 2026”이 공통적으로 강조하듯, 효과적인 ML 학습은 “개념 대 코드”의 이분법이 아니라 작은 프로젝트의 양과 즉각적 피드백 루프가 핵심이다. 영상의 철학은 이 두 축 중 전자만 강조하고 후자는 “관찰하기”로 대체해 버린다. 관찰은 실행을 대신하지 못한다.

🧭 종합

합의 지점

양측은 의외로 많은 부분에 합의한다. 첫째, 2026년의 머신러닝 학습에서 “코드 문법 암기”는 더 이상 최우선 목표가 아니다. 둘째, 텐서·오토그라드·학습 루프·nn.Module·DataLoader 같은 핵심 개념은 반드시 장악해야 한다. 셋째, AI 코딩 도구는 무시할 수도, 전부 맡길 수도 없는 “새로운 파트너”다. 넷째, 개념과 실습은 궁극적으로 결합돼야 한다. 차이는 “순서”와 “비중”에 있다. Pro는 개념이 좌표계를 먼저 잡고 그 위에 실습이 얹혀야 한다고 보고, Con은 좌표계 자체가 실습을 통해서만 세워진다고 본다.

열린 질문

나선형 커리큘럼이 1편에서는 과연 얼마나 유효한가? 시리즈가 완결돼야 판단 가능한 질문이다.
“개념 이해”와 “도메인 경험”을 어떻게 구분할 것인가? 창작자의 진단이 어느 쪽에서 온 것인지는 외부에서 검증하기 어렵다.
AI 코딩 도구가 매달 발전하는 상황에서, “AI가 못 하는 것”의 경계선 자체가 움직이고 있다. 오늘의 “판단력”이 내년에도 유효할까?
입문자에게 필요한 “프로젝트 양”의 하한선은 어디인가? 12편 × 프로젝트 1개 = 총 12개는 충분한가?

더 나아간 관점

가장 생산적인 프레임은 “개념 대 코드”가 아니라 “폐쇄 루프의 속도”다. 학습자가 가설을 세우고 → 실행하고 → 결과를 해석해 개념을 갱신하는 한 사이클을 얼마나 빠르게 여러 번 돌리느냐가 실제 실력을 결정한다. 영상의 접근은 이 루프의 “가설”과 “해석” 단계를 강화하는 쪽에 무게를 두고 있다. 반면 Con이 옹호한 전통적 접근은 “실행” 단계의 양에 무게를 둔다. 둘 다 같은 루프의 서로 다른 구간이며, 어느 한쪽을 제거하는 순간 루프 자체가 끊어진다. 이것이 이번 토론이 궁극적으로 드러내는 구조적 진실이다.

따라서 입문자에게 가장 합리적인 전략은 이분법적 선택이 아니라 “단계별 비중 조절”이다. 처음 몇 주는 영상의 방식대로 개념 좌표계를 빠르게 세우되, 즉시 작은 프로젝트(영상의 포켓몬 예제보다 더 작은 규모, 예컨대 피처 3~5개의 장난감 데이터셋)를 병행해 실행 감각을 함께 쌓는다. 이 시기에는 torch.tensor의 모양이 엇나가고 zero_grad()를 빼먹는 종류의 “작고 싼 실패”를 최대한 많이 겪어야 한다. 이후에는 프로젝트 규모를 키우며 개념을 반복 재활용하는 쪽으로 무게를 옮긴다. 창작자가 12편 시리즈로 노리는 바가 사실상 이것에 가깝다면, Pro와 Con의 대립은 같은 답의 다른 표현일 수 있다. 다만 1편 시점에서 그 약속의 이행 가능성은 아직 열려 있는 질문이다.

두 번째로 주목할 관점은 “학습 목표의 이중성”이다. 입문자가 추구하는 결과는 두 가지로 나뉜다. 하나는 “직접 모델을 짜고 디버깅할 수 있는 능력”이고, 다른 하나는 “팀·에이전트·외부 도구가 만든 ML 시스템을 지휘하고 감독할 수 있는 능력”이다. 전통적 교육은 전자만을 목표로 삼았다. 영상이 제안하는 커리큘럼은 전자를 최소한으로 유지하되 후자를 핵심으로 끌어올린다. 2026년 현업 채용 시장이 실제로 요구하는 역량은 이미 후자 쪽으로 이동 중이다. “시스템을 지휘하는 능력”은 API 문법이 아니라 “이 모델이 왜 이 숫자를 뱉었는가, 어떤 단계에서 무너졌는가, 어떤 대안이 있는가”를 판단하는 힘에서 나온다. 다섯 핵심 개념은 그 판단의 어휘집(vocabulary)이다.

마지막으로, “AI가 못 하는 것에 투자하라”는 영상의 메시지는 방향으로는 옳지만 부단히 재조정돼야 한다. AI 도구의 능력 경계는 매달 바뀐다. 오늘은 “시스템 이해·디버깅 직관”이 AI 안전 지대지만, 내년에는 에이전트가 디버깅 루프의 상당 부분을 자동화할 수 있다. 따라서 “개념 우선”은 고정된 커리큘럼이 아니라 지속적으로 갱신되는 메타 전략으로 이해돼야 한다. 그 메타 전략을 몸에 익히는 것 자체가, 아마도 2026년 이후 머신러닝 엔지니어링의 진짜 핵심 스킬일 것이다. 결국 입문자가 던져야 할 질문은 “개념과 코드 중 무엇을 먼저 배우나”가 아니라 “내 학습 루프가 지금 이 순간의 AI 능력 경계와 얼마나 잘 맞물려 있는가”다. 이 질문을 매 분기 스스로에게 던질 수 있는 사람이라면, Pro와 Con의 어느 편을 들든 결국 같은 종착지에 도달할 것이다.

04영문 원본 · Transcript

So in this video, I'm going to introduce this machine learning engineering video series
that I plan to do for 12 episodes, and then we're going to do the first episode as well,
which is on PyTorch. So I had this idea for this video series, structured a little bit different
from my other videos. But I wanted to learn more about machine learning, and what kind of skills
and knowledge is necessary to become a machine learning engineer. So I had my agent scrape
together a bunch of machine learning engineer job postings, a couple dozen of them from Indeed.
Each company obviously has their own requirements, but I had it distill and find the common
skill sets and knowledge and tool knowledge that is necessary for a modern machine learning
engineer, not based on theory or curriculums or academics, but actual jobs, where the actual
skills and knowledge that are expected of a machine learning engineer.
And the people who are designing AI models, and I formulated this curriculum with it,
which is going to span 12 episodes. And this is going to be me teaching it, but also me learning,
I've always felt that the best way to learn something is to actually teach it.
So that's what we're going to do, I'm going to be learning with you. And by the end of this,
we'll have a skill set for an actual professional machine learning engineer.
So this is my machine learning engineer video series, it's how to
learn to build machine learning systems, AI model training, basically, not by memorizing code,
but by understanding the concepts well enough to direct AI tools to do it for you.
So I should be clear, this is not a coding tutorial. AI tools like Claude Codex cursor
can write boilerplate code. What they can't do is decide what to build, debug why it's broken,
or know which approach fits your problem. I think a lot of people who are interested in coding and
programming are kind of feeling discouraged in that direction. Obviously, these tools now can
code pretty well, better than most professional engineers, software engineers. So the approach
we're going to take is not here's how to write a for loop in Python. It's more here's what you need
to understand to direct Claude to build machine learning systems. So we're going to focus on the
skills AI can't replace systems understanding, debugging, intuition and domain expertise.
Every video is going to have two parts.
One is going to be the knowledge. And that we'll see how well I can keep to this minute budget, but
hopefully five to 10 minutes, concept explanation and intuition. So what is this thing? Why does it
exist? Why would you use it? What are the mental models that make it click, there's not going to be
any coding in this section, just understanding, grasping the concepts and having a real firm
understanding of whatever domain we're learning about that day. So part two is going to be the
work. So we're going to use that knowledge to
build something real in each video, it won't be a toy example, it'll be an actual useful project
that applies the concepts, you can follow along yourself and see the process, it's not just going
to be the finished result. So this is where it becomes tangible, I will be using some AI tools
as appropriate to do the work. But I will try to do most of it as manual as possible so that we can
really see how everything works. So the format, the knowledge, the first section, what this covers,
is built to give you a durable mental model, something that stays useful even as tools and
libraries change. So it's going to be for each concept, you're going to be looking at what it is,
why it exists, how it works when to use it. So obviously, AI and technology is advancing
incredibly fast, even since I started working in this area since last year,
everything is moving very fast. So we don't want our knowledge to actually be obsolete in a week.
What you won't see is math walls of math notation lined by
line code walkthroughs, and theory disconnected from practice, what you will see is analogies
that make concepts stick, and visuals and diagrams, real world context for every idea.
And why should I care for each concept, I'm going to try to do no knowledge sections without any
real connection to actual work. And the second part is the work. And that's what you'll build.
Every episode is a hands on project that applies the concept to a real problem,
not contrived exercises, but tools and systems you'd
actually use. I'm going to try to make these fun, and some areas that I'm interested in.
So for example, some of the project domains will work in with prediction markets,
crypto, blockchain, game, as well as other sort of topics you see me talk about here
on my video series, sports, developing agents, you know, any type of market or trading, or something
that could hopefully make us money at the end of the day, right. So the work sections will be
step by step showing the process, including my mistakes, they're sure to be a lot. So we're
going to use some AI tools to demonstrate and end up with a working model tool or system. So
the projects are going to build on each other. So some of the projects from, for instance, episode
five will get reused in six. So they will they will build on each other throughout.
So this is the 12 episode topics. Like I said, this was distilled from actual indeed, job offering
finding the common skills and tools that you'd need to become a machine learning engineer.
So you can watch all of these or just specific areas that are interested to you.
Just quickly, I'll run through we're going to start today with PI torch,
then transformers, hugging face embeddings, fine tuning, Laura q Laura, DPO, evaluation,
GPU optimization, distributed training, reinforcement learning, and then lastly, Triton
kernels, which is more in the infrastructure area. Now, as we go, if there's a certain area
that we all seem more interested in, we can dive more into that one.
Starting out, this is the roadmap.
So the four principles, I talked about this before about concepts over code, working with real
projects tutorial style.
And then focus on AI resistance to skills. We focus on what a ice tools can't give you. Like I said,
systems thinking,
debugging intuition, architecture decisions, and domain expertise, stuff that makes you an
engineer, not a typist. I know we're all concerned about AI taking all of our jobs.
So the real focus of this series is going to be trying to focus on areas and skills that can't
easily be taken over by AI. So this is for basically anyone, but developers who want to
become machine learning engineers, builders using AI coding tools. I know a lot of people are vibe
coding and using these tools, but this series should give you a better understanding of what
needs to be done. And people who learn by building, people who want to build their own
AI models, want to really understand what's going on and to help build for the future of AI.
It does assume you can at least understand some code. I'm not a major coder, but I can at least
understand a little. And you at least don't need machine learning explained from zero. You
understand.
The basics of what an AI model is. And you're willing to work on real projects. So that's it.
That's the introduction to the series. Let's get started with episode one, which is going to be on
PyTorch. So here we go. Episode one, the knowledge PyTorch, the framework that powers modern machine
learning. So what is PyTorch? PyTorch is a framework, a collection of prebuilt tools and
functions for building and training neural networks, programs that learn,
practice, and learn from each other. And it's a framework that allows you to learn from each other
and learn from each other. And it's a framework that allows you to learn from each other. And it's a
core of AI models and LLMs. PyTorch is used by most researchers and increasingly in production
apps. The main alternative was TensorFlow by Google, but it seems PyTorch has won the ecosystem
war for now, at least. It's Python based, so it fits naturally into data science workflows because
most data scientists will use Python. So core concept number one is tensors. So this is a key
concept to understand. Tensors are numbers that are used to understand data. And they're used to
understand numbers in boxes. A tensor is just a container of numbers arranged in a grid. More
dimensions equal more axes in the grid. That's all. Sounds intimidating, but it's actually a fairly
easy concept, or simple at least. You'd see these all have names. I'm going to try not to be too
heavy on the vocab. But just quickly, scalar has zero dimensions, just the single number.
Imagine it like the price of one trading card. A vector is a list of numbers in a row.
Think of it like the price of four cards, but it's only one dimension. The matrix is two
dimensions. It's a shape. So imagine we're going to be using the example of trading cards because
the project's going to be related to trading cards, Pokemon cards in specific. But imagine
this matrix. Each line is for a card, and each column is for a different feature. It could be
the set. It could be the price. It could be anything. And then three dimensions equal four cards.
The tensor is three dimensions. It's a stack of these matrices. So it's like having
multiple spreadsheets in one workbook. And it's three dimensional. And this is critical because
neural networks do nothing but maths on these different tensors. Everything is just being
multiplied, added, or somehow transformed into another tensor through the data. And an easy way
is to think of it like a spreadsheet. A single cell is scalar. The row is a vector.
The sheet is a matrix. And then the workbook, if you've used Excel or Google Sheets, is like a whole
workbook with multiple sheets. So every tensor has three properties. One is shape, and that's the
size along each dimension, how many rows, columns, layers, you know. So this three, four would mean
a table with three rows, and then four columns going down. D type is the type of number being
stored. Is it whole numbers, decimals, decimals with the fewer digits? And these are all, if you've
used Python or Rust or anything like that, you know these are different types of numbers. And
this is really important about saving memory, which with these huge LLMs is a very important
concept. Certain types of D types, certain types of numbers, use a lot less memory. And then device
is where the tensor lives in the computer, whether it's on your CPU, which is usually your main
processor, or on a graphics card, a GPU. And it's very important that all the
tensors are on the same device, if you're trying to work with them. Now why are GPUs faster?
If you've ever tried to run an LLM on your CPU versus your GPU, you'll have noticed this yourself.
So a CPU has a few very powerful cores that do tasks one by one. And each core can handle complex
branching logic quickly. It's great for running your operating system web browser like that. But
the CPU has over like 16,000. So individual cores are weaker than the CPU core, but they all work
simultaneously. And there's a lot more. And they're really great at math on huge arrays of
numbers, like tensors. So a training job on that might take 24 hours on a CPU might take only 30
minutes on a good GPU. And that's why you hear so much talk about GPU memory VRAM, with machine
learning and running AI models and training them. The bigger GPU you have, the more memory you're
going to have. The larger your batches and the larger your models can be. So second course concept
of PyTorch autograd. And this is how models learn a neural network is full of numbers called weights,
and it will start just with random weights, just random guesses. The goal of training is to adjust
those weights until the model can actually make accurate predictions. But it can't just randomly
change the weights and try to make random guesses. It has to use gradients. And a gradient is just a
number that tells you if you increase
the
weight slightly, how much does the error change. So if the gradient is positive, it means increasing
this weight would make the error actually worse. So you should decrease it. And if it's negative,
it means you should increase it. If the gradient is a very large number, whether positive or
negative, means that weight has a big effect on the error. And if it's near zero, that means the
weight barely matters right now. autograd short for automatic gradients should make sense is PyTorch
is system that computes all of these gradients for you automatically, no matter how complex your
network is. So PyTorch behind the scenes records every math operation you do on the tensors. It
builds a behind the scenes map of those operations. Then when you use this backward method, PyTorch
walks backwards through the graph, using the chain rule from calculus. And basically it
computes the gradient for every single weight in one pass. So you never have to derive or implement
gradient math yourself. It's all through a PyTorch and using this backward method. So another
little analogy with gradients is that gradients point uphill. gradients tell you the direction
of the steep steepest increase in error. To reduce error, you need to go the opposite direction.
This is a little visualization. The gradient is telling you that the error is going this way.
So to reduce the error, you need to go in the opposite direction in order to reach your goal.
So this is where learning rate comes in. And learning rate just controls how big of a step
you take each time you adjust the weights. So if it's too large of a step, you might overshoot the
sweet spot and just bounce around. If it's too small, you inch forward little by little. And it
just takes forever the training. So you want to find the just right area. You can smoothly converge
towards the best rates. So learning rate is probably the single most important number during
training. Concept number three is the training loop.
Training loop, you've seen me do it in cloud code in some of my auto research videos. But this is
kind of the heartbeat of all machine learning training. Every neural network from tiny demos to
GBD four trains using this four step system. Step one is the forward pass, you just feed a batch of
training data into the model, the data flow through each layer gets transformed and then comes
out the other end as prediction. At this point, though, the model is just guessing based on the
current weights.
Step two is loss calculation. You compare the model's prediction to the actual correct number. And you've
probably seen loss if you've done some auto research experiments yourself. And very simple. It's the
single number measuring how wrong the model was in its predictions. Lower loss means obviously better
predictions. There are different ways to calculate loss will be using MSE mean squared error. But
depending on what you're trying to create, there's different approaches to that. Three is the backward
on the loss. This is where auto grad kicks in and computes the gradients of every weight in the model, telling you
exactly how each way contributed to the error, and which direction to adjust it. And then lastly, the optimizer step, the
optimizer takes those gradients and actually updates the weights. Each way it gets nudged in the direction that
reduces the lost by amount controlled by the learning rate. And then you repeat that you repeat that for 1000s of
iterations, until the loss is small enough.
For the model to work. That's the basics. Obviously, there's a lot more involved in more advanced models. But that's the basic
training loop. And this is the actual code. If you're interested, it's fairly simple. You do the forward pass. You calculate the
loss, you compute the gradients using backward on the loss, and then optimize, update the weights. And then lastly, use zero
grad, which clears the old gradients, so they don't pile up from the previous
round. And this is actually kind of important. PyTorch accumulates the gradients by default. If you don't zero them, each
rounds gradients pile on top of the previous ones, and your updates just become garbage. So this is a very common mistake for
beginners. Next is the nn module. And nn dot module is a Python class that comes with PyTorch. It's the base template that
every neural network component in PyTorch is built from. It's the universal building block.
So when you create your own model, you're going to write a class that inherits an end module and defines two main things. One is
this underscore underscore in it underscore underscore. So this is the constructor and it just going to declare what layers or components
your model has. Each layer contains weights that are going to be trained. And the NN module just automatically keeps tracks of
all those weights for you. You don't need to manually do that. And then forward is a method that you write that
defines what happens when the data passes through this, and that just warmer kind of auto then this Jupyterstack state and whatever
my other tests was going to have like all this compromise after that if we were talking as far as we could get down the button cap to sort of
we are stepping right down the glut義, you need to政 to prevent things that are er겠다
passes through this component, you know, which math operations to run in which order. So in this
little code example, it's taking in 10 input numbers. And these, for example, are the 10 card
features, we want to test to see if we can predict price of Pokemon cards, it then expands to 64
internal values. At that point, it uses this f dot relu, which is an activation function,
it zeroes out any of the negative values, which helps the network learn non obvious patterns.
And basically, it allows neural networks to learn complex patterns, rather than just straight line
relationships. And then after that, it goes from the 64 to the one, which is the one price
prediction. And that happens in the final layer there. So a couple other small things, data loading,
you need to feed data in batches, you're rarely going to send an entire data set through the
model. Once you'd run it, you're going to run it in batches, you're going to run it in batches,
you're going to run it in batches, you're going to run it in batches, you're going to run it in batches,
you're going to run it out of memory, the data sets you're going to be working on are obviously
massive. So you need to split this up into small groups called batches. And the data loader, which
is part of PyTorch handles this automatically. It divides your data into small batches shuffles the
order each time, so that the model doesn't memorize the sequence, and then loads the data
in parallel to keep the GPU busy. So you've probably heard me talk about batch size in the
auto research videos. But batch size is important because larger batches need more GPU memory.
Larger batches need more memory.
Larger batches give more stable estimate of the right direction to adjust the weights,
it can improve the grading quality. Smaller batches add useful randomness that can help models learn
patterns that work on new data, not just training data. And larger batches use the GPU more
efficiently up to a certain point. And you've seen me in the auto research video adjusting batch size
up and down to try to find the sweet spot. But usually try to pick the largest one that'll fit
in your GPUs memory. But it also depends on the data set you're working with. So five quick
takeaways. The tensors are the containers of numbers arranged in grids. autograd is a part
of pytorch that automatically computes gradients. training loop is the four steps on repeat.
nn module is the template for every model component. It holds the weights and defines
what happens when the data flows through. And the data loader splits data into batches,
shuffles them and then feeds them into the model efficiently.
So these five things are really important. Basically, every video and every lesson we're
going to do from here. We're going to go through them one by one. We're going to go through them,
we'll touch on them. So it's important to lock these things down. Hope my my explanation was
helpful. I don't think they're too complicated on their own. But it's just a lot of terminology to
handle when you're first getting into this stuff. Okay, so that was the end of the knowledge
section. Did I get it under 10 minutes? Probably not. But that's okay. Now we're gonna move on to
the work. So for the work section, we're going to be building a Pokemon card price predictor.
We're going to be using real data. This is from TCG player, they have an API available.
So we're just going to pull data on a bunch of popular Pokemon cards. Each will have a certain
amount of features, you know, rarity type, what collection stuff like that. So I am using
clock code to just set up the environment. And then I'm going to go in a bit more manual for
the actual using a pie torch just so we can see it actually in action.
Using cloud code now with Opus 4.6. To set up the environment, it's going to install pie torch,
which is kind of large. If you don't already have Python, you're obviously need to install Python.
And it's also pulling the data from the TCG player API and formatting in the CSV.
And here, Claude is going to give us an overview of what we're going to build this model,
it's going to include a lot of features like this rarity type, HP set card variant holographic or
stuff like that. And it's going to output a single predicted market price in dollars
based on these different features. And we're going to be using pie torch and all the skills
we learned in the knowledge section to build it. It won't obviously be perfect. Obviously,
cars are driven by a lot of different factors. But I thought for a simple project that is going
to use pie torch, I thought it would be fun to do. This is the CSV cloud belt for us from the API,
you can see here pretty great data, all the different Pokemon names,
and the sets that they came from a set number rarity, HP, a bunch of different features we got
to work with here. When it was released. Price variants, is it hollow foil. So this is actually
over 2000 cards we got, and I think like 25 or so features. So we have some going to be able to
build a pretty decent model with this. Okay, so now we're going to work on the first step here
after you have the data. This is the first step.
First, we're gonna do this kind of like step by step. Usually all these steps could be put
together in one Python script. But for the sake of showing the different concepts that we learned,
I just do this step by step. This is step one, we're going to turn the data that we got into
tensors. And the front part here is just kind of plumbing, properly formatting the columns.
But this is where you actually convert to tensors. Right here, these lines here.
Why tensor equals torch dot tensor.
And why tensor is torch dot tensor. And this is the conversion. This is where the
arrays become pi torch tensors. And we specify that the D type that I talked about before is
going to be float 32, which is the standard decimal precision. And then we're going to
inspect three properties that we talked about earlier, we're going to talk about the shape,
the D type, and then the device. And then we're going to look at one of the cards features.
And we're going to move it to the GPU. So in order to process it,
and that'll create it as a tensor that we'll be able to see.
Okay, so here we are in a terminal, I just ran step one, tensors. p y, and we converted feature
shape, we have 2142 cards, 18 features each, the price shape. So there's 2142 prices on per card,
obviously. This shows the D type the float 32 I was talking about, and it was currently on the CPU.
And this is just one of the cards features all 18 values. So the values were converted into these
numbers. And then it has the data on the first cards price, which was 10 cents actually. And
then the script moved it to the GPU so that we could properly process it. So if you remember
these shapes that I was talking about earlier, what we just did was convert the cards into this
matrix, basically this tensor. And this matrix has 2142 rows, one for each of the cards, and then 18
columns going down one per feature. And you can see the end result of that is this card tensors
dot pt file. And that's what we're going to use in the next step to create the model. Okay, and part
two here, this is the second script is we're going to be creating the model itself. And you start
with this.
This should look familiar, creating using the nn model from pi torch, we import this at the top.
And then this should look familiar as well. So this is defining two different methods in it here,
what it does, it's going to take in input dim, which is going to be the 18 features, and it's
going to output 128. And you may ask, why do we expand from 18, right to 128. So what that is, is
basically 128 neurons.
That are looking through all the 18 input features, and computing a different combination of them. So
if we just kept it at 18, we wouldn't be able to have any room to find more combinations. By
expanding this to 128, we give the model room to find a lot of different patterns. And then from
there, after each pass, it becomes smaller, right, it goes from 128 to 64 6432 32 to one until it
gets one, which is the predicted price. So it boils down from a lot of room to find patterns to just
combine the successful patterns into fewer, stronger signals until we finally get to one.
So each layer gets narrower, narrower, you're funneling 18 features to get down to that single
number. And then over here, through each layer, we use Ford, and relu, like I talked about before,
which zeros out the negatives, except the last one, and the last layer here,
outputs the raw prediction price. So we don't want to zero out negative values there. So then,
we move on, we create an instance of the model itself, we inspect the parameters,
which are the weights that get trained. And then we load the real debt data we had from step one,
the actual cards, we feed one real card through in this script, to get one prediction price.
And this is just so we can output you could see each level when we actually print it
in the terminal. And then we're going to feed a batch of the real cards through
until we have our predictions. Now, this is just the first run, basically be creating the structure,
and the flow of the model. But these are all kind of going to be random, basically,
because the model hasn't learned anything yet, we haven't trained it yet. So I'm going to run this
step two model that will go through everything that you just saw.
And the script prints everything out, but you can see the different layers
18 to 128 128 to 6464 to 3232 to one, and then the different parameters we got the total, total
parameters were 12,801. And then we did the forward pass with one card. And this model is
untrained. So the actual output price is meaningless. And then we did a batch of 32 cards.
And like I said, these outputs aren't meaningful at all, because this is just creating the model
structure and flow. So now this next script, we're going to actually go through one loop
of the training loop. And we start with the data set data loader here, we're going to load the
answers that we got from step one using data loader here, like we talked about creating the
batches, you can see how many batches we have, we're going to take one batch. And this is the
setup, we have the device, the model, the loss, and the optimizer. And this is going to go through
the four steps. And this is what we talked about before, it's going to snapshot the weights
beforehand, move the batch to the same device as the model that we have, do the forward pass that
we talked about.
This feeds the
batch
through the model prints predictions versus actual prices, there'll be way off since the
model is random at first. And then it will calculate the loss. This will be how wrong
the model is. At first, it'll be very wrong, like I said, then it's going to do the backward pass.
And these gradients, this backward pass run is going to find the gradients. And this is the key,
because these are the numbers that are going to tell us each weight,
which direction to adjust. And then finally, the optimizer step,
which is going to print the same weights before and after. So you'll be able to see the numbers
actually change, and then the zero grand. So this is just going to be one run here just to show you
what it looks like. So here we go, I ran the script. And this is the data loader part, the
total cards, total batch size is going to be 64. So we'll have 34 batches in total.
One batch, you can see the feature shapes here.
64 64 cards, 18 features, and then the price shape 64 cards, and then one price per card, we go into
the first training step, do the forward pass. So it's going to make these predictions, see how
wrong it is. So it's pretty wrong, then it's going to go through the backward pass
to find the gradient values. And then the optimizer, which you could see shift the
weight slightly, you could see the weights before this was negative 0.0494. After it was negative
0.048.
So it shifted it slightly. And all these were slightly shifted, because you got to do this
training loop, like I said, thousands of times probably. So the key thing to look at here is
the gradients before there were none, obviously. But the gradients after you could see they had
these values. And these tell each of the weights which direction to move. And they change a very
small amount, because that's the training step doing its job, you don't want to overshoot it.
So you can see, that's why for example, this gradient value is negative, very small amount.
So this weight before and this weight after, it got slightly more positive,
still negative in terms of absolute value, but slightly closer to zero. The last step is going
to be we're going to run a full this was just for one small batch, just one training run,
we have to do this a long time. So we're going to do the full training run now.
So in this final step, you can see it does the same thing with the data loader, this is all the same.
model setup. And so we're breaking this into small parts. But then it loads the data here,
and then splits it into 70% is going to be trained on 15% is going to be used for validation,
and 15 is going to be used for the testing, which doesn't happen until the end. So this is going to
be the main training loop. And you can see there's going to be 100 epics. So each epic is going to
go through a train mode, it's going to run every batch through the four steps here, forward, lost,
and then update. And then it's going to do the evaluation. So this runs the validation data
through without updating the weights. And this validation step is just to check whether the model
is actually learning real patterns, or whether it's just memorizing the training data. The
validation set is a set of cards the model never trains on during the training section. After each
epic, we check how well does the model predict prices for cards it's never seen, which it obviously
can't memorize because it hasn't seen it before. So ideally, you want to see the training loss and
the validation loss go down together. That means the model is learning actual patterns.
What's concerning is if the training loss is going down, but the validation loss starts going back
up. That means the model is memorizing the training cards, instead of learning general
price patterns. And this is overfitting, which we've explored before on this channel.
So after the full run, the full training epics, all 100 are done, we're going to be saving that
data for the final step. So I just ran it, you could see the numbers going down the training
loss from each epic as it went through, went down, down, down from an epic 10, it was 0.38 went down
to 0.16. Same thing validation loss went from 0.89 down to 0.85. So that's good. So this was
training the actual model. And the next step is going to be the evaluation. The model is trained
now in PyTorch. So now we need to see how well it actually predicts prices on cards it's never seen,
and then generate some charts for us. So this is the code, it's going to load everything from step
four, it's going to predict on the test that set, it's going to run through. And then we're going to
print out some metrics, which we didn't go into. But this is this is using matplotlib, which is
often used in data science.
Okay, so let's run it and see what we get. So the test set predictions that we got mean absolute
error was $7.39. median was absolute error was eight cents. And median is more useful. In this
case, a few expensive cars skew the mean. But we could see what we what our model predicted these
certain cards to be and what it actually came some of them were quite close, actually. There's a very
large
difference between some of these, but even the more expensive cards, you can see,
the model knew would be more expensive. Some of these are way off, but
at least you could tell that it was supposed to be more expensive.
You could see here, these are the visualizations, the train, this red line is the training loss,
which you can see, consistently went down, the validation kind of was a little bit spiky,
kind of went down, well, it went down from the start, but was kind of ranging in the middle.
There for a while. So that means there might have been some overfitting. What this really tells us
is that the model got very good at predicting the cheap cards ones in like the dollar or less than
dollar range. And the median error was only eight cents. So for most cards, it's a very accurate
model. And that's because most of the the full sample we had were cheap cards. The model didn't
see that many very expensive cards. So it really never was able to learn what makes a card worth $100
plus.
And in this specific case with the Pokemon cards, there's a lot of other factors, obviously,
collector, you know, art, not everything was found in the data. So this model we we built,
learn that rarity and type effect price, but it looked at each feature independently.
It has no sense of context, it doesn't know that you know, Charizard plus all art plus low print
run is more than the sum of its parts. It can't really read a card name and understand what that
means. So in the next slide, you can see that it's a little bit more expensive, but it's a lot
So in the next episode, part two of this series, we are going to build the architecture that
actually can and build on this basic model into transformers. And transformers are used
to understand relationships between things, not just isolated features.
But that was it for episode one. With the intro. This ran way over what I was hoping,
but hopefully I can get these under 25 ish minutes. We'll see. I'll try.
try my best. Obviously, there's a lot more you can get into with PyTorch. And we're going to
continue to use it in the future episodes. But this was just to give you kind of a fundamental
view of it. And how it's used in basic training loops, and basic model functions. From here on
out, we're going to get into much more complicated concepts. So it's good to have some foundation.
For these videos, I'm just going to do this kind of explaining core concepts like this,
doing a simple project, so that you can hopefully and I can hopefully grasp this have a better
understanding while we work with larger models in my usual videos. Once the series is done,
if there's a certain topic that seems like people are really interested in, I'll be happy to do like
more of a deep dive on one specific tool or topic or process. Because there's obviously a lot more
with PyTorch, you get into all the details. That's going to be it for this episode. If you liked it,
please leave a like.
Subscribe, leave a comment. Let me know if you actually like this kind of content a little bit
different from what I usually do. But I thought it'd be interesting. And that's it. I'll see you
in the next one.