Transformers, the tech behind LLMs | Deep Learning Chapter 5

2024-04-01 · 27m · 자막 공식자막

01한국어 번역 · Korean

트랜스포머, LLM을 움직이는 기술 | 딥러닝 5장

원본: https://www.youtube.com/watch?v=wjZofJX0v4M · 업로드: 2024-04-01 · 길이: 27m

GPT라는 이름의 약자는 생성형 사전학습 트랜스포머(Generative Pretrained Transformer)다. 첫 번째 단어는 직관적이다. 새로운 텍스트를 만들어내는 봇이라는 뜻이다. 사전학습(Pretrained)은 모델이 대량의 데이터로부터 학습하는 과정을 거쳤음을 의미하며, 접두사가 암시하듯 특정 과제에 맞춰 추가 학습으로 미세조정(fine-tuning)할 여지가 있다는 뜻이기도 하다. 하지만 진짜 핵심은 마지막 단어, 트랜스포머(Transformer)다. 트랜스포머는 특정한 종류의 신경망, 즉 머신러닝 모델이며, 현재 AI 붐의 근간을 이루는 핵심 발명이다.

이 영상과 이후 이어지는 장들에서는 트랜스포머 내부에서 실제로 어떤 일이 벌어지는지 시각적으로 설명한다. 데이터가 흘러가는 과정을 따라가며 한 단계씩 살펴볼 것이다. 트랜스포머를 활용해 만들 수 있는 모델은 매우 다양하다. 오디오를 받아 자막을 생성하는 모델이 있고, 반대로 텍스트만으로 합성 음성을 만들어내는 모델도 있다. 2022년 세상을 뒤흔든 달리(DALL-E)나 미드저니(Midjourney)처럼 텍스트 설명을 받아 이미지를 만드는 도구들도 트랜스포머에 기반한다. 2017년 구글(Google)이 처음 소개한 원조 트랜스포머는 한 언어에서 다른 언어로 텍스트를 번역하는 용도로 발명되었다. 하지만 우리가 집중할 변형은 챗GPT(ChatGPT) 같은 도구의 기반이 되는 유형으로, 텍스트를 입력받아 이어질 내용을 예측하는 모델이다.

예측은 뒤따를 수 있는 여러 텍스트 조각에 대한 확률 분포 형태로 나온다. 얼핏 생각하면 다음 단어를 예측하는 것과 새 텍스트를 생성하는 것은 전혀 다른 목표처럼 보인다. 하지만 예측 모델이 있으면, 초기 텍스트 조각을 주고 모델이 생성한 확률 분포에서 무작위로 하나를 뽑아 텍스트에 붙인 뒤 전체를 다시 모델에 넣어 새로운 예측을 하게 만들 수 있다. 예를 들어 GPT-2를 노트북에서 돌리면서 씨앗 텍스트로부터 반복적으로 다음 조각을 예측하고 샘플링해서 이야기를 생성하면, 그 이야기는 별로 말이 되지 않는다. 하지만 GPT-3 API 호출로 바꾸면, 기본적으로 같은 모델이지만 훨씬 커진 덕분에 갑자기 거의 마법처럼 그럴듯한 이야기가 나온다. 이렇게 반복적으로 예측하고 샘플링하는 과정이 챗GPT 같은 대규모 언어 모델과 대화할 때 한 단어씩 생성되는 원리다.

트랜스포머 내부의 데이터 흐름

이제 트랜스포머 내부의 데이터 흐름을 높은 수준에서 살펴보자. 각 단계의 세부 사항은 나중에 충분히 다루겠지만, 큰 그림부터 보면 이렇다.

먼저 입력이 여러 작은 조각으로 쪼개진다. 이 조각들을 토큰(token)이라 부르며, 텍스트의 경우 단어이거나 단어의 일부, 또는 자주 쓰이는 문자 조합이다. 이미지나 소리가 관여하면 이미지의 작은 패치나 소리의 작은 덩어리가 토큰이 된다. 각 토큰은 벡터(vector), 즉 숫자 목록과 연결되며, 이 벡터는 해당 조각의 의미를 어떤 식으로든 인코딩한다. 이 벡터들을 매우 고차원 공간의 좌표로 생각하면, 비슷한 의미를 가진 단어들은 그 공간에서 서로 가까운 벡터에 위치하는 경향이 있다.

이 벡터 시퀀스는 어텐션 블록(attention block)이라 알려진 연산을 통과한다. 어텐션 블록은 벡터들이 서로 대화하면서 정보를 주고받아 값을 갱신하는 곳이다. 예를 들어, “머신러닝 모델”에서 ‘모델’의 의미와 “패션 모델”에서 ‘모델’의 의미는 다르다. 어텐션 블록이 바로 문맥 속 어떤 단어가 다른 어떤 단어의 의미를 갱신하는 데 관련이 있는지, 그리고 정확히 어떻게 갱신해야 하는지를 파악하는 역할을 한다.

그 다음 벡터들은 다층 퍼셉트론(multi-layer perceptron) 또는 피드포워드 층(feed-forward layer)이라 불리는 다른 종류의 연산을 통과한다. 여기서는 벡터들이 서로 대화하지 않고, 모두 동일한 연산을 병렬로 거친다. 이 단계는 각 벡터에 대해 긴 질문 목록을 던지고, 답에 따라 값을 갱신하는 것에 비유할 수 있다.

어텐션 블록과 다층 퍼셉트론 블록의 모든 연산은 거대한 행렬 곱셈(matrix multiplication)의 더미처럼 생겼으며, 우리의 핵심 과제는 그 행렬들을 해석하는 법을 이해하는 것이다. 이 과정이 반복되면서, 맨 마지막에는 지문의 모든 핵심 의미가 시퀀스의 마지막 벡터에 담기게 된다. 그 마지막 벡터에 특정 연산을 적용하면 다음에 올 수 있는 모든 토큰에 대한 확률 분포가 나온다.

딥러닝의 기본 전제

여기서 잠깐 딥러닝의 기본 전제를 짚고 넘어가자. 머신러닝이란 데이터를 사용해 모델의 동작 방식을 결정하는 접근법이다. 이미지를 받아 설명을 내놓거나, 텍스트를 보고 다음 단어를 예측하거나, 직관과 패턴 인식이 필요한 과제가 있을 때, 초기 AI처럼 절차를 명시적으로 코드로 짜는 대신, 조정 가능한 매개변수(parameter)가 잔뜩 달린 유연한 구조를 설정한 뒤, 입력에 대해 출력이 어때야 하는지 보여주는 수많은 예시를 활용해 매개변수를 조정하는 것이다.

가장 단순한 형태의 머신러닝은 선형 회귀(linear regression)다. 입력과 출력이 각각 하나의 숫자, 예를 들어 집의 면적과 가격이고, 데이터에 가장 잘 맞는 직선을 찾는 것이다. 매개변수는 기울기와 y절편, 단 2개다. 당연히 딥러닝 모델은 훨씬 복잡하다. GPT-3는 매개변수가 2개가 아니라 1,750억 개다. 딥러닝은 지난 수십 년간 대규모로 확장해도 놀라울 정도로 잘 작동하는 모델 계열이며, 모두 역전파(backpropagation)라는 동일한 학습 알고리즘을 사용한다. 이 알고리즘이 잘 작동하려면 모델이 특정 형식을 따라야 한다.

어떤 종류의 모델이든 입력은 실수 배열 형태여야 한다. 숫자 목록이나 2차원 배열, 또는 텐서(tensor)라 불리는 고차원 배열이 될 수 있다. 입력 데이터는 여러 층을 거치며 점진적으로 변환되고, 각 층은 항상 실수 배열로 구성된다. 최종 층이 출력이며, 우리 텍스트 모델에서는 다음 토큰 확률 분포를 나타내는 숫자 목록이다.

딥러닝에서 모델의 매개변수는 거의 항상 가중치(weight)라 불린다. 이 가중치가 데이터와 상호작용하는 유일한 방식은 가중합(weighted sum)을 통해서다. 그리고 사이사이에 비선형 함수를 뿌리지만, 이 비선형 함수는 매개변수에 의존하지 않는다. 보통은 가중합을 직접 쓰는 대신 행렬-벡터 곱(matrix-vector product)의 구성 요소로 묶어서 표현한다. 조정 가능한 매개변수가 담긴 행렬이 데이터에서 나온 벡터를 변환한다고 생각하면 개념적으로 더 깔끔하다. GPT-3의 1,750억 개 가중치는 약 28,000개의 고유한 행렬로 구성되어 있으며, 이 행렬들은 8가지 범주로 나뉜다. 우리는 각 범주가 무슨 역할을 하는지 하나씩 살펴볼 것이다.

임베딩: 단어를 벡터로 바꾸기

텍스트 처리의 첫 단계는 입력을 작은 조각으로 나눈 뒤 이를 벡터로 변환하는 것이다. 모델은 미리 정의된 어휘, 예컨대 50,000개의 가능한 단어 목록을 갖고 있으며, 처음 만나는 행렬인 임베딩 행렬(embedding matrix)은 이 단어 각각에 대해 하나의 열(column)을 갖는다. 이 열이 첫 단계에서 각 단어가 어떤 벡터로 변환되는지를 결정한다. W_E라 표기하며, 다른 모든 행렬처럼 값은 무작위로 시작하지만 데이터로부터 학습된다.

단어를 벡터로 바꾸는 것은 트랜스포머 이전부터 흔히 쓰이던 기법이지만, 처음 접하면 다소 낯설다. 이것을 단어 임베딩(word embedding)이라 부르며, 벡터를 고차원 공간의 점으로 기하학적으로 생각할 수 있다. GPT-3의 임베딩은 12,288차원이다.

핵심 아이디어는, 학습 과정에서 모델이 가중치를 조정하면서 공간의 방향에 의미론적 의미가 실리는 임베딩을 만들어낸다는 것이다. ‘tower’의 임베딩에 가장 가까운 단어를 검색하면 모두 비슷한 ‘탑’ 느낌을 풍긴다.

이를 보여주는 유명한 예시가 있다. ‘woman’과 ‘man’의 벡터 차이, 즉 한쪽 끝에서 다른 쪽 끝으로 향하는 작은 벡터를 구하면, 이것이 ‘king’과 ‘queen’의 차이와 매우 유사하다. 여성 군주를 뜻하는 단어를 모른다면, ‘king’에 ‘woman - man’ 방향을 더하고 가장 가까운 임베딩을 찾으면 된다. 또 다른 예로, ‘Italy’의 임베딩에서 ‘Germany’를 빼고 ‘Hitler’를 더하면 ‘Mussolini’에 매우 가까운 결과가 나온다. 모델이 어떤 방향에 ‘이탈리아다움’을, 다른 방향에 ‘2차 세계대전 추축국 지도자’라는 의미를 연결한 것처럼 보인다. ‘Germany’와 ‘Japan’의 차이를 구해서 ‘sushi’에 더하면 ‘bratwurst’에 매우 가까워지는 예시도 재미있다.

수학적으로 기억해둘 것은 두 벡터의 내적(dot product)이 두 벡터가 얼마나 같은 방향을 가리키는지를 측정하는 수단이라는 점이다. 계산적으로는 대응하는 성분끼리 곱한 후 다 더하는 것이고, 기하학적으로는 비슷한 방향이면 양수, 직교하면 0, 반대 방향이면 음수다. 예를 들어, ‘cats - cat’ 벡터가 복수성(plurality) 방향을 나타낼 수 있다는 가설을 세우고, 이 벡터와 단수 명사 및 대응하는 복수 명사의 내적을 각각 구하면, 복수형이 일관되게 더 높은 값을 보인다.

임베딩 행렬은 모델의 첫 번째 가중치 묶음이다. GPT-3의 어휘 크기는 정확히 50,257개이고 임베딩 차원은 12,288이므로, 이를 곱하면 약 6억 1,700만 개의 가중치가 된다.

문맥 흡수와 컨텍스트 크기

트랜스포머에서 임베딩 공간의 벡터는 단순히 개별 단어만을 나타내는 것이 아니다. 벡터는 해당 단어의 위치 정보도 인코딩하며, 더 중요하게는 문맥을 흡수할 수 있는 능력을 갖는다. 예컨대 ‘king’의 임베딩으로 출발한 벡터가 네트워크의 여러 블록을 거치면서 점차 끌려가고 밀려나, 최종적으로는 스코틀랜드에 살았고 전임 왕을 살해하여 왕위에 올랐으며 셰익스피어풍 언어로 묘사되는 왕이라는 훨씬 구체적이고 섬세한 방향을 가리키게 된다.

네트워크는 한 번에 고정된 수의 벡터만 처리할 수 있으며, 이를 컨텍스트 크기(context size)라 한다. GPT-3의 컨텍스트 크기는 2,048이다. 이것이 트랜스포머가 다음 단어를 예측할 때 참조할 수 있는 텍스트 양의 한계이며, 초기 챗GPT와 긴 대화를 나누면 봇이 대화의 맥락을 놓치는 듯한 느낌을 받는 이유이기도 하다.

언임베딩과 확률 분포 생성

네트워크 맨 끝에서 일어나는 일을 살펴보자. 원하는 출력은 다음에 올 수 있는 모든 토큰에 대한 확률 분포다. 예를 들어, 마지막 단어가 ‘Professor’이고 문맥에 ‘Harry Potter’가 있으며 바로 앞에 ‘least favorite teacher’가 왔다면, 잘 훈련된 네트워크는 ‘Snape’에 높은 확률을 부여할 것이다.

여기에는 두 가지 단계가 있다. 첫째, 언임베딩 행렬(unembedding matrix, W_U)이라 불리는 또 다른 행렬이 문맥의 마지막 벡터를 어휘 크기인 50,000개의 값 목록으로 변환한다. 둘째, 소프트맥스(softmax) 함수가 이를 확률 분포로 정규화한다.

마지막 임베딩만 사용하고 나머지는 버리는 것이 이상해 보일 수 있지만, 학습 과정에서는 최종 층의 모든 벡터가 각자 바로 뒤에 올 단어를 동시에 예측하도록 활용된다. 이것이 학습의 효율을 훨씬 높이는 방법이다.

언임베딩 행렬은 어휘의 각 단어에 대해 하나의 행을 가지며, 각 행의 원소 수는 임베딩 차원과 같다. 임베딩 행렬과 매우 유사하되 순서만 바뀐 것이다. 이것도 약 6억 1,700만 개의 매개변수를 추가해, 지금까지의 누적은 10억 조금 넘는 수준이다.

소프트맥스 함수

소프트맥스 함수에 대해 좀 더 이야기하자. 숫자 시퀀스가 확률 분포 역할을 하려면, 각 값이 0과 1 사이여야 하고 전체 합이 1이어야 한다. 그런데 행렬-벡터 곱의 출력은 이런 조건을 전혀 만족하지 않는다. 값이 음수이거나 1보다 훨씬 크고, 합도 1이 되지 않는다.

소프트맥스는 임의의 숫자 목록을 유효한 확률 분포로 변환하는 표준적인 방법이다. 가장 큰 값이 1에 가깝게, 작은 값이 0에 가깝게 만든다. 원리는 각 숫자에 대해 e를 그 숫자만큼 거듭제곱하여 모든 값을 양수로 만든 뒤, 전체 합으로 나누어 정규화하는 것이다.

입력 중 하나가 나머지보다 유의미하게 크면 출력의 해당 항이 분포를 지배한다. 하지만 단순히 최대값만 고르는 것보다 부드럽다(soft). 비슷하게 큰 다른 값들도 유의미한 비중을 얻기 때문이다.

챗GPT가 다음 단어를 정할 때 쓰는 것처럼, 여기에 온도(temperature)라는 상수 T를 지수의 분모에 넣어 재미를 더할 수 있다. T가 크면 낮은 값에도 더 많은 가중치가 가서 분포가 균일해지고, T가 작으면 큰 값이 더 공격적으로 지배한다. T가 0이면 모든 가중치가 최대값에 몰린다.

예를 들어, “옛날 옛적에…”라는 씨앗 텍스트로 GPT-3에게 이야기를 쓰게 하되 온도를 다르게 설정하면, 온도 0에서는 항상 가장 예측 가능한 단어를 골라 골디락스(Goldilocks) 이야기의 진부한 변형이 나온다. 온도가 높으면 덜 확률적인 단어를 선택할 기회가 생겨 더 독창적으로 시작하지만, 금세 횡설수설로 변질될 위험이 있다.

소프트맥스 출력의 각 성분을 확률(probability)이라 부르듯이, 입력은 로짓(logit)이라 부른다. 텍스트를 넣으면 모든 단어 임베딩이 네트워크를 통과하고, 언임베딩 행렬과 최종 곱셈을 한 결과로 나오는 정규화되지 않은 값이 다음 단어 예측을 위한 로짓이다.

어텐션으로의 다리

이 장의 목표 상당 부분은 어텐션 메커니즘(attention mechanism)을 이해하기 위한 기초를 다지는 것이었다. 단어 임베딩에 대한 강한 직관, 소프트맥스, 내적이 유사도를 측정하는 방식, 그리고 대부분의 계산이 조정 가능한 매개변수가 가득한 행렬의 곱셈 형태여야 한다는 전제를 갖추고 있다면, 현대 AI 붐 전체의 초석인 어텐션 메커니즘을 이해하는 과정은 비교적 순탄할 것이다.

02리서치 문서 · Document

트랜스포머 완전 해부: LLM의 심장부를 들여다보다

원본: YouTube · 업로드: 2024-04-01 · 길이: 27m

서론

ChatGPT, Claude, Gemini — 지금 세상을 바꾸고 있는 대규모 언어 모델(LLM)의 공통 기반은 단 하나의 아키텍처, **트랜스포머(Transformer)**다. 2017년 구글 연구진 8명이 발표한 논문 *“Attention Is All You Need”*에서 처음 제안된 이 구조는 이전까지 자연어 처리의 주류였던 순환 신경망(RNN)과 합성곱 신경망(CNN)을 완전히 대체하며 AI의 지형을 뒤집었다. 수학 교육 유튜버 3Blue1Brown(Grant Sanderson)은 이 영상에서 트랜스포머 내부의 데이터 흐름을 시각적으로 추적하며, 전문가가 아닌 사람도 직관적으로 핵심 원리를 이해할 수 있도록 안내한다.

이 글에서는 영상의 핵심 내용을 정리하고, 외부 자료를 통해 맥락을 보강한다.

본론

1. GPT의 이름에 담긴 세 가지 키워드

GPT는 Generative Pretrained Transformer의 약자다. 생성형(Generative) — 새로운 텍스트를 만들어낸다. 사전학습(Pretrained) — 대량 데이터로 먼저 학습한 뒤 특정 과제에 맞춰 미세조정한다. 트랜스포머(Transformer) — 어텐션 메커니즘 기반의 신경망 아키텍처다. 이 세 단어가 현대 LLM의 본질을 압축한다.

영상은 특히 세 번째 단어에 집중한다. 트랜스포머가 텍스트를 입력받아 “다음에 올 토큰의 확률 분포”를 출력하는 과정, 그리고 이 예측을 반복 적용해 장문의 텍스트를 생성하는 원리를 차근차근 풀어낸다.

2. 데이터 흐름의 전체 그림

트랜스포머 내부의 데이터 흐름은 크게 네 단계로 나뉜다.

토큰화(Tokenization)와 임베딩(Embedding): 입력 텍스트가 토큰이라는 작은 조각으로 쪼개지고, 각 토큰은 임베딩 행렬(W_E)을 통해 고차원 벡터로 변환된다. GPT-3의 경우 어휘 크기 50,257개, 임베딩 차원 12,288로 약 6억 1,700만 개의 매개변수를 차지한다.

어텐션 블록(Attention Block): 벡터들이 서로 정보를 주고받으며 문맥을 반영한 의미로 갱신된다. “머신러닝 모델”과 “패션 모델”에서 ‘모델’이 서로 다른 뜻으로 해석되는 것이 바로 이 단계의 결과다.

다층 퍼셉트론(MLP) 블록: 벡터들이 상호 대화 없이 동일한 변환을 병렬로 거친다. 각 벡터에 긴 질문 목록을 던지고 답에 따라 값을 갱신하는 것에 비유할 수 있다.

언임베딩(Unembedding)과 소프트맥스(Softmax): 최종 벡터가 언임베딩 행렬(W_U)을 거쳐 어휘 크기만큼의 점수(로짓)로 변환되고, 소프트맥스 함수가 이를 확률 분포로 정규화한다.

어텐션과 MLP 블록이 교대로 여러 번 반복되면서, 마지막 벡터에는 전체 문맥의 핵심 의미가 응축된다. GPT-3는 이런 행렬이 약 28,000개, 총 1,750억 개의 가중치로 구성되어 있다.

3. 단어 임베딩의 기하학

영상의 가장 인상적인 부분 중 하나는 단어 임베딩의 기하학적 성질을 시각화한 것이다. 2013년 Tomas Mikolov 팀이 Word2Vec으로 발견한 유명한 예시 — ‘king - man + woman ≈ queen’ — 를 출발점으로, 임베딩 공간의 방향에 의미론적 의미가 실린다는 개념을 보여준다.

‘Italy - Germany + Hitler ≈ Mussolini’, ‘Germany - Japan + sushi ≈ bratwurst’ 같은 예시는 모델이 국적, 역사적 역할, 음식 문화 같은 추상적 관계를 벡터 공간의 방향으로 인코딩했음을 드러낸다. 내적(dot product)은 두 벡터의 유사도를 정량화하는 수단으로, 이후 어텐션 메커니즘의 핵심이 된다.

4. 컨텍스트 크기의 제약과 의미

트랜스포머는 한 번에 처리할 수 있는 토큰 수가 고정되어 있다. GPT-3의 컨텍스트 크기는 2,048 토큰으로, 이것이 모델이 “기억”할 수 있는 범위의 상한이다. 초기 ChatGPT와 긴 대화를 나누면 봇이 맥락을 놓치는 듯한 느낌을 받는 이유가 여기에 있다.

이후 모델들은 이 한계를 극적으로 늘렸다. GPT-4 Turbo는 128K 토큰, Claude 3은 200K 토큰, Gemini 1.5 Pro는 최대 1M 토큰까지 확장했다. 하지만 긴 컨텍스트에서도 정보를 균일하게 활용하는 것은 여전히 연구 과제다.

5. 온도와 창의성의 스펙트럼

소프트맥스 함수에 온도(temperature) 매개변수 T를 도입하면, 생성 텍스트의 “창의성”을 조절할 수 있다. T=0이면 항상 가장 확률 높은 토큰만 고르므로 예측 가능하고 진부한 결과가 나온다. T를 높이면 덜 확률적인 토큰에도 기회가 주어져 독창적인 결과가 가능하지만, 지나치면 횡설수설로 전락한다.

영상에서는 GPT-3에 “옛날 옛적에…”를 주고 온도를 달리해 이야기를 생성하는 실험을 보여준다. 온도 0은 골디락스 이야기의 진부한 변형, 높은 온도는 한국 출신 웹 아티스트에 대한 독창적 시작이지만 곧 의미 없는 문장으로 변질되는 과정을 보여주며, 이론과 실제의 간극을 체감하게 한다.

핵심 인사이트

트랜스포머의 모든 계산은 결국 행렬 곱셈이다. 1,750억 개의 매개변수가 약 28,000개의 행렬에 담겨 있고, 이 행렬들이 데이터 벡터를 변환하는 과정이 트랜스포머의 전부다.
단어 임베딩은 단순 코드가 아니라 의미의 기하학이다. 벡터 공간의 방향 자체가 성별, 국적, 복수성 같은 의미론적 축을 형성한다.
어텐션 메커니즘은 문맥 이해의 핵심이다. 같은 단어라도 주변 문맥에 따라 전혀 다른 의미로 해석되는 능력이 여기서 나온다.
예측 모델이 곧 생성 모델이다. “다음 토큰 예측”이라는 단순한 목표가 반복 적용을 통해 장문의 일관된 텍스트 생성으로 이어진다.
온도 하나로 결정론과 창의성 사이를 오갈 수 있다. 소프트맥스의 온도 매개변수는 LLM 활용의 핵심 조절 장치다.

더 알아보기

Attention Is All You Need (Vaswani et al., 2017) — 트랜스포머 아키텍처를 처음 제안한 원조 논문
Language Models are Few-Shot Learners (Brown et al., 2020) — GPT-3의 1,750억 매개변수 모델과 퓨샷 학습을 다룬 OpenAI 논문
The Illustrated Transformer (Jay Alammar) — 트랜스포머 아키텍처를 단계별 그림으로 설명한 블로그 포스트
Transformer Explainer (Polo Club of Data Science) — GPT-2 기반 인터랙티브 트랜스포머 시각화 도구
The Illustrated Word2Vec (Jay Alammar) — 단어 임베딩의 원리를 시각적으로 설명한 가이드
3Blue1Brown 딥러닝 시리즈 — 이 영상을 포함한 전체 시리즈의 인터랙티브 버전

03찬반 토론 · Debate

토론: “다음 토큰 예측”이라는 단순한 원리가 진정한 언어 이해를 만들어낼 수 있는가?

Round 1

🟢 Pro

3Blue1Brown의 영상이 보여주는 가장 강력한 메시지는, 트랜스포머가 놀라울 정도로 단순한 원리 위에 세워졌다는 것이다. 입력을 벡터로 바꾸고, 행렬 곱셈을 반복하며, 소프트맥스로 확률 분포를 만든다. 이 과정의 전부가 가중합(weighted sum)이라는 기초 연산의 조합이다. 그런데 이 단순한 구조가 GPT-2에서 GPT-3로 규모만 키웠을 때, “말이 되지 않는 이야기”가 “그럴듯한 이야기”로 도약하는 현상을 실제로 만들어냈다.

이것은 우연이 아니다. Kaplan et al.(2020)의 스케일링 법칙(scaling laws) 연구는 모델 크기, 데이터 양, 계산량을 늘릴수록 성능이 예측 가능한 멱법칙(power law)을 따라 향상된다는 것을 실증했다. 다음 토큰 예측이라는 학습 목표가 “단순하다”는 것은 약점이 아니라 강점이다. 목표가 단순하기 때문에 대규모 데이터에서 보편적으로 적용 가능하고, 텍스트뿐 아니라 이미지, 음성, 코드까지 같은 프레임워크로 처리할 수 있다.

영상에서 보여준 단어 임베딩의 기하학적 성질 — ‘king - man + woman ≈ queen’, ‘Italy - Germany + Hitler ≈ Mussolini’ — 은 모델이 단순히 통계적 공기(co-occurrence)를 외우는 것이 아니라, 추상적 관계를 벡터 공간의 구조로 인코딩한다는 증거다. Google Research(2024)의 연구에 따르면, 트랜스포머의 내부 임베딩에서 “토큰보다 큰” 추상 개념에 대응하는 뉴런 활성 패턴이 자연스럽게 형성되며, 단일 히든 상태가 여러 미래 토큰을 놀라운 정확도로 예측하는 “다단계 표상(multi-step representation)“을 암묵적으로 구축한다는 것이 밝혀졌다.

🔴 Con

Pro 측의 논증은 인상적인 경험적 결과를 나열하지만, 핵심 질문을 회피한다: 통계적 패턴 매칭과 진정한 이해 사이에 실질적 차이가 있는가?

트랜스포머가 “포르투갈의 수도는 리스본”이라는 답을 내놓을 때, 이것이 포르투갈과 리스본의 관계를 “이해”한 것인가, 아니면 학습 데이터에서 ‘포르투갈’과 ‘리스본’이 통계적으로 강하게 연결되어 있기 때문인가? 후자라면, 학습 데이터에 없는 새로운 관계를 추론하는 능력은 근본적으로 제한된다.

더 근본적인 문제가 있다. 자기회귀(autoregressive) 모델은 설계상 한 토큰씩 순차적으로 생성한다. 즉, 생성을 시작하기 전에 전체적인 계획을 세우는 능력이 구조적으로 결여되어 있다. 인간이 글을 쓸 때는 전체 논증의 구조를 먼저 구상하고, 서론에서 결론까지의 흐름을 설계한 뒤 문장을 채워 넣는다. 트랜스포머는 이와 정반대로 작동한다 — 현재까지의 토큰으로부터 바로 다음 토큰을 추측할 뿐이다. 이 구조적 한계는 규모를 아무리 키워도 해결되지 않는다.

Pro 측이 언급한 스케일링 법칙 자체도 한계에 봉착하고 있다. 2024년 들어 사전학습(pre-training) 단계의 스케일링이 정체기에 접어들었다는 관측이 잇따르고 있으며, 성능 개선의 주된 원천이 사전학습 규모 확대에서 사후학습(post-training)과 추론 시간 계산(test-time compute)으로 이동하고 있다. 어떤 규모든 넘지 못하는 “돌이킬 수 없는 손실(irreducible loss)“의 바닥이 약 1.82 수준에 존재한다는 추정도 있다.

Round 2

🟢 Pro (재반론)

Con 측은 “통계적 패턴 매칭 vs. 진정한 이해”라는 이분법을 제시했지만, 이 구분 자체가 의심스럽다. 인간의 언어 이해 역시 경험적 패턴의 축적 위에 세워진다. 아이가 “뜨거운”의 의미를 배우는 과정은 수많은 맥락에서 이 단어가 등장하는 패턴을 통계적으로 학습하는 것과 본질적으로 다르지 않다. 중요한 것은 메커니즘이 아니라 결과다.

Con 측이 자기회귀 모델의 “계획 불가능성”을 지적했지만, 체인 오브 소트(chain-of-thought) 프롬프팅과 같은 기법은 모델이 중간 추론 단계를 명시적으로 생성하게 함으로써 이 한계를 우회한다. 사실상 “계획”을 토큰의 형태로 외부화하는 것이다. OpenAI의 o1 모델과 후속 모델들은 추론 시간 계산(test-time compute)을 늘려 복잡한 수학 문제와 코딩 과제에서 이전 모델을 크게 능가했다.

스케일링 법칙의 정체에 대한 Con 측의 주장도 과장이다. 사전학습 스케일링이 둔화되었다 해도, 추론 시간 계산 스케일링이라는 새로운 차원이 열렸다. 이것은 “트랜스포머의 실패”가 아니라 “트랜스포머 활용법의 진화”다. 아키텍처 자체는 여전히 핵심 기반으로 작동하고 있다.

🔴 Con (재반박)

Pro 측은 “인간의 언어 이해도 패턴 매칭이다”라고 주장하며 구분을 해소하려 했지만, 이는 논점을 흐리는 것이다. 인간은 텍스트 패턴만으로 학습하지 않는다. 감각 경험, 신체적 상호작용, 사회적 맥락이 언어 이해의 토대를 이룬다. 트랜스포머가 텍스트만으로 학습하는 한, 이 “기반(grounding)” 문제는 해결되지 않는다.

체인 오브 소트를 통한 “계획의 외부화”라는 Pro 측의 반론도 문제를 해결하지 못한다. 추론 단계를 토큰으로 생성한다 해도, 각 단계는 여전히 이전 토큰들로부터의 다음 토큰 예측이다. 진정한 계획은 목표로부터 역방향으로 작업하는 능력을 요구하는데, 자기회귀 구조는 본질적으로 전방향만 가능하다. o1 모델의 인상적인 결과도 특정 벤치마크에서의 성능이지, 범용적 추론 능력의 증거는 아니다.

더 중요하게, 트랜스포머 아키텍처 자체의 효율성 문제가 있다. 어텐션 메커니즘의 계산 복잡도는 컨텍스트 길이의 제곱에 비례한다. 이것은 영상에서 언급한 2,048 토큰의 컨텍스트 크기가 왜 그렇게 제한적인지를 설명한다. Mamba 같은 상태 공간 모델(State Space Model)은 선형 복잡도로 최대 백만 토큰의 시퀀스를 처리하면서도 동급 크기의 트랜스포머에 필적하는 성능을 보여준다. NVIDIA(2024)의 연구는 어텐션과 SSM 층을 결합한 하이브리드 모델이 순수 트랜스포머보다 우수할 수 있음을 공식 검증했다. 트랜스포머가 “최종 답”이 아닐 수 있다.

Round 3

🟢 Pro

Con 측은 기반(grounding) 문제를 제기하고, Mamba 같은 대안 아키텍처를 들어 트랜스포머의 한계를 주장했다. 두 가지 모두 일리 있지만, 결론을 잘못 이끌고 있다.

기반 문제에 대해: 멀티모달 트랜스포머는 이미 텍스트만이 아니라 이미지, 음성, 영상을 함께 처리한다. 영상 자체도 DALL-E, Midjourney 같은 이미지 생성 모델과 음성 합성 모델이 트랜스포머 기반이라는 점을 강조했다. 트랜스포머가 “텍스트만으로 학습한다”는 전제가 이미 현실과 맞지 않는다.

Mamba와 하이브리드 모델에 대해: Con 측 스스로 “어텐션과 SSM 층을 결합한 하이브리드 모델”이 유망하다고 인정했다. 이것은 트랜스포머의 대체가 아니라 확장이다. IBM의 Granite 4.0, AI21의 Jamba 시리즈 모두 어텐션 층을 핵심 구성 요소로 유지하고 있다. 어텐션 메커니즘은 사라지는 것이 아니라, 더 효율적인 구성 요소들과 결합하며 진화하고 있다.

3Blue1Brown의 영상이 진정으로 보여주는 것은, 행렬 곱셈이라는 단순한 연산으로부터 의미의 기하학이 자발적으로 출현한다는 놀라운 사실이다. 이 “출현(emergence)“의 원리가 트랜스포머의 진짜 힘이며, 규모나 아키텍처 변형에 관계없이 핵심으로 남을 통찰이다.

🔴 Con

Pro 측은 멀티모달 트랜스포머를 들어 기반 문제가 해결되고 있다고 주장했지만, 이미지와 텍스트를 함께 처리하는 것과 세계에 대한 근본적 이해를 갖는 것은 다르다. 멀티모달 모델도 여전히 데이터의 통계적 분포를 학습할 뿐이며, “뜨거운 냄비를 만지면 데인다”는 인과 관계를 감각 없이 진정으로 이해할 수 있는지는 여전히 열린 질문이다.

하이브리드 모델이 어텐션을 “유지”한다는 Pro 측의 주장은 사실이지만, 논점을 교묘하게 바꾸는 것이다. 문제는 “어텐션이 유용한가”가 아니라 “순수 트랜스포머 아키텍처가 AI의 최종 형태인가”다. 영상은 트랜스포머의 원리를 아름답게 설명하지만, 바로 그 아름다움이 함정일 수 있다 — “이해 가능하다”는 것과 “충분하다”는 것은 다른 문제다.

마지막으로, 영상이 의도적으로 다루지 않는 거대한 코끼리가 방 안에 있다: 에너지와 자원이다. GPT-3의 1,750억 매개변수를 학습시키는 데 필요한 전력과 탄소 배출, 추론 시간의 비용은 이 우아한 수학적 구조가 사회적으로 지속 가능한지에 대한 근본적 질문을 제기한다. 기술적 우아함만으로는 기술의 정당성을 입증할 수 없다.

🧭 종합

합의 지점:

트랜스포머의 어텐션 메커니즘은 현재까지 발견된 가장 강력한 문맥 처리 도구이며, 대안 아키텍처도 이를 완전히 대체하기보다 결합하는 방향으로 발전하고 있다.
단어 임베딩의 의미론적 기하학은 단순한 가중합으로부터 추상적 관계가 출현할 수 있음을 실증하는 인상적인 결과다.
다음 토큰 예측은 놀랍도록 범용적인 학습 목표이지만, 그것만으로 범용 인공지능에 도달할 수 있는지는 미해결 문제다.

열린 질문:

자기회귀 모델이 진정한 의미의 “계획”과 “역방향 추론”을 수행할 수 있는가, 아니면 이를 위해 근본적으로 다른 아키텍처가 필요한가?
스케일링의 수확 체감이 “벽”인가, 아니면 새로운 차원(추론 시간 계산, 합성 데이터, 하이브리드 아키텍처)으로 우회 가능한 “굽이”인가?
통계적 패턴 매칭과 “진정한 이해” 사이의 경계는 어디인가? 이 구분이 실질적으로 의미가 있는가?

더 나아간 관점: 영상은 트랜스포머를 설명하는 데 집중하지만, 아마도 가장 심오한 시사점은 명시적으로 다루지 않은 곳에 있다. 행렬 곱셈의 반복으로부터 의미와 추론의 외양이 출현한다면, “이해”라는 개념 자체를 재정의해야 할 수도 있다. 인간의 이해도 뉴런 간 신호 전달의 통계적 패턴이라면, 트랜스포머와 인간 사이의 차이는 종류의 차이가 아니라 정도의 차이일 수 있다. 이 질문은 공학의 영역을 넘어 인지과학과 철학의 교차점에 서 있으며, 3Blue1Brown의 시각적 설명이 그토록 매력적인 이유이기도 하다 — 수학이 보여주는 것은 기술만이 아니라 마음의 본질에 대한 단서이기도 하기 때문이다.

04영문 원본 · Transcript

The initials GPT stand for Generative Pretrained Transformer.
So that first word is straightforward enough, these are bots that generate new text.
Pretrained refers to how the model went through a process of learning
from a massive amount of data, and the prefix insinuates that there's
more room to fine-tune it on specific tasks with additional training.
But the last word, that's the real key piece.
A transformer is a specific kind of neural network, a machine learning model,
and it's the core invention underlying the current boom in AI.
What I want to do with this video and the following chapters is go through a
visually-driven explanation for what actually happens inside a transformer.
We're going to follow the data that flows through it and go step by step.
There are many different kinds of models that you can build using transformers.
Some models take in audio and produce a transcript.
This sentence comes from a model going the other way around,
producing synthetic speech just from text.
All those tools that took the world by storm in 2022 like DALL-E and Midjourney
that take in a text description and produce an image are based on transformers.
Even if I can't quite get it to understand what a pi creature is supposed to be,
I'm still blown away that this kind of thing is even remotely possible.
And the original transformer introduced in 2017 by Google was invented for
the specific use case of translating text from one language into another.
But the variant that you and I will focus on, which is the type that
underlies tools like ChatGPT, will be a model that's trained to take in a piece of text,
maybe even with some surrounding images or sound accompanying it,
and produce a prediction for what comes next in the passage.
That prediction takes the form of a probability distribution
over many different chunks of text that might follow.
At first glance, you might think that predicting the next word
feels like a very different goal from generating new text.
But once you have a prediction model like this,
a simple thing you could try to make it generate, a longer piece of text,
is to give it an initial snippet to work with,
have it take a random sample from the distribution it just generated,
append that sample to the text, and then run the whole process again to make
a new prediction based on all the new text, including what it just added.
I don't know about you, but it really doesn't feel like this should actually work.
In this animation, for example, I'm running GPT-2 on my laptop and having it repeatedly
predict and sample the next chunk of text to generate a story based on the seed text.
The story just doesn't actually really make that much sense.
But if I swap it out for API calls to GPT-3 instead, which is the same basic model,
just much bigger, suddenly almost magically we do get a sensible story,
one that even seems to infer that a pi creature would live in a land of math and
computation.
This process here of repeated prediction and sampling is essentially
what's happening when you interact with ChatGPT,
or any of these other large language models, and you see them producing
one word at a time.
In fact, one feature that I would very much enjoy is the ability to
see the underlying distribution for each new word that it chooses.
Let's kick things off with a very high level preview
of how data flows through a transformer.
We will spend much more time motivating and interpreting and expanding
on the details of each step, but in broad strokes,
when one of these chatbots generates a given word, here's what's going on under the hood.
First, the input is broken up into a bunch of little pieces.
These pieces are called tokens, and in the case of text these tend to be
words or little pieces of words or other common character combinations.
If images or sound are involved, then tokens could be little
patches of that image or little chunks of that sound.
Each one of these tokens is then associated with a vector, meaning some list of numbers,
which is meant to somehow encode the meaning of that piece.
If you think of these vectors as giving coordinates in some very high dimensional space,
words with similar meanings tend to land on vectors that are
close to each other in that space.
This sequence of vectors then passes through an operation that's
known as an attention block, and this allows the vectors to talk to
each other and pass information back and forth to update their values.
For example, the meaning of the word model in the phrase "a machine learning
model" is different from its meaning in the phrase "a fashion model".
The attention block is what's responsible for figuring out which
words in context are relevant to updating the meanings of which other words,
and how exactly those meanings should be updated.
And again, whenever I use the word meaning, this is
somehow entirely encoded in the entries of those vectors.
After that, these vectors pass through a different kind of operation,
and depending on the source that you're reading this will be referred
to as a multi-layer perceptron or maybe a feed-forward layer.
And here the vectors don't talk to each other,
they all go through the same operation in parallel.
And while this block is a little bit harder to interpret,
later on we'll talk about how the step is a little bit like asking a long list
of questions about each vector, and then updating them based on the answers
to those questions.
All of the operations in both of these blocks look like a
giant pile of matrix multiplications, and our primary job is
going to be to understand how to read the underlying matrices.
I'm glossing over some details about some normalization steps that happen in between,
but this is after all a high-level preview.
After that, the process essentially repeats, you go back and forth
between attention blocks and multi-layer perceptron blocks,
until at the very end the hope is that all of the essential meaning
of the passage has somehow been baked into the very last vector in the sequence.
We then perform a certain operation on that last vector that produces a probability
distribution over all possible tokens, all possible little chunks of text that might
come next.
And like I said, once you have a tool that predicts what comes next
given a snippet of text, you can feed it a little bit of seed text and
have it repeatedly play this game of predicting what comes next,
sampling from the distribution, appending it, and then repeating over and over.
Some of you in the know may remember how long before ChatGPT came into the scene,
this is what early demos of GPT-3 looked like,
you would have it autocomplete stories and essays based on an initial snippet.
To make a tool like this into a chatbot, the easiest starting point is to have a
little bit of text that establishes the setting of a user interacting with a
helpful AI assistant, what you would call the system prompt,
and then you would use the user's initial question or prompt as the first bit of
dialogue, and then you have it start predicting what such a helpful AI assistant
would say in response.
There is more to say about an added step of training that's required
to make this work well, but at a high level this is the idea.
In this chapter, you and I are going to expand on the details of what happens at the very
beginning of the network, at the very end of the network,
and I also want to spend a lot of time reviewing some important bits of background
knowledge, things that would have been second nature to any machine learning engineer by
the time transformers came around.
If you're comfortable with that background knowledge and a little impatient,
you could probably feel free to skip to the next chapter,
which is going to focus on the attention blocks,
generally considered the heart of the transformer.
After that, I want to talk more about these multi-layer perceptron blocks,
how training works, and a number of other details that will have been skipped up to
that point.
For broader context, these videos are additions to a mini-series about deep learning,
and it's okay if you haven't watched the previous ones,
I think you can do it out of order, but before diving into transformers specifically,
I do think it's worth making sure that we're on the same page about the basic premise
and structure of deep learning.
At the risk of stating the obvious, this is one approach to machine learning,
which describes any model where you're using data to somehow determine how a model
behaves.
What I mean by that is, let's say you want a function that takes in
an image and it produces a label describing it,
or our example of predicting the next word given a passage of text,
or any other task that seems to require some element of intuition
and pattern recognition.
We almost take this for granted these days, but the idea with machine learning is that
rather than trying to explicitly define a procedure for how to do that task in code,
which is what people would have done in the earliest days of AI,
instead you set up a very flexible structure with tunable parameters,
like a bunch of knobs and dials, and then, somehow,
you use many examples of what the output should look like for a given input to tweak
and tune the values of those parameters to mimic this behavior.
For example, maybe the simplest form of machine learning is linear regression,
where your inputs and outputs are each single numbers,
something like the square footage of a house and its price,
and what you want is to find a line of best fit through this data, you know,
to predict future house prices.
That line is described by two continuous parameters,
say the slope and the y-intercept, and the goal of linear
regression is to determine those parameters to closely match the data.
Needless to say, deep learning models get much more complicated.
GPT-3, for example, has not two, but 175 billion parameters.
But here's the thing, it's not a given that you can create some giant
model with a huge number of parameters without it either grossly
overfitting the training data or being completely intractable to train.
Deep learning describes a class of models that in the
last couple decades have proven to scale remarkably well.
What unifies them is that they all use the same training algorithm,
it's called backpropagation, we talked about it in previous chapters,
and the context that I want you to have as we go in is that in order for this
training algorithm to work well at scale, these models have to follow a certain
specific format.
And if you know this format going in, it helps to explain many of the choices for how a
transformer processes language, which otherwise run the risk of feeling kinda arbitrary.
First, whatever kind of model you're making, the
input has to be formatted as an array of real numbers.
This could simply mean a list of numbers, it could be a two-dimensional array,
or very often you deal with higher dimensional arrays,
where the general term used is tensor.
You often think of that input data as being progressively transformed into many
distinct layers, where again, each layer is always structured as some kind of
array of real numbers, until you get to a final layer which you consider the output.
For example, the final layer in our text processing model is a list of numbers
representing the probability distribution for all possible next tokens.
In deep learning, these model parameters are almost always referred to as weights,
and this is because a key feature of these models is that the only way these
parameters interact with the data being processed is through weighted sums.
You also sprinkle some non-linear functions throughout,
but they won't depend on parameters.
Typically, though, instead of seeing the weighted sums all naked
and written out explicitly like this, you'll instead find them
packaged together as various components in a matrix vector product.
It amounts to saying the same thing, if you think back to how matrix vector
multiplication works, each component in the output looks like a weighted sum.
It's just often conceptually cleaner for you and me to think
about matrices that are filled with tunable parameters that
transform vectors that are drawn from the data being processed.
For example, those 175 billion weights in GPT-3 are
organized into just under 28,000 distinct matrices.
Those matrices in turn fall into eight different categories,
and what you and I are going to do is step through each one of those categories to
understand what that type does.
As we go through, I think it's kind of fun to reference the specific
numbers from GPT-3 to count up exactly where those 175 billion come from.
Even if nowadays there are bigger and better models,
this one has a certain charm as the first large-language
model to really capture the world's attention outside of ML communities.
Also, practically speaking, companies tend to keep much tighter
lips around the specific numbers for more modern networks.
I just want to set the scene going in, that as you peek under the
hood to see what happens inside a tool like ChatGPT,
almost all of the actual computation looks like matrix vector multiplication.
There's a little bit of a risk getting lost in the sea of billions of numbers,
but you should draw a very sharp distinction in your mind between
the weights of the model, which I'll always color in blue or red,
and the data being processed, which I'll always color in gray.
The weights are the actual brains, they are the things learned during training,
and they determine how it behaves.
The data being processed simply encodes whatever specific input is
fed into the model for a given run, like an example snippet of text.
With all of that as foundation, let's dig into the first step of this text processing
example, which is to break up the input into little chunks and turn those chunks into
vectors.
I mentioned how those chunks are called tokens,
which might be pieces of words or punctuation,
but every now and then in this chapter and especially in the next one,
I'd like to just pretend that it's broken more cleanly into words.
Because we humans think in words, this will just make it much
easier to reference little examples and clarify each step.
The model has a predefined vocabulary, some list of all possible words,
say 50,000 of them, and the first matrix that we'll encounter,
known as the embedding matrix, has a single column for each one of these words.
These columns are what determines what vector each word turns into in that first step.
We label it W_E, and like all the matrices we see,
its values begin random, but they're going to be learned based on data.
Turning words into vectors was common practice in machine learning long before
transformers, but it's a little weird if you've never seen it before,
and it sets the foundation for everything that follows,
so let's take a moment to get familiar with it.
We often call this embedding a word, which invites you to think of these
vectors very geometrically as points in some high dimensional space.
Visualizing a list of three numbers as coordinates for points in 3D space would
be no problem, but word embeddings tend to be much much higher dimensional.
In GPT-3 they have 12,288 dimensions, and as you'll see,
it matters to work in a space that has a lot of distinct directions.
In the same way that you could take a two-dimensional slice through a 3D space
and project all the points onto that slice, for the sake of animating word
embeddings that a simple model is giving me, I'm going to do an analogous
thing by choosing a three-dimensional slice through this very high dimensional space,
and projecting the word vectors down onto that and displaying the results.
The big idea here is that as a model tweaks and tunes its weights to determine
how exactly words get embedded as vectors during training,
it tends to settle on a set of embeddings where directions in the space have a
kind of semantic meaning.
For the simple word-to-vector model I'm running here,
if I run a search for all the words whose embeddings are closest to that of tower,
you'll notice how they all seem to give very similar tower-ish vibes.
And if you want to pull up some Python and play along at home,
this is the specific model that I'm using to make the animations.
It's not a transformer, but it's enough to illustrate the
idea that directions in the space can carry semantic meaning.
A very classic example of this is how if you take the difference between
the vectors for woman and man, something you would visualize as a
little vector in the space connecting the tip of one to the tip of the other,
it's very similar to the difference between king and queen.
So let's say you didn't know the word for a female monarch,
you could find it by taking king, adding this woman minus man direction,
and searching for the embedding closest to that point.
At least, kind of.
Despite this being a classic example for the model I'm playing with,
the true embedding of queen is actually a little farther off than this would suggest,
presumably because the way queen is used in training data is not merely a feminine
version of king.
When I played around, family relations seemed to illustrate the idea much better.
The point is, it looks like during training the model found it advantageous to
choose embeddings such that one direction in this space encodes gender information.
Another example is that if you take the embedding of Italy,
and you subtract the embedding of Germany, and add that to the embedding of Hitler,
you get something very close to the embedding of Mussolini.
It's as if the model learned to associate some directions with Italian-ness,
and others with WWII axis leaders.
Maybe my favorite example in this vein is how in some models,
if you take the difference between Germany and Japan, and add it to sushi,
you end up very close to bratwurst.
Also in playing this game of finding nearest neighbors,
I was very pleased to see how close cat was to both beast and monster.
One bit of mathematical intuition that's helpful to have in mind,
especially for the next chapter, is how the dot product of two
vectors can be thought of as a way to measure how well they align.
Computationally, dot products involve multiplying all the
corresponding components and then adding the results, which is good,
since so much of our computation has to look like weighted sums.
Geometrically, the dot product is positive when vectors point in similar directions,
it's zero if they're perpendicular, and it's negative whenever
they point in opposite directions.
For example, let's say you were playing with this model,
and you hypothesize that the embedding of cats minus cat might represent a sort of
plurality direction in this space.
To test this, I'm going to take this vector and compute its dot
product against the embeddings of certain singular nouns,
and compare it to the dot products with the corresponding plural nouns.
If you play around with this, you'll notice that the plural ones
do indeed seem to consistently give higher values than the singular ones,
indicating that they align more with this direction.
It's also fun how if you take this dot product with the embeddings of the words one,
two, three, and so on, they give increasing values,
so it's as if we can quantitatively measure how plural the model finds a given word.
Again, the specifics for how words get embedded is learned using data.
This embedding matrix, whose columns tell us what happens to each word,
is the first pile of weights in our model.
Using the GPT-3 numbers, the vocabulary size specifically is 50,257,
and again, technically this consists not of words per se, but of tokens.
The embedding dimension is 12,288, and multiplying those
tells us this consists of about 617 million weights.
Let's go ahead and add this to a running tally,
remembering that by the end we should count up to 175 billion.
In the case of transformers, you really want to think of the vectors
in this embedding space as not merely representing individual words.
For one thing, they also encode information about the position of that word,
which we'll talk about later, but more importantly,
you should think of them as having the capacity to soak in context.
A vector that started its life as the embedding of the word king, for example,
might progressively get tugged and pulled by various blocks in this network,
so that by the end it points in a much more specific and nuanced direction that
somehow encodes that it was a king who lived in Scotland,
and who had achieved his post after murdering the previous king,
and who's being described in Shakespearean language.
Think about your own understanding of a given word.
The meaning of that word is clearly informed by the surroundings,
and sometimes this includes context from a long distance away,
so in putting together a model that has the ability to predict what word comes next,
the goal is to somehow empower it to incorporate context efficiently.
To be clear, in that very first step, when you create the array of
vectors based on the input text, each one of those is simply plucked
out of the embedding matrix, so initially each one can only encode
the meaning of a single word without any input from its surroundings.
But you should think of the primary goal of this network that it flows through
as being to enable each one of those vectors to soak up a meaning that's much
more rich and specific than what mere individual words could represent.
The network can only process a fixed number of vectors at a time,
known as its context size.
For GPT-3 it was trained with a context size of 2048,
so the data flowing through the network always looks like this array of 2048 columns,
each of which has 12,000 dimensions.
This context size limits how much text the transformer can
incorporate when it's making a prediction of the next word.
This is why long conversations with certain chatbots,
like the early versions of ChatGPT, often gave the feeling of
the bot kind of losing the thread of conversation as you continued too long.
We'll go into the details of attention in due time,
but skipping ahead I want to talk for a minute about what happens at the very end.
Remember, the desired output is a probability
distribution over all tokens that might come next.
For example, if the very last word is Professor,
and the context includes words like Harry Potter,
and immediately preceding we see least favorite teacher,
and also if you give me some leeway by letting me pretend that tokens simply
look like full words, then a well-trained network that had built up knowledge
of Harry Potter would presumably assign a high number to the word Snape.
This involves two different steps.
The first one is to use another matrix that maps the very last vector in that
context to a list of 50,000 values, one for each token in the vocabulary.
Then there's a function that normalizes this into a probability distribution,
it's called softmax and we'll talk more about it in just a second,
but before that it might seem a little bit weird to only use this last embedding
to make a prediction, when after all in that last step there are thousands of
other vectors in the layer just sitting there with their own context-rich meanings.
This has to do with the fact that in the training process it turns out to be
much more efficient if you use each one of those vectors in the final layer
to simultaneously make a prediction for what would come immediately after it.
There's a lot more to be said about training later on,
but I just want to call that out right now.
This matrix is called the Unembedding matrix and we give it the label WU.
Again, like all the weight matrices we see, its entries begin at random,
but they are learned during the training process.
Keeping score on our total parameter count, this Unembedding
matrix has one row for each word in the vocabulary,
and each row has the same number of elements as the embedding dimension.
It's very similar to the embedding matrix, just with the order swapped,
so it adds another 617 million parameters to the network,
meaning our count so far is a little over a billion,
a small but not wholly insignificant fraction of the 175 billion
we'll end up with in total.
As the very last mini-lesson for this chapter,
I want to talk more about this softmax function,
since it makes another appearance for us once we dive into the attention blocks.
The idea is that if you want a sequence of numbers to act as a probability distribution,
say a distribution over all possible next words,
then each value has to be between 0 and 1, and you also need all of them to add up to 1.
However, if you're playing the deep learning game where everything you do looks like
matrix-vector multiplication, the outputs you get by default don't abide by this at all.
The values are often negative, or much bigger than 1,
and they almost certainly don't add up to 1.
Softmax is the standard way to turn an arbitrary list of numbers
into a valid distribution in such a way that the largest values end up closest to 1,
and the smaller values end up very close to 0.
That's all you really need to know.
But if you're curious, the way it works is to first raise e to the power
of each of the numbers, which means you now have a list of positive values,
and then you can take the sum of all those positive values and divide each
term by that sum, which normalizes it into a list that adds up to 1.
You'll notice that if one of the numbers in the input is meaningfully bigger than the
rest, then in the output the corresponding term dominates the distribution,
so if you were sampling from it you'd almost certainly just be picking the maximizing
input.
But it's softer than just picking the max in the sense that when other values
are similarly large, they also get meaningful weight in the distribution,
and everything changes continuously as you continuously vary the inputs.
In some situations, like when ChatGPT is using this distribution to create a next word,
there's room for a little bit of extra fun by adding a little extra spice into this
function, with a constant T thrown into the denominator of those exponents.
We call it the temperature, since it vaguely resembles the role of temperature in
certain thermodynamics equations, and the effect is that when T is larger,
you give more weight to the lower values, meaning the distribution is a little bit
more uniform, and if T is smaller, then the bigger values will dominate more
aggressively, where in the extreme, setting T equal to zero means all of the weight
goes to maximum value.
For example, I'll have GPT-3 generate a story with the seed text,
"once upon a time there was A", but I'll use different temperatures in each case.
Temperature zero means that it always goes with the most predictable word,
and what you get ends up being a trite derivative of Goldilocks.
A higher temperature gives it a chance to choose less likely words,
but it comes with a risk.
In this case, the story starts out more originally,
about a young web artist from South Korea, but it quickly degenerates into nonsense.
Technically speaking, the API doesn't actually let you pick a temperature bigger than 2.
There's no mathematical reason for this, it's just an arbitrary constraint imposed
to keep their tool from being seen generating things that are too nonsensical.
So if you're curious, the way this animation is actually working is I'm taking the
20 most probable next tokens that GPT-3 generates,
which seems to be the maximum they'll give me,
and then I tweak the probabilities based on an exponent of 1/5.
As another bit of jargon, in the same way that you might call the components of
the output of this function probabilities, people often refer to the inputs as logits,
or some people say logits, some people say logits, I'm gonna say logits.
So for instance, when you feed in some text, you have all these word embeddings
flow through the network, and you do this final multiplication with the
unembedding matrix, machine learning people would refer to the components in that raw,
unnormalized output as the logits for the next word prediction.
A lot of the goal with this chapter was to lay the foundations for
understanding the attention mechanism, Karate Kid wax-on-wax-off style.
You see, if you have a strong intuition for word embeddings, for softmax,
for how dot products measure similarity, and also the underlying premise that
most of the calculations have to look like matrix multiplication with matrices
full of tunable parameters, then understanding the attention mechanism,
this cornerstone piece in the whole modern boom in AI, should be relatively smooth.
For that, come join me in the next chapter.
As I'm publishing this, a draft of that next chapter
is available for review by Patreon supporters.
A final version should be up in public in a week or two,
it usually depends on how much I end up changing based on that review.
In the meantime, if you want to dive into attention,
and if you want to help the channel out a little bit, it's there waiting.