I Built a Better MLB Prediction Model Using New Bat Tracking Data + ML

2026-03-26 · 23m · 자막 —

01한국어 번역 · Korean

배트 트래킹(bat tracking) 데이터와 머신러닝(ML)으로 만든 MLB 예측 모델

원본: https://www.youtube.com/watch?v=esLFua0u5fY · 업로드: 2026-03-26 · 길이: 24m · 채널: Onchain AI Garage

왜 야구 데이터인가

다양한 유형의 모델을 직접 만들어 보는 실험을 이어오다가, 이번에는 스포츠 데이터에 관심이 생겼다. 본인도 야구 팬이고, 스포츠 데이터는 팬으로서 경기를 더 깊이 즐기는 용도, 구단이 전력 분석에 활용하는 용도, 그리고 베팅 및 예측 시장에 쓰이는 용도까지 쓰임새가 무궁무진하다. 마침 메이저리그(MLB) 시즌 개막이 가까워졌고, 공개된 데이터도 많아서 배트 트래킹 데이터를 중심으로 예측 모델을 만들어 보기로 했다. 배트 트래킹 데이터는 2024년에 MLB가 처음 공개했고, 2025년에는 더 많은 항목이 추가되었다. 이 데이터로 타자의 타격 품질을 더 잘 묘사하고, 더 나아가 다음 시즌을 예측할 수 있을지가 이번 프로젝트의 핵심 질문이었다.

WOBA와 xWOBA, 그리고 그 한계

먼저 알아야 할 개념은 WOBA(가중 출루율, weighted on base average) 이다. 타율(batting average)이나 출루율(on base percentage)과 달리, WOBA는 타자의 종합 공격 가치를 측정한다. 단타, 2루타, 3루타, 홈런, 볼넷 각각이 실제로 몇 점의 기대 득점을 만들어내는지에 따라 가중치를 부여한다. 평균적인 타자의 WOBA는 대략 0.310이고, 최상위 타자들은 0.400을 넘긴다. 이 지표는 톰 탱고(Tom Tango)가 개발했고, 가중치는 선형 가중치(linear weights)라고 부르는, 수백만 건의 과거 타석 데이터에서 뽑아낸 기대 득점 기반값이다.

타율만 쓰면 빗맞은 행운의 안타와 강한 라인드라이브 2루타가 같은 값으로 계산되고 볼넷도 완전히 무시된다. WOBA는 이 두 문제를 동시에 해결한다. 여기서 한 발 더 나아간 것이 MLB의 공식 지표인 xWOBA(expected WOBA) 다. 2015년 StatCast 도입과 함께 등장했고, 타구 속도(exit velocity) 와 발사각(launch angle) 을 기반으로 비슷한 타구들이 과거에 어떤 결과로 이어졌는지 추정해 해당 타구의 “기댓값”을 매긴다. 문제는 이 지표가 그 두 변수만 본다는 점이다. 누가 쳤는지, 어디로 갔는지, 어떤 투구를 상대했는지는 전혀 고려하지 않는다.

배트 트래킹 혁명

2024년부터 MLB는 공이 아닌 배트 자체의 움직임을 측정한 데이터를 내놓기 시작했다. 구장마다 12대의 카메라로 모든 스윙을 3차원으로 추적해 배트 스피드(bat speed), 어택 앵글(attack angle), 배럴 비율(barrel rate) 등을 얻어냈다. 2025년에는 여기에 어택 디렉션(attack direction, 타구 방향) 과 스윙 패스 틸트(swing path tilt) 가 추가되었다. 그동안 팬그래프(FanGraphs), 샘 월시(Sam Walsh), 톰 네스티코(Tom Nestico) 등 여러 연구자가 각자 방식으로 이 데이터를 다뤘지만, 2025년에 새로 추가된 배트 트래킹 지표들을 머신러닝과 결합해 예측 모델을 만드는 시도는 찾을 수 없었다. 그리고 단일 경기 정확도가 아니라 시즌 대 시즌(year-to-year) 예측에서 배트 트래킹이 실제로 도움이 되는지 검증한 사례도 보이지 않았다. 이 빈 자리를 메우는 것이 이번 실험의 목표가 되었다.

두 개의 모델: 기술(descriptive) 모델과 예측(predictive) 모델

접근은 두 개의 모델로 나누었다. 모든 작업은 Claude Code 환경에서 진행했다.

기술 모델: “이 타구는 원래 얼마의 가치를 가져야 했는가”를 묘사하는 모델. 개별 타구의 품질을 배트 트래킹 지표로 표현한다.
예측 모델: “이 타자는 앞으로 어떤 성적을 낼 것인가”를 예측하는 모델. 272개 선수 시즌(player season) 데이터로 학습했다.

머신러닝의 진짜 어려움은 “패턴을 찾는 것”이 아니라 “찾은 패턴이 진짜인지, 그저 잡음에 불과한 신기루인지” 구분하는 데 있다. 그래서 과정 전반을 매우 엄격하게 설계했다. 데이터는 Baseball Savant StatCast API에서 받아 왔고, 배트 트래킹과 타구 데이터를 결합해 총 19개의 후보 피처(feature) 를 엔지니어링했다.

교차 검증, 에이블레이션, 그리디 선택

머신러닝의 제1원칙은 “학습에 쓴 데이터로는 절대 테스트하지 않는다”는 것이다. 시험 문제를 미리 본 학생이 100점을 받는다고 공부한 게 아니듯, 학습 데이터로 성능을 재는 건 아무 의미가 없다. 그래서 5-폴드 교차 검증(5-fold cross validation) 을 기본으로 썼다. 23만 개의 WOBA 이벤트를 5등분해, 4개로 학습하고 나머지 1개로 테스트하는 과정을 5번 반복해 평균을 냈다.

선수 단위 예측 모델은 데이터가 272명뿐이라 더 엄격한 리브원아웃(leave-one-out) 교차 검증을 적용했다. 271명으로 학습해 1명을 예측하는 과정을 272번 반복해 모델이 자기 정답을 절대 못 보게 했다.

그다음은 피처 에이블레이션(feature ablation). 어떤 피처가 진짜로 성능에 기여하는지 체계적으로 걸러내는 절차다. 두 방향으로 돌렸다.

전진 에이블레이션(forward ablation): 타구 속도 + 발사각만 쓰는 MLB 기본 모델에서 출발해 피처를 하나씩 추가. 여기서 스프레이 앵글(spray angle, 타구 방향)이 무려 18.3% 성능을 개선했다. 어택 디렉션, 어택 앵글, 스윙 길이도 긍정적이었다.
후진 에이블레이션(backward ablation): 19개 피처를 전부 넣은 상태에서 하나씩 빼 보기. 이쪽에서는 발사각이 가장 중요했고 스프레이 앵글이 그 뒤를 이었다.

흥미로운 건 배트 스피드였다. 전진 에이블레이션에서는 도움이 되는 것처럼 보였지만, 후진 에이블레이션에서는 오히려 아주 미세하게 성능을 깎았다. 해석은 간단했다. 타구 속도 안에 이미 배트 스피드의 신호가 담겨 있어서, 풀 모델에서는 그냥 중복 정보가 된 것이다. 두 방향의 결과를 모두 봐야 모델의 진짜 구조가 드러난다는 걸 보여주는 좋은 사례였다.

이어서 그리디 전진 선택(greedy forward selection) 으로 최적 피처 조합을 쌓았다. 매 라운드마다 가장 성능을 많이 올리는 피처를 고르는 방식이다. 1라운드에서는 스프레이 앵글이 압도적(+18%), 2라운드에서는 어택 디렉션(+3.2%), 3라운드에서는 어택 앵글(+0.9%)이 선택됐다. 그 이후로는 남은 14개 피처 중 어떤 것도 모델을 개선하지 못해 알고리즘이 스스로 멈췄다. 결과적으로 최종 피처는 5개. 피처가 많다고 좋은 모델이 되는 게 아니라는 걸 모델 스스로 증명한 셈이다.

알고리즘 비교와 하이퍼파라미터 튜닝

같은 데이터와 피처 위에서 네 개의 알고리즘을 비교했다. MLB 공식 모델이 쓰는 KNN(K-최근접 이웃) 은 가장 비슷한 과거 타구들을 찾아 평균을 내는 방식이다. 반면 XGBoost, LightGBM 같은 그래디언트 부스팅 트리 계열은 수백 개의 작은 결정 트리를 이어 붙이며 이전 트리의 실수를 다음 트리가 보정하는 구조다. 최종적으로 가장 좋은 RMSE를 낸 것은 LightGBM이었고, 기술 모델의 엔진으로 채택했다.

하이퍼파라미터 튜닝은 XGBoost와 LightGBM에 대해 각각 160개 조합을 돌렸고, 5-폴드 CV까지 곱하면 대략 1,600개의 모델이 학습된 셈이다. GPU로 돌려도 시간이 꽤 걸렸다. 기본 설정의 RMSE 0.377이 최적 설정에서 0.371로 내려갔다. 절대값으로는 작지만, 이런 문제에선 매 소수점 자리가 의미가 있다.

기술 모델 결과: xWOBA+

최종 기술 모델은 타구 속도, 발사각, 스프레이 앵글, EV 효율, 배트 스피드 5개 피처를 쓴다. MLB의 xWOBA 기준선 RMSE 0.479 대비, 내가 xWOBA+ 라고 이름 붙인 이 모델은 RMSE 0.371을 기록해 22.5% 더 정확하게 개별 타구의 기댓값을 묘사했다.

그러나 예측에서는 실패했다

그런데 중요한 발견이 있었다. 잘 묘사한다고 해서 잘 예측하는 건 아니다. 2024년 데이터로 학습한 xWOBA+로 2025년 성적을 예측해 봤더니 MLB의 기본 xWOBA를 전혀 이기지 못했다. 이유는 스프레이 앵글에 있었다. 스프레이 앵글은 시즌 간 변동성이 크다. 2024년 당겨치는 타자가 2025년에 밀어치는 타자가 될 수 있다. 단일 경기, 단일 타구 묘사에는 강력했지만 시즌 간 예측에는 오히려 독이 되는 피처였던 것이다.

두 번째 시도: PWOBA(예측 WOBA)

그래서 문제를 다시 정의하고 새 모델을 만들었다. 핵심 인사이트는, 예측에 유용한 피처는 빨리 안정되고 해마다 일관되게 유지되는 지표라는 점이었다. 배트 스피드는 단 3번의 스윙만으로도 안정되고, 배럴 비율은 50타구 정도면 안정된다. 기술 모델에서 별 쓸모 없었던 이 피처들이 예측 문제에서는 주인공이 된다.

데이터는 272개뿐이니 모델은 단순해야 했다. 복잡한 부스팅 트리는 이 정도 표본에서는 과적합하기 쉽다. 최종 선택은 엘라스틱넷(Elastic Net), 즉 L1과 L2 정규화를 동시에 거는 선형 모델이었다. L1은 중요하지 않은 피처의 계수를 정확히 0으로 만들고, L2는 모든 계수를 전반적으로 작게 눌러준다. 72개 조합으로 두 페널티 강도를 튜닝했다.

검증은 리브원아웃 교차검증으로 가장 엄격하게 걸었고, 기준선으로 마르셀(Marcel) 이라는 “최근 스탯을 평균 내고 평균으로 회귀시키는” 의도적으로 순진한 시스템과 비교했다. 마르셀을 못 이기면 그냥 잡음을 학습한 것이기 때문이다. 결과적으로 PWOBA는 마르셀(0.561), MLB xWOBA(0.613)를 모두 제치고 0.633의 연간 상관계수를 기록했다. MLB 예측기보다 3.3% 더 정확하다. 흥미롭게도 2024년에 “과성과”로 분류한 선수들 중 78%가 실제로 2025년에 성적이 하락했다.

배운 것들

이번 작업에서 얻은 교훈은 분명하다.

배트 스피드는 예측에서는 중요하지만 묘사에서는 덜 중요하다. 한 번의 스윙을 평가하는 데는 거의 쓸모없지만, 그 선수의 신체적 상한을 드러내기 때문에 미래 성적의 세 번째로 강력한 예측 변수가 된다.
데이터가 많다고 항상 좋은 예측이 되는 건 아니다. 피처 5개짜리 기술 모델은 타구 품질은 22% 더 정확히 묘사했지만, 훨씬 단순한 예측 모델이 미래는 더 잘 맞혔다.
묘사와 예측은 다른 목표다. 이 둘을 섞으면 곤란하다.
작은 데이터에서는 단순한 모델이 이긴다. 272개 샘플에서는 정규화된 선형 모델이 XGBoost와 LightGBM을 눌렀다. 과도한 엔지니어링은 금물이다.
상대 투수 품질도 중요하다. 상대한 투수들의 WOBA 평균을 피처로 넣는 것이 예측을 의미 있게 개선했다. 엘리트 투수를 상대로 기록한 WOBA는 약체 투수를 상대로 기록한 WOBA보다 훨씬 인상적이다.

2026 시즌 전망과 다음 단계

2026 시즌 개막 주에 PWOBA 기준 상위 타자들을 뽑았다. 1위는 MVP 출신 애런 저지(Aaron Judge), 타구 품질이 꾸준히 뛰어난 지안카를로 스탠튼(Giancarlo Stanton), 그리고 오타니, 슈와버(Schwarber), 피트 알론소(Pete Alonso)가 이어진다. 저지는 2025년 WOBA 0.601이라는 비현실적 수치를 찍어서 회귀 후보 1순위이긴 하지만, 0.484까지 떨어져도 여전히 최상위권이다.

이 예측은 앞으로 sportsaigarage.com 대시보드에 공개해 Steamer, ZiPS, BAT X 같은 기존 예측 시스템들과 함께 추적할 예정이다. 대체로 기존 시스템들은 이 모델보다 더 보수적이라 더 큰 폭의 회귀를 예상하고 있다.

다음 주제는 부상과 IL(부상자 명단) 복귀 이후의 성적 변화다. 특정 부상과 이탈 기간이 복귀 후 경기력에 어떤 영향을 주는지를 모델링하려고 한다. 야구 데이터가 가장 풍부해서 시작했지만 농구, 미식축구 같은 다른 종목으로도 확장해 볼 계획이다. 이번 프로젝트는 스포츠 데이터로 진지한 머신러닝 모델을 만드는 첫 시도였고, 최신 AI 도구로 이 영역에서 아직 할 수 있는 게 훨씬 많다는 확신이 남았다.

02리서치 문서 · Document

배트 트래킹 시대의 MLB 예측: 묘사(description)와 예측(prediction)은 같은 문제가 아니다

원본 영상: YouTube · Onchain AI Garage · 2026-03-26

서론: 왜 지금 배트 트래킹인가

2024년, MLB는 StatCast 역사상 가장 큰 데이터 확장을 단행했다. 구장마다 12대의 카메라로 모든 스윙을 3D로 추적해 배트 스피드(bat speed), 어택 앵글(attack angle), 배럴 비율(barrel rate) 같은 배트 자체의 물리량을 공개하기 시작했고, 2025년에는 여기에 스윙 패스(swing path), 어택 디렉션(attack direction), 이상적 어택 앵글(ideal attack angle) 이 추가되었다. MLB가 공식적으로 정리한 지표 안내는 New Statcast metrics measure swing path, attack angle, attack direction에서 볼 수 있고, 실제 선수별 수치는 Statcast Bat Tracking 리더보드에 공개되어 있다.

Onchain AI Garage는 이 새 데이터가 기존 MLB 예측 모델의 사각지대를 메워줄 수 있는지 실험했다. 결론은 의외로 교과서적이다. “타구를 더 잘 묘사하는 모델”과 “내년 성적을 더 잘 맞히는 모델”은 거의 별개의 문제라는 사실이 다시 확인되었다. 이 글은 해당 실험의 방법론을 정리하고, 배트 트래킹 시대의 야구 통계학이 어디로 가고 있는지 살펴본다.

본론

1. xWOBA는 무엇을 측정하고, 무엇을 못 보는가

야구에서 타자의 종합 공격 가치를 표현하는 현대적 지표는 wOBA(가중 출루율, weighted on base average) 다. 단타·2루타·3루타·홈런·볼넷에 각각 다른 가중치를 주는데, 이 가중치는 톰 탱고(Tom Tango)가 수백만 타석 데이터로 계산한 선형 가중치(linear weights) 에서 나온다. 타율과 달리 빗맞은 안타와 라인드라이브 2루타를 구분하고, 볼넷도 제대로 반영한다는 점이 핵심이다.

MLB가 2015년 StatCast 도입과 함께 내놓은 xWOBA(expected wOBA) 는 여기서 한 걸음 더 나아간다. 타구 하나가 실제 결과와 무관하게 “원래 어떤 가치의 타구였는지” 추정하는 지표로, 타구 속도와 발사각을 입력으로 받아 비슷한 과거 타구들의 결과 분포를 평균낸다. xwOBA의 수학적 성격에 대한 배경과 KNN 기반 재구성 사례는 Thomas Nestico의 xwOBA KNN 모델링 글에서 잘 정리되어 있다.

문제는 xWOBA가 “타구 속도와 발사각만 본다”는 점이다. 누가 쳤는지, 어느 방향으로 갔는지, 어떤 투수의 어떤 공을 상대했는지는 모델에 반영되지 않는다. 배트 트래킹 데이터는 정확히 이 빈틈을 채울 수 있는 후보다.

2. 실험 설계: 두 개의 질문, 두 개의 모델

Onchain AI Garage는 문제를 둘로 나눴다.

기술 모델(descriptive model) — “이 타구는 원래 얼마만큼의 가치를 가졌어야 하나?”
예측 모델(predictive model) — “이 타자의 다음 시즌 성적은 어떻게 될까?”

두 모델은 입력도, 적합한 알고리즘도 다르다. 기술 모델은 23만 건의 개별 타구 데이터를, 예측 모델은 단 272개의 선수 시즌(player-season) 데이터를 쓴다. 전자에는 패턴을 깎아 낼 공간이 충분하지만, 후자는 과적합의 지뢰밭이다.

3. 기술 모델 파이프라인: 19개 피처에서 5개로

Baseball Savant API로 2024~2025 시즌 데이터를 받아 19개의 후보 피처를 만들었다. 핵심 절차는 다음과 같다.

5-폴드 교차 검증으로 데이터를 5등분하고, 하나를 테스트용으로 남긴 채 나머지로 학습하는 과정을 5번 반복한다.
전진 에이블레이션(forward ablation) 과 후진 에이블레이션(backward ablation) 으로 각 피처의 순수 기여도를 이중 검증했다. 전자는 피처를 하나씩 더하고, 후자는 풀 세트에서 하나씩 뺀다.
그리디 전진 선택(greedy forward selection) 으로 최적 조합을 쌓았다.
LightGBM과 XGBoost 각각에 대해 160개 하이퍼파라미터 조합을 5-폴드 CV로 돌려 최적값을 찾았다.

가장 흥미로운 발견은 세 가지였다. 첫째, 스프레이 앵글(spray angle) 만 추가해도 기본 xWOBA 대비 RMSE가 18.3% 개선됐다. 둘째, 배트 스피드는 전진 에이블레이션에서는 이득처럼 보이지만 후진 에이블레이션에서는 오히려 성능을 미세하게 깎았다. 타구 속도에 이미 배트 스피드의 신호가 담겨 있기 때문이다. 셋째, 14개 피처는 추가해도 전혀 개선이 없어 알고리즘이 스스로 멈췄다. 최종 기술 모델 xWOBA+ 의 RMSE는 0.371로, MLB 기본선 0.479 대비 22.5% 더 정확했다.

4. 예측으로 옮기자 모든 것이 무너졌다

하지만 이 기술 모델을 2024년 데이터로 학습시켜 2025년 성적을 맞히게 했더니, MLB의 xWOBA를 전혀 이기지 못했다. 원인은 명확했다. 스프레이 앵글은 시즌 간에 크게 요동치는 지표다. 2024년에 당겨치던 선수가 2025년엔 반대편 필드로 간다. 단일 타구의 품질을 묘사하는 데는 결정적이지만, 그걸 그대로 미래 예측에 밀어 넣으면 잡음을 학습한 결과가 된다.

이 대목은 야구 분석 커뮤니티가 수십 년에 걸쳐 배운 교훈과 정확히 일치한다. MDPI의 Machine Learning in Baseball Analytics 리뷰 논문은 “좋은 모델”의 정의 자체가 목적에 따라 달라져야 함을 강조하고, Adam Salorio의 공격 지표 분석론은 xwOBA와 xwOBAcon이 서로 다른 안정화 속도를 가진다는 점을 짚는다. 샘플 크기와 안정화 시간을 무시한 예측 모델은 결국 노이즈 지도가 된다.

5. PWOBA: 빨리 안정되는 피처, 단순한 모델

두 번째 모델 PWOBA(predictive WOBA) 는 문제를 근본부터 다시 짰다. 핵심 원칙은 하나다. 예측에 쓸 피처는 빨리 안정되고 해마다 일관되어야 한다. 배트 스피드는 겨우 3번의 스윙만에 안정되고, 배럴 비율은 약 50타구 안에 안정된다. 기술 모델에서 쓸모없었던 이 지표들이 예측 문제에선 주인공이다.

데이터가 272개뿐이라 모델도 단순해야 했다. 선택은 엘라스틱넷(Elastic Net). L1 페널티로 중요하지 않은 피처의 계수를 정확히 0으로 만들고, L2 페널티로 전체 계수를 눌러준다. 72개의 (L1, L2) 조합을 튜닝했고, 검증은 리브원아웃(leave-one-out) 이었다. 271명으로 학습해 1명을 예측하는 과정을 272번 반복하면 모델은 절대 자기 정답을 볼 수 없다.

기준선으로는 마르셀(Marcel) 을 썼다. “최근 스탯을 가중 평균 내고 평균 쪽으로 회귀시키는” 의도적으로 순진한 예측 시스템이다. 마르셀을 못 이기면 어떤 복잡한 모델이든 잡음만 학습한 것이다. PWOBA는 마르셀(0.561)과 MLB xWOBA(0.613)를 모두 제치고 0.633의 연간 예측 상관계수를 기록했다. 약 3.3%의 개선이지만, 이 정도 문제에서는 의미 있는 수치다. 덤으로, “2024년 과성과” 플래그가 붙은 선수 중 78%가 실제로 2025년에 성적이 하락했다.

핵심 인사이트

묘사와 예측은 다른 손실 함수를 쓴다. 같은 데이터셋이라도 목표가 달라지면 거의 모든 설계 결정을 다시 해야 한다. 피처 선택, 모델 복잡도, 검증 전략까지 전부.
“피처를 더 넣으면 성능이 오른다”는 믿음은 작은 데이터에서 가장 위험하다. 272개 샘플에서는 정규화된 선형 모델이 부스팅 트리보다 강했다.
안정화 시간(stabilization time)은 예측 피처 선택의 1차 필터다. 배트 스피드가 3스윙 만에 안정된다는 사실은, 그것이 선수의 “신체적 상한”을 반영한다는 뜻이다. 반면 스프레이 앵글은 접근 방식에 따라 매년 달라지는 “전술적 선택”에 가깝다.
상대 투수 품질을 조정하는 것은 거의 공짜 점수다. 엘리트 투수를 상대로 찍은 WOBA 0.350은 약체 투수 상대 같은 숫자와 질적으로 다르다.
기준선을 정성스럽게 잡아라. 마르셀 같은 의도적으로 순진한 베이스라인이 없으면, 개선 3.3%가 실제인지 환상인지 구분할 수 없다.

더 알아보기

New Statcast metrics measure swing path, attack angle, attack direction (MLB.com) — 2025년 추가된 배트 트래킹 지표 공식 설명
Statcast Bat Tracking 리더보드 (Baseball Savant) — 선수별 실측 수치와 이상적 어택 앵글 비율 확인
Test Driving Statcast’s Newest Bat Tracking Metrics (FanGraphs) — 스윙 패스와 어택 앵글 변화로 설명되는 2025 브레이크아웃 사례
Modelling xwOBA with KNN (Thomas Nestico) — 배트 트래킹 없이 MLB xWOBA를 KNN으로 재현한 글
Machine Learning in Baseball Analytics: Sabermetrics and Beyond (MDPI) — 세이버메트릭스와 머신러닝의 접점을 다룬 리뷰 논문
Exploring Key Metrics and Methodology - 2025 Update (Adam Salorio) — xwOBA/xwOBAcon 안정화 속도와 예측 가치 분석

03찬반 토론 · Debate

토론: “배트 트래킹 데이터와 머신러닝은 MLB 공식 예측 모델보다 더 나은 타자 예측을 만들 수 있는가”

논제: 2025년에 공개된 배트 트래킹 지표와 머신러닝을 결합하면, MLB 공식 xWOBA보다 시즌 간 타자 성적을 더 잘 예측할 수 있다.

Round 1

🟢 Pro — “배트 트래킹은 안정화가 빠르고 신체적 상한을 드러내므로 예측력을 높인다”

Onchain AI Garage 실험의 PWOBA는 MLB의 xWOBA를 3.3% 이겼고, 의도적으로 순진한 마르셀(Marcel) 기준선(0.561)과 xWOBA(0.613)를 모두 제쳐 0.633의 연간 상관계수를 기록했다. 단순한 개선이 아니라 마르셀이라는 하한선과 MLB라는 현재 최상선을 동시에 넘었다는 점이 중요하다. 엄격한 리브원아웃(leave-one-out) 검증 위에서 얻은 수치라 과적합 가능성도 최소화됐다. 이는 “배트 트래킹 + 머신러닝”이 단지 유행이 아니라 측정 가능한 예측력 향상을 가져온다는 증거다.

무엇보다 배트 스피드는 단 3번의 스윙만에 안정된다. 이는 세이버메트릭스 커뮤니티가 오랫동안 강조해 온 “안정화 시간(stabilization time)“의 관점에서 결정적인 속성이다. 타율, 출루율 같은 전통 지표는 수백 타석이 지나야 신호가 잡음을 이기는데, 배트 트래킹 지표는 구조적으로 노이즈가 적다. 그 이유는 단순하다. 배트 스피드는 선수의 신체적 상한을 측정하고, 이 상한은 하루아침에 변하지 않기 때문이다.

셋째로, 이 실험은 “2024년 과성과”로 분류한 선수의 78%가 실제로 2025년에 성적이 하락했음을 보여주었다. 무작위 추측 대비 압도적으로 높은 적중률이다. 예측 모델이 단지 평균 순위를 맞히는 것이 아니라, MLB 공식 모델이 놓치는 회귀 후보를 실제로 찾아낸다는 뜻이다. 한 단계 더 구체적인 예측 가치가 여기서 생긴다.

🔴 Con — “3.3% 개선은 272 샘플 노이즈 구간 안이고, 배트 트래킹의 기여는 검증되지 않았다”

첫째, 3.3% 개선의 통계적 유의성이 제시되지 않았다. 연간 상관계수 0.613에서 0.633으로 오른 것은 절댓값 0.02 차이다. 272명이라는 데이터셋 크기에서 이 정도 차이의 신뢰구간은 충분히 넓다. 진짜 개선인지, 단일 시즌(2024→2025) 표본의 운인지 확신할 수 없다. 재현은 최소 3~4시즌이 쌓인 뒤에야 가능하다.

둘째, 배트 트래킹 데이터의 순수 기여도를 입증하는 에이블레이션이 예측 모델에서는 공개되지 않았다. 기술 모델에서는 전진·후진 에이블레이션을 모두 돌렸다고 했지만, 정작 PWOBA가 배트 트래킹 없이도 같은 성능을 냈는지, 아니면 상대 투수 품질과 같은 다른 피처가 대부분의 개선을 만든 건지 알 수 없다. 실제로 발표자는 “상대 투수 품질 보정이 예측을 많이 개선했다”고 직접 인정했다. 배트 트래킹이 공을 얼마나 세웠는지 가르는 변인이 아닐 수 있다는 것이다.

셋째, 스프레이 앵글이 시즌 간 크게 요동친다는 점은 Pro 측 주장과 정면으로 충돌한다. 배트 트래킹이 “구조적으로 안정적”이라는 주장에는 배트 스피드만 부분적으로 해당할 뿐, 어택 디렉션이나 스프레이 앵글 계열은 오히려 타자의 의식적 접근에 따라 해마다 달라진다. “배트 트래킹 전체가 예측력을 높인다”는 일반화는 과장이다.

Round 2

🟢 Pro (재반론) — Con의 세 지적에 차례로 답한다

Con의 첫째 지적(유의성 부재)에 대해: 연간 상관계수 0.02 개선이 “운일 수 있다”는 주장은 마르셀과의 0.072 격차(0.561 → 0.633)를 함께 봐야 반박된다. 만약 PWOBA의 우위가 단순한 시즌 노이즈라면, 마르셀이 0.633에 근접하는 해도 있어야 한다. 그러나 마르셀 방식은 구조적으로 일정 수준 이상으로 올라갈 수 없다. PWOBA는 그 천장을 확실히 넘었고, xWOBA 대비 개선은 “절대적 차이”가 아니라 “이미 성숙한 기준선을 깎아낸 차이”라는 점을 감안해야 한다. 2016년 이후 MLB xWOBA를 0.02 포인트 개선한 공개 모델은 드물다.

Con의 둘째 지적(배트 트래킹 기여 미검증)에 대해: 실제 발표에서 저자는 배트 스피드를 “미래 성적의 세 번째로 강력한 예측 변수”로 명시했고, 기술 모델에선 덜 중요했던 배트 트래킹 지표가 예측 모델에서는 핵심으로 올라갔음을 설명했다. 상대 투수 품질이 함께 기여한 것은 맞지만, 그 역시 배트 트래킹 신호와 결합될 때 의미가 생기는 피처다. 투수 품질 보정 단독으로는 xWOBA 대비 이 정도 개선이 관찰되지 않는다는 것은 기존 공개 연구에서 반복 확인된 바다.

Con의 셋째 지적(지표별 안정성 차이)에 대해: Con이 직접 인정했듯 배트 스피드는 안정적이다. 그리고 PWOBA는 스프레이 앵글을 예측 피처에서 의식적으로 제외했다. 기술 모델과 예측 모델을 분리한 바로 그 이유가 이것이다. 즉 Con의 반박은 이미 모델 설계 단계에서 해결된 문제를 다시 끄집어낸 것이다. “배트 트래킹 전체”가 아니라 “안정화가 빠른 배트 트래킹 지표”가 예측력을 높인다는 것이 실제 주장이며, 여기에는 과장이 없다.

🔴 Con (재반박) — Pro의 세 응답을 되받는다

Pro의 첫째 응답(마르셀과의 격차)에 대해: 마르셀을 이겼다는 사실은 충분조건이 아니다. 마르셀은 2003년에 톰 탱고가 “의도적으로 바보 같게” 만든 기준선이고, 이미 공개된 ZiPS, Steamer, THE BAT X 같은 시스템은 수년간 마르셀을 0.05~0.08 수준으로 앞질러 왔다. 따라서 “PWOBA가 마르셀을 이겼다”는 말은 “MLB xWOBA 수준의 시스템을 잡았다”는 뜻에 머물 뿐, 이미 대중에 공개된 상용 예측 시스템들보다 낫다는 증거는 아니다. 대조군이 부족하다는 것이 가장 큰 약점이다.

Pro의 둘째 응답(배트 스피드의 기여)에 대해: “3번째로 강력한 예측 변수”라는 주장은 모델 내부의 상대적 순위일 뿐, 해당 피처를 제거했을 때의 성능 하락 폭이 공개되지 않았다. 만약 배트 스피드를 빼고 상대 투수 품질과 전통 지표만 남겨도 RMSE가 거의 같다면, 배트 트래킹의 기여는 “있으나 마나”다. 진짜 검증은 피처 홀드아웃(hold-out) 실험이며, PWOBA는 그 실험 결과를 보여주지 않았다. 순위는 증거가 아니다.

Pro의 셋째 응답(안정 피처만 선택했다)에 대해: 바로 그 선택이 데이터 스누핑(data snooping) 의 냄새를 풍긴다. 272 샘플 위에서 “안정적인 것만 골랐다”는 설명은 결과를 보고 고른 셈이 될 수 있다. 안정화 속도를 독립 데이터로 먼저 고정하고, 그 후에 피처를 선택했는가? 아니면 같은 272 샘플 안에서 돌리면서 사후적으로 골라낸 것인가? 후자라면 리브원아웃조차 편향을 완전히 제거하지 못한다. Pro가 이 지점에서 설계의 순서를 명확히 공개하지 않는 한, “과장 아님”이라는 주장은 보류되어야 한다.

Round 3

🟢 Pro — Con의 재반박을 되돌린다

Con의 첫째 재반박(상용 시스템 대조군 부재)에 대해: 이 지적은 정당하지만 결론을 뒤집지 못한다. Onchain AI Garage는 영상 마지막에 Steamer, ZiPS, THE BAT X와 같은 상용 시스템을 PWOBA와 나란히 추적하는 공개 대시보드(sportsaigarage.com) 를 만들겠다고 공언했다. 즉 Con이 요구하는 정확히 그 대조 실험이 2026 시즌 내내 실시간으로 진행될 예정이다. 현 시점에서 “그 대조군이 없으므로 의미 없다”는 판단은 성급하다. 오히려 Pro 측 주장은 “공개 검증을 회피하지 않고 정면으로 받겠다”는 태도로 강화된다. 실제 예측 시스템은 모두 이런 공개 추적을 거치며 신뢰도를 쌓아왔고, PWOBA는 그 길을 그대로 따르고 있다.

Con의 둘째 재반박(홀드아웃 미공개)에 대해: 홀드아웃 미공개는 발표 자료의 스타일 문제지 방법론의 결함은 아니다. 저자는 72개 L1/L2 조합의 엘라스틱넷 튜닝 과정에서 L1이 정확히 0으로 만든 피처가 무엇인지를 공개했다면 같은 정보를 얻었을 것이다. 엘라스틱넷에서 계수가 0이 아닌 채 살아남은 배트 트래킹 지표가 있다는 것 자체가, 그 피처가 모델에 기여하고 있다는 수학적 증거다. “피처 홀드아웃 실험을 직접 보여주지 않았다”는 이유로 “기여가 없다”고 말하는 것은 Con 쪽의 비약이다.

Con의 셋째 재반박(데이터 스누핑 우려)에 대해: 안정화 시간은 Onchain AI Garage의 272 샘플에서 추출한 것이 아니라, 수십 년간 누적된 세이버메트릭스 공개 문헌에서 이미 확립된 수치다. 배트 스피드 3 스윙, 배럴 비율 50 타구라는 숫자는 저자가 만든 것이 아니라 참조한 것이다. 그러므로 “사후 선택”이라는 혐의는 해당되지 않는다. 이는 데이터 스누핑이 아니라 도메인 지식에 기반한 사전 피처 선정이며, 머신러닝 방법론에서 권장되는 바로 그 실무다.

🔴 Con — Pro의 Round 3를 다시 꺾는다

Pro의 첫째 응답(공개 대시보드로 검증 예정)에 대해: 미래에 이뤄질 검증은 현재의 주장을 정당화하지 못한다. 논제는 “예측할 수 있다”이지 “앞으로 검증해 보겠다”가 아니다. 게다가 대시보드가 단일 시즌(2026)만 돌리면 여전히 표본이 하나다. 최소 3시즌은 쌓여야 “상용 시스템을 이긴다”는 주장이 성립하는데, 그때 가서 이기지 못해도 이번 논제의 주장은 철회되어야 한다. “검증 예정”은 주장의 근거가 아니라, 주장을 보류할 이유다.

Pro의 둘째 응답(엘라스틱넷 계수가 증거)에 대해: L1이 0으로 만들지 않은 피처가 “기여한다”는 것은 사실이지만, 기여의 크기는 전혀 말해주지 않는다. 0이 아닌 0.01짜리 계수와 0.5짜리 계수는 예측력에서 전혀 다른 의미를 가진다. 또한 엘라스틱넷은 다중공선성이 있는 피처들에 대해 계수를 분산 배분하는 경향이 있어, 배트 트래킹 피처가 “혼자서 기여한다”고 단정할 수 없다. 진짜 증거는 여전히 제거 실험이며, 그것 없이는 Pro의 주장이 “정황적”이라는 비판을 피하지 못한다.

Pro의 셋째 응답(도메인 지식 기반 사전 선정)에 대해: 안정화 시간 수치가 공개 문헌에서 온 것은 맞다. 그러나 어떤 피처를 최종 모델에 넣을지 결정하는 과정이 272 샘플 성능을 보면서 이뤄졌다면, 그 자체가 사후 선택이다. 도메인 지식을 “참고”했다는 것과, 모델 성능에 피드백을 받으며 피처 조합을 바꿨다는 것은 다른 문제다. Pro는 이 경계를 명확히 긋지 않았다. 방법론의 결백을 입증할 책임은 주장하는 쪽에 있으며, 현재로선 그 입증이 부족하다.

🧭 종합

합의 지점

양측은 몇 가지 지점에서 분명히 합의한다. 첫째, 묘사와 예측은 서로 다른 손실 함수를 쓰는 별개의 문제라는 사실이다. Onchain AI Garage가 두 개의 모델을 따로 만든 것은 올바른 설계였고, Con도 이 점에는 이의를 제기하지 않는다. 둘째, 배트 스피드라는 특정 지표가 선수의 신체적 상한을 반영하며 안정화가 빠르다는 경험적 관찰은 양측 모두 인정한다. 셋째, 마르셀이라는 의도적으로 순진한 기준선을 세운 것은 방법론적으로 올바른 선택이었다는 데도 다툼이 없다.

열린 질문

진짜 쟁점은 여전히 열려 있다. 첫째, 3.3%의 개선이 통계적으로 유의한가는 하나의 시즌 대 시즌 비교만으로는 판단할 수 없다. 최소 3시즌 이상의 공개 추적이 필요하다. 둘째, 배트 트래킹 피처의 순수 기여도를 측정하기 위한 공식 제거 실험이 공개되어야 한다. 셋째, PWOBA가 Steamer, ZiPS, THE BAT X 같은 기존 상용 시스템을 이길 수 있는가는 현재로선 답할 수 없고, 저자가 약속한 sportsaigarage.com 대시보드의 시즌 결과로만 답할 수 있다. 넷째, 피처 선정 파이프라인에 데이터 스누핑이 없었는지는 저자가 과정의 순서를 더 자세히 공개해야 정리된다.

더 나아간 관점

이 토론이 정말로 조명하는 건 “배트 트래킹이 좋은가”보다 “우리는 예측 모델을 어떻게 검증해야 하는가” 이다. 272 샘플 같은 작은 데이터셋에서 나온 개선은, 그것이 정당한 개선이든 운이든, 반드시 시간이라는 변수로 재검증되어야 한다. 한 해의 수치를 가지고 “이겼다”고 말할 수 있는 연구는 없다. Onchain AI Garage의 기여는 모델 자체보다 작은 데이터 위에서 엄격한 검증 절차(리브원아웃, 마르셀 기준선, 정규화 선형 모델)를 투명하게 공개했다는 점에 있다. 앞으로 이 유형의 실험이 늘어날수록 “묘사를 위한 모델”과 “예측을 위한 모델”의 분리, “안정화 시간 기반 피처 선정”, “상용 시스템과의 다년 추적” 같은 실무 규범이 공유 표준으로 자리 잡을 것이다.

또한 이 논쟁은 독립 개발자가 상용 예측 시스템과 경쟁할 수 있는가라는 더 넓은 질문으로도 연결된다. Steamer, ZiPS, THE BAT X는 모두 오랜 기간에 걸쳐 튜닝되고 검증된 시스템이고, 그 뒤에는 수많은 도메인 전문가와 시뮬레이션 인프라가 있다. Onchain AI Garage가 Claude Code라는 AI 코딩 환경 하나로 이 영역에 진입할 수 있었다는 사실 자체가 의미심장하다. Pro의 입장에서 이는 “기술 장벽이 극적으로 낮아졌다”는 증거이고, Con의 입장에서는 “낮아진 장벽이 반드시 더 좋은 모델을 뜻하지는 않는다”는 경고다. 두 관점 모두 2026년 시즌 대시보드가 실제로 돌아가기 시작하면 증거에 의해 일부는 강화되고 일부는 기각될 것이다. 그것이 이번 토론의 진짜 결론이다.

04영문 원본 · Transcript

So continuing our experiments into building different types of models, I decided to look into sports data, which is an interesting field.
I'm obviously a sports fan myself, and there's a lot of different uses for this as a fan to watch sports better, as a team using this kind of data to enhance your team.
And then obviously there's betting and prediction markets that this kind of data can be used for.
So specifically, I want to look at baseball data, the baseball season starting soon.
And there's a lot of data available, so I thought I would try to build a model for specifically bat tracking data.
And bat tracking data was released by the MLB for the first time in 2024, and even more data was released last year in 2025.
So I wanted to try to use this data to see what kind of edge we could find in making predictions and describing how hitters,
hit balls.
So first, key concept we need to understand is WOBA.
I know there's a lot of stats everyone's used to on base percentage and batting average.
But recently, one of the more popular advanced hitting stats is WOBA, weighted on base average.
And basically, it measures the hitter's total offensive value.
Unlike just batting average, it gives more credit to extra base hits and walks, weighted by how many runs each outcome is actually hitting.
So the average hitter's total offensive value is 0.310, and the average hitter's total offensive value is over 0.400.
And this stat was developed by Tom Tango, who is now a MLB senior data architect.
And the weights come from linear weights, how many runs each event is actually worth on average.
And it's based on millions of historical plate appearances.
So you may be asking, why not just use batting average?
Batting average just treats like a bloop single the same as a line drive double.
It also ignores the average hitter's total offensive value.
It ignores walks entirely, and WOBA fixes this.
So, how MLB predicts outcomes today.
So MLB's official stat is X WOBA, expected WOBA.
And it looks at how hard a ball was hit and at what angle.
And then estimates what should have happened based on thousands of similar batted balls.
So exit velocity, launch angle, and then the expected outcome.
And this was introduced in 2015.
When StatCast first came out.
And the problem is that this is all it uses.
It doesn't know who hit it, where it went, or what pitch they were facing.
The bat tracking revolution.
In 2024, MLB started releasing bat tracking data.
Measurements of the bat itself, not just the ball.
12 cameras per stadium tracked every swing in 3D.
And we got a lot more data from this.
We got bat speed, attack angle, barrel rate.
And like I said, in 2015, we got a lot more data from this.
In 2025, even more data was added.
Attack direction and swing path tilt.
So the question I had is whether this new data could be used to build better predictions.
So others have obviously used this data.
This is a well-trodden space in terms of statistics and data models.
MLB's X WOBA, like I said.
Sam Walsh did one that included spray angle.
VanGraphs has done research.
Tom...
Nestico recreated MLB's X WOBA with a basic algorithm, but without bat tracking.
So the gap that I had found is that nobody had combined the new 2025 bat tracking metrics
with machine learning to build a prediction model.
And nobody had tested whether bat tracking improves year-to-year prediction.
Not just single game accuracy.
So this was based on my research.
I don't know if on an individual level people had done this.
So my approach, we were going to design two models, and I did this in cloud code.
One was just a descriptive model, basically.
And this was our first kind of attack.
So what should this specific batted ball have been worth?
Kind of trying to describe the quality of a batted ball.
So this uses a lot of the bat tracking statistics that we gathered that have recently been released.
And then the second model is going to be more predictive.
How will this hitter perform?
And this is trained on 272 player seasons using the same kind of bat tracking data.
So we're going to go into this with more detail,
but just an overview of our process for the machine learning pipeline.
It was a fairly rigorous process.
So machine learning finds patterns in data that humans can't see.
But the real work is in making sure those patterns are real, not mirages,
which can often be the case.
So you have to be very careful.
Here's a general process we followed.
I'm going to go into each of these.
Steps in more detail.
First, we collected all the data from stack cast and they provide a lot of free and easy to access data through their API.
Then we engineered 19 candidate features using the batted ball data and the bat tracking data that we received.
And we used ablation, which is a way to test each of the features values.
We did this in several ways using cross validation, using forward and backward ablation.
Then for.
Select, we did greedy forward selection to pick the best features and the most important features for that were actually impactful five we optimize we tuned 160 hyper parameter combos, try to find the best hyper parameters, and then we did validation cross season holdout test to make sure and validate all of the data and make sure it was actually important and not just noise.
So every step of this was designed to provide.
So every step of this was designed to provide.
So every step of this was designed to provide.
To prevent overfitting and overfitting is when a model memorizes noise instead of learning real patterns, and they can be a concern, especially when you're working with a smaller data set.
We only had two years of data with the advanced bat tracking statistics.
So I was very careful each of these steps to avoid overfitting, but let's go into more detail on each of these.
So the golden rule of machine learning is that you never test data that you never test on data you trained on.
And cross validation forces us by splitting the data into sections and rotating to see which section is best.
You can see these are called folds, train, train, train train test.
Because if I gave you a test where you already seen all the answers getting hundred percent wouldn't mean any wouldn't mean necessarily that you learned everything.
closed validation like this in forces honesty, we split our 230,000 Wybal's And two five equal chunks like this.
And we trained on for chunks while trapping them at pressure.
while testing on the fifth, then we rotated a different chunk became the test each time. So
this happened five times. And the model score is its average across the five tests each on data
had never trained on. So for a predictive model where we only had 272 players, we use an even
stricter version called leave one out. And this was for the predictive model. We trained on 271
players predict one, and then repeat that 272 times. So the model literally never sees its own
answer. And this kind of five fold CV is standard for large datasets, as well as leave one out. And
it's used for smaller datasets, because it maximizes training data per fold. So a different
type of cross validation is a simple train test split, you've probably seen like an 8020. And this
works for large datasets, but it can be unreliable for smaller ones, you might accidentally put all
the easy ones in there. And then you can use it for smaller ones. And then you can use it for
larger ones. And then you can use it for larger ones. And then you can use it for larger ones. And
it's easy to predict players in the test set. So then we did feature ablation, like I mentioned
before. And this determines what actually matters we ran into, we ran two systematic tests across
all 19 candidate features. Forward ablation adds one feature at a time backward ablation removes
one from the full set, and they both use the five fold cross validation that I talked about here.
So with this forward ablation, we added each feature to the baseline. And the baseline was
just using exit velocity and launch angle, like the MLB status. And then we just added these and we
saw spray angle had a 18.3% improvement. And you can see also attack direction, attack angle, swing
length, other things had a positive impact. And then we did backwards ablation, which is kind of
the opposite. We started not from the baseline, but we started with the full model with little
literally every feature.
And then remove them one by one. And we saw from this that the most important was launch angle, and
then spray angle was also important. So the key insight from this is if you look at bat speed here,
it looks helpful when you add it just with this for depletion model. But then when you see backward
ablation actually very slightly hurts the model. And the key insight here is that once you have exit
velocity, adding in bat speed doesn't matter.
It doesn't really make much of an impact. If anything, it's redundant. And you can see here in the key insight,
bat speed helps individually, but is redundant in the full model because exit velocity also captures its
signal.
Forward and backward ablation tell different stories. And that's why you need both to form a proper model.
And it makes sense when you think of it exit velocity and bat speed being connected. And the spray angle
being very important is
a significant insight, because that is left out of the base model that MLB uses for X WOBA. And previous
researches had shown this, but our model just confirmed that spray angle was important in describing batted
ball quality. Now greedy forward selection, building the optimal set was the next step. Now instead of just
testing the features in isolation,
we built the model up one inch apart.
ingredient at a time, always adding whichever helped the most. So round one was just spray angle wins by a mile. An 18% improvement you can see here. So we kept that. Next in round two, we did attack direction here, which also improved the model by 3.2%. And we kept that. And then lastly, attack angle 0.9% improvement. And you can see this R
MSE. And what that is, is called a root mean squared error. Basically, it's how far off our predictions are from reality on average. So the base model that MLB uses has a RMSE of 0.479. And the key is to try to get this smaller, try to make it closer to reality. So our model predicts what the value of the WOBA should be. And RMSE squares each of the errors.
So big misses get penalized more than small ones, averages them all together and then takes the square root to get back to the original scale. So we were able to improve from 0.479 down to 0.377. And then we stopped here. After these three additions, all the 14 remaining features made the model works. So the algorithm stopped correctly. That's why we ended up with just five features. You don't necessarily need more features for a better model. Each addition was rigorously test with the cross validation I talked about before.
So then we had kind of a comparison with different algorithms. So we tested four different machine learning algorithms on the same data and features. Each brings a different approach and to the findings. So we compared all of these and MLB uses KNN, which finds the most similar past batted balls and averages their outcomes. We actually found that light GBM worked the best gave us the best.
RMSE. And this is similar to XG boost. XG boost builds hundreds of small decision trees, each correcting the last one's mistake. Light GBM uses a similar approach, but grows the trees leaf by leaf, instead of level by level. So this was the one, this is the algorithm that we ended up using for our descriptive model. So then we did hyper parameter tuning, fine tuning the engine. Every machine learning algorithm has hyper parameters, settings that control
how it learns. Think of it like tuning a car, how fast should it learn? How complex should each decision be? How much should it hedge? So these are all these different settings, learning rate, max depth, number of trees. And if you watch any of my auto research videos, you'll see these are the things you'll notice that the auto research will change. And we ran a similar system here. So we tested 160 combinations of settings for both XG boost and light GBM.
Each evaluated with the five fold cross validation. So that's 100, it's 1000, actually, and 600 models trained. And we had to run this on my GPU took quite a while. It wasn't a huge difference, the difference between the default settings and the optimize, drop the RMSE from 0.377 to 0.371. Small and absolute absolute terms, but with something like this, every fraction matters in prediction.
And this is the final results for the descriptive model, I'm calling it. So after all of that process, here's the final model. He used exit velocity, launch angle, spray angle, EV, exit velocity, efficiency and bat speed. And compared to the Major League Baseball's baseline XWOBA 0.479, our XWOBA plus, I'm calling it as a descriptive model was 22.5% better.
Resulting in an RMSE of 0.371. So it is 22.5% more accurate than baseball's model at explaining what a batted ball should have been worth. It's describing the quality of the individual batted balls.
So this was the first model we developed. And this was specifically for, as I say, describing how well a ball is hit and how likely it's to become a hit.
However, the question is, does this model helped us predict the future? And we discovered that no, it is not. A better description does not mean better prediction. And we tested this out ourselves to see if our model made better predictions for next season, comparing 2004, 2024 and 2025.
And we found that it did not beat Major League Baseball's XWOBA. So why?
Our key element was spray angle. And spray angle can shift year to year. So a player who pulls everything in 2024 might go more to the opposite field in 2025.
It was accurate in single game models and in single-hitted balls, batted balls, but it doesn't translate well into better year-to-year predictions. So we had to rebuild this as a completely different model, one that designed for prediction, not description.
So then we built PWOBA, predictive WOBA.
Okay, so this was a different problem with different features.
So when you're trying to train a model, you need to think of what problem or question you're trying to answer.
So the key insight from our research, the best predictive features are the ones that stabilize quickly and stay consistent year to year.
Bat speed stabilizes in just three swings.
Barrel rate in around 50 batted balls.
So we went back from scratch to look at all these features, some of which were not helpful.
And our descriptive model, but were actually quite important in terms of predictive models.
So we only had 272 data points.
So we had to choose the right algorithm with this smaller data set.
And in here, we found that elastic net one, which is a different type of algorithm.
And it's a more simple algorithm, which is often the case with fewer data points.
It's a linear model.
And it forces simplicity.
So a key lesson here, the best algorithm depends on your data.
A lot of times, complex models overfit with small data sets.
A simple linear model with strong regularization was the right tool here.
And overfitting was a key concern.
With such a small data set, we were using 2024 data to try to predict 2025 data.
So every design choice was aimed at preventing the model from seeing patterns that aren't real.
And we did this three main ways.
One was through regularization.
And our algorithm had two built-in penalties.
L1 forced unimportant features to exactly zero.
And L2 shrunk all weights towards zero.
So we tuned both penalties across 72 combinations.
Think of these as guardrails that kept the model on the road.
And then leave one out, CB, cross-validation.
So this is the strictest form of validation, cross-validation.
For each of the 272 players.
We trained on the other 271 and predicted that one player.
And then repeated that 272 times.
These were all separate training runs.
So the model never sees the player it's predicting during the training itself.
And then we did a baseline comparison.
We benchmarked against Marcel.
A deliberately naive system that just averages recent stats and regresses towards a mean.
So any real model must beat Marcel.
Or it's just fitting noise.
And we found that ours did.
0.63 versus 0.561.
So the result was that our PWOBA beats the Major League's XWOBA.
This is the year-to-year prediction correlation.
Higher is better.
The actual, the Marcel kind of baseline.
A simple formula.
Major League Baseball.
It's XWOBA 0.613 and then ours did 0.633.
Which is 3.3% better than Major League Baseball's predictor.
And interestingly, the regression flag, which is players we flagged as over performing in 2024.
78% of them actually declined in 2025.
So what we learned here.
I think we learned.
A lot in this process.
Bad speed matters for prediction, not description.
It's useless to judge for judging a single swing, but it's the third best predictor for future performance because it measures a hitter's physical ceiling.
More data doesn't always mean better predictions.
Our descriptive model used five features and was 22% more accurate, but a simpler model predicts the future better.
So precision and prediction are different goals.
And simple models can be.
Complex ones.
A regularized linear model outperformed XGBoost and like GBM when we only had 272 training examples.
So don't over-engineer your models.
You have to consider your data set and what you're working with.
Opponent quality matters.
Makes sense logically.
So adjusting for the quality of pitchers.
A hitter face improved our prediction a lot.
And this was one of the features we used.
A 0.350 WOBA.
So WOBA against elite pitching is more impressive than against weak pitching.
So the opening day is this week and for 2026, these are the predictions of our top hitters by PWOBA, which is our prediction model that we're using.
It has Aaron Judge, number one, who was the MVP last year.
Makes sense.
Giancarlo Stanton also has very good batted ball quality.
Has trouble staying healthy, but when he is able to hit, he gets good quality hits.
And then you can see the rest of it.
Obviously, Otani and then some of the top hitters.
Schwarber, Pete Alonso.
And a lot of these are regression candidates.
Judge being the most, but obviously that's just because last year he had an incredible year.
0.601 WOBA, which is impressive.
Not considered sustainable.
But still at 0.484.
He would still be in the top.
So what's next?
So I built this out, this dashboard, for PWOBA.
My prediction model here for top hitters.
And this was kind of my first stab at incorporating some advanced machine learning techniques into baseball or sports stats.
There's a lot more I'd like to try to do with this.
Try to hone in a little bit more on more specific questions.
questions, you know. But you'll be able to find this data. I'm probably going to release this
dashboard. It's going to be sportsaigarage.com. And we will keep track of this. These are all
predicting, these are all different preseason predictions from steamers, zips, bad acts.
And we will see how our prediction stacks up against some of these. They tend to, if you look
at them, they tend to be more conservative. These are the top 30 hitters. And you can see there,
they tend to be a lot lower. So they're expecting even more regression than our model showed.
But this will update once the actual 2026 season begins, this will update and we will see where we
where we stand. Like I said, this is my first attempt at building a model using sports data.
I'm currently working on some more specific questions and building models for those.
It's a fun area to work in. I know it's kind of over the top.
We're done. But I still think there's a lot more that we can do with more recent AI capabilities
and advanced machine learning techniques. The question I'm working on right now is,
how does specific injuries to players and how long they're out on the IL affect them
when they actually come back to the game? How does it affect their gameplay? So that's kind
of the question I'm working on right now. So if I get good data from that, you'll see another video
on that question.
Or others. Right now we use baseball just because there's a lot of data provided by baseball. But
we can also look at basketball or football. Anyway, that's it for today's episode. Please
leave a comment, leave a like, subscribe to the channel, and I'll see you in the next one.