NVIDIA AI Podcast

Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298

24 min

Snap's head of engineering platforms, Prudviya Bhattala, explains how the company migrated its massive 10 petabyte/day data pipeline from CPU-based Apache Spark to GPU-accelerated Spark using NVIDIA Spark Rapids and Google Cloud's GKE. The transition leveraged idle GPUs from Snap's online inference capacity, resulting in a 76% cost reduction, 62% fewer cores, and an 80% drop in memory footprint without any code changes to existing jobs.

Summarized by Podsumo

🎙️ Ask about this episode

✨ Key Takeaways

1

Snap processes over 10 petabytes of data daily for its experimentation platform, serving nearly a billion monthly active users.
2

By moving to GPU-accelerated Spark with NVIDIA Spark Rapids, Snap cut job costs by 76%, cores by 62%, memory footprint by 80%, and eliminated 120 terabytes of disk/memory spill.
3

Zero code changes were required to adopt GPU acceleration, thanks to Spark Rapids' compatibility with existing Apache Spark pipelines.
4

Snap creatively reused idle GPUs from its online inference stack by migrating workloads to a new Kubernetes-based data platform on Google Kubernetes Engine (GKE), with built-in fallback to CPUs and preemption support for live traffic spikes.
5

The migration from prototype to full production handling 10+ petabytes took only 8-9 months, enabled by a three-way partnership between Snap, NVIDIA, and Google Cloud.

💬 Notable Quotes

"We were able to cut almost about 76% of our job costs. 76. It's phenomenal! — Prudviya Bhattala, Head of Engineering Platforms at Snap"
"With Spark Rapids, we didn't have to change a single thing about how we ran the jobs. That was the beauty of it. — Prudviya Bhattala"
"We were able to cut out almost 120 terabytes of disk and memory spill from our pipelines. Just vanished once we started doing all of this. — Prudviya Bhattala"