Snap's head of engineering platforms, Prudviya Bhattala, explains how the company migrated its massive 10 petabyte/day data pipeline from CPU-based Apache Spark to GPU-accelerated Spark using NVIDIA Spark Rapids and Google Cloud's GKE. The transition leveraged idle GPUs from Snap's online inference capacity, resulting in a 76% cost reduction, 62% fewer cores, and an 80% drop in memory footprint without any code changes to existing jobs.
Summarized by Podsumo
Snap processes over 10 petabytes of data daily for its experimentation platform, serving nearly a billion monthly active users.
By moving to GPU-accelerated Spark with NVIDIA Spark Rapids, Snap cut job costs by 76%, cores by 62%, memory footprint by 80%, and eliminated 120 terabytes of disk/memory spill.
Zero code changes were required to adopt GPU acceleration, thanks to Spark Rapids' compatibility with existing Apache Spark pipelines.
Snap creatively reused idle GPUs from its online inference stack by migrating workloads to a new Kubernetes-based data platform on Google Kubernetes Engine (GKE), with built-in fallback to CPUs and preemption support for live traffic spikes.
The migration from prototype to full production handling 10+ petabytes took only 8-9 months, enabled by a three-way partnership between Snap, NVIDIA, and Google Cloud.
"We were able to cut almost about 76% of our job costs. 76. It's phenomenal! — Prudviya Bhattala, Head of Engineering Platforms at Snap"
"With Spark Rapids, we didn't have to change a single thing about how we ran the jobs. That was the beauty of it. — Prudviya Bhattala"
"We were able to cut out almost 120 terabytes of disk and memory spill from our pipelines. Just vanished once we started doing all of this. — Prudviya Bhattala"