Spark magic: How high-level pipelines become distributed hardcore

Day 2 /  / Track 2  /  RU

Spark is the most popular tool for building data pipelines. Every data engineer knows Spark, blah-blah-blah… OK, but Spark is just a distributed Java Streams, right? But how does it work then? Oh, it turns out you can't just call "flatMap" or "groupBy" to a remote machine. Codegen! Interested? Come and find more!


Speakers

Pasha Finkelstein
JetBrains

Pasha is a speaker and developer for the Big Data Tools team at JetBrains and the author of the Kotlin API for Apache Spark. In the past, he has worked in almost every IT position, from tech support to manager and data engineer. He loves talking to people about anything, but particularly he loves talking about IT. He's a Kotlin fan.