Apache® Spark™ DataFrames: Simple and Fast Analysis of Structured Data
Presented by
Michael Armbrust
About this talk
This session will provide a technical overview of Apache Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.
No matter at what stage of your data journey you’re in, this channel will help you get a better understanding of the fundamental concepts of the Databricks Lakehouse platform and the problems we’re helping to solve for data teams.…