Hi [[ session.user.profile.firstName ]]

Deep Dive: Apache Spark Memory Management

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Recorded Jun 15 2016 43 mins
Your place is confirmed,
we'll send you email reminders
Presented by
Andrew Or
Presentation preview: Deep Dive: Apache Spark Memory Management

Network with like-minded attendees

  • [[ session.user.profile.displayName ]]
    Add a photo
    • [[ session.user.profile.displayName ]]
    • [[ session.user.profile.jobTitle ]]
    • [[ session.user.profile.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(session.user.profile) ]]
  • [[ card.displayName ]]
    • [[ card.displayName ]]
    • [[ card.jobTitle ]]
    • [[ card.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(card) ]]
  • Channel
  • Channel profile
  • Accelerate Innovation by unifying Data and AI Recorded: Jul 31 2019 57 mins
    See how Apple, Finra, FIS, Overstock.com, Hewlett-Packard, Shell, Hotels.com and many others overcame these challenges to connect data science and data engineering using Databricks, the company founded by the original creators of Apache Spark™. The results? Faster performance, scaled data processes, simplified infrastructure, streamlined workflows, and greater collaboration.
  • Introducing MLflow: Infrastructure for a Complete Machine Learning Lifecycle Recorded: Aug 30 2018 54 mins
    Matei Zaharia, Co-Founder and Chief Technologist at Databricks, Denny Lee, Technical Product Marketing Manager at Databricks
    ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.

    In our webinar, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.

    We will show how to:
    - Keep track of experiments runs and results across popular frameworks, including TensorFlow, with MLflow Tracking
    - Execute a MLflow Project published on GitHub from the command line or Databricks notebook as well as remotely execute your project on to a Databricks cluster
    - Quickly deploy MLflow Models on-prem or in the cloud and expose them via REST APIs

    Get started now at https://www.mlflow.org/
  • From Data Prep to Deep Learning: How HP Unifies Analytics with Databricks Recorded: Jul 31 2018 47 mins
    Franco Vieira, Data Scientist at HP
    HP has invested into a new product delivery paradigm called Device as a Service (DaaS). Success of the DaaS investment depends on automating the delivery, monitoring, replacement, user interaction, and servicing of the device. At the core of DaaS is a set of Virtual Assistants optimizing the cost and the use experience assuring customer satisfaction with an aggressive cost model. Key takeaways from this presentation is how HP is using the Databricks Unified Analytics Platform to develop Virtual Assistants to change the workplace. Additionally, John will cover HP's approach to developing AI on Apache Spark™ and why HP chose Spark as a core technology for AI.
  • Scalable End-to-End Deep Learning using TensorFlow™ and Databricks Recorded: Jul 9 2018 42 mins
    Brooke Wenig, Data Science Solutions Consultant at Databricks, Siddarth Murching, Software Engineer at Databricks
    Deep Learning has shown tremendous success, and as we all know, the more data the better the models. However, we eventually hit a bottleneck on how much data we can process on a single machine. This necessitates a new way of training neural networks: in a distributed manner.

    In this webinar, we walk through how to use TensorFlow™ and Horovod (an open-source library from Uber to simplify distributed model training) on Databricks to build a more effective recommendation system at scale. We will cover:

    - The new Databricks Runtime for ML, shipped with pre-installed libraries such as Keras, Tensorflow, Horovod, and XGBoost to enable data scientists to get started with distributed Machine Learning more quickly
    - The newly-released HorovodEstimator API for distributed, multi-GPU training of deep learning models against data in Apache Spark™
    - How to make predictions at scale with deep learning pipelines
  • Streaming Analytics Use Cases on Apache Spark™ Recorded: May 17 2018 60 mins
    Deepsha Menghani, Prod Mktg Mgr at Microsoft; Dhruv Kumar, Solutions Architect; Brian Dirking, Sr. Dir of Partner Mktg
    Real time analytics are crucial to many use cases. Apache Spark™ provides the framework and high volume analytics to provide answers from your streaming data. Join us in this webinar and see a demonstration of how to build IoT and Clickstream Analytics Notebooks in Azure Databricks. These Notebooks will use Python and SQL code to capture data from Azure Events Hub and Azure IoT Hub, parse the data, and make it available to run in machine learning models. See how your organization can start taking advantage of your streaming data.
  • Is Your Data Lake GDPR Ready? How to Avoid Drowning in Data Requests Recorded: May 9 2018 41 mins
    Arsalan Tavakoli-Shiraji, VP of Solutions; Justin Olsson, Senior Legal Counsel and Michael Armbrust, Software Engineer
    With GDPR enforcement rapidly approaching on May 25, many companies are still trying to figure out how to comply with one of the regulation’s biggest pain points - data subject requests (DSRs). Under GDPR, data subjects (individuals) in the EU have the right to request information on what personal data is collected, how it is being used, and to have that data changed or erased.

    For many organizations that rely on data lakes to store their big data, sifting through millions of files to locate and modify records for a DSR is at minimum a massive effort. And trying to do this within prescribed timelines is near impossible.

    Fortunately there’s a path forward. Through an optimized approach to data management, Databricks powered by Apache Spark™ makes it easy to quickly find, edit and erase data submerged deep within your data lake without disrupting your data pipelines.

    Join this webinar to learn:
    • The GDPR requirements of data subject requests
    • The challenges big data and data lakes create for organizations
    • How Databricks Delta, a powerful new offering within the Databricks Unified Analytics Platform improves data lake management and makes it possible to quickly find and surgically remove or modify individual records
    • Best practices for GDPR data governance
    • Live demo on how to easily fulfill data requests with Databricks
  • Collaboration to Production with Apache Spark on Azure Databricks Recorded: Apr 27 2018 54 mins
    Sandy May, Data Shepherd at Elastacloud
    Sandy is going to highlight some key aspects of the new Spark-as-a-Service offering in Azure, from Databricks. Leveraging the power of Databricks notebooks to showcase loading and cleaning data in SQL and Scala, exploration and all the way through to having a model into production.
  • Apache Spark™ for Machine Learning and AI Recorded: Apr 26 2018 61 mins
    Brian Dirking, Senior Director of Partner Marketing at Databricks, and Nauman Fakhar, System Architect at Databricks
    Azure Databricks in an Apache Spark™ based platform, providing the scale, collaborative platform, and integration with your Azure environment that makes it the best place to run your ML and AI workloads on Azure. This webinar will include an in-depth demo of key AI and ML use cases.
  • How Viacom Revolutionized Audience Experiences with Real-Time Analytics at Scale Recorded: Apr 25 2018 59 mins
    Mark Cohen, VP of Data Platform Engineering at Viacom; Chris Burns, Machine Learning Solutions Architect at AWS
    With 170+ global networks, Viacom is focused on providing an amazing audience experience to its billions of viewers around the world. Core to this strategy is leveraging big data and advanced analytics to offer the right content to the right audience and deliver it flawlessly on any device. To make this possible, Viacom set-out to build a real-time, scalable data analytics platform on Apache Spark™.

    Join this webinar to learn how Viacom overcame the complexities of Spark with Databricks and AWS to build an end-to-end scalable self-service insights platform that delivers on a wide range of analytics use cases.

    This webinar will cover:
    - The challenges Viacom faced building a scalable, real-time data insights and AI platform
    - How they overcame these challenges with Spark, AWS and Databricks
    - How they leverage a unified analytics platform for data pipelines, analytics and machine learning to reduce video start delays and improve content delivery with stream analytics at scale
    - What it takes to create a data driven culture with self-service analytics that meet the needs of business users, data analysts and data scientists
  • Getting Started with Apache Spark™ on Azure Databricks Recorded: Mar 27 2018 60 mins
    Brian Dirking, Senior Director of Partner Marketing at Databricks, and Nauman Fakhar, System Architect at Databricks
    Learn the basics of Apache Spark™ on Azure Databricks. Designed by Databricks, in collaboration with Microsoft, Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click set up, streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

    This webinar will cover the following topics:

    · RDDs, DataFrames, Datasets, and other fundamentals of Apache Spark.
    · How to quickly setup Azure Databricks, relieving you of DataOps duties.
    · How to use the Databricks interactive notebooks, which provide a collaborative space for your entire analytics team, and how you can schedule notebooks, immediately putting your work into production.
  • Fast and Reliable ETL Pipelines with Databricks Recorded: Mar 7 2018 57 mins
    Prakash Chockalingam, Product Manager at Databricks
    Building multiple ETL pipelines is very complex and time consuming, making it a very expensive endeavor. As the number of data sources and the volume of the data increases, the ETL time also increases, negatively impacting when an enterprise can derive value from the data.

    Join Prakash Chockalingam, Product Manager and data engineering expert at Databricks, to learn how to avoid the common pitfalls of data engineering and how the Databricks Unified Analytics Platform can ensure performance and reliability at scale to lower total cost of ownership (TCO).

    In this webinar, you will learn how Databricks can help to:
    - Remove infrastructure configuration complexity to reduce DevOps efforts
    - Optimize your ETL data pipelines for performance without compromising reliability
    - Unify data engineering and data science to accelerate innovation for the business.
  • Azure Databricks: Accelerating Innovation with Microsoft Azure and Databricks Recorded: Feb 15 2018 52 mins
    Brian Dirking, Senior Director of Partner Marketing at Databricks
    Data scientists and data engineers need a secure and scalable platform to collaborate on analytics. Register for this webinar and see how Azure Databricks provides a platform that enables teams to accelerate innovation, providing:

    - A collaborative workspace to experiment with models and datasets, and then put jobs into action instantly.
    - An automated infrastructure that enables you to autoscale compute and storage independently.

    The live demo portion of the webinar will show how Azure Databricks can bring in streaming data, run it in a machine learning model, and then output the results to PowerBI for visualization.
  • What's New in the Upcoming Apache Spark 2.3 Release? Recorded: Feb 8 2018 49 mins
    Reynold Xin, Chief Architect at Databricks, and Jules Damji, Spark Community and Developer Advocate
    The upcoming Spark 2.3 release marks a big step forward in speed, unification, and API support.

    Reynold Xin and Jules Damji from Databricks will walk through how you can benefit from the upcoming improvements:

    - New DataSource APIs that enable developers to more easily read and write data for Continuous Processing in Structured Streaming.
    - PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
    - Improved performance by taking advantage of NVMe SSDs.
    - Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
  • Ten Must-Haves to Deploy Machine Learning and AI in the Enterprise Recorded: Jan 25 2018 61 mins
    Forrester VP & Principal Analyst, Mike Gualtieri; Data Science Lead at Overstock, Chris Robison; PM at Overstock, Craig Kelly
    Enterprise data science teams are driving big innovations in machine learning, but this has put them under increased pressure to deliver more models, more frequently, and more rapidly.

    In this webinar, Forrester VP & Principal Analyst, Mike Gualtieri, will share data on the top trends in machine learning and lay out what data science teams need to do in order to maximize their output.

    Chris Robison, Head of Data Science at Overstock.com and Craig Kelly, Group Product Manager at Overstock.com, will showcase how they utilized big data and machine learning to

    -Create a one-to-one personalized shopping experience.
    -Decrease cost of moving models to production by nearly 50%.
    -Stand up new models 5x faster than before.
  • How Databricks helps iPass optimize for performance and availability Recorded: Jan 10 2018 60 mins
    Tomasz Magdanski, Director of Big Data and Analytics at iPass
    iPass is the world’s largest wifi network serving over 160 network providers with nearly 60+ million hotspots in airports, hotels, airplanes, and public spaces in 120 countries across the globe.

    Analyzing the state of the world’s wifi in real time is a daunting task fraught with unpredictable challenges that can impact performance, reliability, and security. Join this webinar to learn why iPass moved from an on-premises Hadoop system to Databricks in the cloud and how they are able to deliver ground-breaking results with a small and nimble team.

    With Databricks, iPass can now focus on scalable business logic and not building infrastructure. This new found freedom has allowed their team to:
    -monitor the performance of millions of wifi hotspots around world.
    -leverage machine learning and real-time analytics to understand the health of access points.
    -make recommendations to customers on the best access point to use to ensure optimal performance.
  • Continuous Integration & Continuous Delivery with Databricks Recorded: Dec 7 2017 45 mins
    Prakash Chockalingam, Product Manager at Databricks
    Continuous integration and continuous delivery (CI/CD) enables an organization to rapidly iterate on software changes while maintaining stability, performance, and security. Many organizations have adopted various tools to follow the best practices around CI/CD to improve developer productivity, code quality, and software delivery. However, following the best practices of CI/CD is still challenging for many big data teams.

    This webinar will highlight:
    *Key challenges in building a data pipeline for CI/CD.
    *Key integration points in a data pipeline's CI/CD cycle.
    *How Databricks facilitates iterative development, continuous integration and build.
  • Unified Data Management: The Best of Data Lakes, Data Warehouses and Streaming Recorded: Nov 16 2017 61 mins
    Jason Pohl, Software Engineer at Databricks, and Bill Chambers, Product Manager at Databricks
    Current data management architectures are a complex combination of siloed, single-purpose tools. There are data lakes for low cost storage, but are difficult to use for data discovery, data warehouses that are reliable and optimized for fast queries, but come at a cost when having to scale, and various streaming and batch systems to shuffle data between them, often times resulting in data integrity issues.

    Businesses have to create a patchwork of different tools, skillsets, and expertise just to solve one fundamental problem: How can I make data-driven decisions faster?

    Join this webinar to learn how Databricks Delta — a new unified data management system — takes advantage of the the scale of a data lake, the reliability and performance of a data warehouse, and the low-latency updates of a streaming system, all in a unified and fully managed fashion.

    This webinar will cover:
    -How the need to process batch and streaming data creates challenges for enterprises with complex data architectures.
    -How Databricks Delta takes the best of data warehouses, data lakes and streaming systems to provide a highly scalable, performant, and reliable data management system.
    -A live demonstration of Databricks Delta to showcase how easy it is to cost-efficiently scale without impacting query performance.
  • 5 Keys to Build Machine Learning and Visualization Into Your Application Recorded: Nov 8 2017 51 mins
    Databricks, Handshake, and Looker
    Machine learning has unlocked new possibilities that deliver significant business value. However most companies don’t have the resources to either build and maintain the supporting infrastructure or apply data science to build a smarter solution.

    Join us for this webinar and hear from John Huang, engineering and data analytics lead at Handshake, as he shares how he quickly and cost effectively scaled a small engineering team to build an machine-learning powered recommendation engine that profiles users and behaviors to present relevant next steps. In this webinar you will learn how to:

    -Simplify and accelerate data engineering processes including data ingest and ETL
    -Incorporate machine learning into your production application without an army of data scientists
    -Choose an analytics engine that will enable key analytics such as attribution, step analysis, and linear regression
    -Embed visualizations into your application that drive stickiness
  • How to Put Cluster Management on Autopilot Recorded: Oct 19 2017 49 mins
    Prakash Chockalingam, Product Manager at Databricks
    A key obstacle for doing data engineering at scale is having a robust distributed infrastructure on which frameworks like Apache Spark can run efficiently. On top of building the infrastructure, having proper automatic functioning of the infrastructure is another critical piece for running production workloads.

    Join this webinar to learn how Databricks’ Unified Analytics Platform can help simplify your data engineering problems by configuring your distributed infrastructure to be in autopilot mode. Learn how:
    -Databricks’ automated infrastructure will allow you to autoscale compute and storage independently.
    -To significantly reduce cloud costs through cutting edge cluster management features.
    -To control certain features in the cluster management and balance between ease of use and manual control.
  • How CardinalCommerce Significantly Improved Data Pipeline Speeds by 200% Recorded: Sep 21 2017 61 mins
    Christopher Baird from CardinalCommerce
    CardinalCommerce was acquired by Visa earlier this year for its critical role in payments authentication. Through predictive analytics and machine learning, Cardinal measures performance and behavior of the entire authentication process across checkout, issuing and ecosystem partners to recommend actions, reduce fraud and drive frictionless digital commerce.

    With Databricks, CardinalCommerce simplified data engineering to improve the performance of their ETL pipeline by 200% while reducing operational costs significantly via automation, seamless integration with key technologies, and improved process efficiencies.

    Join this webinar to learn how CardinalCommerce was able to:
    -Simplify access to data across the organization
    -Accelerate data processing by 200%
    -Reduce EC2 costs through faster performance and automated infrastructure
    -Visualize performance metrics to customers and stakeholders
Making Big Data Simple
Databricks’ mission is to accelerate innovation for its customers by unifying Data Science, Engineering and Business. Founded by the team who created Apache Spark™, Databricks provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. Users achieve faster time-to-value with Databricks by creating analytic workflows that go from ETL and interactive exploration to production. The company also makes it easier for its users to focus on their data by providing a fully managed, scalable, and secure cloud infrastructure that reduces operational complexity and total cost of ownership.

Embed in website or blog

Successfully added emails: 0
Remove all
  • Title: Deep Dive: Apache Spark Memory Management
  • Live at: Jun 15 2016 4:00 pm
  • Presented by: Andrew Or
  • From:
Your email has been sent.
or close