Hi [[ session.user.profile.firstName ]]

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

In this webcast, Joseph Bradley from Databricks will be speaking about Apache Spark’s distributed Machine Learning Library - MLlib.

We will start off with a quick primer on machine learning, Spark MLlib, and a quick overview of some Spark machine learning use cases. We will continue with multiple Spark MLlib quick start demos. Afterwards, the talk will transition toward the integration of common data science tools like Python pandas, scikit-learn, and R with MLlib.
Recorded Feb 24 2016 59 mins
Your place is confirmed,
we'll send you email reminders
Presented by
Joseph Bradley
Presentation preview: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Network with like-minded attendees

  • [[ session.user.profile.displayName ]]
    Add a photo
    • [[ session.user.profile.displayName ]]
    • [[ session.user.profile.jobTitle ]]
    • [[ session.user.profile.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(session.user.profile) ]]
  • [[ card.displayName ]]
    • [[ card.displayName ]]
    • [[ card.jobTitle ]]
    • [[ card.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(card) ]]
  • Channel
  • Channel profile
  • Apache® Spark™ - The Unified Engine for All Workloads Jan 12 2017 6:00 pm UTC 60 mins
    Tony Baer, Principal Analyst at Ovum
    The Apache® Spark™ compute engine has gone viral – not only is it the most active Apache big data open source project, but it is also the fastest growing big data analytics workload, on and off Hadoop. The major reason behind Spark’s popularity with developers and enterprises is its flexibility to support a wide range of workloads including SQL query, machine learning, streaming, and graph analysis.


    This webinar features Ovum analyst Tony Baer, who will explain the real-world benefits to practitioners and enterprises when they build a technology stack based on a unified approach with Apache Spark.

    This webinar will cover:
    Findings around the growth of Spark and diverse applications using machine learning and streaming.
    The advantages of using Spark to unify all workloads, rather than stitching together many specialized engines like Presto, Storm, MapReduce, Pig, and others.
    Use case examples that illustrate the flexibility of Spark in supporting various workloads.
  • Apache® Spark™ MLlib 2.x: Migrating ML Workloads to DataFrames Dec 8 2016 6:00 pm UTC 60 mins
    Joseph K. Bradley and Jules S. Damji
    In the Apache® Spark™ 2.x releases, Machine Learning (ML) is focusing on DataFrame-based APIs. This webinar is aimed at helping users take full advantage of the new APIs. Topics will include migrating workloads from RDDs to DataFrames, ML persistence for saving and loading models, and the roadmap ahead.
  • How to Evaluate Cloud-based Apache® Spark™ Platforms Recorded: Nov 16 2016 62 mins
    Nik Rouda - ESG Global
    Since its release, Apache Spark has quickly become the fastest growing big data processing engine. But few companies have the domain expertise and resources to build their own Spark-based infrastructure - often times resulting in a mix of tools that are complex to stand up and time consuming to maintain.

    There are several cloud-based platforms available that allow you to harness the power of Spark while reaping the advantages of the cloud. This webinar features ESG Global senior analyst Nik Rouda who will share research and best practices to help decision makers evaluate the most popular cloud-based Apache Spark solutions and to understand the differences between them.
  • Databricks for Data Engineers Recorded: Oct 26 2016 49 mins
    Prakash Chockalingam
    Apache Spark has become an indispensable tool for data engineering teams. Its performance and flexibility made ETL one of Spark’s most popular use cases. In this webinar, Prakash Chockalingam - seasoned data engineer and PM - will discuss how Databricks allows data engineering teams to overcome common obstacles while building production-quality data pipelines with Spark. Specifically, you will learn:

    - Obstacles faced by data engineering teams while building ETL pipelines;
    - How Databricks simplifies Spark development;
    - A demonstration of key Databricks functionalities geared towards making data engineers more productive.
  • How Edmunds.com Leverages Apache® Spark™ on Databricks to Improve Customer Conve Recorded: Oct 19 2016 60 mins
    Shaun Elliott, Christian Lugo
    Edmunds.com is a leading online car information and shopping marketplace serving nearly 20 million visitors each month to their website. With a 10x growth in data to 100x of TBs in the past for years, their engineering team was looking for ways to increase consumer engagement and conversion by improving the data integrity of Edmunds' website.

    Databricks simplifies the management of their Apache Spark infrastructure while accelerating data exploration at scale. Now they can quickly analyze large datasets to determine the best sources for car data on their website.

    In this webinar, you will learn:

    Why Edmunds.com moved from MapReduce to Databricks for ad hoc data exploration.
    How Databricks democratized data access across teams to improve decision making and feature innovation.
    Best practices for doing ETL and building a robust data pipeline with Databricks.
  • How Omega Point Helps Investors Optimize their Portfolios with Apache® Spark™ on Recorded: Aug 18 2016 56 mins
    Omer Cedar, CEO, Omega Point and Eran Cedar, CTO, Omega Point
    Omega Point uses big data analytics to enable investment professionals to reduce risk while increasing their returns. Databricks enables Omega Point to uncover performance drivers of investment portfolios using massive volumes of market data. Join us to learn how Omega Point built a next-generation investment analytics platform to isolate critical market signals from noise with a big data architecture built with Apache Spark on Databricks.
  • Databricks' Data Pipelines: Journey and Lessons Learned Recorded: Aug 4 2016 58 mins
    Burak Yavuz
    With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform.

    In this webinar, we discuss the role and importance of ETL and what are the common features of an ETL pipeline. We will then show how the same ETL fundamentals are applied and (more importantly) simplified within Databricks’ Data pipelines. By utilizing Apache Spark as its foundation, we can simplify our ETL processes using one framework. With Databricks, you can develop your pipeline code in notebooks, create Jobs to productionize your notebooks, and utilize REST APIs to turn all of this into a continuous integration workflow. We will provide tips and tricks of doing ETL with Spark and lessons learned from our pipeline.
  • How DNV GL is removing analytic barriers in the energy industry with Databricks Recorded: Jul 20 2016 55 mins
    Jonathan Farland, Senior Data Scientist, DNV GL and Kyle Pistor, Solutions Engineer, Databricks
    Smart meter sensor data presents tremendous opportunities for the energy industry to better understand their customers and anticipate their needs. With smart meter data, energy industry data analysts and utilities are able to use hourly readouts to gain high resolution insights into energy consumption patterns across structures and customer types, and in addition gain near real time insights into grid operations.

    Join Jonathan Farland, a technical consultant at DNV GL Energy, to learn how this globally renowned energy company is processing data at scale and mining deeper insights by leveraging statistical learning techniques. In this talk, Jon will share how DNV GL is using Apache Spark and Databricks to turn smart meter data into insights to better serve their customers by:

    - Accelerating data processing compared to competing platforms, at times by nearly 100 times faster without incurring additional operational costs.
    - Scaling to any size on-demand while being able to decouple compute and storage resources to minimize operational expense.
    - Eliminating the need to spend time on DevOps, allowing their data scientists and engineers to focus on solving data problems.
  • Better Sales Performance with Databricks Recorded: Jun 23 2016 57 mins
    Justin Mills and Anna Holschuh of Yesware
    In this webinar, you will learn how Yesware used Databricks to radically improve the reliability, scalability, and ease of development of Yesware’s Apache Spark data pipeline. Specifically the Yesware team will cover the workflow of taking an idea from the prototyping stage in a Databricks notebook to the final, fully-tested, peer reviewed, and versioned production feature that produces high quality data for Yesware customers on a daily basis.
  • Deep Dive: Apache Spark Memory Management Recorded: Jun 15 2016 43 mins
    Andrew Or
    Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
  • Productionizing your Streaming Jobs Recorded: May 26 2016 60 mins
    Prakash Chockalingam
    Apache Spark™ Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:

    - Motivation and most common use cases for Spark Streaming
    - Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
    - Performance Optimization Techniques
  • Enabling Exploratory Analysis of Large Data with Apache Spark and R Recorded: May 19 2016 60 mins
    Hossein Falaki
    R has evolved to become an ideal environment for exploratory data analysis. The language is highly flexible - there is an R package for almost any algorithm and the environment comes with integrated help and visualization. SparkR brings distributed computing and the ability to handle very large data to this list. SparkR is an R package distributed within Apache Spark. It exposes Spark DataFrames, which was inspired by R data.frames, to R. With Spark DataFrames, and Spark’s in-memory computing engine, R users can interactively analyze and explore terabyte size data sets.

    In this webinar, Hossein will introduce SparkR and how it integrates the two worlds of Spark and R. He will demonstrate one of the most important use cases of SparkR: the exploratory analysis of very large data. Specifically, he will show how Spark’s features and capabilities, such as caching distributed data and integrated SQL execution, complement R’s great tools such as visualization and diverse packages in a real world data analysis project with big data.
  • Apache Spark 2.0: Faster, Easier, and Smarter Recorded: May 5 2016 61 mins
    Reynold Xin
    In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.

    The major themes for Spark 2.0 are:
    - Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
    - Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
    - Tungsten Phase 2: Speed up Apache Spark by 10X
  • GraphFrames: DataFrame-based graphs for Apache® Spark™ Recorded: Apr 14 2016 53 mins
    Joseph Bradley
    GraphFrames bring the power of Apache Spark DataFrames to interactive analytics on graphs.

    Expressive motif queries simplify pattern search in graphs, and DataFrame integration allows seamlessly mixing graph queries with Spark SQL and ML. By leveraging Catalyst and Tungsten, GraphFrames provide scalability and performance. Uniform language APIs expose the full functionality of GraphX to Java and Python users for the first time.

    In this talk, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.

    For experts, this talk will also include a few technical details on design decisions, the current implementation, and ongoing work on speed and performance optimizations.
  • Not Your Father's Database Recorded: Apr 7 2016 51 mins
    Vida Ha
    This session will cover a series of use cases where you can store your data cheaply in files and analyze the data with Apache Spark, as well as use cases where you want to store your data into a different data source to access with Spark DataFrames. Here’s an example outline of some of the topics that will be covered in the talk:

    Use cases to store in file systems to use with Apache Spark:

    1. Analyzing a large set of data files.
    2. Doing ETL of a large amount of data.
    3. Applying Machine Learning & Data Science to a large dataset.
    4. Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.

    Use cases to store your data into databases for use with Apache Spark:

    1. Random access, frequent inserts, and updates of rows of SQL tables. Databases have better performance for these use cases.
    2. Supporting Incremental updates of Databases into Spark. It’s not performant to update your Spark SQL tables backed by files. Instead, you can use message queues and Spark Streaming or doing an incremental select to make sure your Spark SQL tables stay up to date with your production databases.
    3. External Reporting with many concurrent requests. While Spark’s ability to cache your file data in memory will allow you to get back to fast interactive querying, that may not optimal for supporting many concurrent requests. It’s better to use Spark to ETL your data to summary tables or some other format into a traditional database to serve your reports if you have many concurrent users to support.
    4. Searching content. A Spark job can certainly be written to filter or search for any content in files that you’d like. ElasticSearch is a specialized engine designed to return search results quicker.
  • Just-in-Time Data Warehousing on Databricks: CDC and Schema On Read Recorded: Mar 7 2016 60 mins
    Jason Pohl
    In this webcast, Jason Pohl, Solution Engineer from Databricks, will cover how to build a Just-in-Time Data Warehouse on Databricks with a focus on performing Change Data Capture from a relational database and joining that data to a variety of data sources. Not only does Apache Spark and Databricks allow you to do this easier with less code, the routine will automatically ingest changes to the source schema.

    Highlights of this webinar include:
    1. Starting with a Databricks notebook, Jason will build a classic Change Data Capture (CDC) ETL routine to extract data from an RDBMS.

    2. A deep-dive into selecting a delta of changes from tables in an RDBMS, writing it to Parquet, querying it using Spark SQL.

    3. Demonstrate how to apply a schema at time of read rather than before write
  • Apache® Spark™ MLlib: From Quick Start to Scikit-Learn Recorded: Feb 24 2016 59 mins
    Joseph Bradley
    In this webcast, Joseph Bradley from Databricks will be speaking about Apache Spark’s distributed Machine Learning Library - MLlib.

    We will start off with a quick primer on machine learning, Spark MLlib, and a quick overview of some Spark machine learning use cases. We will continue with multiple Spark MLlib quick start demos. Afterwards, the talk will transition toward the integration of common data science tools like Python pandas, scikit-learn, and R with MLlib.
  • Jump Start into Apache® Spark™ and Databricks Recorded: Feb 11 2016 61 mins
    Denny Lee
    Denny Lee, Technology Evangelist with Databricks, will provide a jump start into Apache Spark and Databricks. Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.

    This introductory level jump start will focus on the following scenarios:
    - Quick Start on Spark: Provides an introductory quick start to Spark using Python and Resilient Distributed Datasets (RDDs). We will review how RDDs have actions and transformations and their impact on your Spark workflow.
    - A Primer on RDDs to DataFrames to Datasets: This will provide a high-level overview of our journey from RDDs (2011) to DataFrames (2013) to the newly introduced (as of Spark 1.6) Datasets (2015).
    - Just in Time Data Warehousing with Spark SQL: We will demonstrate a Just-in-Time Data Warehousing (JIT-DW) example using Spark SQL on an AdTech scenario. We will start with weblogs, create an external table with RegEx, make an external web service call via a Mapper, join DataFrames and register a temp table, add columns to DataFrames with UDFs, use Python UDFs with Spark SQL, and visualize the output - all in the same notebook.
  • How Celtra Optimizes its Advertising Platform with Databricks Recorded: Dec 9 2015 59 mins
    Grega Kešpret
    Leading brands such as Pepsi and Macy’s use Celtra’s technology platform for brand advertising. To inform better product design and resolve issues faster, Celtra relies on Databricks to gather insights from large-scale, diverse, and complex raw event data. Learn how Celtra uses Databricks to simplify their Apache Spark deployment, achieve faster project turnaround time, and empower people to make data-driven decisions.

    In this webinar, you will learn how Databricks helps Celtra to:
    - Utilize Apache Spark to power their production analytics pipeline.
    - Build a “Just-in-Time” data warehouse to analyze diverse data sources such as Elastic Load Balancer access logs, raw tracking events, operational data, and reportable metrics.
    - Go beyond simple counting and group events into sequences (i.e., sessionization) and perform more complex analysis such as funnel analytics.
  • Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell Recorded: Dec 1 2015 61 mins
    Patrick Wendell
    In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release.

    Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].
Make big data simple
Databricks’ vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache® Spark™, a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed.

Databricks is the largest contributor to the open source Apache Spark project providing 10x more code than any other company. The company has also trained over 20,000 users on Apache Spark, and has the largest number of customers deploying Spark to date. Databricks provides a just-in-time data platform, to simplify data integration, real-time experimentation, and robust deployment of production applications.

Embed in website or blog

Successfully added emails: 0
Remove all
  • Title: Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
  • Live at: Feb 24 2016 6:00 pm
  • Presented by: Joseph Bradley
  • From:
Your email has been sent.
or close