Denny Lee, Technology Evangelist with Databricks, will provide a jump start into Apache Spark and Databricks. Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
This introductory level jump start will focus on the following scenarios:
- Quick Start on Spark: Provides an introductory quick start to Spark using Python and Resilient Distributed Datasets (RDDs). We will review how RDDs have actions and transformations and their impact on your Spark workflow.
- A Primer on RDDs to DataFrames to Datasets: This will provide a high-level overview of our journey from RDDs (2011) to DataFrames (2013) to the newly introduced (as of Spark 1.6) Datasets (2015).
- Just in Time Data Warehousing with Spark SQL: We will demonstrate a Just-in-Time Data Warehousing (JIT-DW) example using Spark SQL on an AdTech scenario. We will start with weblogs, create an external table with RegEx, make an external web service call via a Mapper, join DataFrames and register a temp table, add columns to DataFrames with UDFs, use Python UDFs with Spark SQL, and visualize the output - all in the same notebook.
RecordedFeb 11 201661 mins
Your place is confirmed, we'll send you email reminders
Frank Austin Nothaft, Genomics Data Engineer at Databricks
With the drastic drop in the cost of sequencing a single genome, many organizations across biotechnology, pharmaceuticals, biomedical research, and agriculture have begun to make use of genome sequencing. While the sequence of a single genome may provide insight about the individual who was sequenced, to derive maximal insight from the genomic data, the ultimate goal is to query across a cohort of many hundreds to thousands of individuals.
Join this webinar to learn how Databricks — powered by Apache Spark — enables queries across a database of genomics in interactive time and simplifies the application of machine learning models and statistical tests to genomics data across patients, to derive more insight into the biological processes driven by genomic alterations.
In this webinar, we will:
- Demonstrate how Databricks can rapidly query annotated variants across a cohort of 1,000 samples.
- Look at a case study using Databricks to improve the performance of running an expression quantitative trait loci (eQTL) test across samples from the GEUVADIS project.
- Show how we can parallelize conventional genomics tools using Databricks.
Saket Mengle, Senior Principal Data Scientist at DataXu
The central premise of DataXu is to apply data science to better marketing. At its core, is the Real-time Bidding Platform that processes 2 petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is DataXu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics.
This webinar will showcase DataXu’s successful use-cases of using the Apache® Spark™ framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production.
We will also discuss in detail:
Challenges of using Apache Spark in a petabyte scale machine learning system and how we worked to solve the issues.
Best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Apache® Spark™ has become an indispensable tool for data science teams. Its performance and flexibility enables data scientists to do everything from interactive exploration, feature engineering, to model tuning with ease. In this webinar, Maddie Schults - Databricks product manager - will discuss how Databricks allows data science teams to use Apache Spark for their day-to-day work.
You will learn:
- Obstacles faced by data science teams in the era of big data;
- How Databricks simplifies Spark development;
- A demonstration of key Databricks functionalities that help data scientists become more productive.
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on how our customers productionize machine learning models, do a deep dive with actual customer case studies, and show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Francis Lau, Senior Director, Product Intelligence at Smartsheet
Apache Spark is red hot, but without the compulsory skillsets, it can be a challenge to operationalize — making it difficult to build a robust production data pipeline that business users and data scientists across your company can use to unearth insights.
Smartsheet is the world’s leading SaaS platform for managing and automating collaborative work. With over 90,000 companies and millions of users, it helps teams get work done ranging from managing simple task lists to orchestrating the largest sporting events and construction projects.
In this webinar, you will learn how Smartsheet uses Databricks to overcome the complexities of Spark to build their own analysis platform that enables self-service insights at will, scale, and speed to better understand their customers’ diverse use cases. They will share valuable patterns and lessons learned in both technical and adoption areas to show how they achieved this, including:
How to build a robust metadata-driven data pipeline that processes application and business systems data to provide a 360 view of customers and to drive smarter business systems integrations.
How to provide an intuitive and valuable “pyramid” of datasets usable by both technical and business users.
Their roll-out approach and materials used for company-wide adoption allowing users to go from zero to insights with Spark and Databricks in minutes.
The Apache® Spark™ compute engine has gone viral – not only is it the most active Apache big data open source project, but it is also the fastest growing big data analytics workload, on and off Hadoop. The major reason behind Spark’s popularity with developers and enterprises is its flexibility to support a wide range of workloads including SQL query, machine learning, streaming, and graph analysis.
This webinar features Ovum analyst Tony Baer, who will explain the real-world benefits to practitioners and enterprises when they build a technology stack based on a unified approach with Apache Spark.
This webinar will cover:
Findings around the growth of Spark and diverse applications using machine learning and streaming.
The advantages of using Spark to unify all workloads, rather than stitching together many specialized engines like Presto, Storm, MapReduce, Pig, and others.
Use case examples that illustrate the flexibility of Spark in supporting various workloads.
In the Apache® Spark™ 2.x releases, Machine Learning (ML) is focusing on DataFrame-based APIs. This webinar is aimed at helping users take full advantage of the new APIs. Topics will include migrating workloads from RDDs to DataFrames, ML persistence for saving and loading models, and the roadmap ahead.
Since its release, Apache Spark has quickly become the fastest growing big data processing engine. But few companies have the domain expertise and resources to build their own Spark-based infrastructure - often times resulting in a mix of tools that are complex to stand up and time consuming to maintain.
There are several cloud-based platforms available that allow you to harness the power of Spark while reaping the advantages of the cloud. This webinar features ESG Global senior analyst Nik Rouda who will share research and best practices to help decision makers evaluate the most popular cloud-based Apache Spark solutions and to understand the differences between them.
Apache Spark has become an indispensable tool for data engineering teams. Its performance and flexibility made ETL one of Spark’s most popular use cases. In this webinar, Prakash Chockalingam - seasoned data engineer and PM - will discuss how Databricks allows data engineering teams to overcome common obstacles while building production-quality data pipelines with Spark. Specifically, you will learn:
- Obstacles faced by data engineering teams while building ETL pipelines;
- How Databricks simplifies Spark development;
- A demonstration of key Databricks functionalities geared towards making data engineers more productive.
Edmunds.com is a leading online car information and shopping marketplace serving nearly 20 million visitors each month to their website. With a 10x growth in data to 100x of TBs in the past for years, their engineering team was looking for ways to increase consumer engagement and conversion by improving the data integrity of Edmunds' website.
Databricks simplifies the management of their Apache Spark infrastructure while accelerating data exploration at scale. Now they can quickly analyze large datasets to determine the best sources for car data on their website.
In this webinar, you will learn:
Why Edmunds.com moved from MapReduce to Databricks for ad hoc data exploration.
How Databricks democratized data access across teams to improve decision making and feature innovation.
Best practices for doing ETL and building a robust data pipeline with Databricks.
Omer Cedar, CEO, Omega Point and Eran Cedar, CTO, Omega Point
Omega Point uses big data analytics to enable investment professionals to reduce risk while increasing their returns. Databricks enables Omega Point to uncover performance drivers of investment portfolios using massive volumes of market data. Join us to learn how Omega Point built a next-generation investment analytics platform to isolate critical market signals from noise with a big data architecture built with Apache Spark on Databricks.
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform.
In this webinar, we discuss the role and importance of ETL and what are the common features of an ETL pipeline. We will then show how the same ETL fundamentals are applied and (more importantly) simplified within Databricks’ Data pipelines. By utilizing Apache Spark as its foundation, we can simplify our ETL processes using one framework. With Databricks, you can develop your pipeline code in notebooks, create Jobs to productionize your notebooks, and utilize REST APIs to turn all of this into a continuous integration workflow. We will provide tips and tricks of doing ETL with Spark and lessons learned from our pipeline.
Jonathan Farland, Senior Data Scientist, DNV GL and Kyle Pistor, Solutions Engineer, Databricks
Smart meter sensor data presents tremendous opportunities for the energy industry to better understand their customers and anticipate their needs. With smart meter data, energy industry data analysts and utilities are able to use hourly readouts to gain high resolution insights into energy consumption patterns across structures and customer types, and in addition gain near real time insights into grid operations.
Join Jonathan Farland, a technical consultant at DNV GL Energy, to learn how this globally renowned energy company is processing data at scale and mining deeper insights by leveraging statistical learning techniques. In this talk, Jon will share how DNV GL is using Apache Spark and Databricks to turn smart meter data into insights to better serve their customers by:
- Accelerating data processing compared to competing platforms, at times by nearly 100 times faster without incurring additional operational costs.
- Scaling to any size on-demand while being able to decouple compute and storage resources to minimize operational expense.
- Eliminating the need to spend time on DevOps, allowing their data scientists and engineers to focus on solving data problems.
In this webinar, you will learn how Yesware used Databricks to radically improve the reliability, scalability, and ease of development of Yesware’s Apache Spark data pipeline. Specifically the Yesware team will cover the workflow of taking an idea from the prototyping stage in a Databricks notebook to the final, fully-tested, peer reviewed, and versioned production feature that produces high quality data for Yesware customers on a daily basis.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Apache Spark™ Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Databricks’ vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache® Spark™, a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed.
Databricks is the largest contributor to the open source Apache Spark project providing 10x more code than any other company. The company has also trained over 20,000 users on Apache Spark, and has the largest number of customers deploying Spark to date. Databricks provides a just-in-time data platform, to simplify data integration, real-time experimentation, and robust deployment of production applications.