Spark Structured Streaming on the Cloud: Introduction to Internals
Register now to see the on-demand recording of this webinar.
Apache Spark has been gaining steam, with rapidity, both in the headlines and in real-world adoption. Spark was developed in 2009, and open sourced in 2010. Since then, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. This open source analytics engine stands out for its ability to process large volumes of data significantly faster than contemporaries such as MapReduce, primarily owing to in-memory storage of data on its own processing framework. That being said, one of the top real-world industry use cases for Apache Spark is its ability to process ‘streaming data‘.
With so much data being processed on a daily basis, it has become essential for companies to be able to stream and analyze it all in real time, and Spark Streaming has the capability to handle this extra workload. Some experts even theorize that Spark could become the go-to platform for stream-computing applications, no matter the type. The reason for this claim is that Spark Streaming unifies disparate data processing capabilities, allowing developers to use a single framework to accommodate all their processing needs. Among the general ways that Spark Streaming is being used by businesses today are Streaming ETL, Data Enrichment, Trigger Event Detection and Complex Session Analysis. In this webinar, we will cover an introduction, internals and industry use cases of ‘Structured Streaming in Spark’.
- Understanding of Data Processing Architecture
- Why and When to use Spark’s Structured Streaming
- Spark’s Structured Streaming Programming Paradigm
- Internals of Spark’s Structured Streaming
- Spark Structured Streaming in the Real World – examples of how customers of Qubole use it
RecordedFeb 7 201830 mins
Your place is confirmed, we'll send you email reminders
Nate Shea-han, Americas Global Black Belt, Data & AI at Microsoft and Shaun Van Staden, Solutions Architect at Qubole
Becoming more competitive with big data today means having the right technology to uncover new insights from your data and make critical business decisions in real time. Qubole and Microsoft help companies activate their big data in the cloud to uncover insights that improve customer engagement, increase revenue, and lower costs.
Join experts from Qubole and Microsoft as they discuss how to activate your big data and how to get the most out of open source technologies on the cloud. In this webinar, you'll learn:
- How to modernize with data lakes and data warehouses on the cloud
- Strategies for boosting business value out of Machine Learning and advanced analytics with Qubole on Azure
- How to reduce costs, control risks, and improve data governance as you build your data pipelines
- The importance of data security and privacy
- Real world examples of successful companies activating their big data
Americas Global Black Belt, Data & AI at Microsoft
Nate Shea-han has been with Microsoft for 14 years and has spent the last 8 years focused on the helping Microsoft customers transform their business in the cloud on the Azure platform. Currently he has responsibilities across the United States, Canada and Latin America for Microsoft’s AI, big data, and analytics offerings. Nate has also worked extensively with Microsoft partner community.
Shaun Van Staden
Solutions Architect, Qubole
Shaun Van Staden has 19 years of experience in enterprise software managing advanced analytics projects, as a developer, DBA, business analyst and now a solutions architect. As a solutions architect manager, Shaun is responsible for supporting business development and sales at Qubole and helping customers transform their use cases for the cloud. Prior to Qubole, Shaun worked as a solutions architect at NICE Systems and Merced Systems (acquired by NICE).
Big data technologies can be both complex and involve time consuming manual processes. Organizations that intelligently automate big data operations lower their costs, make their teams more productive, scale more efficiently, and reduce the risk of failure.
In our webinar, representatives from TiVo, creator of a digital recording platform for television content, will explain how they implemented a new big data and analytics platform that dynamically scales in response to changing demand. You’ll learn how the solution enables TiVo to easily orchestrate big data clusters using Amazon Elastic Cloud Compute (Amazon EC2) and Amazon EC2 Spot instances that read data from a data lake on Amazon Simple Storage Service (Amazon S3) and how this reduces the development cost and effort needed to support its network and advertiser users. TiVo will share lessons learned and best practices for quickly and affordably ingesting, processing, and making available for analysis terabytes of streaming and batch viewership data from millions of households.
Join our webinar to learn:
- How to dramatically reduce management complexities for big data analytics operations on AWS.
- Best practices for optimizing data lakes for self-service analytics that enable teams to productionize data science and accelerate data pipelines.
- About using Qubole’s auto-scaling to reduce the complexity and deployment time of big data projects.
- How to reduce the cost of big data workloads with Qubole’s automated Spot Instance Bidding and management.
In this keynote Ashish Thusoo, CEO of Qubole, discusses the gap that enterprises face today when activating their big data. He makes a case for the shift that organizations need to make towards a big data activation strategy in order to put their data assets to use for differentiating and achieving business objectives. The session also covers key elements of big data activation supported by usage trends of Qubole's cloud-native big data activation platform. He presents various ways that enterprises can use to measure their own activation readiness, and demonstrate why Qubole provides the right approach to big data activation.
Sumit Gupta, VP, AI, Machine Learning and HPC, IBM Cognitive Systems
From chat bots, to recommendation engines, to Google Voice and Apple Siri, AI has begun to permeate our lives. In this keynote, IBM's Sumit Gupta demystifies what AI is, presents the differences between machine learning and deep learning, explains why the huge interest now, shows some fun use cases and demos, and then discusses use cases of how deep learning based AI methods can be used to garner insights from data for enterprises. Sumit also talks about what IBM is doing to make deep learning and machine learning more accessible and useful to a broader set of data scientists.
Jose Villacis, Senior Director of Product Marketing, Qubole
Every CEO aspires to create a data-driven culture that can activate 100s or 1000s of users and petabyte-scale data to continuously deliver true business value. This keynote panel explores the journey of 4 companies: Comcast, Qubole, Fanatics and MediaMath, that have chronicled their successes and challenges in two books by O’Reilly Media about Creating a Data-Driven Enterprise. The panelists talk not just about their technology strategy and choices but also how data-driven insights are powering their business and transforming the competitive dynamics of their industry.
In this webinar, we introduce you to the self-managing, self-optimizing implementation of the Apache Presto open source project. Qubole's Presto-as-a-Service is primarily intended for Data Analysts who need to translate business questions into SQL queries. Since the questions are often ad-hoc, there is some trial and error involved and arriving at the final results may involve a series of SQL queries. By reducing the response time of these queries, the Qubole platform can reduce the time to insight and greatly benefit the business. Besides performance efficiencies, users benefit substantially from multiple supported data formats, continuous auto-scaling and efficient management of Presto clusters, improved user experience, and tightened security.
In this webinar, we cover:
* Presto-as-a-Service on Qubole
* How Qubole Presto is different from OSS Presto
* How Qubole Presto is different from EMR Presto, Athena
* Demo on:
- Join Reorder, DF
- S3 Optimization
- Auto Scaling
- Spot Support
* Customer Use Cases
Ashish Thusoo, CEO and Co-Founder, Qubole José Villacís, Senior Director, Product Marketing
Join our Big Data Activation Report Webinar where our CEO Ashish Thusoo will go in-depth into our 2018 Qubole Big Data Activation Report findings and share how customers are using multiple engines to get the most out of their big data.
The report analyzes usage data from over 200 Qubole customers to provide answers to key questions such as:
- How fast is usage of open source big data engines like Apache Spark, Presto and Apache Hive/Hadoop growing?
- What engines are used most and for what?
- What engines and big data tools are rising stars?
- How successful are companies at providing their users access to data?
- What are the cost saving benefits of doing big data in the cloud?
You'll come away with both hard data and a few ideas for how to get more out of your big data initiatives.
Sasha will tell the story of building an end-to-end data product that feeds various parts of the Return Path business to optimize email programs for marketers. We will cover discovery, development, and production of an email classification model that uses Apache Spark to fit classifiers such as Random Forests and Support Vector Machines to read email text and classify the content. We will discuss the different methods of hyperparameter tuning and ensembling used, and will describe different stages of production from batch jobs in Qubole Scheduler and Apache Airflow to streaming in Apache Kafka. We will also reflect on what it means to be a full stack data scientist, and how data science teams can be empowered to own their own data products.
James Rowland-Jones, Principal Program Mgr., Microoft
Building a coherent platform for advanced analytics and reporting can feel quite overwhelming. A plethora of choices exist out there, and often it can feel like you are being asked to choose a side - it's almost like being asked to pick your favorite child! However, it doesn't have to be this way. Many of the services in a next-generation platform are actually complimentary, working together to deliver your next-generation analytical architecture.
In this session we will walk through the core components of a next-generation analytical platform architecture, discussing key decision points along the way. At the end you will have a clear and concrete understanding of how you can easily stand up an advanced analytical platform in minutes and bring demonstrable value to your users.
Apache Spark applications are difficult to tune for optimal performance, and the use of cloud stores like S3 as a truth-store makes things even more complex. This talk will briefly cover SparkLens (Spark tuning tool), Spark with Rubix (distributed cache), and direct-write for Hive tables and its performance numbers.
This presentation will discuss the importance of data quality and outline an approach to assess and measure the quality of product usage event logs. A data quality assessment framework helps build trust in our data and enables analysts to generate a deep understanding of product usage patterns, product stability, and utilization of purchased assets by our customers. Unlocking valuable insights from this data depends on the presence of high quality and complete data sets that provide the ability to link product usage events with back office accounts and entitlement data.
Justin Wainwright, Systems Analyst, Oracle Data Cloud, Oracle
This session highlights the model used within Oracle Data Cloud (ODC) for Apache Hadoop 2 and Apache Spark clusters. We'll talk about taking the guesswork out of cluster design, and about the keys for balancing cost and performance while minimizing administrative overhead.
This session will provide an overview of the enterprise fabric and the encapsulated view of the required capabilities. Some key components of the fabric include data and cognitive technologies. We will dive into the enterprise fabric-based architecture and why it is the core foundation for business transformation.
Presto is an open source distributed query engine that supports much of the SQL analytics workload at Facebook. This talk introduces a selection of Facebook use cases, which range from user-facing reporting applications to multi-hour ETL jobs, then explains the architecture, implementation, features, and performance optimizations that enable Presto to support these use cases.
Tom Silverstrim, Sr. Manager, Adobe Media Optimizer, Adobe Ad Cloud
Qubole has been the data warehouse of the DSP for the last six-plus years, and was selected as the ideal partner for mobilizing the considerable amount of diagnostic and base truth data contained within Amazon S3. From these origins, Qubole now powers our custom reporting infrastructure, machine learning algorithms, and user mapping reports, along with its evolving role in supporting system diagnoses and audits. We will touch on several use cases that demonstrate the flexibility and power of Qubole in democratizing data across the organization.
Erik Zwiefel, Advanced Analytics & AI Architect, Microsoft
TDSP is an agile data science process meant to keep data science and business teams working together. In this session, we'll explore the Team Data Science Process and walk through an example using Azure Machine Learning Services.
Azure Data Lake is one of the most powerful PaaS services that Microsoft Azure offers to manage big data. Built on well-known projects such as HDFS and YARN, it allows for the ability to focus on the design of the solution instead of the administration part. A new language, U-SQL, combines SQL and C# to work with any type and size of data. During the session, we will explore Azure Data Lake Store and Azure Data Lake Analytics, the core components of the Azure Data Lake offering.
Vignesh Sukumar, Senior Manager, Data Engineering, Optimizely
At Optimizely, we receive billions of user click stream events for the thousands of A/B experiments we run for our customers every day. Previously, customer inquiries related to the alignment of key experiment metrics between raw data and experiment results required expensive engineering analysis due to lack of scale and flexibility. In this talk, Vignesh will walk the audience through the journey of how we created a playbook to enhance customer trust in these situations and make self-service scalable for the entire organization.
Rajat Venkatesh, Senior Director of Engineering, Qubole
Popular SQL on Hadoop engines like SparkSQL on Apache Spark, Hive, and Presto have become much faster on the cloud. This talk will explore the major features, architectural changes, and best practices to supercharge these SQL engines. The talk will also peek into the future for upcoming performance-related features.
At our core, we are a team of engineers who live, eat, and sleep big data. We believe that ubiquitous access to information is the key to unlocking a company's success. To achieve this, a big data platform must be agile, flexible, scalable, and proactive to anticipate a company's needs.