Spark Structured Streaming on the Cloud: Introduction to Internals
Register now to see the on-demand recording of this webinar.
Apache Spark has been gaining steam, with rapidity, both in the headlines and in real-world adoption. Spark was developed in 2009, and open sourced in 2010. Since then, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. This open source analytics engine stands out for its ability to process large volumes of data significantly faster than contemporaries such as MapReduce, primarily owing to in-memory storage of data on its own processing framework. That being said, one of the top real-world industry use cases for Apache Spark is its ability to process ‘streaming data‘.
With so much data being processed on a daily basis, it has become essential for companies to be able to stream and analyze it all in real time, and Spark Streaming has the capability to handle this extra workload. Some experts even theorize that Spark could become the go-to platform for stream-computing applications, no matter the type. The reason for this claim is that Spark Streaming unifies disparate data processing capabilities, allowing developers to use a single framework to accommodate all their processing needs. Among the general ways that Spark Streaming is being used by businesses today are Streaming ETL, Data Enrichment, Trigger Event Detection and Complex Session Analysis. In this webinar, we will cover an introduction, internals and industry use cases of ‘Structured Streaming in Spark’.
- Understanding of Data Processing Architecture
- Why and When to use Spark’s Structured Streaming
- Spark’s Structured Streaming Programming Paradigm
- Internals of Spark’s Structured Streaming
- Spark Structured Streaming in the Real World – examples of how customers of Qubole use it
RecordedFeb 7 201830 mins
Your place is confirmed, we'll send you email reminders
As the volume, variety, and velocity of data increases, the cloud is the most efficient and cost-effective option for machine learning and advanced analytics. Organizations looking to scale their big data projects can do so with greater ease with a cloud-native data platform.
Qubole provides a single platform for data engineers, analysts, and scientists that supports multiple use cases -- from machine learning to predictive analytics. The platform saves organizations up to 50 percent in data processing costs by leveraging multiple engines like Apache Spark, Presto, and Hive, and automatically provisions, manages, and optimizes cloud resources.
Join experts from Qubole as they demonstrate how to get the most out of your data on the cloud. In this webinar, you'll learn:
- The benefits of a single platform and centralized access to data
- How to pick the right data processing engines and tools
- To save money with intelligent cluster management and financial governance
- Key considerations to evaluate cloud data platforms
As corporations augment their corporate data warehouses and data marts with cloud data lakes in order to support new big data requirements, the question about how to grant governed access to those data lakes becomes more pressing. Certainly, capturing new and different types of data is important but deriving value from those datasets remains the ultimate goal.
Whether or not the data lake consumers write SQL or leverage 3rd party BI and visualization tools, what matters is that they can continue to be productive using the skills and tools they already know. The difference is that now those tools and skills should be used with back-end engines that can can help them quickly sift through petabytes of data and at the same time provide support for fast interactive queries.
This means that in order for those data lake investments to succeed it is important for data admins to provide: SQL access to all authorized data, support for BI tools, cross-team collaboration capabilities, and governed self-service.
In this webinar we will cover:
- Data collaboration and access using SQL
- Tools that enable fast self-service for different teams
- Considerations for choosing the right SQL back-end for your use case
Data engineers today serve a wider audience than just a few years ago. Companies now need to apply machine learning (ML) techniques on their data in order to remain relevant. Among the new challenges faced by data engineers is the need to build and fill Data Lakes as well as reliably delivering complete large-volume data sets so that data scientists can train more accurate models.
Aside from dealing with larger data volumes, these pipelines need to be flexible in order to accommodate the variety of data and the high processing velocity required by the new ML applications. Qubole addresses these challenges by providing an auto-scaling cloud-native platform to build and run these data pipelines.
In this webinar we will cover:
- Some of the typical challenges faced by data engineers when building pipelines for machine learning.
- Typical uses of the various Qubole engines to address these challenges.
- Real-world customer examples
The biggest mistake businesses make when spending on data processing services in the cloud is in assuming that cloud will lower their overall cost. While the cloud has the potential to offer better economics both in the short and long-term, the bursty nature of big data processing requires following cloud engineering best practices, such as upscaling and downscaling infrastructure and leveraging the spot market for best pricing, to realize such economics.
Businesses also fail to appreciate the potential of runaway costs in a 100% variable cost environment, something they rarely have to worry about in a fixed cost on-premise environment. In the absence of financial governance, companies leave themselves vulnerable to cost overruns where even a single rogue query can result in tens of thousands of dollars in unbudgeted spend.
In this webinar you’ll learn how to:
- Identify areas of cost optimization to drive maximum performance for the lowest TCO
- Monitor total costs at the application, user, and account level
- Provide admins the ability to control and design the infrastructure spend
- Automatically optimize clusters for lower infrastructure spend based on custom-defined parameters
José Villacís, Matthew Settipane, and Jon King from Qubole
With the volume of data and the scale of innovation happening in the cloud, it is only a matter of "when" your big data processing will move to the cloud. When that happens, you need to be ready with your choice of architecture and technology platform.
If you are using Cloudera, Hortonworks or MapR, you should attend this can't miss webinar to learn best practices in areas such as:
- Difference between hosting an on-premise data platform in the cloud versus adopting a cloud-native architecture for data processing in the cloud
- Avoiding security and cost pitfalls that can derail your migration to the cloud
- Building a platform to cater to the expanding the number of active users and data
- Supporting the next generation of machine learning and complex analytics use cases
- Using the scale and flexibility provided by the cloud to implementing a data-driven business culture
James E. Curtis Senior Analyst, Data Platforms & Analytics, 451 Research
The cloud has the potential to deliver on the promise of big data processing for machine learning and analytics to help organizations become more data-driven, however, it presents its own set of challenges.
This webinar covers best practices in areas such as.
- Using automation in the cloud to derive more value from big data by delivering self-service access to data lakes for machine learning and analytics
- Enabling collaboration among data engineers, data scientists, and analysts for end-to-end data processing
- Implementing financial governance to ensure a sustainable program
- Managing security and compliance
- Realizing business value through more users and use cases
In addition, this webinar provides an overview of Qubole’s cloud-native data platform’s capabilities in areas described above.
About Our Speaker:
James Curtis is a Senior Analyst for the Data, AI & Analytics Channel at 451 Research. He has had experience covering the BI reporting and analytics sector and currently covers Hadoop, NoSQL and related analytic and operational database technologies.
James has over 20 years' experience in the IT and technology industry, serving in a number of senior roles in marketing and communications, touching a broad range of technologies. At iQor, he served as a VP for an upstart analytics group, overseeing marketing for custom, advanced analytic solutions. He also worked at Netezza and later at IBM, where he was a senior product marketing manager with responsibility for Hadoop and big data products. In addition, James has worked at Hewlett-Packard managing global programs and as a case editor at Harvard Business School.
James holds a bachelor's degree in English from Utah State University, a master's degree in writing from Northeastern University in Boston, and an MBA from Texas A&M University.
Prateek Shrivastava, Principal Product Manager, Qubole
Storage and compute are cheaper than ever. As a result, data engineering is undergoing a generational shift and is no longer defined by star-schema modeling techniques on data warehouses. Further, downstream operations are not just BI reporting and now include emerging use-cases such as data science. This means that modern day ETL tools should be dynamic, scalable, and extensible enough to handle complex business logic.
Airflow provides that level of abstraction today’s Data Engineers need. The Qubole Data Platform provides single-click deployment of Apache Airflow, automates cluster and configuration management, and includes dashboards to visualize the Airflow Directed Acyclic Graphs (DAGs).
In this webinar we will cover:
- A brief Introduction to Apache Airflow and its optimal use cases
- How to remove the complexity of spinning up and managing the Airflow cluster
- How to Scale out horizontally with multi-node Airflow cluster
- Real-world customer examples
Piero Cinquegrana, Sr. Data Science Product Manager, Qubole
Deep learning works on large volumes of unstructured data such as human speech, text, and images to enable powerful use cases such as speech-to-text transcription, voice identification, image classification, facial or object recognition, analysis of sentiment or intent from text, and many more. In the last few years, TensorFlow has become a very popular deep learning framework for image recognition and speech detection use cases.
All deep learning methods, including TensorFlow, require large volumes of data to train the model. Today, the most significant challenge in deep learning is the ever-increasing training time — as models get more complicated, the size of training data continues to increase. In order to address this challenge, cloud providers have launched instance types with many graphics processing units (GPUs) in a single node. However, using all of the GPUs in a single training job is not trivial. Qubole’s TensorFlow engine has been built to run on distributed Graphics Processing Units (GPUs) on Amazon Web Services.
In this webinar we will:
- Discuss how Qubole has achieved single-node, multi-GPU parallelization using native Tensorflow and Keras with Tensorflow as a backend.
- Present results from our studies that show how training time varies with the number of GPUs in the cluster.
- Run through a demo of a TensorFlow use case on Qubole.
Presto is a distributed ANSI SQL engine designed for running interactive analytics queries. Presto outshines other data processing engines when used for business intelligence (BI) or data discovery because of its ability to join terabytes of unstructured and structured data in seconds, or cache queries intermittently for a rapid response upon later runs. Presto can also be used in place of other well known interactive open source query engine such as Impala, Hive or traditional SQL data warehouses.
Qubole Presto, a cloud-optimized version of open source Presto, allows for dynamic cluster sizing based on workload, and terminates idle clusters — ensuring high reliability while reducing compute costs. Qubole customers use Presto along with their favorite BI tools, including PowerBI, Looker, Tableau, or any ODBC- and JDBC-compliant BI tool, to explore data and run queries.
In this webinar, you’ll learn:
- Why Presto is better suited for ad-hoc queries than other engines like Apache Spark
- How to jumpstart analysts across your organization to harness the power of your big data
- How to generate interactive or ad hoc queries or scheduled reports using Qubole and Presto
- Real-world examples of companies using Presto
Ashwin Chandra Putta, Sr. Product Manager at Qubole
Apache Spark is powerful open source engine used for processing complex, memory-intensive workloads to create data pipelines or to build and train machine learning models. Running Spark on a cloud data activation platform enables rapid processing of petabyte size datasets.
Qubole runs the biggest Spark clusters in the cloud and supports a broad variety of use cases from ETL and machine learning to analytics. Qubole supports a performance-enhanced and cloud-optimized version of the open source framework Apache Spark. Qubole brings all of the cost and performance optimization features of Qubole’s cloud native data platform to Spark workloads.
Qubole improves the performance of Spark workloads with enhancements such as fast storage, distributed caching, advanced indexing, metadata caching, job isolation on multi-tenant clusters. Qubole has open sourced SparkLens, a Spark profiler that provides insights into Spark application that help users optimize their Spark workloads.
In this webinar, you’ll learn:
- Why Spark is essential for big data, machine learning, and artificial intelligence
- How a cloud-native platform allows you to scale Spark across your organization, enable all data users, and successfully deploy AI and ML at scale
- How Spark runs on Qubole in a live demo
- Real-world examples of companies using Spark on Qubole
Many companies today struggle to balance their users’ demands for data with the cost of scaling their data operations. As the volume, variety, and velocity of data grows, data teams are getting overwhelmed and traditional infrastructure is being pushed to the brink.
In this webinar, Qubole SVP of Product Mohit Bhatnagar will share how Qubole’s cloud-native platform helps companies scale their data operations, activate petabytes of data, and reach administrator-to-user ratios as high as 1:200 (compared to ratios of 1:20 with other platforms).
He’ll also share how Qubole customers like Lyft, Under Armour and Turner use our cloud-native platform and multiple open source engines to run their big data workloads more efficiently and cost-effectively, as well as how the cloud helps them rapidly scale operations while simultaneously reduce their overall big data costs.
In this webinar you’ll learn:
- How to handle a broad set of needs and data sources
- The importance of a cloud-native architecture for scaling big data operations
- How and when to leverage multiple engines like Apache Spark, Presto and Airflow
- The importance of a multi-layered approach to security
Amit Duvedi, VP of Business Value Engineering, Qubole
Every investment in big data, whether people or technology, should be measured by how quickly it generates value for the business. While big data uses cases may vary, the need to prioritize investments, control costs and measure impact is universal.
Like most CTOs, CIOs, VPs or Directors overseeing big data projects, you’re likely somewhere in between putting out fires and demonstrating how your big data projects are driving growth. If your focus, for example, is improving your users’ experience you need to be able to demonstrate a clear ROI in the form of higher customer retention or lifetime value.
However, in addition to driving growth, you’re also responsible for managing costs. Here’s the rub-- if you’re successful in driving growth, your big data costs will only go up. That’s the consequence of successful big data use cases. How then, when you have success, do you limit and manage rising cloud costs?
In this webinar, you’ll learn:
- How to measure business value from big data use cases
- Typical bottlenecks that delay time to value and ways to address them
- Strategies for managing rising cloud and people costs
- How best-in-class companies are generating value from big data use cases while also managing their costs
Cloud service models have become the new norm for enterprise deployments in almost every category — and big data is no exception. As the volume, variety, and velocity of data increase exponentially, the cloud offers a more efficient and cost-effective option for managing the unpredictable and bursty workloads associated with big data compared to traditional on-premises data centers.
Organizations looking to scale their big data projects and implement a data-driven business culture can do so with greater ease on the cloud. However, adopting a cloud deployment model requires a cloud-first re-architecture and a platform approach rather than a simple lift and shift of data applications and pipelines.
A cloud-native data platform like Qubole helps organizations save on average 50 percent in total cost of ownership. Intelligent automation of cluster management tasks allows data teams to focus on business outcomes, thereby greatly improving SLAs and the end-user experience.
Join experts from Qubole as they discuss how to activate your big data and get the most out of open source technologies on the cloud. In this webinar, you'll learn:
- How big data projects benefit from a cloud-native data platform
- How intelligent cluster management can help you save in total cost of ownership
- About companies that successfully transitioned their big data to the cloud
- How to evaluate cloud data platforms for your big data needs
Barbara Eckman from Comcast; Brad Linder from Sling TV; John Slocum from MediaMath; Utpal Bhatt from Qubole
Every CEO aspires to create a data-driven culture that can activate hundreds or thousands of users and petabyte-scale data to continuously deliver true business value. This webcast panel discussion will explore the journey of three companies — Comcast, Sling TV, and MediaMath — that have chronicled their successes and challenges in a book by O’Reilly Media about creating a data-driven enterprise in media.
The panelists discuss:
- Their general technology strategy and choices
- How data-driven insights are powering their businesses
- Transforming the competitive dynamics of their industry through the power of data
This webinar covers how Qubole extended Apache AIRflow to manage the operational inefficiencies that arise managing data pipelines in a multi-tenant environment. Qubole also shares how to make data pipelines robust by adding data quality checks using CheckOperators.
- Overview of major types of data pipelines
- How Qubole manages deployments and upgrades of data pipelines in a multi-tenant environment
- The Data quality issues that arise during data ingestion or transformation.
- The approach that Qubole has adopted using Apache Airflow Check operators
- The best practices in using Apache Airflow for data quality checks.
Nate Shea-han, Americas Global Black Belt, Data & AI at Microsoft and Shaun Van Staden, Solutions Architect at Qubole
Becoming more competitive with big data today means having the right technology to uncover new insights from your data and make critical business decisions in real time. Qubole and Microsoft help companies activate their big data in the cloud to uncover insights that improve customer engagement, increase revenue, and lower costs.
Join experts from Qubole and Microsoft as they discuss how to activate your big data and how to get the most out of open source technologies on the cloud. In this webinar, you'll learn:
- How to modernize with data lakes and data warehouses on the cloud
- Strategies for boosting business value out of Machine Learning and advanced analytics with Qubole on Azure
- How to reduce costs, control risks, and improve data governance as you build your data pipelines
- The importance of data security and privacy
- Real world examples of successful companies activating their big data
Americas Global Black Belt, Data & AI at Microsoft
Nate Shea-han has been with Microsoft for 14 years and has spent the last 8 years focused on the helping Microsoft customers transform their business in the cloud on the Azure platform. Currently he has responsibilities across the United States, Canada and Latin America for Microsoft’s AI, big data, and analytics offerings. Nate has also worked extensively with Microsoft partner community.
Shaun Van Staden
Solutions Architect, Qubole
Shaun Van Staden has 19 years of experience in enterprise software managing advanced analytics projects, as a developer, DBA, business analyst and now a solutions architect. As a solutions architect manager, Shaun is responsible for supporting business development and sales at Qubole and helping customers transform their use cases for the cloud. Prior to Qubole, Shaun worked as a solutions architect at NICE Systems and Merced Systems (acquired by NICE).
Big data technologies can be both complex and involve time consuming manual processes. Organizations that intelligently automate big data operations lower their costs, make their teams more productive, scale more efficiently, and reduce the risk of failure.
In our webinar, representatives from TiVo, creator of a digital recording platform for television content, will explain how they implemented a new big data and analytics platform that dynamically scales in response to changing demand. You’ll learn how the solution enables TiVo to easily orchestrate big data clusters using Amazon Elastic Cloud Compute (Amazon EC2) and Amazon EC2 Spot instances that read data from a data lake on Amazon Simple Storage Service (Amazon S3) and how this reduces the development cost and effort needed to support its network and advertiser users. TiVo will share lessons learned and best practices for quickly and affordably ingesting, processing, and making available for analysis terabytes of streaming and batch viewership data from millions of households.
Join our webinar to learn:
- How to dramatically reduce management complexities for big data analytics operations on AWS.
- Best practices for optimizing data lakes for self-service analytics that enable teams to productionize data science and accelerate data pipelines.
- About using Qubole’s auto-scaling to reduce the complexity and deployment time of big data projects.
- How to reduce the cost of big data workloads with Qubole’s automated Spot Instance Bidding and management.
In this keynote Ashish Thusoo, CEO of Qubole, discusses the gap that enterprises face today when activating their big data. He makes a case for the shift that organizations need to make towards a big data activation strategy in order to put their data assets to use for differentiating and achieving business objectives. The session also covers key elements of big data activation supported by usage trends of Qubole's cloud-native big data activation platform. He presents various ways that enterprises can use to measure their own activation readiness, and demonstrate why Qubole provides the right approach to big data activation.
Sumit Gupta, VP, AI, Machine Learning and HPC, IBM Cognitive Systems
From chat bots, to recommendation engines, to Google Voice and Apple Siri, AI has begun to permeate our lives. In this keynote, IBM's Sumit Gupta demystifies what AI is, presents the differences between machine learning and deep learning, explains why the huge interest now, shows some fun use cases and demos, and then discusses use cases of how deep learning based AI methods can be used to garner insights from data for enterprises. Sumit also talks about what IBM is doing to make deep learning and machine learning more accessible and useful to a broader set of data scientists.
Jose Villacis, Senior Director of Product Marketing, Qubole
Every CEO aspires to create a data-driven culture that can activate 100s or 1000s of users and petabyte-scale data to continuously deliver true business value. This keynote panel explores the journey of 4 companies: Comcast, Qubole, Fanatics and MediaMath, that have chronicled their successes and challenges in two books by O’Reilly Media about Creating a Data-Driven Enterprise. The panelists talk not just about their technology strategy and choices but also how data-driven insights are powering their business and transforming the competitive dynamics of their industry.
At our core, we are a team of engineers who eat, sleep, and live big data. We believe that ubiquitous access to information is the key to unlocking a company's success. To achieve this, a big data platform must be agile, flexible, scalable, and proactive to anticipate a company's needs.