Journey from exploration and visualization to machine learning and natural language processing. Discover how Return Path built a cloud based, production ready, enterprise scale data solution without a dedicated Dev Ops team. Leveraging modern distributed computing frameworks like Spark and managed services like EMR and Qubole were key to the process.
RecordedSep 19 201745 mins
Your place is confirmed, we'll send you email reminders
Register now to see the on-demand recording of this webinar.
Apache Spark has been gaining steam, with rapidity, both in the headlines and in real-world adoption. Spark was developed in 2009, and open sourced in 2010. Since then, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. This open source analytics engine stands out for its ability to process large volumes of data significantly faster than contemporaries such as MapReduce, primarily owing to in-memory storage of data on its own processing framework. That being said, one of the top real-world industry use cases for Apache Spark is its ability to process ‘streaming data‘.
With so much data being processed on a daily basis, it has become essential for companies to be able to stream and analyze it all in real time, and Spark Streaming has the capability to handle this extra workload. Some experts even theorize that Spark could become the go-to platform for stream-computing applications, no matter the type. The reason for this claim is that Spark Streaming unifies disparate data processing capabilities, allowing developers to use a single framework to accommodate all their processing needs. Among the general ways that Spark Streaming is being used by businesses today are Streaming ETL, Data Enrichment, Trigger Event Detection and Complex Session Analysis. In this webinar, we will cover an introduction, internals and industry use cases of ‘Structured Streaming in Spark’.
- Understanding of Data Processing Architecture
- Why and When to use Spark’s Structured Streaming
- Spark’s Structured Streaming Programming Paradigm
- Internals of Spark’s Structured Streaming
- Spark Structured Streaming in the Real World – examples of how customers of Qubole use it
We have come a long way since the term "Big Data" swept the business world off its feet as the next frontier for innovation, competition and productivity. Hadoop, NoSQL and Spark have become members of the enterprise IT landscape, data lakes have evolved as a real strategy and migration to the cloud has accelerated across service and deployment models.
On the road ahead, the demand for real-time analytics will continue to skyrocket alongside growth in IoT, machine learning, and cognitive applications. Meeting the speed and scalability requirements of these types of workloads requires more flexible and efficient data management processes – both on-premises and in the cloud. Flexible deployment and integration options will become a must-have for projects.
Finally, the need for data governance and security is intensifying as businesses adopt new approaches to expand their data storage and access via data lakes and self-service analytics programs. As data, along with its sources and users, continues to proliferate, so do the risks and responsibilities of ensuring its quality and protection.
Join us to watch the replay of "What's Ahead in Big Data and Analytics" to get real direction and practical advice on the challenges and opportunities to tackle in 2018.
Discover the newly launched features in Qubole, powered by Data Intelligence, that automates mundane Data Model performance appraisal and simplifies Data Ops. This session will provide a detailed walkthrough of Qubole’s latest offering in Data Intelligence that includes Data Model insights and Recommendations including Partitioning, Formatting, and Sorting that helps optimize data models for improved performance and computing resources. In addition, learn about Qubole’s latest offering in self-service analytics and how it can improve analysts productivity by making data discovery easy through column and table name auto-suggestion and completion, and insights preview.
In the final session of Data Platforms Online 2017, Ashish Thusoo will offer some of his highlights from the week’s sessions, pick out some emerging themes and trends, and answer questions from the audience. Ashish built the original data team at Facebook, is a co-author of Apache Hive, co-author of “Building a Data Driven Enterprise with DataOps” and CEO of Qubole. He’ll be moderated by Horia Margarit, resident Data Scientist at Qubole. Get your questions ready for what will be a lively and entertaining discussion!
Andrew Reichman, Sr. Director of Cloud Strategy, Oracle
Cloud has changed the game when it comes to data analytics. Previously, organizations had to lock themselves into a particular architecture and level of capacity for three to seven years and do all the lifting themselves. Cloud on the other hand allows them to experiment with different hardware and software options, get more of the solution as a service and scale up and down to meet project spikes and accelerate busy jobs at will. This makes it much more viable for any company to get the advantages of advanced analytics against large data sets, without an oversize IT staff or huge capital investments.
Oracle cloud is specifically designed to help enterprises take advantage of cloud for data analytics—it offers massive non-variable performance, predictable low cost and broad choice of deployment and software options. Oracle and Qubole work together to deliver a new breed of data platform—capable of taming the scale, performance, cost and complexity issues associated with gaining business insight from data of all types.
Watch this webinar to understand:
- Summary of industry trends for big data on the cloud
- How Oracle Cloud Infrastructure is optimized for big data workloads from a cost, performance and flexibility perspective
- How Oracle Cloud Big Data solutions compare with on-premises and competing cloud options
James Curtis, Senior Analyst - Data Platforms & Analytics, 451 Research
The question is not much whether to migrate to the cloud or not. That question has likely already been answered by many organizations and the answer is a resounding full steam ahead. But the start of the journey can be daunting especially with a lot of ‘as-a-service’ terminology floating around. Please join James Curtis, senior analyst at 451 Research, as he discusses not only some industry trends and what many organizations are doing but also a simplified approach to understanding cloud services and how that might best fit your organization. Because it’s not so much buyer beware; it’s more about buyer understand.
Aman Naimat, Senior Vice President, Technology, Demandbase
There is a surge in hype around Artificial Intelligence. Startups are raising hundreds of millions of dollars by bedazzling investors with Deep Learning, word embeddings, and reinforcement learning. This is a distraction from the very real problems that data and AI can solve if done right. By working across dozens of machine learning problems that are live in the real world, I’ve worked out the most common problems encountered and recurring design patterns on how to solve real-world problems using AI as a tool. This talk will arm you with a perspective on how to get pragmatic solutions with AI today.
This webinar is focused on:
- Increasing collaborative friction between engineers, analysts, and business
- Process-driven iteration i.e. balancing agility with discipline
- Making the quantitative business case for moving from big bang to continuous enhancement (convincing your CFO/CIO to shift from CapEx to OpEx)
- Case studies and outcomes from our clients
Pratap Ramamurthy, Partner Solution Architect, Amazon Web Services
Today’s organizations are tasked with managing multiple data types, coming from a wide variety of sources. Faced with massive volumes and heterogeneous types of data, organizations are finding that in order to deliver insights in a timely manner, they need a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. Every use case might be different and different use cases might need different tools. AWS provides a variety of options for your needs from RDS, EMR, Redshift, Athena and Quicksight. In this talk we will discuss the different technologies available on AWS and its application.
Once considered the "black magic" of digitally-born retailers like Amazon, personalizing the customer experience has now become table stakes for any retailer interested in surviving in the era of digital transformation. the techniques, tools and scalable platforms necessary to optimize customer interactions are now available and accessible for use by companies of any shape or size. We'll discuss how the use of big data technology in the cloud eases the implementation of common retail use cases as well as how it helps to avoid typical pitfalls.
We will look at how DevOps is making Data Science more mainstream with automation, release trains, agility and operational readiness. In this talk, we will look at the various tools and techniques in building a successful Data Science practice and how DevOps can be introduced to provide Continuous Integration, Delivery, and Deployment of Data Science Models.
Krishna Mamidipaka, Sr. Program Manager, Microsoft Azure
Continuous streams of data are generated in every industry from sensors, manufacturing IoT devices, business transactions, social media, network devices, clickstream logs, and more. Found within these streams of data are critical business insights that are waiting to be unlocked. Attend this session and learn how customers are creating solutions for fleet monitoring, smart grid, network monitoring, recommendations, and other real-time solutions to analyze multiple concurrent streams of data-in-motion into insights and actions for competitive advantage.
In this session you will see demos and learn how services like Azure Event Hubs, Stream Analytics, Machine Learning, and other Azure services work seamlessly together to create your end to end real-time analytics solutions.
Scott Donohoo, Technology Solutions Professional, Microsoft, & Erik Zwiefel, Technology Solutions Professional, Microsoft
This session will cover how Azure Machine Learning and R can help data scientists overcome the following challenges:
- Development Time - Dramatically reduce the time of running initial ML experiment validations.
- Performance - Option for best in class performance.
For deeper data science needs the session will explore how hard core data scientists can leverage R to attack the most complex scenarios.
Data governance, discovery and lineage help data scientists find and integrate data of interest to uncover otherwise hidden trends, anomalies, and powerful predictors of business successes and failures. Comcast’s Streaming Data Platform comprises a wide variety of ingest, transformation, and storage services. Peer-reviewed Apache Avro schemas support end-to-end data governance. Apache Atlas is our metadata repository for data discovery and lineage. We have extended Atlas with custom data and process types, eg.: avro schemas; AWS S3 buckets and prefixes; kafka topics; and kinesis streams. Custom asynchronous messaging libraries notify Atlas of new data and schema entities and lineage links as they are created.
At BloomReach we process around 100 million products everyday across all of our customers. For each customer, the feed processing needs to be fast and reliable, and while indexing there shouldn't be any impact on serving. We will walk over how we've built this in BloomReach while also making sure that the cost is minimal.
Looker is the modern data platform that democratizes data analytics, creates meaningful insights, and powers critical business actions. The Looker platform allows you to analyze your data and act on it within a single interface, enabling both business users and analysts to add maximum value where it matters most. Stop jumping between tabs and tools - do it all in Looker.
Today, 91% of companies are ingesting data from third party partners to run their businesses. Additionally, 70% of companies either currently send or plan to send data to partners. This inter-company data collaboration powers insights, machine learning, and better consumer experiences. But, it also increases workloads for strapped engineering teams and creates challenges to data access. Learn how companies are streamlining and even automating their Data Operations to accelerate the time from data to business value.
Cathy Palmer, Ph.D., Principal Program Manager, Microsoft
Enterprises are building big data solutions with Azure Data Lake, an on-demand, real-time stream processing service with a no-limits data lake built to support massively parallel analytics. Patterns of enterprise solutions are emerging and evolving as customers migrate their analytics workloads to the cloud and embrace new business opportunities. With an overview of Azure Data Lake, this webinar briefly explores some of the choices customers are making in building big data solutions with Azure Data Lake.
Shawn James, Sr. Director Technology Alliances, Talend
Talend provides the data agility businesses need to use the latest cloud technologies to act with insight across their organization and win in an economy being deeply transformed by exploding data volumes, technology innovation, and fundamental changes to the IT infrastructure. Join us to learn how Talend and Qubole together help companies’ business users execute data preparation workloads in the cloud at a fraction of the cost and resources.
In the biological sciences, hypothesis-driven experiments and bottom-up design experiments rely on predicting what will happen with new cells and molecules. Machine learning excels at prediction and has become more democratized, making it an important component in the biotech toolkit. We use Merck's Kaggle competition as a representative task in this domain that involves predicting molecular activity from numeric descriptors of chemical structure. Our approach utilizes deep neural networks using the Keras library in a Qubole notebook, which is conveniently attached to an autoscaled Spark cluster. We use Spark to distribute the hyperparameter search for optimizing the neural net.
At our core, we are a team of engineers who live, eat, and sleep big data. We believe that ubiquitous access to information is the key to unlocking a company's success. To achieve this, a big data platform must be agile, flexible, scalable, and proactive to anticipate a company's needs.