Hi [[ session.user.profile.firstName ]]

Improve Amazon EMR Performance up to 4X

Are you currently running Amazon EMR but lacking the visibility and measurement of how your cluster is performing? Pepperdata for Amazon EMR enables users of Amazon Elastic MapReduce to run jobs up to four times faster and simultaneously reduce costs. Users can see over 300 metrics even after the cluster has been terminated, so users have a historical view of performance.

Register for our webinar to learn how Amazon EMR can help streamline your big data projects, and how Pepperdata can help you get the most value from your investment.
Recorded Oct 13 2016 36 mins
Your place is confirmed,
we'll send you email reminders
Presented by
Vinod Nair, Product Manager at Pepperdata
Presentation preview: Improve Amazon EMR Performance up to 4X

Network with like-minded attendees

  • [[ session.user.profile.displayName ]]
    Add a photo
    • [[ session.user.profile.displayName ]]
    • [[ session.user.profile.jobTitle ]]
    • [[ session.user.profile.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(session.user.profile) ]]
  • [[ card.displayName ]]
    • [[ card.displayName ]]
    • [[ card.jobTitle ]]
    • [[ card.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(card) ]]
  • Channel
  • Channel profile
  • Pepperdata Application Summary Page Overview Dec 13 2017 7:00 pm UTC 60 mins
    Alex Pierce
    Find any application easily with a simple new application search capability. Intended for software engineers, developers, operators, architects and technical leads who develop Spark applications, Pepperdata has simplified the task of application performance management. Pepperdata Field Engineer Alex Pierce demonstrates how to identify bottlenecks and get recommendations and insights to improve the performance of your application in one place.
  • Classifying Multi-Variate Time Series at Scale Recorded: Dec 7 2017 27 mins
    Ash Munshi
    Characterizing and understanding the runtime behavior of large-scale Big Data production systems is extremely important. Typical systems consist of hundreds to thousands of machines in a cluster with hundreds of terabytes of storage costing millions of dollars, solving problems that are business critical. By instrumenting each running process, and measuring their resource utilization including CPU, Memory, I/O, network etc., as time series it is possible to understand and characterize the workload on these massive clusters. Each time series is a series consisting of tens to tens of thousands of data points that must be ingested and then classified. At Pepperdata, our instrumentation of the clusters collects over three hundred metrics from each task every five seconds resulting in millions of data points per hour. At this scale the data are equivalent to the biggest IOT data sets in the world. Our objective is to classify the collection of time series into a set of classes that represent different work load types. Or phrased differently, our problem is essentially the problem of classifying multivariate time series.

    Intended for machine learning researchers and developers who use machine learning in their applications, Pepperdata CEO Ash Munshi presents a unique, off-the-shelf approach to classifying time series that achieves near best-in-class accuracy for univariate series and generalizes to multivariate time series.

    Before joining Pepperdata, Ash was executive chairman for Marianas Labs, a deep learning startup sold in December 2015. Prior to that he was CEO for Graphite Systems, a big data storage startup that was sold to EMC DSSD in August 2015. Munshi also served as CTO of Yahoo, as a CEO of both public and private companies, and is on the board of several technology startups.
  • Building a Big Data Stack on Kubernetes Recorded: Dec 6 2017 55 mins
    Sean Suchter
    There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark).

    Intended for software engineers, developers, architects and technical leads who develop Spark applications, this session will discuss how to build a big data stack on Kubernetes. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.
  • Fix Spark Failures and Bottlenecks Faster and Easier Recorded: Nov 16 2017 51 mins
    Vinod Nair
    Fix Spark Failures and Bottlenecks Faster and Easier is a webinar presentation intended for software engineers, developers, and technical leads who develop Spark applications. Pepperdata has gathered trillions of performance data points on production clusters running Spark, covering a variety of industries, applications, and workload types.

    This presentation discusses the results of analyzing many Spark jobs on many multi-tenant production clusters. Pepperdata Field Engineer, Kirk Lewis will discuss common issues seen, the symptoms of those issues, and how developers can address them. This discussion includes key performance insights — best and worst practices, gotchas, and tuning recommendations — based on analyzing the behavior and performance of millions of Spark applications. In addition, Kirk will describe how we are turning these learnings into heuristics used in the open source Dr. Elephant project.

    This webinar is followed by a live Q & A. A replay of this webinar will be available within 24 hours at https://www.pepperdata.com/resources/webinars/.
  • Effective High-Speed Multi-Tenant Data Lakes Recorded: Oct 25 2017 45 mins
    Sean Suchter, CTO and founder, Pepperdata
    Big Data has increased the demand for big data management solutions that operate at scale and meet business requirements. Big Data organizations realize quickly that scaling from small, pilot projects to large-scale production clusters involves a steep learning curve. Despite tremendous progress, critically important areas including multi-tenancy, performance optimization, and workflow monitoring remain areas where the operations team still needs management help.

    Intended for enterprises who already have a data lake or are setting up their first data lake, this presentation will discuss how to implement data lakes with operations tools that automatically optimize clusters with solutions for monitoring, performance tuning, and troubleshooting in production environments.

    Sean is the co-founder and CTO of Pepperdata. Previously, Sean was the founding GM of Microsoft’s Silicon Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search. Prior to Microsoft, Sean managed the Yahoo Search Technology team, the first production user of Hadoop. Sean joined Yahoo through the acquisition of Inktomi, and holds a B.S. in Engineering and Applied Science from Caltech.
  • Solving Performance Bottlenecks For Spark Developers Recorded: Oct 11 2017 47 mins
    Vinod Nair, Director of Product Management
    Intended for software engineers, developers, architects and technical leads who develop Spark applications, Vinod Nair will discuss how Pepperdata the product suite helps developers in Big Data Environments.
  • Strata Data Conference NYC –Pepperdata –HDFS on Kubernetes: Lessons Learned Recorded: Sep 28 2017 41 mins
    Kimoon Kim
    There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark.
  • Top Considerations When Choosing a Big Data Management and Performance Solution Recorded: Sep 20 2017 54 mins
    Kirk Lewis
    The growing adoption of Hadoop and Spark has increased the demand for Big Data management solutions that operate at scale and meet business requirements. However, Big Data organizations realize quickly that scaling from small, pilot projects to large-scale production clusters involves a steep learning curve. Despite tremendous progress, there remain critically important area, including multi-tenancy, performance optimization, and workflow monitoing where the DevOps team still needs management help. In this webinar, field engineer Kirk Lewis discusses the top considerations when choosing a big data management and performance solution.
  • HDFS on Kubernetes: Lessons Learned Recorded: Sep 19 2017 46 mins
    Kimoon Kim, Pepperdata Software Engineer
    HDFS on Kubernetes: Lessons Learned is a webinar presentation intended for software engineers, developers, and technical leads who develop Spark applications and are interested in running Spark on Kubernetes. Pepperdata has been exploring Kubernetes as potential Big Data platform with several other companies as part of a joint open source project.

    In this webinar, Kimoon Kim will show you how to: 

    –Run Spark application natively on Kubernetes
    –Enable Spark on Kubernetes read and write data securely on HDFS protected by Kerberos
  • Making OpenTSDB Perform at Massive Scale - Pepperdata Meetup Replay Recorded: Aug 29 2017 33 mins
    Simon King
    OpenTSDB is a open-source time series database built on top of HBase. Thanks to HBase, OpenTSDB scales very nicely to accommodate large amounts of data in terms of bytes or data points -- at Pepperdata we ingest hundreds of billions of data points per day. Where OpenTSDB struggles to scale is in the number of distinct time series. Pepperdata stores time series data on all the hardware and processes across many Hadoop clusters: billions of discrete series per day. Speaker, Simon King, will discuss some of OpenTSDB's strengths and weaknesses, and some of the techniques Pepperdata uses to work around its limitations. Originally presented at Galvanize in San Francisco on 8/21/17.
  • Overcoming Performance Challenges of Building Spark Applications for AWS Recorded: Aug 16 2017 39 mins
    Vinod Nair, Director of Product Management at Pepperdata
    Overcome Performance Challenges in Building Spark Applications for AWS is a webinar presentation intended for software engineers, developers, and technical leads who develop Spark applications for EMR or EC2 clusters.

    In this webinar, Vinod Nair will show you how to:

    Identify which portion of your application consumes the most resources
    Identify the bottlenecks slowing down your applications
    Test your applications against development or production workloads
    Significantly reduce troubleshooting issues due to ambient cluster conditions

    This webinar is followed by a live Q & A. A replay of this webinar will be available within 24 hours at https://www.pepperdata.com/resources/webinars/.
  • Kerberized HDFS – and How Spark on Yarn Accesses It Recorded: Aug 9 2017 50 mins
    Kimoon Kim
    Following up on his recent presentation HDFS on Kubernetes and the Lessons Learned, Senior Software Engineer, Kimmoon Kim presents on Kerberized HDFS and how Spark on Yarn Accesses It
  • Creatively Visualizing Spark Data Recorded: Jul 18 2017 26 mins
    Christina Holland
    Pepperdata tech talk by Pepperdata software engineer, Christina Holland on creatively visualizing spark data and designing new ways to see new pipelines.
  • Production Spark Series Part 4: Spark Streaming Delivers Critical Patient Care Recorded: Jun 22 2017 58 mins
    Charles Boicey, Chief Innovation Officer, Clearsense
    Clearsense is a pioneer in healthcare data science solutions using Spark Streaming to provide real time updates to health care providers for critical health care needs. Clinicians are enabled to make timely decisions from the assessment of a patient's risk for Code Blue, Sepsis and other conditions based on the analysis of information gathered from streaming physiological monitoring along with streaming diagnostic data and the patient historical record. Additionally this technology is used to monitor operational and financial process for efficiency and cost savings. This talk discusses the architecture needed and the challenges associated with providing real time SLAs along with 100% uptime expectations in a multi-tenant Hadoop cluster.
  • Spark Summit 2017 - Connect Code to Resource Consumption to Scale Production Recorded: Jun 6 2017 26 mins
    Vinod Nair, Director of Product Management
    Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will also discuss various sources of information on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. We will show how Pepperdata’s products can clearly identify such usages and tie them to specific lines of code. We will show how Spark application owners can quickly identify the root causes of such common problems as job slowdowns, inadequate memory configuration, and Java garbage collection issues.
  • Spark Summit 2017 – Spark Summit Bay Area Apache Spark Meetup Recorded: Jun 5 2017 98 mins
    Sean Suchter, Pepperdata Founder and CTO
    Bay Area Apache Spark Meetup at the 10th Spark Summit featuring tech-talks about using Apache Spark at scale from Pepperdata’s CTO Sean Suchter, RISELab’s Dan Crankshaw, and Databricks’ Spark committers and contributors.
  • HDFS on Kubernetes: Lessons Learned Recorded: Jun 2 2017 36 mins
    Kimoon Kim, Engineer, Pepperdata
    There is growing interest in running Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark). Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely.

    In this webinar, we will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, we will show:

    - Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons.
  • Production Spark Series Part 3: Tuning Apache Spark Jobs Recorded: May 30 2017 40 mins
    Simon King, Engineer, Pepperdata
    A Spark Application that worked well in a development environment or with sample data may not behave as expected when run against a much larger dataset in a production environment. Pepperdata Application Profiler, based on open source Dr Elephant, can help you tune you Spark Application based on current dataset characteristics and cluster execution environment. Application Profiler uses a set of heuristics to provide actionable recommendations to help you quickly tune your applications.

    Occasionally an application will fail (or execute too slowly) due to circumstances outside your control: a busy cluster, another misbehaving YARN application, bad luck, or bad "cluster weather". We'll discuss ways to use Pepperdata's Cluster Analyzer to quickly determine when an application failure may not be your fault and how to diagnose and fix symptoms that you can affect.
  • Production Spark Series Part 2: Connecting Your Code to Spark Internals Recorded: May 9 2017 39 mins
    Sean Suchter, CTO/Co-Founder, Pepperdata
    Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. Users and operators who are aware of the concepts will become more effective at their interactions with Spark.
  • Big Data for Big Data: Machine Learning Models of Hadoop Cluster Behavior Recorded: Apr 10 2017 37 mins
    Sean Suchter, CTO/Co-Founder, Pepperdata and Shekhar Gupta, Software Engineer, Pepperdata
    Learn how to use machine learning to improve cluster performance.

    This talk describes the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.

    Performance of batch processing systems such as YARN is generally determined by the throughput, which measures the amount of workload (tasks) completed in a given time window. For a given cluster size, the throughput can be increased by running as much workload as possible on each host, to utilize all the free resources available on host. Because each node is running a complex combination of different tasks/containers, the performance characteristics of the cluster are dynamically changing. As a result, there is always a danger of overutilizing host memory, which can result into extreme swapping or thrashing. The impacts of thrashing can be very severe; it can actually reduce the throughput instead of increasing it.

    By using very fine-grained (5 second) data from many production clusters running very different workloads, we have trained a generalized model that very rapidly detects the onset of thrashing, within seconds from the first symptom. This detection has proven fast enough to enable effective mitigation of the negative symptom of thrashing, allowing the hosts to continuously provide high throughput.

    To build this system we used hand-labeling of bad events combined with large scale data processing using Hadoop, HBase, Spark, and iPython for experimentation. We will discuss the methods used as well as the novel findings about Big Data cluster performance.
DevOps for Big Data
Pepperdata is the DevOps for Big Data company. Leading Enterprise companies depend on Pepperdata to manage and improve the performance of Hadoop and Spark. Developers and operators use Pepperdata products and services to diagnose and solve performance problems in production and increase cluster utilization. The Pepperdata product suite improves communication of performance issues between Dev and Ops, shortens time to production, and increases cluster ROI. Pepperdata products and services work with customer Big Data systems both on-premise and in the cloud

Embed in website or blog

Successfully added emails: 0
Remove all
  • Title: Improve Amazon EMR Performance up to 4X
  • Live at: Oct 13 2016 6:00 pm
  • Presented by: Vinod Nair, Product Manager at Pepperdata
  • From:
Your email has been sent.
or close