Hi [[ session.user.profile.firstName ]]

Pepperdata

  • Date
  • Rating
  • Views
  • Overcoming Performance Challenges of Building Spark Applications for AWS
    Overcoming Performance Challenges of Building Spark Applications for AWS Vinod Nair, Director of Product Management at Pepperdata Recorded: Aug 16 2017 39 mins
    Overcome Performance Challenges in Building Spark Applications for AWS is a webinar presentation intended for software engineers, developers, and technical leads who develop Spark applications for EMR or EC2 clusters.

    In this webinar, Vinod Nair will show you how to:

    Identify which portion of your application consumes the most resources
    Identify the bottlenecks slowing down your applications
    Test your applications against development or production workloads
    Significantly reduce troubleshooting issues due to ambient cluster conditions

    This webinar is followed by a live Q & A. A replay of this webinar will be available within 24 hours at https://www.pepperdata.com/resources/webinars/.
  • Kerberized HDFS – and How Spark on Yarn Accesses It
    Kerberized HDFS – and How Spark on Yarn Accesses It Kimoon Kim Recorded: Aug 9 2017 50 mins
    Following up on his recent presentation HDFS on Kubernetes and the Lessons Learned, Senior Software Engineer, Kimmoon Kim presents on Kerberized HDFS and how Spark on Yarn Accesses It
  • Creatively Visualizing Spark Data
    Creatively Visualizing Spark Data Christina Holland Recorded: Jul 18 2017 26 mins
    Pepperdata tech talk by Pepperdata software engineer, Christina Holland on creatively visualizing spark data and designing new ways to see new pipelines.
  • Production Spark Series Part 4: Spark Streaming Delivers Critical Patient Care
    Production Spark Series Part 4: Spark Streaming Delivers Critical Patient Care Charles Boicey, Chief Innovation Officer, Clearsense Recorded: Jun 22 2017 58 mins
    Clearsense is a pioneer in healthcare data science solutions using Spark Streaming to provide real time updates to health care providers for critical health care needs. Clinicians are enabled to make timely decisions from the assessment of a patient's risk for Code Blue, Sepsis and other conditions based on the analysis of information gathered from streaming physiological monitoring along with streaming diagnostic data and the patient historical record. Additionally this technology is used to monitor operational and financial process for efficiency and cost savings. This talk discusses the architecture needed and the challenges associated with providing real time SLAs along with 100% uptime expectations in a multi-tenant Hadoop cluster.
  • Spark Summit 2017 - Connect Code to Resource Consumption to Scale Production
    Spark Summit 2017 - Connect Code to Resource Consumption to Scale Production Vinod Nair, Director of Product Management Recorded: Jun 6 2017 26 mins
    Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will also discuss various sources of information on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. We will show how Pepperdata’s products can clearly identify such usages and tie them to specific lines of code. We will show how Spark application owners can quickly identify the root causes of such common problems as job slowdowns, inadequate memory configuration, and Java garbage collection issues.
  • Spark Summit 2017 – Spark Summit Bay Area Apache Spark Meetup
    Spark Summit 2017 – Spark Summit Bay Area Apache Spark Meetup Sean Suchter, Pepperdata Founder and CTO Recorded: Jun 5 2017 98 mins
    Bay Area Apache Spark Meetup at the 10th Spark Summit featuring tech-talks about using Apache Spark at scale from Pepperdata’s CTO Sean Suchter, RISELab’s Dan Crankshaw, and Databricks’ Spark committers and contributors.
  • HDFS on Kubernetes: Lessons Learned
    HDFS on Kubernetes: Lessons Learned Kimoon Kim, Engineer, Pepperdata Recorded: Jun 2 2017 36 mins
    There is growing interest in running Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark). Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely.

    In this webinar, we will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, we will show:

    - Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons.
  • Production Spark Series Part 3: Tuning Apache Spark Jobs
    Production Spark Series Part 3: Tuning Apache Spark Jobs Simon King, Engineer, Pepperdata Recorded: May 30 2017 40 mins
    A Spark Application that worked well in a development environment or with sample data may not behave as expected when run against a much larger dataset in a production environment. Pepperdata Application Profiler, based on open source Dr Elephant, can help you tune you Spark Application based on current dataset characteristics and cluster execution environment. Application Profiler uses a set of heuristics to provide actionable recommendations to help you quickly tune your applications.

    Occasionally an application will fail (or execute too slowly) due to circumstances outside your control: a busy cluster, another misbehaving YARN application, bad luck, or bad "cluster weather". We'll discuss ways to use Pepperdata's Cluster Analyzer to quickly determine when an application failure may not be your fault and how to diagnose and fix symptoms that you can affect.
  • Production Spark Series Part 2: Connecting Your Code to Spark Internals
    Production Spark Series Part 2: Connecting Your Code to Spark Internals Sean Suchter, CTO/Co-Founder, Pepperdata Recorded: May 9 2017 39 mins
    Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. Users and operators who are aware of the concepts will become more effective at their interactions with Spark.
  • Big Data for Big Data: Machine Learning Models of Hadoop Cluster Behavior
    Big Data for Big Data: Machine Learning Models of Hadoop Cluster Behavior Sean Suchter, CTO/Co-Founder, Pepperdata and Shekhar Gupta, Software Engineer, Pepperdata Recorded: Apr 10 2017 37 mins
    Learn how to use machine learning to improve cluster performance.

    This talk describes the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.

    Performance of batch processing systems such as YARN is generally determined by the throughput, which measures the amount of workload (tasks) completed in a given time window. For a given cluster size, the throughput can be increased by running as much workload as possible on each host, to utilize all the free resources available on host. Because each node is running a complex combination of different tasks/containers, the performance characteristics of the cluster are dynamically changing. As a result, there is always a danger of overutilizing host memory, which can result into extreme swapping or thrashing. The impacts of thrashing can be very severe; it can actually reduce the throughput instead of increasing it.

    By using very fine-grained (5 second) data from many production clusters running very different workloads, we have trained a generalized model that very rapidly detects the onset of thrashing, within seconds from the first symptom. This detection has proven fast enough to enable effective mitigation of the negative symptom of thrashing, allowing the hosts to continuously provide high throughput.

    To build this system we used hand-labeling of bad events combined with large scale data processing using Hadoop, HBase, Spark, and iPython for experimentation. We will discuss the methods used as well as the novel findings about Big Data cluster performance.

Embed in website or blog