Hi [[ session.user.profile.firstName ]]

Classifying Multi-Variate Time Series at Scale

Characterizing and understanding the runtime behavior of large-scale Big Data production systems is extremely important. Typical systems consist of hundreds to thousands of machines in a cluster with hundreds of terabytes of storage costing millions of dollars, solving problems that are business critical. By instrumenting each running process, and measuring their resource utilization including CPU, Memory, I/O, network etc., as time series it is possible to understand and characterize the workload on these massive clusters. Each time series is a series consisting of tens to tens of thousands of data points that must be ingested and then classified. At Pepperdata, our instrumentation of the clusters collects over three hundred metrics from each task every five seconds resulting in millions of data points per hour. At this scale the data are equivalent to the biggest IOT data sets in the world. Our objective is to classify the collection of time series into a set of classes that represent different work load types. Or phrased differently, our problem is essentially the problem of classifying multivariate time series.

Intended for machine learning researchers and developers who use machine learning in their applications, Pepperdata CEO Ash Munshi presents a unique, off-the-shelf approach to classifying time series that achieves near best-in-class accuracy for univariate series and generalizes to multivariate time series.

Before joining Pepperdata, Ash was executive chairman for Marianas Labs, a deep learning startup sold in December 2015. Prior to that he was CEO for Graphite Systems, a big data storage startup that was sold to EMC DSSD in August 2015. Munshi also served as CTO of Yahoo, as a CEO of both public and private companies, and is on the board of several technology startups.
Recorded Dec 7 2017 27 mins
Your place is confirmed,
we'll send you email reminders
Presented by
Ash Munshi
Presentation preview: Classifying Multi-Variate Time Series at Scale

Network with like-minded attendees

  • [[ session.user.profile.displayName ]]
    Add a photo
    • [[ session.user.profile.displayName ]]
    • [[ session.user.profile.jobTitle ]]
    • [[ session.user.profile.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(session.user.profile) ]]
  • [[ card.displayName ]]
    • [[ card.displayName ]]
    • [[ card.jobTitle ]]
    • [[ card.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(card) ]]
  • Channel
  • Channel profile
  • Spark Application Performance Management with Pepperdata Application Spotlight Feb 21 2018 7:00 pm UTC 60 mins
    Vinod Nair
    Pepperdata Application Spotlight analyzes all Hadoop and Spark jobs running on the cluster and provides developers with technical insights on how each job performed. Intended for software engineers, developers, and technical leads who develop Spark applications, this webinar demonstrates how Application Spotlight helps developers quickly improve application performance, reduce resource usage, and understand application failures. Participate in this webinar and learn how developers can:

    –Identify the lines of code and the stages that cause performance issues related to CPU, memory, garbage collection, network, and disk I/O

    –Easily disambiguate resources used during parallel stages

    –Understand why run-time variations occur for the same application

    –Determine whether performance issues are due to the application or other workloads on the cluster

    –Receive actionable recommendations for tuning jobs

    –Validate tuning changes made to applications with a before and after comparison

    –View the highlights worst performing phases of jobs

    –Improve MapReduce and Spark developer productivity

    –Improve cluster efficiency based on clear recommendations on how to modify workloads and configurations


    Vinod Nair leads product management at Pepperdata. He brings more than 20 years of experience in engineering and product management to the job, with a special interest in distributed systems and Hadoop. He has worked in software for telecommunications, financial management for small business, and big data. Vinod’s approach to product management is deeply influenced by his success in applying Lean Startup principles and rapid iteration to product design and development.
  • Pepperdata Hadoop and Spark Performance Solutions for Dev and Ops Recorded: Feb 7 2018 58 mins
    Kirk Lewis
    Despite tremendous progress, there remain critically important areas, including multi-tenancy, performance optimization, and workflow monitoring where the DevOps team still needs management help. Pepperdata is the first company to integrate deep performance measurement and understanding into the DevOps process for Big Data applications. Pepperdata products enable developers to rapidly debug, optimize, and understand production applications while also enabling operators to diagnose and automatically solve performance problems in production multi-tenant clusters. Presented by Field Engineer Kirk Lewis, this webinar is an overview of Pepperdata products and services.

    In this online webinar followed by a live Q and A, Field Engineer Kirk Lewis will show you how to:

    • Reduce time to problem resolution using comprehensive and detailed performance data–Pepperdata Platform Spotlight helps operators overcome Hadoop and Spark performance limitations by monitoring all facets of cluster performance in real time, including CPU, RAM, disk I/O, and network usage by user, job, and task.

    • Increase capacity utilization by 30-50% without adding new hardware–Pepperdata adaptively and automatically tunes the cluster based on real-time resource utilization with performance improvement results that cannot be achieved through manual tuning.

    • Help developers understand and improve application performance–Pepperdata Application Spotlight enables developers to identify and fix application performance problems, excessive usage of resources, and application errors.
  • Building a Big Data Stack on Kubernetes Recorded: Jan 25 2018 51 mins
    Pepperdata Founder and CTO, Sean Suchter
    There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark).

    Intended for software engineers, developers, architects and technical leads who develop Spark applications, this session will discuss how to build a big data stack on Kubernetes. In particular, Sean will demonstrate:

    –The official Apache Spark 2.3 Kubernetes integration
    –How Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons.
    –How you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.
  • Fix Spark Failures and Bottlenecks Faster and Easier Recorded: Jan 17 2018 49 mins
    Vinod Nair
    Intended for software engineers, developers, and technical leads who develop Spark applications, this webinar discusses the results of analyzing many Spark jobs on many multi-tenant production clusters, the common issues seen, the symptoms of those issues, and how developers can address them. Pepperdata has gathered trillions of performance data points on production clusters running Spark, covering a variety of industries, applications, and workload types.

    Presenter Vinod Nair will talks about key performance insights — best and worst practices, gotchas, and tuning recommendations — based on analyzing the behavior and performance of millions of Spark applications. In addition, Vinod will describe how we are turning these learnings into heuristics leveraged from the open source Dr. Elephant project.

    This webinar is followed by a live Q & A. A replay of this webinar will be available within 24 hours at https://www.pepperdata.com/resources/webinars/.
  • Pepperdata Application Summary Page Overview Recorded: Dec 19 2017 22 mins
    Alex Pierce
    Find any application easily with a simple new application search capability. Intended for software engineers, developers, operators, architects and technical leads who develop Spark applications, Pepperdata has simplified the task of application performance management. Pepperdata Field Engineer Alex Pierce demonstrates how to identify bottlenecks and get recommendations and insights to improve the performance of your application in one place.
  • Classifying Multi-Variate Time Series at Scale Recorded: Dec 7 2017 27 mins
    Ash Munshi
    Characterizing and understanding the runtime behavior of large-scale Big Data production systems is extremely important. Typical systems consist of hundreds to thousands of machines in a cluster with hundreds of terabytes of storage costing millions of dollars, solving problems that are business critical. By instrumenting each running process, and measuring their resource utilization including CPU, Memory, I/O, network etc., as time series it is possible to understand and characterize the workload on these massive clusters. Each time series is a series consisting of tens to tens of thousands of data points that must be ingested and then classified. At Pepperdata, our instrumentation of the clusters collects over three hundred metrics from each task every five seconds resulting in millions of data points per hour. At this scale the data are equivalent to the biggest IOT data sets in the world. Our objective is to classify the collection of time series into a set of classes that represent different work load types. Or phrased differently, our problem is essentially the problem of classifying multivariate time series.

    Intended for machine learning researchers and developers who use machine learning in their applications, Pepperdata CEO Ash Munshi presents a unique, off-the-shelf approach to classifying time series that achieves near best-in-class accuracy for univariate series and generalizes to multivariate time series.

    Before joining Pepperdata, Ash was executive chairman for Marianas Labs, a deep learning startup sold in December 2015. Prior to that he was CEO for Graphite Systems, a big data storage startup that was sold to EMC DSSD in August 2015. Munshi also served as CTO of Yahoo, as a CEO of both public and private companies, and is on the board of several technology startups.
  • Building a Big Data Stack on Kubernetes Recorded: Dec 6 2017 55 mins
    Sean Suchter
    There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark).

    Intended for software engineers, developers, architects and technical leads who develop Spark applications, this session will discuss how to build a big data stack on Kubernetes. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.
  • Fix Spark Failures and Bottlenecks Faster and Easier Recorded: Nov 16 2017 51 mins
    Vinod Nair
    Fix Spark Failures and Bottlenecks Faster and Easier is a webinar presentation intended for software engineers, developers, and technical leads who develop Spark applications. Pepperdata has gathered trillions of performance data points on production clusters running Spark, covering a variety of industries, applications, and workload types.

    This presentation discusses the results of analyzing many Spark jobs on many multi-tenant production clusters. Pepperdata Field Engineer, Kirk Lewis will discuss common issues seen, the symptoms of those issues, and how developers can address them. This discussion includes key performance insights — best and worst practices, gotchas, and tuning recommendations — based on analyzing the behavior and performance of millions of Spark applications. In addition, Kirk will describe how we are turning these learnings into heuristics used in the open source Dr. Elephant project.

    This webinar is followed by a live Q & A. A replay of this webinar will be available within 24 hours at https://www.pepperdata.com/resources/webinars/.
  • Effective High-Speed Multi-Tenant Data Lakes Recorded: Oct 25 2017 45 mins
    Sean Suchter, CTO and founder, Pepperdata
    Big Data has increased the demand for big data management solutions that operate at scale and meet business requirements. Big Data organizations realize quickly that scaling from small, pilot projects to large-scale production clusters involves a steep learning curve. Despite tremendous progress, critically important areas including multi-tenancy, performance optimization, and workflow monitoring remain areas where the operations team still needs management help.

    Intended for enterprises who already have a data lake or are setting up their first data lake, this presentation will discuss how to implement data lakes with operations tools that automatically optimize clusters with solutions for monitoring, performance tuning, and troubleshooting in production environments.

    Sean is the co-founder and CTO of Pepperdata. Previously, Sean was the founding GM of Microsoft’s Silicon Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search. Prior to Microsoft, Sean managed the Yahoo Search Technology team, the first production user of Hadoop. Sean joined Yahoo through the acquisition of Inktomi, and holds a B.S. in Engineering and Applied Science from Caltech.
  • Solving Performance Bottlenecks For Spark Developers Recorded: Oct 11 2017 47 mins
    Vinod Nair, Director of Product Management
    Intended for software engineers, developers, architects and technical leads who develop Spark applications, Vinod Nair will discuss how Pepperdata the product suite helps developers in Big Data Environments.
  • Strata Data Conference NYC –Pepperdata –HDFS on Kubernetes: Lessons Learned Recorded: Sep 28 2017 41 mins
    Kimoon Kim
    There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark.
  • Top Considerations When Choosing a Big Data Management and Performance Solution Recorded: Sep 20 2017 54 mins
    Kirk Lewis
    The growing adoption of Hadoop and Spark has increased the demand for Big Data management solutions that operate at scale and meet business requirements. However, Big Data organizations realize quickly that scaling from small, pilot projects to large-scale production clusters involves a steep learning curve. Despite tremendous progress, there remain critically important area, including multi-tenancy, performance optimization, and workflow monitoing where the DevOps team still needs management help. In this webinar, field engineer Kirk Lewis discusses the top considerations when choosing a big data management and performance solution.
  • HDFS on Kubernetes: Lessons Learned Recorded: Sep 19 2017 46 mins
    Kimoon Kim, Pepperdata Software Engineer
    HDFS on Kubernetes: Lessons Learned is a webinar presentation intended for software engineers, developers, and technical leads who develop Spark applications and are interested in running Spark on Kubernetes. Pepperdata has been exploring Kubernetes as potential Big Data platform with several other companies as part of a joint open source project.

    In this webinar, Kimoon Kim will show you how to: 

    –Run Spark application natively on Kubernetes
    –Enable Spark on Kubernetes read and write data securely on HDFS protected by Kerberos
  • Making OpenTSDB Perform at Massive Scale - Pepperdata Meetup Replay Recorded: Aug 29 2017 33 mins
    Simon King
    OpenTSDB is a open-source time series database built on top of HBase. Thanks to HBase, OpenTSDB scales very nicely to accommodate large amounts of data in terms of bytes or data points -- at Pepperdata we ingest hundreds of billions of data points per day. Where OpenTSDB struggles to scale is in the number of distinct time series. Pepperdata stores time series data on all the hardware and processes across many Hadoop clusters: billions of discrete series per day. Speaker, Simon King, will discuss some of OpenTSDB's strengths and weaknesses, and some of the techniques Pepperdata uses to work around its limitations. Originally presented at Galvanize in San Francisco on 8/21/17.
  • Overcoming Performance Challenges of Building Spark Applications for AWS Recorded: Aug 16 2017 39 mins
    Vinod Nair, Director of Product Management at Pepperdata
    Overcome Performance Challenges in Building Spark Applications for AWS is a webinar presentation intended for software engineers, developers, and technical leads who develop Spark applications for EMR or EC2 clusters.

    In this webinar, Vinod Nair will show you how to:

    Identify which portion of your application consumes the most resources
    Identify the bottlenecks slowing down your applications
    Test your applications against development or production workloads
    Significantly reduce troubleshooting issues due to ambient cluster conditions

    This webinar is followed by a live Q & A. A replay of this webinar will be available within 24 hours at https://www.pepperdata.com/resources/webinars/.
  • Kerberized HDFS – and How Spark on Yarn Accesses It Recorded: Aug 9 2017 50 mins
    Kimoon Kim
    Following up on his recent presentation HDFS on Kubernetes and the Lessons Learned, Senior Software Engineer, Kimmoon Kim presents on Kerberized HDFS and how Spark on Yarn Accesses It
  • Creatively Visualizing Spark Data Recorded: Jul 18 2017 26 mins
    Christina Holland
    Pepperdata tech talk by Pepperdata software engineer, Christina Holland on creatively visualizing spark data and designing new ways to see new pipelines.
  • Production Spark Series Part 4: Spark Streaming Delivers Critical Patient Care Recorded: Jun 22 2017 58 mins
    Charles Boicey, Chief Innovation Officer, Clearsense
    Clearsense is a pioneer in healthcare data science solutions using Spark Streaming to provide real time updates to health care providers for critical health care needs. Clinicians are enabled to make timely decisions from the assessment of a patient's risk for Code Blue, Sepsis and other conditions based on the analysis of information gathered from streaming physiological monitoring along with streaming diagnostic data and the patient historical record. Additionally this technology is used to monitor operational and financial process for efficiency and cost savings. This talk discusses the architecture needed and the challenges associated with providing real time SLAs along with 100% uptime expectations in a multi-tenant Hadoop cluster.
  • Spark Summit 2017 - Connect Code to Resource Consumption to Scale Production Recorded: Jun 6 2017 26 mins
    Vinod Nair, Director of Product Management
    Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will also discuss various sources of information on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. We will show how Pepperdata’s products can clearly identify such usages and tie them to specific lines of code. We will show how Spark application owners can quickly identify the root causes of such common problems as job slowdowns, inadequate memory configuration, and Java garbage collection issues.
  • Spark Summit 2017 – Spark Summit Bay Area Apache Spark Meetup Recorded: Jun 5 2017 98 mins
    Sean Suchter, Pepperdata Founder and CTO
    Bay Area Apache Spark Meetup at the 10th Spark Summit featuring tech-talks about using Apache Spark at scale from Pepperdata’s CTO Sean Suchter, RISELab’s Dan Crankshaw, and Databricks’ Spark committers and contributors.
DevOps for Big Data
Pepperdata is the DevOps for Big Data company. Leading Enterprise companies depend on Pepperdata to manage and improve the performance of Hadoop and Spark. Developers and operators use Pepperdata products and services to diagnose and solve performance problems in production and increase cluster utilization. The Pepperdata product suite improves communication of performance issues between Dev and Ops, shortens time to production, and increases cluster ROI. Pepperdata products and services work with customer Big Data systems both on-premise and in the cloud

Embed in website or blog

Successfully added emails: 0
Remove all
  • Title: Classifying Multi-Variate Time Series at Scale
  • Live at: Dec 7 2017 7:00 pm
  • Presented by: Ash Munshi
  • From:
Your email has been sent.
or close