Hi [[ session.user.profile.firstName ]]

Troubleshooting Apache Zookeeper Like a CSI

How would you investigate bugs in Apache Zookeeper?

Many engineering teams leverage Apache Zookeeper to build highly available systems at scale. As with any complex system you hope everything performs as expected. But occasionally bugs arise requiring resources with forensic skills, similar to those of a detective, to be on the ready to investigate and fix them. Register and learn about how PagerDuty engineers discovered the Zookeeper Poison Packet and how they:

Discovered bugs across multiple systems
Investigated the bugs to better understand underlying issues
Solved issues quickly and methodically to maintain high availability
Recorded May 31 2016 34 mins
Your place is confirmed,
we'll send you email reminders
Presented by
Evan Gilman, Operations Engineer and Arup Chakrabarti, Director of Infrastructure
Presentation preview: Troubleshooting Apache Zookeeper Like a CSI

Network with like-minded attendees

  • [[ session.user.profile.displayName ]]
    Add a photo
    • [[ session.user.profile.displayName ]]
    • [[ session.user.profile.jobTitle ]]
    • [[ session.user.profile.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(session.user.profile) ]]
  • [[ card.displayName ]]
    • [[ card.displayName ]]
    • [[ card.jobTitle ]]
    • [[ card.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(card) ]]
  • Channel
  • Channel profile
  • Achieving Real-Time Operations: Latest Integrations for Process Automation Dec 13 2018 5:00 pm UTC 25 mins
    Andrew Marshall, PagerDuty; Sean Higgins, PagerDuty; Monte Montoya, Cprime
    Over 3,000 Atlassian users count on PagerDuty’s 300+ integrations to power their end-to-end real-time digital operations—and we’re excited to launch a comprehensive set of integrations for the Atlassian suite, which allows Atlassian users to add PagerDuty’s real-time digital operations capabilities to their everyday work.

    In this webinar, PagerDuty product manager Sean Higgins And CPrime’s Monte Montoya will show you how PagerDuty’s new Atlassian integrations help teams automate processes throughout the DevOps lifecycle, save time, and take full advantage of Jira Software, Jira Service Desk, Bitbucket, Statuspage, and other Atlassian tools.
  • Managing High-Severity Issues in Support Recorded: Nov 14 2018 32 mins
    Luke Kanter, Senior Product Manager, Business Operations, FanDuel; Lauren Wang, Director of Solutions Marketing, PagerDuty
    Availability is a must for leading daily fantasy sports site FanDuel, so its millions of users can play games during live sporting events. But the majority of these happen after working hours, so what happens if there’s a high-severity issue and you’re in Support? With FanDuel’s technical teams five time zones away, how do you mobilize the right resources, especially during peak traffic and usage times? In this session, learn how FanDuel addresses these challenges using PagerDuty for SupportOps.
  • Security and Developers: a match made in PagerDuty Recorded: Oct 25 2018 45 mins
    David Cliffe, PagerDuty
    As developers, it can be overwhelming trying to balance the onslaught of new features, bug fixes, and operational improvements, on top of being on-call for your services with a steady flow of alerts and incidents to handle.

    Why not throw some security vulnerabilities and patches into the mix, right? Thankfully, that's someone else's job - well, it used to be. This session will highlight ways that security ownership is changing with DevSecOps and how PagerDuty can help make that an easier transition for both sides: developers and security teams.
  • Don't Be a Bystander, Be an Incident Commander Recorded: Oct 16 2018 17 mins
    Rachael Byrne, Data Enablement, PagerDuty
    Many organizations have some kind of incident response process to coordinate during a major service outage. Some operationally mature companies incorporate a formal Incident Commander role in their process for a faster, more effective response. The Incident Commander serves as the final decision-maker during a major incident, delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution. Whether or not a company has a formal process that includes an Incident Commander, most companies believe that your most senior engineer is best suited to lead an incident response. Rachael challenges this assumption.

    Rachael has learned first hand that successful Incident Commanders do not need to be highly technical, let alone senior, to effectively lead a coordinated response to a major incident. Comfort with a structured process and soft skills such as communication are actually more important than technical knowledge for an effective Incident Commander.

    All organizations need to maximize the number of people able to lead a major incident response to avoid burn-out of their most senior technical leaders and increase overall availability of their service. In this webinar, you’ll how to develop an inclusive incident response process that welcomes more Incident Commanders without compromising response effectiveness that they can immediately apply at their own organizations.
  • Best Practices in Driving DevOps Transformation Recorded: Jul 17 2018 21 mins
    Paul Rechsteiner, Principal Product Manager, Incident Response
    The benefits of DevOps transformation are significant and well-understood, with over 70% of organizations having already implemented or planning to implement DevOps within the next 12 months according to Forrester. But it’s not a change that happens overnight. After empowering tens of thousands of operations teams to effectively manage their services in production, we’ve distilled what we’ve learned from them into a few best practices that have been foundational to driving change and transforming to be more agile.

    In this session, we’ll go over a few topics:

    - Common operational challenges across the organizational spectrum
    - The top best practices that organizations use to empower distributed teams
    - How to enable both central and distributed teams to gain comprehensive visibility, take action on issues fast, and be more productive
    - A couple of case studies of real organizations who have achieved successes in the journey toward DevOps transformation, and how they’re getting there
  • What's the Future of DevOps? Recorded: Jul 10 2018 40 mins
    Tori Wieldt, New Relic, Tony Hansmann, Pivotal, Matt Stratton, PagerDuty, and Vera Chen, PagerDuty
    The adoption of DevOps fundamentals has changed the game of software delivery. And DevOps has come a long way from being a methodology that was considered, “for startups only” to what it is today—a culture and people first approach to engineering that enables teams to deploy faster, eliminate silos, and iteratively improve no matter your size. Do it right, and your team stands to release much faster, predictively, and safely, while mitigating unplanned work and making it easier to get ahead of customer experience.

    Join us as we discuss the future of DevOps with leaders from Pivotal, New Relic, and PagerDuty. We’ll look back at where DevOps has been, what it’s brought to the table, and where it’s going.

    • What are the biggest challenges organizations face when cultivating DevOps adoption?
    • What key fundamentals of DevOps are here to stay?
    • Can DevOps best practices and fundamentals be adopted beyond the engineering team?
    • Is it time for the enterprise to finally embrace DevOps?
    • What’s the next big thing for DevOps?

    Tori Wieldt, Developer Advocate, New Relic
    Tony Hansmann, Field CTO, Pivotal
    Matt Stratton, DevOps Evangelist, PagerDuty
    Vera Chen, Technical Marketing Manager, PagerDuty
  • Improving Your Employee Retention With Real-Time Ops Data Recorded: Jul 3 2018 32 mins
    Ophir Ronen, PagerDuty, Stephen O’Grady, RedMonk, and Mary Moore-Simmons, SendGrid
    Employee attrition is a challenge that every organization, regardless of industry, faces. People change jobs and move companies for many different reasons. This got us thinking—is on-call health a factor that can significantly affect employee attrition rates? If the health factors of on-call pain are bad enough, could it cause your employees to leave your company prematurely?

    Join us for our upcoming webinar, where we’ll address these questions and share results from a global survey of IT managers and practitioners on employee attrition.

    We’ll also show you how to tackle employee dissatisfaction with real-time operations health data so you can effectively build and retain teams responsible for your digital transformation. You’ll gain insight into the most impactful operations health indicators as they relate to average employee tenure length across your organization, including:

    • Notifications that interrupt work and life.
    • Notifications that wake up responders at night.
    • Notifications that interrupt weekends.
    • Notifications that interrupt consecutive weekend days.
  • Selecting the Right Data Science Approach for Your Operations Recorded: May 16 2018 16 mins
    Lilia Gutnik, Product Manager, PagerDuty
    Organizations want to modernize their IT Operations with the promise of data science and machine learning. The potential results from harnessing data is awesome: automate repetitive tasks, focus on innovation instead of maintenance, and profit from actionable insights that grows the business.

    However,  many teams struggle with the challenge of applying data science to their operations in a way that is suited to their business, their resources, and their goals.

    In this webinar, we’ll answer key questions such as:

    • Which operations challenges can be improved with Data Science?
    • How can data science help your business?
    • How do you evaluate what data science is right for you?
  • 52% of Companies Sacrifice Cybersecurity for Speed Recorded: Apr 24 2018 47 mins
    Pete Cheslock, Head of Ops, Threat Stack and Franklin Mosley, Senior Application Security Engineer, PagerDuty
    A recent Threat Stack survey finds that over 50% of companies admit to cutting back on security measures to meet a business deadline or objective. As long as companies are willing to sacrifice security at the altar of speed, the long-held dream of marrying DevOps and security simply won’t come true.
    Join this webinar to hear Pete Cheslock, Threat Stack head of Ops, and Franklin Mosley, PagerDuty Senior Application Security Engineer discuss the current status, gaps, and obstacles of DevSecOps. Here are just a few of the survey findings:

    • 68% of companies say that their CEO demands DevOps and security teams not do anything that slows the business down
    • 57% percent of companies say their operations team pushes back on security best practices
    • 44% of developers are not trained to code securely
  • Organizing and Optimizing ITSM Toolsets Recorded: Nov 22 2017 29 mins
    Manish Kalra, Director of Product Marketing, PagerDuty
    Service Management (ITSM) is an approach for designing, delivering, managing and improving the way IT is used within an organization. To make that approach a reality, a core requirement is having the right strategic toolset for your unique organizational needs.

    But are the right tools to choose to help you deliver optimal services and keep your application and critical infrastructure available? How do you organize all the information these tools are feeding your organization everyday?

    Join PagerDuty as they take you through what it takes to:

    ●Consolidate multiple ITSM services into one hub
    ●Support a 2-speed IT infrastructure and multiple ITSM processes within a single organization
    ●Evaluate integrations and flexibility of potential toolsets
  • DevOps at Scale: Using AWS & PagerDuty to Improve Growth & Incident Resolution Recorded: Oct 19 2017 39 mins
    Thomas Robinson, Solutions Architect, AWS & Eric Sigler, Head of DevOps, PagerDuty; Christopher Hoey, SRE, Datadog
    Meeting the demands of everchanging IT management and security requirements means evolving both how you respond to and resolve incidents. It’s critical for organizations to adopt a scalable DevOps solution that integrates with their current monitoring systems to enable collaboration across development and operations teams, reducing the mean time to resolution.

    PagerDuty works with AWS services like Amazon CloudWatch, to provide rapid incident response with rich, contextual details that allow you to analyze trends and monitor the performance of your applications and AWS environment.
  • Introduction to Being an Incident Responder Recorded: Sep 28 2017 30 mins
    Eric Sigler - Head of DevOps @ PagerDuty
    Best practices to succeed during major incident response

    What do you do when the unexpected happens and causes customer-impacting downtime? It’s of the utmost importance that you are prepared and can get our systems back into full working order as quickly as possible. It’s crucial to have a well-defined strategy to come together as a team, work the problem, and get to a solution quickly.

    Drawing from the experiences of thousands of operationally mature teams, this incident responder training will help you gain the understanding required to help support your team’s success when mitigating customer-impacting issues.

    Join us to learn:
    •What is incident response?
    •The roles involved in incident response
    •How to incorporate learnings from previous incident responses
    •Skills for success
  • The Definitive Incident Resolution Lifecycle for Modern Ops Recorded: Sep 20 2017 39 mins
    Dave Cliffe, Group Product Manager, PagerDuty & Sean Higgins, Product Manager, PagerDuty
    The stakes of managing complex infrastructure continue to increase alongside the ever-increasing costs of outages. And while many IT Operations teams are investing in monitoring and ITSM tools to detect issues, they are often forced to react to high volumes of event data without context and without any consistent, well-defined processes. This leads to costly operational inefficiencies, employee burnout, and extended customer downtime. In this webinar, you'll learn how to:

    •Optimize your ITSM toolsets by integrating people, data, and processes
    •Maximize cross-functional transparency and consistency
    •Prioritize incidents with well-defined rules
    •Automated troubleshooting and remediation
    •Improve problem management with postmortems and continuous learning across your team
  • Introduction to Being an Incident Commander Recorded: Jul 25 2017 45 mins
    Eric Sigler - Head of DevOps @ PagerDuty
    During a major customer-impacting incident, every minute counts. The team must work together seamlessly and quickly to a successful resolution — and that’s done best with an Incident Commander driving coordination.

    An Incident Commander is tasked with being the decision maker, delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution. This is a critical role and there are essential best practices you must follow to get it right.

    Join us to learn:
    •What is an Incident Commander?
    •The role and responsibilities of an Incident Commander
    •Incident call procedures and terminology
    •Incident commander skills for success
  • Introduction to Being an Incident Commander Recorded: Jul 12 2017 40 mins
    Eric Sigler - Head of DevOps @ PagerDuty
    During a major customer-impacting incident, every minute counts. The team must work together seamlessly and quickly to a successful resolution — and that’s done best with an Incident Commander driving coordination.

    An Incident Commander is tasked with being the decision maker, delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution. This is a critical role and there are essential best practices you must follow to get it right.

    Join us to learn:
    •What is an Incident Commander?
    •The role and responsibilities of an Incident Commander
    •Incident call procedures and terminology
    •Incident commander skills for success
  • Better incident Management with ChatOps Recorded: Jun 27 2017 27 mins
    Sidd Singh (Senior Product Manager)
    ChatOps — conversation-driven development -— is changing the way development and operations teams work, helping increase productivity by an average of 32% and team transparency by 80.4%. By bringing your tools into your conversations, you can automate tasks, develop, and fix issues more effectively by learning and working in a single environment.

    Join us as we share use cases and examples of how to leverage ChatOps for better collaboration with more flexibility and speed than ever before. We share and demo our ChatOps extensions to popular chat tools, including HipChat, Slack, Flowdock, Cisco Spark, and more!

    You’ll discover how to:
    - Streamline ChatOps incident management
    - Leverage message buttons to increase speed during response
    - Easily fix issues in context without toggling between tools
    - Automate ops-related tasks with bots and slash commands
    - Enforce user permissions and analytics
  • Reach On-Call Teams Faster With Live Call Routing Recorded: Mar 6 2017 30 mins
    Tyler Wells, Customer Support & Success Manager, PagerDuty
    Prevent incidents from becoming business-impacting by notifying the on-call team immediately. With Live Call Routing, anyone can reach your on-call teams in real-time to report incidents simply by calling a phone number. Teams can ensure incidents are received and resolved faster.

    Join us as we showcase configuration and real product examples of how PagerDuty’s Live Call Routing capability is enabling improved visibility and response times for customers.

    Discover how Live Call Routing provides:
    - Automatic call forwarding via schedules and escalations
    - Triggered incidents with just a phone call
    - Phone tree to reach specific teams
    - Global numbers
  • Full-Stack Anomaly Detection and Response Orchestration Recorded: Mar 6 2017 28 mins
    David Cooper, Product Manager, PagerDuty
    Microservices architectures have unleashed unprecedented amounts of application data on organizations. More often than not, there’s no way to correlate data coming from siloed tools looking at only a single part of critical apps or infrastructure, making it difficult to understand the overall health of the digital business, diagnose the root cause when service disruptions occur, and coordinate a response in real-time.

    With PagerDuty’s Operations Command Console, you can visualize the health of applications, services, and infrastructure while managing incident response workflows all in one place to easily mobilize, coordinate, and orchestrate both technical and business response to incidents.

    Discover how the Operations Command Console provides:
    - A single view for full-stack event intelligence and response workflows
    - Interactive and customizable applications for actionable insights
    - Shared context between infrastructure, service health, incidents, and response
    - Pattern and anomaly detection across all your data sources
  • Oracle Delivers Better Customer Experience With PagerDuty’s Operations Command C Recorded: Jan 26 2017 41 mins
    Nestor Camacho, Manager Network Operations Center at Oracle + Ophir Ronen, Product Director, Event Management at Pag
    Learn how Oracle is using PagerDuty to visualize every dimension of the customer experience and create unified views of application performance, infrastructure health, and incident response to deliver better software.

    Discover how Oracle is using PagerDuty and the newly released Operations Command Console to:
    - Respond to incidents in real-time
    - Accelerate Mean Time To Identification (MTTI)
    - Support better postmortems
    - Get full-stack visibility into the health of their applications, services, and infrastructure

    This webinar showcases real product examples of how the Operations Command Console from PagerDuty is enabling Oracle to deliver better software.
  • Streamline Critical Communications With Stakeholder Engagement Recorded: Jan 26 2017 35 mins
    Jeremy Bourque, Product Manager at PagerDuty
    When a major incident occurs, its impact is felt across the organization. While the technical response is underway, stakeholders from all areas of the business — including public relations, support, legal, executives, and more — must all be engaged and kept informed so they can immediately respond and minimize the overall business impact.

    With Stakeholder Engagement, you can streamline the process of identifying and notifying key business stakeholders and maintaining communication during a major IT incident.

    Discover how Stakeholder Engagement provides:
    - A single source of truth for critical, real-time updates
    - Improved alignment between IT and the business during incidents
    - Automation of communication tasks
    - Streamlined post-mortems
    - New licensing to support business user needs
Your Fastest Path to Incident Resolution
PagerDuty is helping IT Operations and DevOps professionals deliver on the promise of agility, performance and uptime. Our enterprise-grade incident management helps you orchestrate the ideal response to create better customer, employee, and business value.

Embed in website or blog

Successfully added emails: 0
Remove all
  • Title: Troubleshooting Apache Zookeeper Like a CSI
  • Live at: May 31 2016 4:35 pm
  • Presented by: Evan Gilman, Operations Engineer and Arup Chakrabarti, Director of Infrastructure
  • From:
Your email has been sent.
or close