Hi [[ session.user.profile.firstName ]]

Working with Domino and Apache Spark

Being a data scientist requires using the right tool for the right job. In many organizations that means harnessing an array of technologies on projects that span data engineering, ML modeling and creating interactive visualizations. Domino's open platform lets you utilize a variety of different tools all in one place while taking advantage of Domino's collaboration and reproducibility features.

Join us for this month's Continuing Education Webinar where we review how, when and why you should connect your Spark cluster to Domino. Whether it's CDH, HDP, EMR or other Spark providers we review best practices and walk through a use case for a project using Domino and Spark together.
Recorded May 21 2019 22 mins
Your place is confirmed,
we'll send you email reminders
Presented by
Guru Medasani, Domino Data Lab
Presentation preview: Working with Domino and Apache Spark

Network with like-minded attendees

  • [[ session.user.profile.displayName ]]
    Add a photo
    • [[ session.user.profile.displayName ]]
    • [[ session.user.profile.jobTitle ]]
    • [[ session.user.profile.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(session.user.profile) ]]
  • [[ card.displayName ]]
    • [[ card.displayName ]]
    • [[ card.jobTitle ]]
    • [[ card.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(card) ]]
  • Channel
  • Channel profile
  • Monitoring Models at Scale with Domino Dec 11 2019 6:00 pm UTC 49 mins
    Samit Thange, Senior Product Manager, Domino Data Lab
    As more models go to production, managing and monitoring them becomes more onerous. Without proactive monitoring of production models, organizations are exposed to the risk of poor predictions on evolving data affecting their business outcomes.

    Domino enables you to standardize model monitoring across all your data science teams. Key stakeholders get continuous visibility, and teams can initiate proactive actions before your business gets negatively impacted due to model decay. It frees your data scientists to work on building newer models, and facilitates your IT/Ops teams to take charge of monitoring in-production models.

    Join this webinar to learn how Domino can help you monitor your models with:

    -Drift Checks: Detecting changes in pattern of real-world data your models are seeing in production.
    -Quality Checks: Tracking how model accuracy and other quality metrics are changing over time.
    -Health Alerts: Getting alerted when health checks fail so that resolution workflows can be triggered.
  • Machine Learning Vital Signs: Metrics and Monitoring Models in Production Oct 23 2019 5:00 pm UTC 50 mins
    Samit Thange, Domino & Donald Miner, Miner & Kasch
    Many data scientists and their organizations may have hundreds of models running in production, interacting with the real world, and are not keeping track of how their models are performing on live data. Bias and variance can creep into models over time, and we should know when that happens. The world changes, often slowly, and most models perform worse as time goes on. Ensuring everything is working well is a huge undertaking, and unfortunately, many organizations are simply ignoring the problem. Donald Miner, drawing upon his prior experience as a data scientist, engineer, and CTO, details the tracking of machine learning models in production to ensure model reliability, consistency, and performance into the future.

    In this webinar Miner covers:

    -why you should invest time in monitoring your machine learning models.
    -real-world anecdotes about some of the dangers of not paying attention to how a model’s performance can change over time.
    -metrics you should be gathering for each model and what they tell you with a list of “vitals,” what value they provide, and how to measure them.
    -vitals that include classification label distribution over time, distribution of regression results, measurement of bias, measurement of variance, change in output from previous models, and changes in accuracy over time.
    -implementation strategies to keep watch on model drift over time.
  • Turbo-Charging Data Science with AutoML Recorded: Oct 3 2019 63 mins
    Josh Poduska, Chief Data Scientist, Domino Data Lab
    Although there are an increasing number of commercial AutoML products, the open-source ecosystem has been innovating here as well. In the early days of the AutoML movement, the focus was on those looking to leverage the power of ML models without a background in data science - citizen data scientists. Today, however, AutoML tools have a lot to offer experts too.

    In this webinar, we will dive into popular open source AutoML tools such as auto-sklearn, TPOT, MLBox, and AutoKeras. We will also walk through hands-on examples of how to install and use these tools, and highlight special features of each while providing Jupyter notebooks so you can start using these technologies in your work right away. Those who wish to follow along interactively during the webinar and download the notebooks and slides can do so by signing into Domino’s trial version at https://dominodatalab.com/try. Instructions on accessing the materials in Domino trial will be given during the webinar.
  • Best Practices for Getting Data Science Web Apps in Production Recorded: Aug 29 2019 43 mins
    Josh Poduska, Chief Data Scientist, Domino Data Lab
    Data science is a team sport. Sharing your work is critical for getting early feedback from stakeholders and, ultimately, for having your work impact the business. There’s no better way to share your work than with a web app. Creating interactive visualizations lets data scientists build engaging tools for end consumers or prototype something before handing it off to an engineer.

    In this webinar we’ll review the pros and cons of the top data science web app frameworks (Shiny, Dash, Flask, etc.), discuss best practices for building and debugging web apps, and provide a step-by-step walk through on how to quickly deploy web apps in a secure environment while maintaining a system of record for your work.

    This will be a hands-on, technical webinar, aimed at helping data scientists do their job better. We look forward to having you join us.
  • A Data Science Playbook for Explainable ML/AI Recorded: Aug 14 2019 78 mins
    Domino Chief Data Scientist Josh Poduska, and VP of Marketing Jon Rooney
    Navigating Predictive and Interpretable Models

    Model ethics, interpretability, and trust will be seminal issues in data science in the coming decade. This technical webinar discusses traditional and modern approaches for interpreting black box models. Additionally, we will review cutting edge research coming out of UCSF, CMU, and industry. This new research reveals holes in traditional approaches like SHAP and LIME when applied to some deep net architectures and introduces a new approach to xML/xAI where interpretability is a hyperparameter in the model building phase rather than a post-modeling exercise. We will provide step-by-step guides that practitioners can use in their work to navigate this interesting space.

    We will review code examples of interpretability techniques. You can follow along with the presentation by running your own notebook hosted in Domino's trial environment. Create a free trial account at: http://dominodatalab.com/try?utm_source=brighttalk
  • Data-driven to Model-driven Recorded: Jul 10 2019 71 mins
    Featuring Forrester Senior Analyst Kjell Carlsson, Ph.D and Domino Chief Data Scientist Josh Poduska
    Over the last two decades business and IT leaders chased the “big data” dream. They were sold on a data-driven future, complete with competitive advantages automagically gained collecting and storing “big data”. In the end, these enterprises needed more than analyst dashboards. They need to make data science an organizational capability to become a model-driven organization.

    You can have all the data you want, do all the machine learning you want, but if you aren’t running your business on models, you’ll soon be left behind. In this webinar, we will demystify the model-driven business:

    -Urgency: why companies running on models are automating decisions and winning
    -Experts: how deep expertise is required to make this change
    -Purpose Built: what are the platforms incorporating data science as a core capability

    Finally, we will highlight the major Domino platform investments that enable a model-driven business:

    -Open, Flexible Infrastructure: agile scalable compute with accelerated experiments
    -Collaborative R&D: an experimental, iterative, and exploratory environment for teams
    -Governance of Data Science: explore new ways to review, quality control, and monitor
  • Providing unprecedented transparency into the health of data science teams Recorded: Jun 26 2019 14 mins
    Kelly Xu, Product Marketing Manager and Georgi Matev, Head of Product Management
    While working with Fortune 500 companies who have sophisticated data science organizations, we realized that existing tools don't meet collaboration, visibility, and governance needs.

    Domino 3.5 answers the needs of data science leaders. It allows data science managers to define their own data science project life cycle, and easily track and manage projects with a holistic understanding of the latest developments. It also surfaces projects that need immediate attention in real time by showing the projects that are blocked. Watch this short webinar to learn how Domino 3.5 equips data science leaders and IT admins with:

    -Project Portfolio Dashboard
    -Project Stage Configuration
    -License Usage Reporting
  • How Climate Grew Its Data Science Capabilities 10x in 2 Years Recorded: Jun 25 2019 34 mins
    Mir Yasir Ali - Sr. Staff Software Engineer, Climate Corporation / Kris Skrinak - Machine Learning Segment Lead, AWS
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Case Studies - Climate Corporation, AWS

    The Climate Corporation provides a platform for farmers around the world to use best-in-class analytic capabilities to digitize their operations and optimize their profits. Come learn about Climate’s journey from a scrappy startup to a mature company and hear how we grew our data science capabilities 10x, from supporting 20 data scientists to supporting 200 data scientists, over 2 years.

    As a new startup, data scientists were working on customized servers with a complex set of unstandardized libraries. Time and effort were lost to maintenance and overhead, with researchers spending 50% of their time maintaining and customizing research environments. Additionally, sharing work between groups was a significant challenge and versioning of models/data was done manually, with a high risk for error. We identified a need to standardize our environments to minimize time spent configuring research environments as well as simplifying collaboration across data scientists on an ongoing basis.

    To meet this need, we developed a process and infrastructure whereby hardware and software are tailored to a researcher’s needs based on their domain, and we built out automation to enable this process by default. By automating the configuration of Yarn, Spark, Docker, AWS and Domino we drove standardized infrastructure for research and discovery within Climate. This enabled the data science team to deliver models to production faster and at less than half the previous cost, enabling farmers around the world to increase crop yields and grow more food for all of us!
  • Data Science as the Enabler of Digital Transformation Recorded: Jun 25 2019 32 mins
    Bill Groves - Chief Data Officer, Walmart
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Case Studies - Walmart

    Digital Transformation is not necessarily about digital technology, but about the fact that technology, which is digital, allows people/companies to solve their traditional problems - and they prefer this digital solution to the old solution. Digital transformation is more about the integration of digital technology into all areas of a business, fundamentally changing how you operate and deliver value to customers. Learn about how Walmart is managing digital transformation as a cultural change that requires organizations at the behemoth size to startup-scale to continually challenge the status quo, experiment, and get comfortable with failure. The evolution of the role of data scientist will be explored at length along with what the required skill sets of the future will entail.
  • Auto-Adjudicating Veterinary Claims at Trupanion Recorded: Jun 25 2019 33 mins
    David Jaw - Lead Data Scientist, Trupanion
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Case Studies - Trupanion

    Trupanion is one of the fastest growing players in the pet medical insurance space. At Trupanion we aim to revolutionize the way pet owners are able to approach the costs of veterinary care by completely restructuring the claim reimbursement model. Leveraging NLP and machine learning we are able to auto-adjudicate claims in a matter of seconds compared to the industry standard of one week. Eliminating the laborious claims process not only allows our customers and partnering veterinary clinics to focus on providing the best care for their pets but also allows Trupanion to operate at scale as our business grows. In this talk, we will discuss how Domino has empowered the data science and engineering teams at Trupanion to create a multi-layer framework for auto-adjudicating veterinary claims in real time.
  • Data Science at The New York Times Recorded: Jun 25 2019 32 mins
    Chris Wiggins - Chief Data Scientist, The New York Times
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Case Studies - The New York Times

    The Data Science group at The New York Times develops and deploys machine learning solutions to newsroom and business problems. Re-framing real-world questions as machine learning tasks require not only adapting and extending models and algorithms to new or special cases but also sufficient breadth to know the right method for the right challenge. I’ll first outline how unsupervised, supervised, and reinforcement learning methods are increasingly used in human applications for description, prediction, and prescription, respectively. I’ll then focus on the ‘prescriptive’ cases, showing how methods from the reinforcement learning and causal inference literature can be of direct impact in engineering, business, and decision-making more generally.
  • Best Practices for Advancing Human Progress with Models Recorded: Jun 25 2019 32 mins
    Marck Vaisman - Technology Solutions, Data and AI, Microsoft
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Leadership - Microsoft

    With his background spanning work in government, the commercial sector, and academia alike, Marck brings a unique and comprehensive perspective to this talk, where he’ll delve into the common themes and lessons learned across all kinds of organizations aiming to improve outcomes through data science.

    Topics that will be addressed include:

    - The importance and key considerations of building a data science community comprised of diverse backgrounds and skill sets
    - How to recognize and communicate the impact of models and data science, and how to articulate that value to business stakeholders
    - How to tackle hard problems in human ways by upskilling domain experts
    - Cautions and mitigation strategies around giving non-technical users access to data science technologies
    - Analogies between the federal and commercial sectors, and how they can learn from each other
  • Catalyzing Your Transformation, Thanks to Early Model Risk Governance Recorded: Jun 25 2019 25 mins
    Sebastien Conort - Chief Data Scientist, BNP Paribas Cardif
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Leadership - BNP Paribas Cardif

    Issuing an internal model risk governance is often not prioritized in the early stages of the transformation journey towards a model-driven company. Indeed, designing and validating such governances can be seen as a complex and heavy task, cannibalizing resources. In addition, some stakeholders of the transformation see governances as innovation killers. In this talk, Sebastien will explain all the benefits he experimented after releasing an internal analytics model risk governance at BNP Paribas Cardif, how it boosted the transformation pace, and why he thinks that is an important deliverable to be prioritized as soon as possible on the transformation path.
  • Product Management for AI Recorded: Jun 25 2019 33 mins
    Peter Skomoroch - Head of Data Products, Workday
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Leadership - Workday

    Companies that understand how to apply AI will scale and win their respective markets over the next decade. That said, delivering on this promise and managing machine learning projects is much harder than most people anticipate. Many organizations hire teams of PhDs and data scientists, then fail to ship products that move business metrics. The root cause is often a lack of product strategy for AI, or the failure to adapt their product development processes to the needs of machine learning systems. This talk will cover some of the common ways machine learning fails in practice, the tactical responsibilities of AI product managers, and how to approach product strategy for AI.

    Peter Skomoroch, former Head of Data Products at Workday and LinkedIn, will describe how you can navigate these challenges to ship metric moving AI products that matter to your business.

    Peter will provide practical advice on:

    - The role of an AI Product Manager
    - How to evaluate and prioritize your AI projects
    - The ways AI product management differs from traditional product management
    - Bridging the worlds of design and machine learning
    - Making trade offs between data quality and other business metrics
  • Applied Machine Learning in Finance Recorded: Jun 25 2019 32 mins
    Chakri Cherukuri - Sr. Quantitative Researcher, Bloomberg
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Practitioner - Bloomberg

    Quantitative finance is a rich field in finance where advanced mathematical and statistical techniques are employed by both the sell-side and buy-side institutions. Techniques like time-series analysis, stochastic calculus, multivariate statistics, and numerical optimization are often used by “quants” for modeling asset prices, portfolio construction/optimization, building automated trading strategies. My talk will focus on how machine learning and deep learning techniques are being used in this field.

    In the first part of the talk, we will look at use cases involving both structured and unstructured data sets in finance, where machine learning techniques can be applied. Then we will pick a few case studies and examine in detail how machine learning models can be applied for predictive analytics.

    We’ll look at interactive plots running in Jupyter notebooks. The main focus of the talk will be on reproducible research and model interpretability.
  • Applying Exponential Family Embeddings in Natural Language Processing to Analyze Recorded: Jun 25 2019 34 mins
    Maryam Jahanshahi - Research Scientist, TapRecruit
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Practitioner - TapRecruit

    Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity of words in a large corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.

    In this talk, Maryam will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. She will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous). Maryam will discuss how we implemented a dynamic embedding model using Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last 3 years. Maryam will specifically focus the description of results on how tech and data science skill sets have developed, grown and pollinated other types of jobs over time.

    Key takeaways: (1) Lessons learnt from implementing different word embedding methods (from pertained to custom); (2) How to map trends from a combination of natural language and structured data; (3) How data science skills have varied across industries, functions and over time.
  • Regularization of RNN Through Bayesian Networks Recorded: Jun 25 2019 30 mins
    Vishal Hawa - Principal Data Scientist, Vanguard
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Practitioner - Vanguard

    Data Scientists are often subjected to a wide array of data patterns and signatures. While they analyze the data in the light of a particular problem, they often face challenges when data signatures do not yield to patterns they are expecting. At this point, data scientists must improvise their techniques that are custom to the problem at hand.

    While Deep Learning has shown significant promise towards model performance, it can quickly become untenable particularly when data size falls short of problem space. One such situation regularly appears when modeling with RNNs. RNNs can quickly memorize and over-fit (the problem is further aggravated when data size is small to medium). The presentation exposes shortcomings of RNNs and how a combination of RNNs and Bayesian Network (PGM) can not only overcome this shortcoming but also improvise sequence-modeling behavior of RNNs. We will learn this in the context of Marketing Channel Attribution modeling.
  • Multi-task Deep Learning for Image Tagging Recorded: Jun 25 2019 32 mins
    Wayne Thompson - Chief Data Scientist, SAS
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - Practitioner - SAS

    A fundamental characteristic of human learning is that we learn multiple pieces of information simultaneously. We can describe an image verbally because we are natural multi-task agents. A comparable concept in machine learning is called multi-task learning (MTL) and it has become increasingly useful in practice. A common MTL use case is image tagging. For example, a retailer can use MTL to identify visual attributes for clothing items. Multiple attributes are learned simultaneously such as the type of clothing, texture, color, pattern, gender, and fit type. The tagged results can be used for customer profile analysis to make purchase recommendations. With a set of personal photos, it is possible to infer the fashion style of the shopper by analyzing the attributes of clothes and then recommend other clothing items for purchase. Tagging can also be used for retrieval systems like image search, or as part of feature engineering.

    In this presentation, we build a multi-task deep learning model using DLPy to tag fashion clothing items. Convolutional neural networks show extraordinary performance for image classification and object recognition applications. DLPy is a high-level and easy-to-use Python API for SAS Deep Learning models. We explain how DLPy can be applied to data preparation, data processing, multi-task model building, assessment and deployment for image tagging.
  • Data Science, Past & Future Recorded: Jun 25 2019 49 mins
    Paco Nathan - Managing Partner, Derwen, Inc.
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - General Session - Derwen, Inc.

    This talk explores big themes — the challenges in industry, the advances in technology — which brought us to this point. Origins of our field followed a simple formula at nearly every step along the way. We can use that formula as a lens to understand changes that are emerging.

    Looking through six decades since Tukey first described “data analytics”, challenges and advances have often upset the status quo. Hardware capabilities evolved in dramatic leaps. Software layers provided new kinds of control systems and conceptual abstractions in response. These manifested as surges in data rates and compute resources. Then industry teams applied increasingly advanced mathematics to solve for novel business cases. That’s the formula.

    For example, as spinny disks gave way to SSDs and commodity CPUs became multicore, Hadoop use cases gave way to Spark which fit the hardware better. Cluster computing workloads that had been ETL or clickstream, gave way to more complex math used for recommender systems, anti-fraud, anti-churn, and other advanced predictive analytics.
    We’re at a point now where more of the predictive analytics are moving to Python or R in-memory processing (Arrow), while more advanced workloads such as deep learning take over the clusters. On the horizon, even more complex use cases such as knowledge graph work will be consuming what new hardware provides.

    Let’s examine this “lens” into how our field evolves. Also keep in mind that growing security threats and increasingly complex regulatory requirements drive from the top, placing even more premium on novel business cases. We’ll look through the trends in history, leading up to now, consider examples from this conference, then look at what’s on the horizon.
  • Panel Discussion: Data Science & the Future of Investing Recorded: Jun 25 2019 45 mins
    Thomas Laffont & Alex Izydorczyk, Coatue Management / Matthew Granade, Point72
    Recorded at Rev 2 | May 23-24, 2019 | New York

    Day 2 - General Session - Coatue Management, Point72

    Though active asset managers have used quantitative techniques for decades, data science has presented them with new challenges and opportunities. In this session, three leaders representing different hedge fund perspectives will join a conversation to answer questions and discuss topics including:

    - How are hedge funds leveraging data science to drive innovation and new strategies?
    - How are investment strategies in public and private markets changing with the evolution of data science and modeling practices?
    - What lessons and concepts from finance firms’ quant research experiences have been valuable to incorporate into the new data science regime? Conversely, what must hedge funds “unlearn” from their past ways of operating?
    - What are some concrete tactics and techniques for integrating data science into your fund? Which of those can be applied to businesses outside of the finance sector?
    - How to integrate data scientists with traditional analysts and portfolio managers.
Learn to drive business impact with your predictive models at scale
Today, the best-run companies run their business on models, and those that don’t face existential threat. Welcome to "The Model Driven Business" - a channel where we will share use cases and best practices for organizations striving to make data science an organizational capability that drives business impact.

Embed in website or blog

Successfully added emails: 0
Remove all
  • Title: Working with Domino and Apache Spark
  • Live at: May 21 2019 5:00 pm
  • Presented by: Guru Medasani, Domino Data Lab
  • From:
Your email has been sent.
or close