Hi [[ session.user.profile.firstName ]]

Hadoop & Spark

  • Date
  • Rating
  • Views
  • Analyse, Visualize, Share Social Network Interactions w Apache Spark & Zeppelin
    Analyse, Visualize, Share Social Network Interactions w Apache Spark & Zeppelin
    Eric Charles, Founder at Datalayer Recorded: Sep 11 2018 49 mins
    Apache Spark for Big Data Analysis combined with Apache Zeppelin for Visualization is a powerful tandem that eases the day to day job of Data Scientists.

    In this webinar, you will learn how to:

    + Collect streaming data from the Twitter API and store it in a efficient way
    + Analyse and Display the user interactions with graph-based algorithms wi.
    + Share and collaborate on the same note with peers and business stakeholders to get their buy-in.
  • Building a Fast, Scalable & Accurate NLP Pipeline on Apache Spark
    Building a Fast, Scalable & Accurate NLP Pipeline on Apache Spark
    David Talby, CTO, Pacific AI Recorded: Sep 4 2018 62 mins
    Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

    This talk introduces the NLP library for Apache Spark. It natively extends the Spark ML pipeline API's which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark's built-in optimizations.

    The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection. The talk will demonstrate using these algorithms to build commonly used pipelines, using PySpark on notebooks that will be made publicly available after the talk.

    David Talby has over a decade of experience building real-world machine learning, data mining, and NLP systems. He’s a member of the core team that built and open sourced the Spark NLP library.
  • How to Share State Across Multiple Apache Spark Jobs using Apache Ignite
    How to Share State Across Multiple Apache Spark Jobs using Apache Ignite
    Akmal Chaudhri, Technology Evangelist, GridGain Systems Recorded: Aug 28 2018 42 mins
    Attend this session to learn how to easily share state in-memory across multiple Spark jobs, either within the same application or between different Spark applications using an implementation of the Spark RDD abstraction provided in Apache Ignite. During the talk, attendees will learn in detail how IgniteRDD – an implementation of native Spark RDD and DataFrame APIs – shares the state of the RDD across other Spark jobs, applications and workers. Examples will show how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames.

    Akmal Chaudhri has over 25 years experience in IT and has previously held roles as a developer, consultant, product strategist and technical trainer. He has worked for several blue-chip companies such as Reuters and IBM, and also the Big Data startups Hortonworks (Hadoop) and DataStax (Cassandra NoSQL Database). He holds a BSc (1st Class Hons.) in Computing and Information Systems, MSc in Business Systems Analysis and Design and a PhD in Computer Science. He is a Member of the British Computer Society (MBCS) and a Chartered IT Professional (CITP).
  • Implementing a Sparse Logistic Regression Algorithm in Apache Spark
    Implementing a Sparse Logistic Regression Algorithm in Apache Spark
    Lorand Dali, Data Scientist, Zalando Recorded: Aug 21 2018 39 mins
    This talk tells the story of implementation and optimization of a sparse logistic regression algorithm in spark. I would like to share the lessons I learned and the steps I had to take to improve the speed of execution and convergence of my initial naive implementation. The message isn’t to convince the audience that logistic regression is great and my implementation is awesome, rather it will give details about how it works under the hood, and general tips for implementing an iterative parallel machine learning algorithm in spark.

    The talk is structured as a sequence of “lessons learned” that are shown in form of code examples building on the initial naive implementation. The performance impact of each “lesson” on execution time and speed of convergence is measured on benchmark datasets.

    You will see how to formulate logistic regression in a parallel setting, how to avoid data shuffles, when to use a custom partitioner, how to use the ‘aggregate’ and ‘treeAggregate’ functions, how momentum can accelerate the convergence of gradient descent, and much more. I will assume basic understanding of machine learning and some prior knowledge of spark. The code examples are written in scala, and the code will be made available for each step in the walkthrough.

    Lorand is a data scientist working on risk management and fraud prevention for the payment processing system of Zalando, the leading fashion platform in Europe. Previously, Lorand has developed highly scalable low-latency machine learning algorithms for real-time bidding in online advertising.

Embed in website or blog