Hi [[ session.user.profile.firstName ]]

Hadoop Ingestion Made Easy

Ingesting petabytes of data at scale in the native Hadoop environment encounters quite a few problems that need to be handled by a platform. Some of the known issues include handling of failure, parallel reading of the data and considering updates while the data is being ingested.

This presentation will deep dive into ingesting unbounded file data into Hadoop using the Apache Apex platform.
Recorded May 12 2016 54 mins
Your place is confirmed,
we'll send you email reminders
Presented by
Dr. Sandeep Deshmukh, Committer Apache Apex, DataTorrent Engineer
Presentation preview: Hadoop Ingestion Made Easy

Network with like-minded attendees

  • [[ session.user.profile.displayName ]]
    Add a photo
    • [[ session.user.profile.displayName ]]
    • [[ session.user.profile.jobTitle ]]
    • [[ session.user.profile.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(session.user.profile) ]]
  • [[ card.displayName ]]
    • [[ card.displayName ]]
    • [[ card.jobTitle ]]
    • [[ card.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(card) ]]
  • Channel
  • Channel profile
  • Top 3 RFP Criteria for Streaming Big Data Recorded: Nov 30 2016 30 mins
    Teddy Rusli, Senior Product Manager at DataTorrent
    Enterprises need a reliable streaming analytics engine that can graduate from a lab project to going into a production application.

    Learn top 3 RFP criteria you need when you evaluate a streaming engine for your enterprise.

    Teddy Rusli, Senior Product Manager at DataTorrent, has vast experience in different aspects and roles in bringing analytics to enterprises.
  • Data in Motion: It All Starts With Ingestion Part 2 Recorded: Aug 25 2016 11 mins
    Gordon Hung, Senior Account Executive at DataTorrent
    Ingesting data into Hadoop is a frustrating, time-consuming activity. Further, the growth of data has created immense challenges that are not met by traditional legacy systems. Not only do you have to ingest structured data but unstructured data as well - at scale. Also, this ingestion needs to happen 24x7, never go down nor lose data.

    Having a simplified big data application that collects, aggregates and moves volumes of data to and from Hadoop is necessary for an efficient data processing pipeline.
  • Data in Motion: It All Starts With Ingestion Recorded: Jul 27 2016 16 mins
    Gordon Hung, Account Executive at DataTorrent
    Ingesting and extracting data from Hadoop can be a frustrating, time consuming activity for many enterprises. DataTorrent Data Ingestion is a standalone big data application that simplifies the collection, aggregation and movement of large amounts of data to and from Hadoop for a more efficient data processing pipeline. DataTorrent Data
    Ingestion makes configuring and running Hadoop data ingestion and data extraction a point and click process enabling a smooth, easy path to your Hadoop-based big data project.
  • Harnessing Value from Data in Motion in Real-Time Recorded: Jul 20 2016 53 mins
    Mike Gualtieri, Principal Analyst at Forrester. Larry Neumann, SVP of Marketing at Solace Systems.
    Today, most enterprises perform analytics on data at rest resulting in slow, outdated insights and untimely decisions. However, in today’s hyper-connected digital world where speed and real-time decision making really matters, enterprises need the ability to capture and act on moving data streams aka data in motion in real-time.

    Join guest speakers Mike Gualtieri, Principal Analyst at Forrester and Larry Neumann, SVP of Marketing at Solace System to learn how enterprises prepare for and use real-time streaming analytics platforms
    to capture, analyze, and act on data in motion at the very moment that data is created.

    Key agenda items:

    - Trends in big data and fast data

    - Why enterprises need to have a data in motion strategy

    - A primer on real-time streaming analytics technology – how streaming analytics is different

    - How your infrastructure requirements change for data in motion vs. data at rest

    - Architecture considerations on selecting real-time streaming analytics platform for your data and application needs

    - Real customer use cases of finding insights from data in motion and building next gen apps with DataTorrent's real-time streaming analytics platform
  • 360° Real-Time Business Insights with Native Hadoop Big Data Platform Recorded: Jun 16 2016 36 mins
    Teddy Rusli, Senior Product Manager; Ian Gomez, Audience Marketing Manager at DataTorrent
    To achieve excellence in customer service, you will need to gain a thorough understanding of customer behaviors and usage patterns. Real-time streaming technology can be used to not only capture the customer data from various sources as it's being created but also delivers faster time to insights and action for an improved customer experience. In this webinar, we will demonstrate how DataTorrent’s real-time native Hadoop stream processing platform enables telco providers to conduct a detailed real-time analysis of Call Data Records (CDR) to obtain deeper visibility of customer usage patterns and customer service intelligence. Those real-time insights can then be leveraged by telco providers to enhance the customer centricity program, improve customer satisfaction and reduce customer churn.

    You will also learn how DataTorrent’s real-time analytics platform can help telco providers to:

    • Quickly ingest large amounts of Call Data Records
    • Perform forensics on dropped calls by zip code for a given region
    • Reduce customer wait times for service calls
    • Maximize average revenue per user (ARPU)
  • Architectural Comparison of Apache Apex and Spark Streaming Recorded: Jun 8 2016 63 mins
    Thomas Weise, Co-Founder & Architect, PMC Member, Apache Apex.
    Apache Apex is a native Hadoop data-in-motion platform. In this presentation, we will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.

    We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
  • Smart Partitioning with Apache Apex Recorded: May 19 2016 64 mins
    Pramod Immaneni, Architect; Thomas Weise, Architect & Co-founder at DataTorrent
    Processing big data often requires running the same computations parallelly in multiple processes or threads, called partitions, with each partition handling a subset of the data. This becomes all the more necessary when processing live data streams where maintaining SLA is paramount. Furthermore, multiple different computations make up an application and each of them may have different partitioning needs. Partitioning also needs to adapt to changing data rates, input sources and other application requirements like SLA.

    In this talk, we will introduce how Apache Apex, a distributed stream processing platform on Hadoop, handles partitioning. We will look at different partitioning schemes provided by Apex some of which are unique in this space. We will also look at how Apex does dynamic partitioning, a feature unique to and pioneered by Apex to handle varying data needs with examples. We will also talk about the different utilities and libraries that Apex provides for users to be able to affect their own custom partitioning.
  • Hadoop Ingestion Made Easy Recorded: May 12 2016 54 mins
    Dr. Sandeep Deshmukh, Committer Apache Apex, DataTorrent Engineer
    Ingesting petabytes of data at scale in the native Hadoop environment encounters quite a few problems that need to be handled by a platform. Some of the known issues include handling of failure, parallel reading of the data and considering updates while the data is being ingested.

    This presentation will deep dive into ingesting unbounded file data into Hadoop using the Apache Apex platform.
  • Productization of Big Data Streaming Analytics Recorded: Apr 18 2016 51 mins
    Mike Gualtieri, Principal Analyst, Forrester Research. Amol Kekre, CTO & Co-Founder, DataTorrent.
    Most Hadoop projects fail. There is a need for a platform that focuses on operational success and time to market. Big Data streaming analytics is critical, and enterprises must succeed in operationalizing it.

    In this webinar, you will learn about Big Data streaming analytics and where the industry is heading.

    1. Scalability & Performance
    2. Analytical Operators & Connectors
    3. Feed Data Lakes
    4. Operable & Enterprise Ready
    5. Time to Market
  • IOT Ingestion & Analytics Using Apache Apex - A Native Hadoop Platform Recorded: Apr 6 2016 62 mins
    Pramod Immaneni, PPMC Member & Architect at DataTorrent - Ian Gomez, Audience Marketing Manager at DataTorrent
    Internet of Things (IoT) devices are becoming more ubiquitous in consumer, business and industrial landscapes. They are being widely used in applications ranging from home automation to the industrial internet. They pose a unique challenge in terms of the volume of data they produce, and the velocity with which they produce it, and the variety of sources they need to handle. The challenge is to ingest and process this data at the speed at which it is being produced in a real-time and fault tolerant fashion. Apache Apex is an industrial grade, scalable and fault tolerant big data processing platform that runs natively on Hadoop. In this webinar, you will see how Apex is being used in IoT applications and also see how the enterprise features such as dimensional analytics, real-time dashboards and monitoring play a key role.
  • Fault Tolerance and Processing Semantics with Apache Apex Recorded: Mar 24 2016 46 mins
    Thomas Weise, Architect & Co-founder; Pramod Immaneni, Architect
    Apache Apex (http://apex.incubator.apache.org/) is an open source stream processing and next generation analytics platform incubating at the Apache Software Foundation. Apex is Hadoop native and was built from ground up for scalability, low-latency processing, high availability and operability.

    In this webinar, you will learn about Apache Apex fault tolerance, high availability and processing guarantees.

    From the users perspective, fault tolerance of a stream processing platform should cover the state of the application/processor and the in-flight data. In the event of failure, the platform should recover, restore state and resume processing with no loss of data. We will cover:

    * Components of an Apex application and how they are made fault tolerant
    * How native YARN support is leveraged for fault tolerance
    * How operator checkpointing works and how the user can tune it
    * Failure scenarios, recovery from failures, incremental recovery
    * Processing guarantees and which option is appropriate for your application
    * Sample topology for highly available, low latency real-time processing
    * How is fault-tolerance in Apex different from similar platforms such as Storm, Spark Streaming and Flink.

    DataTorrent team
  • Introducing Apache Apex (incubating) Recorded: Feb 25 2016 60 mins
    Amol Kekre, CTO, DataTorrent, Thomas Weise, Architect, DataTorrent
    Apache Hadoop has become the de-facto big data platform. It represents tremendous promise of using big data to transform business operations. Hadoop was developed as a solution for efficient and scalable search indexing need. The first version had MapReduce programming model. Mastering MapReduce required steep learning curve, and migrating applications to MapReduce needed a complete re-write. This along with the requirement of moving compute closer to data made MapReduce an impediment that did little to bolster productization of big data. There are faster in-memory substitutes to MapReduce, but they too carry the same baggage. In hindsight, Hadoop should have modeled itself as a distributed operating system, and enabled various programming models to run. Hadoop 2.0 (Yarn) was the answer. Not only does Yarn allow organizations to perform advanced analytics with data at unprecedented volume, but it’s also broadened the use cases for Big Data across the industry segments.
    What is now needed is bleeding edge Yarn-based platform capable of radically realizing Hadoop’s potential.
    These new age big data platforms must deliver real business value of the Big Data. This means easy deployment, ease of  integration with existing IT infrastructure, ease of migration and development of applications to Hadoop, and faster time to insight for business.
    The key requirements of this new platforms should reside in:

    • Simplicity and Production-ready
    • Code Reuse
    • Operability
    • Ease of integration
    • Leveraging the power of Hadoop

    In this webinar we will discuss Apache Apex (incubating), the next generation native Hadoop big data platform. We will go into details on how Apex meets these requirements.
  • Powering IoT Applications With Real-time Streaming Technology Recorded: Jan 28 2016 52 mins
    Nick Durkin, Director, Solutions Engineering, DataTorrent Jie Wu, Director, Product Marketing, DataTorrent
    IoT means data, lots of it. Capturing and analyzing these data in real-time can lead to immediate business benefits. In this DataTorrent webinar, we will share a real-world use case on how a leading utility company leverages a well-designed real-time streaming platform to accelerate multiple IoT applications and achieve real business benefits. Join us to learn how a sophisticated streaming platform helped the IoT company accomplish:

    • Automatic, fast data ingestion of large volume of smart grid data
    • Real-time data enrichment
    • Forecast on network loads for more effective demand side management
    • Real-time measurement and validation of demand response events
Harness Data in Motion
DataTorrent, powered by Apache Apex, is the industry’s only open source enterprise-grade unified stream and batch platform.

Embed in website or blog

Successfully added emails: 0
Remove all
  • Title: Hadoop Ingestion Made Easy
  • Live at: May 12 2016 3:00 pm
  • Presented by: Dr. Sandeep Deshmukh, Committer Apache Apex, DataTorrent Engineer
  • From:
Your email has been sent.
or close