Hi [[ session.user.profile.firstName ]]

Data Science Stack in the Cloud

Journey from exploration and visualization to machine learning and natural language processing. Discover how Return Path built a cloud based, production ready, enterprise scale data solution without a dedicated Dev Ops team. Leveraging modern distributed computing frameworks like Spark and managed services like EMR and Qubole were key to the process.
Recorded Sep 19 2017 45 mins
Your place is confirmed,
we'll send you email reminders
Presented by
Evan Harris, Data Scientist, Return Path
Presentation preview: Data Science Stack in the Cloud

Network with like-minded attendees

  • [[ session.user.profile.displayName ]]
    Add a photo
    • [[ session.user.profile.displayName ]]
    • [[ session.user.profile.jobTitle ]]
    • [[ session.user.profile.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(session.user.profile) ]]
  • [[ card.displayName ]]
    • [[ card.displayName ]]
    • [[ card.jobTitle ]]
    • [[ card.companyName ]]
    • [[ userProfileTemplateHelper.getLocation(card) ]]
  • Channel
  • Channel profile
  • How To Increase Value from Machine Learning and Advanced Analytics on Azure Jun 20 2018 4:00 pm UTC 60 mins
    Nate Shea-han, Americas Global Black Belt, Data & AI at Microsoft and Shaun Van Staden, Solutions Architect at Qubole
    Becoming more competitive with big data today means having the right technology to uncover new insights from your data and make critical business decisions in real time. Qubole and Microsoft help companies activate their big data in the cloud to uncover insights that improve customer engagement, increase revenue, and lower costs.

    ​Join experts from Qubole and Microsoft as they discuss how to activate your big data and how to get the most out of open source technologies on the cloud. In this webinar, you'll learn:

    - How to modernize with data lakes and data warehouses on the cloud
    - Strategies for boosting business value out of Machine Learning and advanced analytics with Qubole on Azure
    - How to reduce costs, control risks, and improve data governance as you build your data pipelines
    - The importance of data security and privacy
    - Real world example​s​ of successful companies activating their big data

    Webinar Speakers:

    Nate Shea-han
    Americas Global Black Belt, Data & AI at Microsoft

    Nate Shea-han has been with Microsoft for 14 years and has spent the last 8 years focused on the helping Microsoft customers transform their business in the cloud on the Azure platform. Currently he has responsibilities across the United States, Canada and Latin America for Microsoft’s AI, big data, and analytics offerings. Nate has also worked extensively with Microsoft partner community.

    Shaun Van Staden
    Solutions Architect, Qubole

    Shaun Van Staden has 19 years of experience in enterprise software managing advanced analytics projects, as a developer, DBA, business analyst and now a solutions architect. As a solutions architect manager, Shaun is responsible for supporting business development and sales at Qubole and helping customers transform their use cases for the cloud. Prior to Qubole, Shaun worked as a solutions architect at NICE Systems and Merced Systems (acquired by NICE).
  • Data Platforms 2018: Opening Keynote - Big Data Activation Recorded: Apr 27 2018 38 mins
    Ashish Thusoo, CEO, Qubole
    In this keynote Ashish Thusoo, CEO of Qubole, discusses the gap that enterprises face today when activating their big data. He makes a case for the shift that organizations need to make towards a big data activation strategy in order to put their data assets to use for differentiating and achieving business objectives. The session also covers key elements of big data activation supported by usage trends of Qubole's cloud-native big data activation platform. He presents various ways that enterprises can use to measure their own activation readiness, and demonstrate why Qubole provides the right approach to big data activation.
  • Data Platforms 2018: Demystifying AI, Machine Learning & Deep Learning Recorded: Apr 27 2018 49 mins
    Sumit Gupta, VP, AI, Machine Learning and HPC, IBM Cognitive Systems
    From chat bots, to recommendation engines, to Google Voice and Apple Siri, AI has begun to permeate our lives. In this keynote, IBM's Sumit Gupta demystifies what AI is, presents the differences between machine learning and deep learning, explains why the huge interest now, shows some fun use cases and demos, and then discusses use cases of how deep learning based AI methods can be used to garner insights from data for enterprises. Sumit also talks about what IBM is doing to make deep learning and machine learning more accessible and useful to a broader set of data scientists.
  • Data Platforms 2018: Big Data Activation Panel Session Recorded: Apr 27 2018 33 mins
    Jose Villacis, Senior Director of Product Marketing, Qubole
    Every CEO aspires to create a data-driven culture that can activate 100s or 1000s of users and petabyte-scale data to continuously deliver true business value. This keynote panel explores the journey of 4 companies: Comcast, Qubole, Fanatics and MediaMath, that have chronicled their successes and challenges in two books by O’Reilly Media about Creating a Data-Driven Enterprise. The panelists talk not just about their technology strategy and choices but also how data-driven insights are powering their business and transforming the competitive dynamics of their industry.​
  • Presto on Qubole - for Fast, Inexpensive, and Scalable Data Processing Recorded: Apr 27 2018 55 mins
    Shubham Tagra, Software Engineer at Qubole
    In this webinar, we introduce you to the self-managing, self-optimizing implementation of the Apache Presto open source project. Qubole's Presto-as-a-Service is primarily intended for Data Analysts who need to translate business questions into SQL queries. Since the questions are often ad-hoc, there is some trial and error involved and arriving at the final results may involve a series of SQL queries. By reducing the response time of these queries, the Qubole platform can reduce the time to insight and greatly benefit the business. Besides performance efficiencies, users benefit substantially from multiple supported data formats, continuous auto-scaling and efficient management of Presto clusters, improved user experience, and tightened security.

    In this webinar, we cover:
    * Presto-as-a-Service on Qubole
    * How Qubole Presto is different from OSS Presto
    * How Qubole Presto is different from EMR Presto, Athena
    * Demo on:
    - Join Reorder, DF
    - S3 Optimization
    - Auto Scaling
    - Spot Support
    * Customer Use Cases
  • 2018 Qubole Big Data Activation Report Webinar Recorded: Apr 26 2018 56 mins
    Ashish Thusoo, CEO and Co-Founder, Qubole José Villacís, Senior Director, Product Marketing
    Join our Big Data Activation Report Webinar where our CEO Ashish Thusoo will go in-depth into our 2018 Qubole Big Data Activation Report findings and share how customers are using multiple engines to get the most out of their big data.

    The report analyzes usage data from over 200 Qubole customers to provide answers to key questions such as:

    - How fast is usage of open source big data engines like Apache Spark, Presto and Apache Hive/Hadoop growing?
    - What engines are used most and for what?
    - What engines and big data tools are rising stars?
    - How successful are companies at providing their users access to data?
    - What are the cost saving benefits of doing big data in the cloud?

    You'll come away with both hard data and a few ideas for how to get more out of your big data initiatives.
  • Email Text Classification: Building an End to End Data Product Recorded: Apr 13 2018 38 mins
    Sasha Mushovic, Data Scientist, Return Path
    Sasha will tell the story of building an end-to-end data product that feeds various parts of the Return Path business to optimize email programs for marketers. We will cover discovery, development, and production of an email classification model that uses Apache Spark to fit classifiers such as Random Forests and Support Vector Machines to read email text and classify the content. We will discuss the different methods of hyperparameter tuning and ensembling used, and will describe different stages of production from batch jobs in Qubole Scheduler and Apache Airflow to streaming in Apache Kafka. We will also reflect on what it means to be a full stack data scientist, and how data science teams can be empowered to own their own data products.
  • Democratizing the Data Pipeline Recorded: Apr 12 2018 36 mins
    Zack Shapiro, Lead Data Architect, Nextdoor
    Learn how the data team at Nextdoor.com stopped writing queries all day and developed a platform that empowered the entire company to build their own data pipelines.
  • Big Data + Data Warehouse = Better Together Recorded: Apr 12 2018 46 mins
    James Rowland-Jones, Principal Program Mgr., Microoft
    Building a coherent platform for advanced analytics and reporting can feel quite overwhelming. A plethora of choices exist out there, and often it can feel like you are being asked to choose a side - it's almost like being asked to pick your favorite child! However, it doesn't have to be this way. Many of the services in a next-generation platform are actually complimentary, working together to deliver your next-generation analytical architecture.

    In this session we will walk through the core components of a next-generation analytical platform architecture, discussing key decision points along the way. At the end you will have a clear and concrete understanding of how you can easily stand up an advanced analytical platform in minutes and bring demonstrable value to your users.
  • Supercharging the Performance of Spark Applications Recorded: Apr 12 2018 38 mins
    Venkat Sowrirajan, Software Engineer, Qubole
    Apache Spark applications are difficult to tune for optimal performance, and the use of cloud stores like S3 as a truth-store makes things even more complex. This talk will briefly cover SparkLens (Spark tuning tool), Spark with Rubix (distributed cache), and direct-write for Hive tables and its performance numbers.
  • A Framework for Assessing the Quality of Product Usage Data Recorded: Apr 12 2018 44 mins
    David Oh, Data Engineer for the ADP, Autodesk
    This presentation will discuss the importance of data quality and outline an approach to assess and measure the quality of product usage event logs. A data quality assessment framework helps build trust in our data and enables analysts to generate a deep understanding of product usage patterns, product stability, and utilization of purchased assets by our customers. Unlocking valuable insights from this data depends on the presence of high quality and complete data sets that provide the ability to link product usage events with back office accounts and entitlement data.
  • The 3S Method for Cluster Architecture Design Recorded: Apr 12 2018 36 mins
    Justin Wainwright, Systems Analyst, Oracle Data Cloud, Oracle
    This session highlights the model used within Oracle Data Cloud (ODC) for Apache Hadoop 2 and Apache Spark clusters. We'll talk about taking the guesswork out of cluster design, and about the keys for balancing cost and performance while minimizing administrative overhead.
  • Enterprise Fabric – A Concept/Essential Thread in Your Transformational Journeys Recorded: Apr 12 2018 40 mins
    Dan Sutherland, Distinguished Engineer & CTO, IBM
    This session will provide an overview of the enterprise fabric and the encapsulated view of the required capabilities. Some key components of the fabric include data and cognitive technologies. We will dive into the enterprise fabric-based architecture and why it is the core foundation for business transformation.
  • Presto: Fast SQL on Everything Recorded: Apr 12 2018 41 mins
    David Phillips, Software Engineer, Facebook
    Presto is an open source distributed query engine that supports much of the SQL analytics workload at Facebook. This talk introduces a selection of Facebook use cases, which range from user-facing reporting applications to multi-hour ETL jobs, then explains the architecture, implementation, features, and performance optimizations that enable Presto to support these use cases.
  • Using Qubole as the Data Lake for Programmatic Advertising Recorded: Apr 12 2018 44 mins
    Tom Silverstrim, Sr. Manager, Adobe Media Optimizer, Adobe Ad Cloud
    Qubole has been the data warehouse of the DSP for the last six-plus years, and was selected as the ideal partner for mobilizing the considerable amount of diagnostic and base truth data contained within Amazon S3. From these origins, Qubole now powers our custom reporting infrastructure, machine learning algorithms, and user mapping reports, along with its evolving role in supporting system diagnoses and audits. We will touch on several use cases that demonstrate the flexibility and power of Qubole in democratizing data across the organization.
  • Team Data Science Process (TDSP) and Azure Machine Learning Recorded: Apr 12 2018 40 mins
    Erik Zwiefel, Advanced Analytics & AI Architect, Microsoft
    TDSP is an agile data science process meant to keep data science and business teams working together. In this session, we'll explore the Team Data Science Process and walk through an example using Azure Machine Learning Services.
  • A Lap Around Azure Data Lake Recorded: Apr 12 2018 41 mins
    Francesco Diaz, Regional Solutions Manager Alps, Nordic & Southern Europe, Insight
    Azure Data Lake is one of the most powerful PaaS services that Microsoft Azure offers to manage big data. Built on well-known projects such as HDFS and YARN, it allows for the ability to focus on the design of the solution instead of the administration part. A new language, U-SQL, combines SQL and C# to work with any type and size of data. During the session, we will explore Azure Data Lake Store and Azure Data Lake Analytics, the core components of the Azure Data Lake offering.
  • The Story of Building a Scalable Data Trust Playbook at Optimizely Recorded: Apr 12 2018 44 mins
    Vignesh Sukumar, Senior Manager, Data Engineering, Optimizely
    At Optimizely, we receive billions of user click stream events for the thousands of A/B experiments we run for our customers every day. Previously, customer inquiries related to the alignment of key experiment metrics between raw data and experiment results required expensive engineering analysis due to lack of scale and flexibility. In this talk, Vignesh will walk the audience through the journey of how we created a playbook to enhance customer trust in these situations and make self-service scalable for the entire organization.
  • Faster SQL on Apache Hadoop on Cloud Platforms - Ad-Hoc & Interactive Analysis Recorded: Apr 12 2018 35 mins
    Rajat Venkatesh, Senior Director of Engineering, Qubole
    Popular SQL on Hadoop engines like SparkSQL on Apache Spark, Hive, and Presto have become much faster on the cloud. This talk will explore the major features, architectural changes, and best practices to supercharge these SQL engines. The talk will also peek into the future for upcoming performance-related features.
  • Kubernetes for Data Engineers Recorded: Apr 12 2018 35 mins
    Rohit Agarwal, Software Engineer, Google
    The talk will give an introduction to Kubernetes in general and then focus on topics relevant to data engineers. In particular, we will talk about how to run stateful workloads on Kubernetes and how to run machine learning workloads that use GPUs on Kubernetes.
Elemental to Big Data
At our core, we are a team of engineers who live, eat, and sleep big data. We believe that ubiquitous access to information is the key to unlocking a company's success. To achieve this, a big data platform must be agile, flexible, scalable, and proactive to anticipate a company's needs.

Embed in website or blog

Successfully added emails: 0
Remove all
  • Title: Data Science Stack in the Cloud
  • Live at: Sep 19 2017 8:00 pm
  • Presented by: Evan Harris, Data Scientist, Return Path
  • From:
Your email has been sent.
or close