Apache Spark for Data Analytics

Apache Spark is the computational engine that powers big data. In this course, you will learn how to use Apache Spark to work data, gain insight using Machine Learning, and analyze streams at scale.

The ability to store, aggregate, and analyze large amounts of data has transformed nearly every industry. Whether finance, medicine, entertainment, government, or technology; the dream is the same: use enormous amounts of data to understand problems, predict outcomes, and take effective action. While many advances make the dream of “big data” possible, one of the most important components of the technical stack is the engine that provides the “distributed computing.”

In many organizations, Apache Spark is the computational engine that powers big data. A general purpose unified analytics engine built to transform, aggregate, and analyze large amounts of information; Spark has become the de-facto brain behind large scale data processing, machine learning, and graph analysis. It has seen rapid adoption by companies such as Netflix, Google, eBay, and others to analyze at massive scale processing petabytes of data on clusters of thousands of nodes.

In this course, we will explore how Apache Spark can be used for data processing. We will cover the fundamentals of Spark including the architecture and internals, the core APIs and data structures, and how Spark can be used for machine learning and analyzing streaming data sets. Throughout the course, you will:

  • Understand when and where to use Spark.
  • Leverage strategies to create data-driven questions that can provide scientific or business value.
  • Learn how to use Apache spark to load, summarize, query, and visualize structured and semi-structured data.
  • Introduce common machine learning techniques that can be used to solve supervised and unsupervised problems inside of Spark.
  • Learn how to analyze streaming data using Spark Streams.
  • Gain hands-on experience with techniques for deploying Spark as part of larger software system.

Target Audience.

  • Software engineers who are seeking to understand Big Data analytics and extend their skills.
  • Data scientists and analysts who need to work with data at moderate to large scale.
  • Data and database professions looking to add analytics skills in a big data environment.
  • Recent college graduates and graduate students with experience in a data discipline looking to move into the world of Data Science and big data.

Prerequisites.

  • Participants should have a working knowledge of Python or Scala and should be familiar with core statistical concepts.

Objectives.

  • Understand what differentiates “big data” from “small data” and know when to use tools such as Apache Spark.
  • Introduce the fundamental components and data structures of Apache Spark and describe how they are used.
  • Demonstrate the differences between resilient distributed datasets (RDD), DataFrames, and Datasets.
  • Show be used to ingest data from different types of sources (files, databases, and other storage technologies) and how the data can be transformed, combined, aggregated, and analyzed using SparkSQL.
  • Introduce the Spark machine learning libary, SparkML, and show how supervised and unsupervised machine learning techniques can be used.
  • Help students gain familiarity with common machine learning algorithms such as Decision Trees, Random Forests, Gradient Boosted Trees, Linear Regression, Collaborative Filtering, and K-Means.
  • Show how tools such as Natural Language Processing (NLP) can be used to perform classification or predictions using unstructured data.
  • Discussing streaming and demonstrate how the Spark Structured Streaming allows for the analysis of datasets that never end.
  •  

Course Outline

Day 1: Data at Scale

Practical Data Science with Spark

Describe how the utilization of data is changing and the emergence of the “Data Scientist” (or “a programmer who knows more statistics than a software engineer and more programming than a statistician”).

  • How is data being used innovatively to ask new and intersting questions?
  • What is Data Science?
  • Data Science, Machine Learning, and AI: What is the difference?
  • Case Study: Applied Data Science at Google
  • Case Study: Predictive Models in Advertising
  • Case Study: Recommender Systems in E-Commerce
  • Data Analytics Life-cycle
    • Discovery
    • Harvesting
    • Priming
    • Exploratory Data Analysis
    • Model Planning
    • Model Building
    • Validation
    • Production Roll-out

Python for Data Analysis

Quickly introduce the Python programming language, its syntax, and the core libraries used to work with data from inside of Spark.

  • Python Modules: Toolboxes
    • Importing modules
    • Listing modules
  • Python Syntax and Structure
    • Core programming language structure
    • functions
    • Comprehensions and syntactic sugar
  • Python Data Science Libraries
    • NumPy
    • NumPy Arrays
    • Pandas
  • Python Dev Tools and Analytic Environments
    • Jupyter

Spark SQL: Structured Data Fundamentals

Aggregating, repairing, normalizing, exploring, and visualizing data with Spark.

  • Introduction to Spark: A General Engine for Large Scale Data Processing
    • What is Spark?
    • How is it used in practice?
  • Spark: Explore and Visualize
    • Configuring an environment for data analysis
    • Importing data form external sources
    • Inspecting data schema and structure
    • Transforming data types, renaming columns, and managing values
    • Calculating descriptive statistics and relationships
    • Code categorical data
    • Show how to represent a distribution in pictures (histograms and related charts)

Day 2: Spark for Machine Learning and Streaming

SparkML: Machine Learning in Spark

  • “The Machines are Coming”: Machine Learning and Artificial Intelligence
    • What is machine learning, what makes it different than artificial intelligence?
    • What are some ML techniques and how can they be used to solve business problems?
  • Supervised versus unsupervised learning: what are the differences?
    • Terminology and definitions
    • Features and observations
    • Labels
    • Continuous and categorical features
  • Machine Learning Algorithms and How They Work in Spark
    • Classification and Regression: How do you build machine learning models to “make guesses” and “put things in buckets”
    • Classification
    • Regression
  • Case Study: Predicting Flight Delays Using SparkML
  • Clustering and Principal Components Analysis
  • Time Series

Case Study: Natural Language Processing

Show how machine learning techniques can be applied alongside feature engineering to solve complex problems.

  • Introduce natural language processing, core constructs that can be used to work with human language.
  • Explore computational models of human language for classification and clustering.
  • Show how keyword extraction using NLP and data normalization can be used to locate patients with a specific condition or disease.

Spark Streaming

What is so great about “streaming data” and how does Spark facilitate its analysis?

  • Case Study: Analysis of Popular Uber Locations
  • Apache Kafka: A Streams Platform
  • Spark Structured Streaming: Working with data that never ends