Data Science Foundations with Python
Data is the residue of every action that takes place in a company, with customers, and in the marketplace. It is created when customers buy products, users interact with services, and colleagues collaborate.
In an increasingly connected world, our ability to capture and leverage data has increased exponentially. We can track interactions, transactions, and encounters in real time; but data in the wrong hands is useless, if not dangerous. In the right hands, data can drive new insights and powerfully informed decisions. When combined with advances in artificial intelligence and machine learning, data can be transformational.
This course introduces fundamental techniques and technologies from data science, predictive analytics, and machine learning that can help you get a handle on the modern information flood. Using the Python programming language, you will:
- Learn analytics skills which will enable you to evaluate, query, and visualize data using open source tools: NumPy, Pandas, Matplotlib, Seaborn, scikit-learn, and Apache Spark.
- Leverage strategies to create data-driven questions that can provide scientific or business value
- Use methods for assembling data from multiple sources and preparing powerful machine learning (ML) models
- Be exposed to common machine learning techniques used to solve supervised and unsupervised problems
- Gain hands-on experience with techniques for deploying models as part of larger systems
Target Audience.
- Software engineers who are seeking to understand analytics and extend their skills.
- Data scientists and analysis who wish to work with data in Python.
- Recent college graduates and graduate students with experience in a data discipline seeking to use Python for data exploration, visualization, analysis, or machine learning.
Prerequisites.
Participants should have a working knowledge of Python and be familiar with core statistical concepts (variance, correlation, etc.). This course is meant for all levels of Python and Data Science backgrounds.
Objectives.
- Understand how Python fits into the data science ecosystem. How do Python tools such as NumPy, Pandas, Matplotlib, Seaborn, scikit-learn, and Apache Spark empower the analysis of data and machine learning.
- Learn strategies that can help formulate data-drive questions to provide scientific or business value.,
- Learn how to analyze data with Jupyter and Python data tools: gather, filter, transform, explore, and visualize data.
- Gain hands-on experience in the creation of machine learning models and tools to assess their accuracy and performance.
Course Outline
- The one day version focuses on working with data in Pandas with an introduction to machine learning techniques.
- The four day course includes data fundamentals and classical machine learning.
- The five day course includes data fundamentals, classical machine learning, and a full day of hands-on case-studies which further explore the techniques.
- The three day version omits formal coverage of Pandas (day 1) and excludes the case studies (day 5).
Day 1: Data Fundamentals
Introducing Python
Introduce the Python programming language, its syntax, and core libraries that are used for working with data.
- Python Modules: Toolboxes
- Importing modules
- Listing methods
- Creating modules
- Python Syntax and Structure
- Core programming language structure
- functions
- object oriented programming
- Comprehensions and other syntactic niceties
- Python Data Science Libraries
- NumPy
- NumPy Arrays
- SciPy
- Pandas
- Python Dev Tools, Analytic Environments, and REPLS
- IPython
- Jupyter
- Jupyter Operation Modes
- Anaconda
Practical Data Science
Describe how the utilization of data is changing and the emergence of the “Data Scientist” or “a programmer who knows more statistics than a software engineer and more programming than a statistician.”
- How is data being used innovatively to ask new and interesting questions?
- What is Data Science?
- Data Science, Machine Learning, AI: What is the difference?
- Case Study: Applied Data Science at Google
- Case Study: Predictive Models in Advertising
- Case Study: Recommender Systems in ECommerce
- Data Analytics Life-cycle
- Discovery
- Harvesting
- Priming
- Exploratory Data Analysis
- Model Planning
- Model Building
- Validation
- Production Roll-out
Data Fundamentals
Aggregating, repairing, normalizing, exploring, and visualizing data.
- Working with data in Python
- Importing data from external sources
- Dealing with missing data
- Dropping columns
- Interpolating missing data in Pandas
- Replacing data
- Scaling/normalizing data
- Exploratory Data Analysis and Visualization: Pandas, Matplotlib, and Plotly
- Transformation, validation, and interpretation
- Getting started with matplotlib and Seaborn
- Plotting Windows and Figures
- Distributions and variance:
- Show to represent a distribution in pictures (histogram and related charts) and numbers (summaries)
- Introduce outliers and describe the effect they might have on a distribution
- Variance: measuring the spread of a distribution
- Modeling distributions: normal, log-normal, and Pareto distributions
- Lab: Visualizing and Summarizing Distributions
- Analyzing Relationships
- Show how Pandas can be used to assess relationships amongst variables
- Visualizing relationships: scatterplots and beyond
- Measuring relationships: correlation and covariance
- Testing relationships: is it meaningful?
- Classical hypothesis testing: means, correlation, and proportions
- Demonstration: Analyzing Relationships
- Lab: More Relationship Analysis
- Data Grouping and Aggregation in Python
- Data aggregation and grouping
- pandas.core.groupby.SeriesGroupBy
- Grouping multiple columns
- Pivot Tables
- Cross-Tabulation
Day 2: Machine Learning Fundamentals
What is Machine Learning
- The Machines are Coming: Machine Learning and Artifical Intelligence
- What are machine learning and artifical intelligence?
- What are some ML techniques and how can they be used to solve business problems?
- Supervised versus unsupervised learning: what are the differences?
- Terminology and definitions
- Features and observations
- Labels
- Continuous and categorical features
- Practical Machine Learning
- Data preparation
- Model training
- Model validation and assessment
- scikit-learn: Estimators, Models, and Predictors
Machine Learning Algorithms
Introduce common machine learning algorithms and explore their use.
- Classification and Regression
- How do you build machine learning models to “make guesses” and “put things into buckets”
- Classification
- Regression
- Clustering and Principal Components Analaysis
- Time Series
Classification
Building, tuning, and assessing classification models
- Classification Overview
- What is classification?
- When is it used?
- Creating Classification Models
- Logistic Regression
- Decision Tree Classifier
- K-Nearest Neighbors
- Gaussian Naive Bayes
- Support Vector Machines
- Random Forest
- Assessing Classification Models
- ROC Visualizations
- Confusion Matrices
- Precision Recall Curves
- Imbalanced Distributions
- Optimizing Classification Models
- Tuning Hyperparameters
- Answering the “Do I have enough data?” question
- Explaining Classification Model Results
- General Model Interpretation
- Linear Models
- Tree Models
- Model Interpretation using Shap
Day 3: Regression and Forecasting
Regression
Building, tuning, and assessing regression models.
- Regression Overview
- What is regression?
- When is it used?
- Creating Regression Models
- Linear and Polynomial Regression
- Extreme Gradient Boost
- Regression Trees and Random Forests
- Extreme Gradient Boosting
- Assessing Regression Models
- R2, MAE, and MSE
- Residuals Plots
- Prediction Error Plots
- Optimizing Regression Parameters
- Explaining Regression Model Results
Day 4: Unsupervised Learning
Dimensionality Reduction
- Dimensionality Reduction Primer
- What is dimensionality reduction?
- What problems does it solve?
- When should it be used?
- Principal Components Analysis (PCA)
- What is PCA?
- How and why does it work?
- What are the results and what do they mean?
- Working with PCA in SciKit Learn
- Using PCA to Visualize and Understand Data
- How many components is optimal?
- Interpreting components
- Biplots
Clustering
- Clustering: Letting the Computer Tell Us About Differences
- K-Means
- Hierarchical Clustering
- Assessing Cluster Results
- Elbow Diagrams
- Visualizing Cluster Impacts on Data
- Silhouette Plots
- Data Exploration of Clusters
Day 5: Case Studies.
Case Study: Machine Learning and Natural Language Processing
Show how machine learning techniques can be applied alongside feature engineering to solve complex problems.
- Introduce Natural Language Processing, core constructs that can be used to work with human language.
- Explore computational models of human language that can be use for classification and clustering.
- Show how keyword extraction using NLP and data normalization can be used to locate patients who have a specific condition or disease.
Deep Learning
Introduce neural networks and their basic function.
- What is a deep neural network? How are they different from other types of machine learning techniques?
- What are the mathematical techniques behind neural networks? How do they work?
- How do we teach networks to “Learn”?
- What are some of the applications for these types of tools in healthcare, finance, and advertising?