Courses Data Science In collaboration with Professor Zico Kolter

Practical Data Science

In partnership with Carnegie Mellon University. Used in cohorts 4, 5, and 6.

Duration

~16 weeks

Level

intermediate

About this course

Data science is the study and practice of how we can extract insight and knowledge from large amounts of data. This course provides a practical introduction to the “full stack” of data science analysis, covering data collection and processing, statistical modeling and machine learning, visualization, and big-data techniques for scaling these methods.

As the name suggests, the emphasis is practical. The course focuses on implementing the techniques, not just understanding them. In place of a midterm and final, learners complete a tutorial on an advanced topic and a group project applying these techniques to a real-world problem.

Instructor

Pat Virtue, Carnegie Mellon University, School of Computer Science. Faculty page.

What you’ll learn

Data collection and management

Ingesting data from unstructured and structured sources, and using relational models, time-series methods, graph and network processing, natural language processing, and geographic information systems to store and manage it.

Statistical modeling

Applying core statistical techniques to understand the properties of data and to design experimental setups for testing hypotheses or collecting new data.

Advanced ML techniques

Kernel methods, boosting, deep learning, anomaly detection, factorization models, and probabilistic modeling.

Data visualization

Visualizing data and the results of analysis, with particular attention to high-dimensional structured data.

Big data

Scaling these methods into regimes where distributed storage and computation become necessary.

Data science debugging

Diagnosing problems across a full data-science pipeline, including data collection, problem setup, ML models, and the conclusions drawn from them.

Lecture topics

The course is organized into four units.

Data collection and management

Introduction to data science
Data collection and scraping
Jupyter Notebook lab
Relational data
Visualization and data exploration
Vectors, matrices, and linear algebra
Graph and network processing
Free text and natural language processing Statistical modeling and machine learning
Introduction to machine learning
Linear classification
Nonlinear modeling and cross-validation
Basics of probability
Maximum likelihood estimation and naive Bayes
Hypothesis testing and experimental design Advanced modeling techniques
Unsupervised learning
Recommender systems
Decision trees and interpretable models
Deep learning Additional topics
Big data and MapReduce methods
Debugging data science
The future of data science

Format

Lecture-based course with slides and lecture notes published openly
Programming homework emphasizing practical implementation
Tutorial on an advanced topic of the learner’s choice
Group project applying data science techniques to a real application
No midterm or final. Assessment is project-based.

Prerequisites

Comfort with Python, basic linear algebra, and basic probability. The course is designed for students with a technical background but does not assume prior data-science or machine-learning experience.

Course materials are openly available at datasciencecourse.org, including slides, lecture notes, and the full lecture schedule.