triAI

Now open

Sauti Project · Ongoing research initiative

Learn more

Courses Data Science In collaboration with Professor Zico Kolter

Practical Data Science

In partnership with Carnegie Mellon University. Used in cohorts 4, 5, and 6.

Duration

~16 weeks

Level

intermediate

About this course

Data science is the study and practice of how we can extract insight and knowledge from large amounts of data. This course provides a practical introduction to the “full stack” of data science analysis, covering data collection and processing, statistical modeling and machine learning, visualization, and big-data techniques for scaling these methods.

As the name suggests, the emphasis is practical. The course focuses on implementing the techniques, not just understanding them. In place of a midterm and final, learners complete a tutorial on an advanced topic and a group project applying these techniques to a real-world problem.

Instructor

Pat Virtue, Carnegie Mellon University, School of Computer Science. Faculty page.

What you’ll learn

Data collection and management

Ingesting data from unstructured and structured sources, and using relational models, time-series methods, graph and network processing, natural language processing, and geographic information systems to store and manage it.

Statistical modeling

Applying core statistical techniques to understand the properties of data and to design experimental setups for testing hypotheses or collecting new data.

Advanced ML techniques

Kernel methods, boosting, deep learning, anomaly detection, factorization models, and probabilistic modeling.

Data visualization

Visualizing data and the results of analysis, with particular attention to high-dimensional structured data.

Big data

Scaling these methods into regimes where distributed storage and computation become necessary.

Data science debugging

Diagnosing problems across a full data-science pipeline, including data collection, problem setup, ML models, and the conclusions drawn from them.

Lecture topics

The course is organized into four units.

Data collection and management

  • Introduction to data science
  • Data collection and scraping
  • Jupyter Notebook lab
  • Relational data
  • Visualization and data exploration
  • Vectors, matrices, and linear algebra
  • Graph and network processing
  • Free text and natural language processing Statistical modeling and machine learning
  • Introduction to machine learning
  • Linear classification
  • Nonlinear modeling and cross-validation
  • Basics of probability
  • Maximum likelihood estimation and naive Bayes
  • Hypothesis testing and experimental design Advanced modeling techniques
  • Unsupervised learning
  • Recommender systems
  • Decision trees and interpretable models
  • Deep learning Additional topics
  • Big data and MapReduce methods
  • Debugging data science
  • The future of data science

Format

  • Lecture-based course with slides and lecture notes published openly
  • Programming homework emphasizing practical implementation
  • Tutorial on an advanced topic of the learner’s choice
  • Group project applying data science techniques to a real application
  • No midterm or final. Assessment is project-based.

Prerequisites

Comfort with Python, basic linear algebra, and basic probability. The course is designed for students with a technical background but does not assume prior data-science or machine-learning experience.


Course materials are openly available at datasciencecourse.org, including slides, lecture notes, and the full lecture schedule.

Newsletter

The Encoder.

Monthly programmes, research, and opportunities updates from The Encoder — a TRI AI Initiative