Machine Learning Systems

New to machine learning? Not sure how ML works in production? Interested to get involved in advanced ML+Systems research? This class is designed for you!
New to machine learning? Not sure how ML works in production? Interested to get involved in advanced ML+Systems research? This class is designed for you!
When we talk about Artificial Intelligence (AI) or Machine Learning (ML), we typically refer to a technique, a model, or an algorithm that gives the computer systems the ability to learn and to reason with data. However, there is a lot more to ML than just implementing an algorithm or a technique. In this course, we will learn the fundamental differences between AI/ML as a model versus AI/ML as a system in production.

Learning Outcomes:

  • Learning to design ML Systems to solve practical problems.
  • Explaining differences between ML as a model and as a system deployed at scale.
  • Describing how an ML system works in production and insights about challenges (AIOps).
  • Locating technical debt in building ML systems.
  • Employing design strategies (such as concepts) and best practices to mitigate technical debt.
  • Incorporating ML-based components into a larger system (e.g., Cyber-Physical Systems).
  • Building systems that are more capable, both as software and as predictive systems.
  • Identifying systems faults and apply strategies to identify root causes in ML systems.
  • Picking the right framework and compute infrastructure and trade-off space.
  • Understanding performance landscape (energy, inference time) and optimization.
  • Troubleshooting training and ensuring the reproducibility of results.
  • Deploying predictive models on resource-constrained environments (e.g., NVIDIA Xavier).


  • Machine Learning Systems: Concepts, Challenges, and Solutions
  • Machine Learning Systems and Software Stack
  • Optimization, Neural Nets, and Learning Theory
  • Backprop and Automatic Differentiation
  • Hardware for Machine Learning Systems: GPU, CPU, TPU, Neuromorphic Computing
  • Machine Learning Accelerators (ML Compilers)
  • Quantized and Low-precision Machine Learning
  • Deployment and Low-latency Inference: Platforms and Model Serving
  • Distributed and Scalable Machine Learning
  • Machine Learning System Testing and Debugging at Scale
  • Setting up Machine Learning Projects and Teams
  • Research Directions: Robust Optimization, Causal AI, AI/ML Systems in Space, Adversarial ML, AI Systems for Social Good, AI Systems for Diversity&Inclusion, Safe AI, Transfer Learning, Deep Reinforcement Learning.


  • Computer Scientists, Data Scientists, AI/ML/Systems/Math/Statistics Researchers often make great progress at building models with cutting edge techniques but turning those models into products is challenging. For example, data scientists may work with unversioned notebooks on static data sets and focus on prediction accuracy while ignoring scalability, robustness, update latency, or operational costs.
  • Software Engineers are trained with clear specifications and tend to focus on code, but may not be aware of the difficulties of working with data and unreliable models. They have a large toolset for decision making and quality assurance but it is not obvious how to apply those to intelligent systems and their challenges.
  • Scientists, Researchers, Engineers, Lawyers, Medical Experts, Physicians, Psychologists, Business Experts, and many other Professionals have heard about AI/ML and would like to know more about the details and how to start or use AI/ML in their profession or their research. It is important to learn how to design ML systems in a principled and systematic way that satisfies certain properties like safety, bias, etc. If we do not learn about these important principles, we may have many ML systems that undermine our human values in our society.