Designing a Machine Learning System

Chip Huyen is the author of the book “Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications”. She has several interesting materials on the internet, and I intend to comment on some of them here.
The material I decided to comment on today is about how to “Design a Machine Learning System”. The link is in English, but I will translate some interesting quotes. I recommend reading the full material.
This material focuses on best practices when designing a Machine Learning system, especially for those preparing for interviews in the field. Even if you are not in this process, I recommend reading it, as these are best practices in ML that every data scientist should know.
I don’t just want to write a summary of the original material, so I will focus on the more general part and comment on some quotes that I found interesting.

The image above shows a generic flow for creating a machine learning system. It is clear that it is an iterative process involving four main components: project setup, data pipeline, modeling/training, and model serving. Each stage can influence and update the previous stages. For example, during training, you may realize that the data used is incorrect or insufficient, requiring you to go back to the previous stage.
Quotes
“In school, you work with available, clean datasets and can spend most of your time on building and training machine learning models. In industry, you probably spend most of your time collecting, annotating, and cleaning data.”
There are always differences between working with AI in real life and in academia. Both are important, but the time spent on data and problem definition is really a differentiator. Other points that make a big difference are thinking about fallback, scalability, and edge cases.
“Modeling, including model selection, training, and debugging, is what’s often covered in most machine learning courses. However, it’s only a small component of the entire process. Some might even argue that it’s the easiest component.”
This is the part that scientists usually like the most, but it is really the smallest part and often the “least important” in the entire flow of solving a problem with ML.
“When searching for a solution, your goal isn’t to show off your knowledge of the latest buzzwords but to use the simplest solution that can do the job. Simplicity serves two purposes. First, gradually adding more complex components makes it easier to debug step by step. Second, the simplest model serves as a baseline to which you can compare your more complex models.”
Wow, I even feel like framing this. The vast majority of data scientists get excited about the latest trendy technologies and forget to start simple.
“ weaker models with well-tuned hyperparameters can outperform stronger, more recent models.”
Here we have another point for simplicity and focusing on the “basic” models. By taking good care of the input data and hyperparameter tuning, your SVM can beat a deep learning model with hundreds of layers.
“And there’s the question of interpretability. If your model predicts that someone shouldn’t get a loan, that person deserves to know the reason why. You need to consider the performance/interpretability tradeoffs. Making a model more complex might increase its performance but make the results harder to interpret.”
Interpretability is a subject often forgotten in AI workflows. Interpretability is sometimes necessary, as in the given example, but it can also help you improve the model.
“You also need to think about the potential biases and misuses of your model. Does it propagate any gender and racial biases from the data, and if so, how will you fix it?”
We have “ultra-futuristic” models, with generative AI and much more, but one of the biggest challenges in AI is privacy and model bias. I talked a bit about this in my text about the documentary “XXXXX”.
Summary of the stages

Project Setup:
— Define objectives and goals.
— Understand user experience and use cases.
— Determine evaluation metrics for training and inference.
— Address project constraints (e.g., time, computing power).
Data Pipeline:
— Collect and annotate data.
— Preprocess and represent data for models.
— Handle data storage and privacy concerns.
— Address biases.
Modeling:
— Select appropriate models based on the type of problem (e.g., supervised vs unsupervised).
— Train models and solve common training problems (e.g., overfitting, underfitting).
— Define baselines (random, human, simple heuristic).
Serving:
— Deploy models and collect user feedback.
— Update models based on feedback and new data.
— Ensure model interpretability and address biases.
Conclusion
I recommend reading the material to all data scientists. These are simple concepts and practices, but many people forget them.