Apr–Nov 2020

Advanced debugging features for a major ML tool

Laptop showing debugger

A core feature of a major machine learning (ML) tool that makes it easy for data scientists to train models faster by automatically profiling and monitoring system utilization, and sending alerts with helpful suggestions when resource bottlenecks occur.

Timeline

Apr–Nov 2020

Client

Fortune 100 Co.

Role

UX Designer

What was the

Problem

Today's monitoring tools aren't enough.

In machine learning, training models is a complex task which can lead to results that are hard to understand and connect back to the data. It’s not enough to visualize real-time system monitoring. Any number of factors can cause issues such as bottlenecks and low resource usage. Data Scientists and Machine Learning Engineers (MLEs) lack advanced tools that can offer real-time suggestions that correlate with issues, such as inefficient bottlenecks that occur during training.

The Bonus Problem: COVID

COVID-19 impacted the way we communicate and collaborate. It challenged our assumptions about productive ways of working and learning. The team overcame these challenges to deliver a critical and compelling feature in the ML Tool. It was presented by C-level stakeholders at their annual conference in 2020.

What was the

Solution

A comprehensive dashboard with interactive monitoring visualizations and contextual, real-time suggestions.

I created a comprehensive dashboard that helps data scientists train models faster. It automatically monitors system resource usage and alerts them when anomalies occur. The Deep Profiling tool is intended to take the debugging experience to the next level. It highlights the most important information related to training jobs. During and after training, it displays the performance of system resources. The tool aids users by offering suggestions to improve the efficiency of training.

What was my

Process

Overcoming unfavorable circumstances.

The complex nature of the project resulted in a scrappy process with a high volume of iteration.

The team consisted of many groups, such as design, product management, and software engineering. Project teams or “squads” consisted of product managers, machine learning experts, developers, and a designer.

Designers integrated with the project team and owned the product through delivery. We used a design system, managed by one member, and regularly met to ensure our projects seamlessly integrated into the ML Platform. There was no official process, so designers were expected to plan, organize, and apply the creative process to get things done.

Most members on my project team had never worked with a designer. It empowered me to guide them through the creative process with abundant grace. I stuck as close as I could to the creative process, but often had to work backwards and let my design inform requirements. Requirements changed often, resulting in a lot of churn and refinement.

02
13
Data Visualizations Designed
3
4
Course Corrections
Computer showing the machine learning feature
My Role

My responsibilities included creating a design plan, interviewing subject-matter-experts, designing and testing the experience, documenting the work, and collaborating with the frontend engineering team to deliver a complete experience. This began in April of 2020 and we needed to finish before the client’s annual conference in November.

I worked closely (remotely) with the PMs and engineers to design and iterate on concepts. I used my knowledge of best design practices and consulted the team to guide us toward creating a simple solution for a complex product. I handed off design to the development team in phases, touching base frequently since we were a fully remote team.

Introducing The

Final Product

Launched in December 2020.

Computer showing the machine learning feature
Monitoring Overview

During or after training, data scientists can view details of the model’s performance. For real-time analysis, users enable monitoring and profiling at the start of training or anytime training is in progress.

Real-time Suggestions

If the tool detects issues, it will display an alert with a snapshot of the training job and suggestions for how to improve performance. For example, if a CPU bottleneck is detected, it might offer a suggestion to increase the instance size from medium to large.

Computer showing the machine learning feature
Granular Monitoring

Users can dive into the model training statistics by investigating visualizations per node. They can monitor system resource usage (e.g. GPU usage over time) as well as framework metrics (e.g. time spent in each training cycle).

I like the way you took random thoughts and comments and put them in a structured design.
Satadal B., AI Product Management Leader
I believe that our team cannot afford NOT to have the monitoring and profiling capabilities.
Chaim R., ML Algorithm Developer for client's customer.
Related

Projects

Check out another high impact project I did for this client.

Laptop showing machine learning tool

I designed a Bias Detection integration throughout a major ML tool to help data scientists understand a model’s predictions.