Migrating from TensorFlow to ML.NET

2022-01-29 .net machine learning auto ml whatsuper

Introduction

Whatsuper is one of my personal projects. It’s an app which aggregates supermarket deals and discounts and displays them for the user. The user can search for specific products (by title) or filter them based on categories, e.g. show all products from categories Fruit and Meat. They can also receive notifications based on these filters, so when a new discount is added, they will receive a push notification.

Every night new entries are added and they need to be categorized in one of the mentioned categories.

TensorFlow and AutoML

I’ve started with TensorFlow, specifically training with AutoKeras. This is a great AutoML tool. It reads your data and tries various parameters (so called hyperparameters) while training the models. Basically it’s changing variables, checks the changes to see if your model’s accuracy goes up. If it’s not it changes variables again starts the next iteration. It keeps doing this until some limit is reached (mostly time or number of loops) and returns the best model.

Easy?

To get to the point where I had a model which was served over an HTTP endpoint for consumption, was not that easy (even with AutoKeras). Granted, I did not have any experience with machine learning and Python, it took me a couple of weeks. Some of the issues I had:

GPU drivers (CUDA)
Running AutoKeras on Windows (BERT is not supported on Windows, see GitHub issue)
- This means you’ll need to run on Linux (or maybe MacOS), but you still want GPU support or else it would take way too long. Which meant more driver issues…
Models return an array of data. The same index is used for output as the index you used for input (I’ve messed these up and spent way too long on why it would return the wrong category 😅).
How do you actually serve the models?

Learnings: if you want to run AutoKeras, just use Google Colab.

If you’re interested, this is my Python script I’ve used in Colab. It would output a model, which could be used by TensorFlow serving to serve your model. I’ve made a Python proxy (which cleaned input first before querying the TensorFlow serving endpoint).

ML.NET (using AutoML)

One of the languages I use daily is .NET. So doing any machine learning related task in .NET should be a no-brainer for me. For some reason, I did not use ML.NET. I was not aware of it’s AutoML features. But now I am. And… WOW. I’m amazed, really. Sure, I do a very simple ML task: classification on text, but still. For contrast:

With AutoKeras the average training time was 6 minutes (with GPU) with an average accuracy of 90%
With ML.NET maximum training time is set 100 seconds with an average accuracy of 93% (training on CPU only)
With ML.NET I do not have to clean up my input data
Doing comparisions with production data: the model trained by ML.NET did:
- predict better
- did not have very weird off predictions, for example classifying handsoap in the category meat

I’m not saying AutoKeras or TensorFlow is bad. If the model is bad it’s my fault, 100%. I’m no data scientist. However, I’m just using AutoKeras and ML.NET as a “user” as is and comparing the results from those high level tools.

Before you start

You’ll need a dataset. For Whatsuper, I needed a set which basically defined in what category a specific product belonged to. This is manual labour – unless a specific dataset already exists, e.g. there are sets available on Kaggle but there was no set available for my use case.

A dataset looks like this (Product,Category):

Coca-Cola,softdrinks
Fanta,softdrinks
Nescafe caramel latte macchiato,coffee
500g of spareribs,meat

Generally, the more data you have, the better your model becomes. The model used by Whatsuper consists of 64831 products (and increasing every day!). We’ve started with machine learning when there were about 1500 products categorized. However, this still requires validation. There’s one dedicated person who validated all products in the database, which is this awesome dude called my dad! ❤️ So, thank you dad!!!

Getting started with ML.NET

You’ll need:

.NET SDK 6.0
My sample dataset (and save it as products.csv)
An IDE with C# support (Visual Studio, Visual Studio Code, Rider)
The ML.NET tool: dotnet tool install -g mlnet

Once you have these prerequisites installed / downloaded, we can continue:

Start the classification on the products.csv file:

mlnet classification --dataset products.csv --has-header false --label-col 1 --train-time 60

mlnet classification: start mlnet with the classification type. Other options are available, but up for you to explore 😉.
--dataset the actual dataset
--has-header set whether the dataset starts with a header (e.g products,label)
--label-col which colom indicates the label. This value is 0 index based
--train-time the amount of seconds we use for training

Once it completes, you’ll see a table with all the different trainers used. ML.NET used different methods to determine the best model for given dataset:

===============================================Experiment Results=================================================                                                                                                                                                                                                                                                        ------------------------------------------------------------------------------------------------------------------
|                                                     Summary                                                    |
------------------------------------------------------------------------------------------------------------------
|ML Task: multiclass-classification                                                                              |
|Dataset: /Users/gerwim/Downloads/mlnet/products.csv                                                             |
|Label : col1                                                                                                    |
|Total experiment time : 59.21571780000001 Secs                                                                  |
|Total number of models explored: 37                                                                             |
------------------------------------------------------------------------------------------------------------------

|                                              Top 5 models explored                                             |
------------------------------------------------------------------------------------------------------------------
|     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
|1    SdcaMaximumEntropyMulti                     0.7772         0.8066       0.6          1                     |
|2    LinearSvmOva                                0.7768         0.8087       0.7          2                     |
|3    LinearSvmOva                                0.7768         0.7987       1.5          3                     |
|4    LinearSvmOva                                0.7737         0.7821       1.0          4                     |
|5    SdcaMaximumEntropyMulti                     0.7677         0.7964       0.6          5                     |
------------------------------------------------------------------------------------------------------------------

Code Generated
Generated C# code for model consumption: /Users/gerwim/Downloads/mlnet/SampleClassification/SampleClassification.ConsoleApp
Check out log file for more information: /Users/gerwim/.mlnet/log.txt

so what does this mean? Based on given dataset, my machine (MacBook Pro) it determined the best trainer would be SdcaMaximumEntropyMulti. It has the highest MicroAccuracy of 77% (given the current data) (learn more about metrics here). To improve this, you’ll need a lot more data, but it’s enough for this quick getting started.

ML.NET generated a console app for us called SampleClassification. Open up SampleClassification/SampleClassification.sln with your favourite IDE
Run the default console app. You’ll see it will output:

Col0: Smuldier


Predicted Col1 value pets 
Predicted Col1 scores: [0.9268945,0.03021481,0.018464386,0.024426412]

So for given input Smuldier it predicts the category pets. Well, you might think “smuldier was exactly in the dataset”. So yes, you’re right. Try changing the value to dier:

ModelInput sampleData = new ModelInput()
{
    Col0 = @"dier",
};

and run the application:

Col0: dier


Predicted Col1 value pets 
Predicted Col1 scores: [0.6456386,0.1875865,0.078922115,0.08785279]

Wow! It predicted dier in the correct category. Try these variations on the words too: deer, doer. You’ll see the prediction being pets. Even though it has not seen this data. Now, try different inputs. Completely different. Maybe bodka (as a typo for vodka)? Try others too!

good job!

Learn more

Thanks for reading!