Wednesday, December 8, 2021

Jota’s ML Advent Calendar – 08/December

 The topic – or reflection - of the day is on MLOps, a term whose popularity comes probably from the fact that it’s a natural step when setting up enterprise ML/DS solutions. Unlike DevOps, which may be considered a solved problem, MLOps sits at the interception of 3 different skillsets, and that makes implementations longer and more complex:

  • Data science, including not only the ML/DS knowledge, but in platforms like Azure Machine Learning or others, also the ability to set up Machine Learning pipelines that go beyond a pipeline as “a set of consecutive calls of Python functions”, and also packages like mlflow for models and metadata tracking.
  • DevOps, and I’m thinking here of the ability to set up things like GitHub or Azure DevOps, from knowing how to do source control from commits to merges (a nightmare with Jupyter notebooks, by the way), but also the ability to set up multi-stage train/build/deploy pipelines across the different environments. As there are specificities to doing this for data science, whoever does this must also have some of the previous skills – it’s not just about building and deploying code, but also training/ deploying/monitoring models.
  • Infrastructure and Security– deployments of the environments must be protected against aspects like data exfiltration, network-level security has to be in place, also firewalls, access control, etc. If skills in Infrastructure/Security for DevOps deployments do exist – e.g., setting up private build agents inside virtual networks, if one adds the specificities of securing Machine Learning services, that’s much rarer.

In my view, part of the complexity of MLOps deployments comes from the fact that these are different skill sets, people talking different languages and with limited knowledge of each other’s worlds. And one more thing that makes it complex, is that data/code/models are “source controlled” in different repositories. Data is in Data lakes or databases, models go to model repos, code goes to source control. But a model can be trained, its metrics stored along with it, and the code is still sitting in a Jupyter Notebook, not checked-in and with no connection to the model, plus rarely is data versioned. There goes reproducibility.

There may be approaches in the market that already solve this with 5-clicks to set up, but in my view we’re not yet in “simple and quick” territory.

To close, I could mention the set of best of breed complementary and interconnected services for the above, from GitHub to Azure Machine Learning and Functions or Kubernetes, and all the security mechanisms of an Enterprise Cloud. :-) But I’ll leave you instead with an excellent post on MLOps from the point of view of a Data Scientist, written by a colleague - Maggie Mhanna: https://towardsdatascience.com/mlops-practices-for-data-scientists-dbb01be45dd8 .

Tuesday, December 7, 2021

Jota’s ML Advent Calendar – 07/December

Today’s note is about data, the base of the pyramid on top of which [most] AI/ML sits. I’ll start by saying I’m not a fan of the “data is the new oil” expression, which leaves a lot unsaid (try searching for ‘data is not the new oil’ to find many different reasons), but the topic is actually inspired by this post by Rachel Thomas: https://www.fast.ai/2021/11/04/data-disasters/ , that goes into several perspectives I also hold – on having context on how the data was collected and what it means, data work being undervalued, or the need to consider how people are impacted. A good read, that led me to two additional thoughts.

One -- in my limited in “solving” tabular machine learning problems, it always paid off to invest time in data exploration/feature generation. Using AutoML or ensembles of models, is probably what gets you over the edge to the best possible quality model, but understanding the data and creating new features that ‘help’ the models do their work is more likely to get you a bump in the quality of what you get. But this is painstaking work, like a crime mystery to be cracked, and for some mysteries to be solved at all you’ll have to interview witnesses again to collect more data. Many people are also familiar with research like this from Anaconda (reported by Datanami here - https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/), according to which 45% of the Data Science time is used in data preparation tasks, versus 21% on model selection/training/scoring. I won’t be surprised to see this last number go down, with further improvements in AutoML.

Second – Rachel ’s post includes this quote from Professor Berk Ustun, “when someone asks why their loan was denied, usually what they want is not just an explanation but to know what they could change in order to get a loan.”, and this again made me think of the Responsible AI Toolbox (https://github.com/microsoft/responsible-ai-toolbox) and EconML (https://github.com/microsoft/EconML), which allow you to identify the “minimum change to lead to a different model prediction”. It’s almost as if some ML problems are now easier to solve, but what happens before and after they are trained (data collection/preparation and exploring it/understanding the predictions) take ever more time.

PS: this just came out today, on the Responsible AI Toolbox: https://techcommunity.microsoft.com/t5/azure-ai-blog/responsible-ai-dashboard-a-one-stop-shop-for-operationalizing/ba-p/3030944 .

Monday, December 6, 2021

Jota’s ML Advent Calendar – 06/December

 My highlight for the day is a publication that came out a few weeks ago, the 2021 State of AI Report ( https://www.stateof.ai/ ), an attempt at compiling the most interesting developments in AI/ML of the year, in (mostly) not too much technical detail. It still runs at 188 slides, so there’s a lot there to see. A couple of the topics that specially interest me are the uses in biology, the impact of transformer models not only in NLP but also Computer Vision, and the discussion on the trend for larger and larger language models. AI chips also make an appearance, but possibly because of the global chip shortage, with less prominence than usual.

As a somewhat related topic, something that I found missing was the lack of mention to either Causal Reasoning (think Judea Pearl’s “Book of Why”) or hybrid approaches to AI - e.g. the “Symbolic + Connectionist” approach (think Gary Marcus + Ernest Davis’ “Rebooting AI”, as a popular recent reference). Maybe we’re just still amazed with the achievements of modern large language models.

On the above topic, on December 23rd, if the time and timezones work, it could be a good idea to register to the free “AI Debate #3” ( https://www.eventbrite.ca/e/ai-debate-3-live-streaming-tickets-133817911977 ), co-organized by MONTREAL.AI and Gary Marcus and with a great list of speakers.

As a very final note – when I was in Uni and briefly studied the topic of AI, Neural Networks were just a theoretical concept, and had to hand-draw a SNEPS network to represent a domain of knowledge, and code A* in LISP for a checkers game. How far things have come.

Saturday, December 4, 2021

Jota’s ML Advent Calendar – 04/December

 Today’s post is a short note on adversarial attacks for Computer Vision, prompted specifically by a new dataset and article on ArXiv from researchers at Scale AI/Allen Institute for AI/ML Collective - [2111.04204v1] Natural Adversarial Objects (arxiv.org), “Natural Adversarial Objects”. The word “Natural” comes from the fact that these 7934 photos were not artificially/intentionally created to cause problems in detection but were selected because they are mislabeled by 7 object detection models (including Yolov3). And “Objects” in the title refers to the fact that the analysis focuses on object detection scenarios – not image classification.

The authors then measured the mean average precision (mAP) against this dataset versus the MSCOCO dataset, and the difference in performance is huge (eg, 74.5% worse)!

The NAO dataset is available here, for anyone interested: Natural Adversarial Objects - Google Drive.

And continuing onto a related topic, what the above also shows is that when evaluating trained models, it isn’t enough to see how good a certain metric is (like mAP/F1/AUC), but also important to look at the distributions of the errors. And to look at this (surprise!) Microsoft actually has a Python package, namely the Error Analysis component of the Responsible AI Widgets, including capabilities to do a visual analysis/exploration of the errors. More information and sample notebooks are available here: https://github.com/microsoft/responsible-ai-toolbox/blob/main/docs/erroranalysis-dashboard-README.md .

Friday, December 3, 2021

Jota’s ML Advent Calendar – 03/December

Today’s post is about Model Explainability, techniques to try to understand why models give certain results – often blackbox/neural networks, and what features are more important.

There are two reasons I post this. One, that we have first class/leading approaches in use that were either created or are owned by people currently at Microsoft, and second, because not many people actually know this

As one example, check out this link from June this year, looking at 7 packages for Explainability in Python: https://towardsdatascience.com/explainable-ai-xai-a-guide-to-7-packages-in-python-to-explain-your-models-932967f0634b . The page features at #1 SHAP, probably the most popular of all, but also at #2 LIME and also EBM/InterpretML, this one described as “an open source package from Microsoft”. What it doesn’t say however, is that the creators of both SHAP (Scott Lundberg) and LIME (Marco Tulio Ribeiro) are researchers at MSR. So out of a list of 7 packages, 3 were either created at Microsoft, or are owned by current employees. And if we add Fairlearn in a related area, and the recent developments on top of SHAP for Explainability on NLP and Vision, we are (in my view) unbeatable in this area.

Because EBM’s is probably the less well known of these, I’m also leaving a link to an article just focusing on it: https://towardsdatascience.com/interpretml-another-way-to-explain-your-model-b7faf0a384f8 , and what I’d suggest is this: next time you are training a model with XGBoost or LightGBM, try also EBMs. You may be surprised with how good it is.

Thursday, December 2, 2021

Jota’s ML Advent Calendar – 02/December

 Today’s pick refers to a recent Databricks acquisition, that of 8080labs and their bamboolib product. You may be familiar with Python libraries like Pandas Profiling, which creates a profile of a tabular dataset - think like Azure Machine Learning's Dataset profile, but with more sophisticated information, good for a first feel of what’s in the data.

Bamboolib is also a python library, including a community (free) and a paid tier, that includes an interactive UI for data exploration (pandas dataframes), including capabilities like creating calculated columns, applying filters, generation of Python code based on UI configurations, charting, and more. Parts of it remind me of the data reshaping capabilities present in PowerBI.

A short video on how it works is here: https://youtu.be/Qni8kX4hSOM , and the homepage is https://bamboolib.8080labs.com/ . I assume this will be fully integrated with Databricks/Spark Dataframes/Koalas soon, but on the meantime, for tabular data exploration, it’s worth checking out the free tier.

A final note about the Databricks acquisition of 8080labs, described here: https://databricks.com/blog/2021/10/06/bringing-lakehouse-to-the-citizen-data-scientist-announcing-the-acquisition-of-8080-labs.html. Databricks/Spark at heart does not target citizen data scientists, but this acquisition (much like Redash before it – now “Databricks SQL”) shows them clearly trying to go into that space.

Wednesday, December 1, 2021

Jota's ML Advent Calendar - 01/December

 Hopping on the idea of the “advent calendar” that is very typical in Germany (at least in Bavaria), during this month I’ll be sharing every day one link/info I’ve read recently in the field of AI/ML that I found interesting.

For a start, my suggestion of the day is FLAML (https://github.com/microsoft/FLAML). This is an AutoML python library available on GitHub and created by people at MSR, based on research published end of last year. Why another one, especially considering we already have this on Azure Machine Learning? A few reasons I like it:

  • It’s fully Python based with a super-simple API. You have control of all the parameters in your experiments in code that you can source control.
  • In my experiments (for tabular data + classification/regression), I got consistently good results.
  • You can set a training budget – for how long you want it to train.
  • You can pick the algorithms you want to use in the training (most common being LightGBM, XGBoost and Catboost) – and if you pick only one, you’re effectively doing hyper-param tuning.
  • Supports sklearn pipelines (i.e., you can for example do Automl as the final step of a training pipeline).
  • You can do an optimization run based on a previous run, to further optimize results you’ve already obtained.
  • Has support for ensembles/stacks, where a set of models are trained to do a 1st prediction, and then a final estimator builds on the outputs of the previous predictors to make a final prediction.
  • And obviously, runs in AML (albeit not benefiting from clusters/parallelization).

Two other relevant links I’d also include are Optuna (https://optuna.org/), a library specifically for hyperparam tuning (similar to HyperOpt), and LightAutoML (https://github.com/sberbank-ai-lab/LightAutoML), both widely used in Kaggle competitions.

Cheers!