Today’s note is about data, the base of the pyramid on top of which [most] AI/ML sits. I’ll start by saying I’m not a fan of the “data is the new oil” expression, which leaves a lot unsaid (try searching for ‘data is not the new oil’ to find many different reasons), but the topic is actually inspired by this post by Rachel Thomas: https://www.fast.ai/2021/11/04/data-disasters/ , that goes into several perspectives I also hold – on having context on how the data was collected and what it means, data work being undervalued, or the need to consider how people are impacted. A good read, that led me to two additional thoughts.
One -- in my limited in “solving” tabular machine learning problems, it always paid off to invest time in data exploration/feature generation. Using AutoML or ensembles of models, is probably what gets you over the edge to the best possible quality model, but understanding the data and creating new features that ‘help’ the models do their work is more likely to get you a bump in the quality of what you get. But this is painstaking work, like a crime mystery to be cracked, and for some mysteries to be solved at all you’ll have to interview witnesses again to collect more data. Many people are also familiar with research like this from Anaconda (reported by Datanami here - https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/), according to which 45% of the Data Science time is used in data preparation tasks, versus 21% on model selection/training/scoring. I won’t be surprised to see this last number go down, with further improvements in AutoML.
Second – Rachel ’s post includes this quote from Professor Berk Ustun, “when someone asks why their loan was denied, usually what they want is not just an explanation but to know what they could change in order to get a loan.”, and this again made me think of the Responsible AI Toolbox (https://github.com/microsoft/responsible-ai-toolbox) and EconML (https://github.com/microsoft/EconML), which allow you to identify the “minimum change to lead to a different model prediction”. It’s almost as if some ML problems are now easier to solve, but what happens before and after they are trained (data collection/preparation and exploring it/understanding the predictions) take ever more time.
PS: this just came out today, on the Responsible AI Toolbox: https://techcommunity.microsoft.com/t5/azure-ai-blog/responsible-ai-dashboard-a-one-stop-shop-for-operationalizing/ba-p/3030944 .