THANK YOU FOR SUBSCRIBING
Model Factories and Test- Driven Machine Learning
Prof Dr Detlef Nauck, Chief Research Scientist for Data Science, Applied Research, BT PLC [LON: BT.A]
Creating – or "learning" – a model from data can be done through statistical data analysis or through so-called Machine Learning (ML), a field in Computer Science and Artificial Intelligence (AI) that looks for algorithms that can automatically improve the performance of a task without explicit programming but by observing relevant data. Data Science uses statistics and ML for Data Analytics – the process of turning data into insights that result in better decisions.
To get the maximum benefit out of ML models, we have to make sure we have the best data possible available to us, that the models are created correctly, operated effectively and that we build in mitigation for when they go wrong.
The truth is that Data Science and Machine Learning are still relatively new. Many organisations still think of data as something that comes out of the "exhaust" of their operations and the way they store and manage it does not yet reflect its capacity to drive decisions and automation. When businesses decided to simply store data in operational stove pipes to merely support business processes, they incurred a technical debt that they will now have to pay back before they can reap the benefits of ML and AI.
Businesses also often see ML as just another flavour of software engineering (it is not) or alternatively as something that has nothing to do with it at all (it does). Data Science and ML are in many aspects similar to software engineering and will gain to learn from this highly evolved engineering discipline.
When we build models from data, we use software and produce software.
To get the maximum benefit out of ML models, we have to make sure we have the best data possible available to us
Each step in a data transformation means writing code. A model is either a piece of code or a collection of parameters specifying how to run some code. But unlike in software engineering, where a programme is the target artefact, in ML we create several artefacts: data for training and testing, models for computing decisions based on data, and finally the decisions that we use to automate or guide some business activity.
Each artefact that is created needs to be individually versioned, documented, and tested in order to ensure that our ML process is reliable and reproducible. In the same way that an organisation needs to know which software systems it is using, it needs to know which models are used in production, where they came from, for how long they are valid, and how to check they work as intended. The whole process needs to follow legal and ethical guidelines for producing and operating models and has to be auditable in view of potential future AI regulation.
Test-driven machine learning or test-driven data analysis tries to learn from test-driven development in software engineering. ML experts have always known that models need to be tested with data not used in the process of creating them (cross-validation). The insight that data needs to be tested before model building and during model operation is not yet that widespread. Using a/b testing (control groups) during model operation to check the quality of decision making is more widely known but rarely followed.
We can compare a decision made by a model to the build process of a piece of software. In test-driven development a build will fail unless all pre-defined test cases have been successfully passed. When using ML models, a decision should only be allowed to be used by an automated process if that decision has passed all test cases. If any test fails, the decision must be rejected, and an exception process has to kick in.
Following test-driven development practices should mean that the data used to build a model and the model itself have already passed all specified tests before the model is allowed into operation. However, in operation, the validity of the current input data, the characteristics of the model behaviour over time and the quality of the decisions still need to undergo a continuous real-time test scenario.
The idea of a model factory is to support test-driven ML and Data Science with a collection of (mostly open source) tools that support development, test, deployment, and operation. In addition, the model factory provides an overlay of versioning, orchestration, reporting and governance. A model factory is a way of working not a single tool or software. It is unlikely that you can buy one from a vendor. No single solution will fit your organisation’s legacy IT and data infrastructure perfectly. Think of the latest Auto-ML tool a vendor is offering you as a flashy sports car. What you really need is a diverse fleet to drive on everything from mud roads to motor ways and the highway code to make everyone drive safely.
If you want to reliably churn out ML models with a high degree of automation you need a model factory that fits your business. Go and build one.
See Also: Data analytics consulting companies