In our previous 3 blogs we wrote about how, once the data scientist has obtained the data, some further challenges exist in the route from the data to the extraction of insight.
We also mentioned that recent algorithmic advances, such as XGBoost and Random Forests, produce very good AI/ML models, using at the core the concept of variable ‘binning’. Given their performance, these techniques have now become very popular amongst data scientists. And for good reason, as the binning essentially performs a feature transformation, on which the algorithm the works further.
Practitioners in the scoring industry have known this concept of ‘binning’ for quite some time. It was their desire to increase the performance of the econometric models, to increase the user-level control of the statistician, and to ensure full explainability (in easy terms) and transparency that led them to this development.
At the core, the concept of binning is straightforward: a continuous variable is transformed into ‘bins’, which are essentially ranges of values. Ideally, each bin has a different relationship between the variable range, and the outcome. For example, young people in the range of 20-30 years, may respond faster to a marketing campaign than people in the range of 50 to 60 years. The feature transformation performed is that subsequently, the original values are replaced by single value for the bin, representing the relationship between the range (‘bin’) and the target or dependent variable.
For categorical variables, the process is quite the same, but the ‘ranges’ may be representing the original values. The feature transformation however ensures a numeric replacement, effectively preparing the variable for use in the ML or AI algorithm.
This technique, offers a clear number of advantages:
- It automatically takes care of missing data, as it replaces ‘missing’ with a meaningful value
- Categorical or nominal values are replaced by meaningful values
- The explanatory or predictive pattern will be captured in the way the variable is binned, and the new value (e.g. the WoE) assigned to each bin.
- Graphically, the binned variable can be easily understood by business colleagues and domain experts. The graphs are also easy to create.
- The data scientist can control the binning in an active manner, capturing business and domain expert feedback, and exerting user-level control over how a data field will enter the ML or AI.
- Explainability and transparency are therefore built-in, to the benefit of the business, the domain expert, and the data scientist.
For all its benefits, running a binning can be time consuming as well. Also, to arrive at the ‘optimal binning’ may take time, and may require some interactions. Using new algorithmic techniques such as XGBoost and Random Forests, will use the concept of decisions trees, combines with a quantising (binning), and then interactively trying to find the optimal combinations and ‘bins’.
These techniques have proven to be powerful, but have the disadvantage that they may not be as explainable and transparent as desired. Also, a data scientist can not really exert variable level control, as these systems are designed to run in a highly automated manner.
It is for these reasons that Quantforce has developed a specific Optimal Binning Algorithm, based on many years of experience in scoring and developing powerful yet explainable ML and AI solutions.
The algorithm is available to users and practitioners through two software applications:
- Quantforce Modeling suite: a desktop or laptop based software, which includes the Optimal Binning Algorithm, and also a Machine Learning section where a scorecard model can direclty be built.
- QuantDiscovery : this is a web based application, where a user can run the Optimal Binning Algorithm on his or her dataset.
Both platforms are low code, click-and-play, and all results displayed both numerical as well as highly visual. The analysis results, including the binning tables and graphical representation, can also be downloaded in Excel. This to enable easy sharing and interaction on the results with the business ans domain experts.
The binning or feature transformation code is also immediately available and be downloaded in the most common programming languages used by data scientists: Python, R, SAS, C#.
When utilising the Quantforce software, a data scientist will be able to reap the benefits of the binning as highlighted above, with two clear additional advantages:
The Optimal Binning Algorithm significantly improves the search for the optimal feature transformation. This has proven to be an important time-saver for the data scientists, and greatly improves the quality of the final ML or AI model.
Highly visual, on-screen and in an Excel download, and the direct export of the feature transformation code, avoids repetitive coding tasks. Again, the data scientist can use the time saved for interaction with the business or domain experts, and invest more time in delivering a great ML or AI model.