In last week’s blog, we discussed how being well organised on the collection, storage and data extraction will benefit the journey towards becoming a data driven company.
Once the data has been made available to the data scientist, the next step in the process is for them to extract relevant insights from the data, and also to communicate these findings in a non-technical manner to their business partners or domain experts. As ever so often, visualisations of predictive or explanatory data patterns, are most useful when interacting with the colleagues : one graph will often be more powerful than a thousand words.
Luckily, software packages often used by the data scientist, such as R, Python, SAS,… , offer a wide variety of functions, code and add-on tools, to help the data scientist wade through the data and start extracting relevant insights.
Still, even with these tools, the data scientist may require some good time to run his or her magic on the data. In nearly all projects, data scientists will be faced with the following challenges:
- Progress can be slowed by repetitive, time consuming coding and visualization tasks. Graphs are very powerful, both for the data scientist and the business manager or domain experts, but simple-yet-insightful graphs may not be so easy to produce, and may require quite some data pre-processing.
- Loss of data or records: in nearly all datasets, missing data will be present one way or another As our computing machines do not handle missing data well, either these records need to be removed, or data imputation needs to be performed. Theories and tools exist around missing data handling and imputation, but the final choice will still reside with the individual data scientist. And at times is can be a difficult one to make.
- Powerful algorithms producing suboptimal results: over time, the AI and ML algorithms a data scientist can leverage, have become ever more powerful. In turn however, these techniques are known to enlarge the undesirable effects of ‘garbage in, garbage out’, and puts extra importance on the feature engineering and transformation. Recent algorithmic advances such as XGBoost and Random Forests, rely on a form of quantiling, or ‘binning’. This can be seen as a built-in feature transformation, and consequently these algorithms have produced some very good predictive or explanatory performance. This ‘binning’ is not really new – scoring professionals have known and leveraged this technique already for many years, but the use has of it has so far been limited to their industry.
- Low level of interaction with the business user: the outcome of nearly all data science projects, is to support the business or the domain experts (such as medical doctors). During the data science process therefore, communication and exchange of points of view, between the data scientist and his/her colleagues, is therefore of utmost importance. Visualizing the data and analysis outcome in an easy-to-understand manner, and spending time reviewing this, will benefit all involved. However, as mentioned above, generating insightful visualizations takes time. Often, this may take up so much time, that little is left for the interaction itself. This, to the detriment of the outcome of the project.
If you want to react, add comments or ask us some questions, please leave a message here: Contact – Quantforce