Data Quality and Quantity for an Industrial AI Project

[Panel Question #2: How can you ensure that you have the correct data in sufficient quality and quantity to carry out an AI project]

Data is the lifeblood of an AI project. In a perfect world, an algorithm would train on all the data that ever existed. However, it still would not be a perfectly trained algorithm.

This question is a hot research topic. The answer depends on several factors, such as the type of classifier, the number of weights, and the data quality. If there is insufficient quantity and quality of data, the model will not achieve the desired accuracy. If there is too much data, the model will tend to overfit and not be accurate in edge situations. This is the most problematic because these are the cases where you most want accurate recommendations.

Data labelling is critical. The industrial projects that I have seen either buy pre-labelled data or farm out data labelling to human-powered task services (Amazon, Fiverr, etc.). Both cases do not guarantee good data. With farmed-out data labelling, the probability of error is even higher.

Another issue coming to the fore is the management of data warehouses and data lakes. The new data privacy laws in Québec will give users the right to opt out of databases at any time. Data management must be systematized with governance controls, cataloguing processes and high security. The quantity of data will fluctuate over time and not always towards more. Depending on the level of security and compliance with local legislation, delegating data management to a cloud service outside Quebec may no longer be an option.

These issues impact how algorithms are defined, trained and deployed. Recommendations with imperfect, incomplete and fast-changing data carry responsibilities, liabilities and costs for industrial users. AI does not necessarily make worse decisions than humans but comes with ethical, social, cultural and political considerations that influence how the technology should be implemented.

The question is not how to have the correct data in quantity and quality. I believe the question is: should an AI solution be built? It is tempting to use the technology in the short term to compensate for the lack of personnel or to boost productivity and profitability. The best approach is to develop an operational understanding of an AI application’s suitability by building proofs of concepts, test cases and case studies.

There is so much we do not understand about AI in the real world. The proposed application’s social, political, economic and ethical impacts must be considered before deciding on a technology solution. Only where there is a demonstrated advantage should the proofing and scaling of an AI solution be considered. This will point the way towards how to collect, classify and use data to create recommendations which lead to the benefits we seek.

Tech scales, but people don’t.