Category: Machine Learning

15 Sep

Big claims are made about the future of artificial intelligence and machine learning – but your datasets are small. What does this mean for you?

Elizabeth Data analysis, Human-Computer Interaction, Machine Learning Tags: , ,

In 2021, 76% of sampled enterprises have prioritized artificial intelligence (AI) and machine learning (ML) over other IT initiatives, according to a study by Algorithmia (Columbus, 2021). Despite increased spending on artificial intelligence and machine learning projects across industries, many organizations with small datasets are left wondering how they can leverage their existing datasets to implement these solutions.

For the purpose of analytics and prediction, machine learning solutions require high-quality, large datasets that can be used to train algorithmic models. Here, ‘high-quality’ and ‘large’ refer to data that is essential, diverse and representative for your project and that has undergone adequate feature transformation (formatting, cleaning, feature extraction). When we consider the size of a dataset needed to train a relatively simple model, Telus International discusses the ‘Rule of 10’: 

“One common and much debated rule of thumb is that a model will often need ten times more data than it has degrees of freedom. A degree of freedom can be a parameter which affects the model’s output, an attribute of one of your data points or, more simply, a column in your dataset. The rule of 10 aims to compensate for the variability that those combined parameters bring to the model’s input.” (Telus International, 2021)

With datasets that are large enough, the effects of both bias and variance are minimized in models, empowering very different machine learning algorithms to perform virtually the same. This is a key driver for why companies like Google, Facebook, Amazon, and Twitter, are dominant in artificial intelligence research.  

But what if your organization isn’t a tech giant with access to minute data from a user base of millions? What if your datasets are “small”? In this case, do machine learning and artificial intelligence solutions still have a role to play in your analytics strategy? The answer is yes, but with a little help. The following are some techniques that can be used to evade the small data challenge. 

  • Synthetic data generation: synthetic data tools are used to generate synthetic data to match the collected ‘real-world’ sample data, while ensuring that the sample data’s important statistical properties are reflected in the synthetic data. According to Gartner, 60% of the data used for the de­vel­op­ment of artificial intelligence and an­a­lyt­ics projects will be syn­thet­i­cally gen­er­ated by 2024 (White, 2021). 
  • Low-shot learning/Few-shot learning: low-shot/few-shot learning is a technique that enables a model to make a prediction based on a small number of training examples. In practice, a machine learning model may be given thousands of simple inspection tasks, each of which only has a small number of training examples. This enables the model to spot the most critical patterns since it only has a small dataset to draw from. 
  • Transfer learning: transfer learning is a technique that allows small datasets to be supplemented by storing knowledge gained while solving a related (but different) problem that ample data is available for. This knowledge is then applied to the small dataset model or problem. For example, knowledge gained while learning to identify trees could apply when seeking to identify shrubs.

Data center, artificial intelligence, machine learning, small datasets
Columbus, Louis. 76% Of Enterprises Prioritize AI & Machine Learning In 2021 IT Budgets. (2021). Retrieved from https://www.forbes.com/sites/louiscolumbus/2021/01/17/76-of-enterprises-prioritize-ai–machine-learning-in-2021-it-budgets/?sh=53ca5af2618a

Telus International. How much AI training data do you need?. (2021). Retrieved from https://www.telusinternational.com/articles/how-much-ai-training-data-do-you-need

White, Andrew. By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated. (2021). Retrieved from  https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/