Big claims are made about the future of artificial intelligence and machine learning – but your datasets are small. What does this mean for you?
In 2021, 76% of sampled enterprises have prioritized artificial intelligence (AI) and machine learning (ML) over other IT initiatives, according to a study by Algorithmia (Columbus, 2021). Despite increased spending on artificial intelligence and machine learning projects across industries, many organizations with small datasets are left wondering how they can leverage their existing datasets to implement these solutions.
For the purpose of analytics and prediction, machine learning solutions require high-quality, large datasets that can be used to train algorithmic models. Here, ‘high-quality’ and ‘large’ refer to data that is essential, diverse and representative for your project and that has undergone adequate feature transformation (formatting, cleaning, feature extraction). When we consider the size of a dataset needed to train a relatively simple model, Telus International discusses the ‘Rule of 10’:
“One common and much debated rule of thumb is that a model will often need ten times more data than it has degrees of freedom. A degree of freedom can be a parameter which affects the model’s output, an attribute of one of your data points or, more simply, a column in your dataset. The rule of 10 aims to compensate for the variability that those combined parameters bring to the model’s input.” (Telus International, 2021)
With datasets that are large enough, the effects of both bias and variance are minimized in models, empowering very different machine learning algorithms to perform virtually the same. This is a key driver for why companies like Google, Facebook, Amazon, and Twitter, are dominant in artificial intelligence research.
But what if your organization isn’t a tech giant with access to minute data from a user base of millions? What if your datasets are “small”? In this case, do machine learning and artificial intelligence solutions still have a role to play in your analytics strategy? The answer is yes, but with a little help. The following are some techniques that can be used to evade the small data challenge.
- Synthetic data generation: synthetic data tools are used to generate synthetic data to match the collected ‘real-world’ sample data, while ensuring that the sample data’s important statistical properties are reflected in the synthetic data. According to Gartner, 60% of the data used for the development of artificial intelligence and analytics projects will be synthetically generated by 2024 (White, 2021).
- Low-shot learning/Few-shot learning: low-shot/few-shot learning is a technique that enables a model to make a prediction based on a small number of training examples. In practice, a machine learning model may be given thousands of simple inspection tasks, each of which only has a small number of training examples. This enables the model to spot the most critical patterns since it only has a small dataset to draw from.
- Transfer learning: transfer learning is a technique that allows small datasets to be supplemented by storing knowledge gained while solving a related (but different) problem that ample data is available for. This knowledge is then applied to the small dataset model or problem. For example, knowledge gained while learning to identify trees could apply when seeking to identify shrubs.
Columbus, Louis. 76% Of Enterprises Prioritize AI & Machine Learning In 2021 IT Budgets. (2021). Retrieved from https://www.forbes.com/sites/louiscolumbus/2021/01/17/76-of-enterprises-prioritize-ai–machine-learning-in-2021-it-budgets/?sh=53ca5af2618a
Telus International. How much AI training data do you need?. (2021). Retrieved from https://www.telusinternational.com/articles/how-much-ai-training-data-do-you-need
White, Andrew. By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated. (2021). Retrieved from https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/
What should you do? Require that users register to participate or allow anonymous submissions? That is the question. In traditional online community applications, systems ask users to log in or register upfront in order to participate. As we’ve seen, asking users to register before they interact, deters some users from participating. So you end up with fewer submissions. But you don’t have to go all the way to allowing anonymous submissions. There’s a better way to do this.
Natural Resources Canada used 76engage to power their online consultations in support of the Canadian Minerals and Metals Plan. Read more to see an infographic of the early feedback received through the platform to-date.
As more and more people have access to the Internet, the connected city of the future will unequivocally incorporate citizens’ input in everyday decisions. Therefore, organizations must encourage, promote, support and participate in active dialogue with the communities they serve.
Participants must feel at ease knowing that organizations will respect their privacy and protect their personal information. When trust deteriorates, participants will simply stop providing input.
Governments are increasingly keen and open to listening to citizens through consultations and online civic engagement. A small set of data points by the Pew Research Center reminds us that while we are keen to include, the digital divide is still pretty much alive.
A recent IAP2 event got me thinking about why not more people participate in public engagement opportunities. One of the topics discussed was what could be done to get more people to participate in public engagement? The assumption was that even though some people do participate, more could be done to make engagements more inclusive.
We’ve all been there. We see a flyer on a street post inviting us to attend a town hall, only to realize that it happened last week. Or we made it to the event, but it was so crowded that the shy people at the back of the room, we didn’t feel confident enough to participate. All of this is gone when public engagement goes online.
The web is a great place for public engagement. But analyzing the growing amounts of data resulting from the increases in participation is discouraging some practitioners. Here’s how they can fight back.