15 Sep

Big claims are made about the future of artificial intelligence and machine learning – but your datasets are small. What does this mean for you?

Elizabeth Thornley Data analysis, Human-Computer Interaction, Machine Learning Tags: , ,

In 2021, 76% of sampled enterprises have prioritized artificial intelligence (AI) and machine learning (ML) over other IT initiatives, according to a study by Algorithmia (Columbus, 2021). Despite increased spending on artificial intelligence and machine learning projects across industries, many organizations with small datasets are left wondering how they can leverage their existing datasets to implement these solutions.

For the purpose of analytics and prediction, machine learning solutions require high-quality, large datasets that can be used to train algorithmic models. Here, ‘high-quality’ and ‘large’ refer to data that is essential, diverse and representative for your project and that has undergone adequate feature transformation (formatting, cleaning, feature extraction). When we consider the size of a dataset needed to train a relatively simple model, Telus International discusses the ‘Rule of 10’: 

“One common and much debated rule of thumb is that a model will often need ten times more data than it has degrees of freedom. A degree of freedom can be a parameter which affects the model’s output, an attribute of one of your data points or, more simply, a column in your dataset. The rule of 10 aims to compensate for the variability that those combined parameters bring to the model’s input.” (Telus International, 2021)

With datasets that are large enough, the effects of both bias and variance are minimized in models, empowering very different machine learning algorithms to perform virtually the same. This is a key driver for why companies like Google, Facebook, Amazon, and Twitter, are dominant in artificial intelligence research.  

But what if your organization isn’t a tech giant with access to minute data from a user base of millions? What if your datasets are “small”? In this case, do machine learning and artificial intelligence solutions still have a role to play in your analytics strategy? The answer is yes, but with a little help. The following are some techniques that can be used to evade the small data challenge. 

  • Synthetic data generation: synthetic data tools are used to generate synthetic data to match the collected ‘real-world’ sample data, while ensuring that the sample data’s important statistical properties are reflected in the synthetic data. According to Gartner, 60% of the data used for the de­vel­op­ment of artificial intelligence and an­a­lyt­ics projects will be syn­thet­i­cally gen­er­ated by 2024 (White, 2021). 
  • Low-shot learning/Few-shot learning: low-shot/few-shot learning is a technique that enables a model to make a prediction based on a small number of training examples. In practice, a machine learning model may be given thousands of simple inspection tasks, each of which only has a small number of training examples. This enables the model to spot the most critical patterns since it only has a small dataset to draw from. 
  • Transfer learning: transfer learning is a technique that allows small datasets to be supplemented by storing knowledge gained while solving a related (but different) problem that ample data is available for. This knowledge is then applied to the small dataset model or problem. For example, knowledge gained while learning to identify trees could apply when seeking to identify shrubs.

Data center, artificial intelligence, machine learning, small datasets
Columbus, Louis. 76% Of Enterprises Prioritize AI & Machine Learning In 2021 IT Budgets. (2021). Retrieved from https://www.forbes.com/sites/louiscolumbus/2021/01/17/76-of-enterprises-prioritize-ai–machine-learning-in-2021-it-budgets/?sh=53ca5af2618a

Telus International. How much AI training data do you need?. (2021). Retrieved from https://www.telusinternational.com/articles/how-much-ai-training-data-do-you-need

White, Andrew. By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated. (2021). Retrieved from  https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/

27 Aug

Quantity or quality of participation? You don’t have to choose.

Marcelo Bursztein Public engagement, Usability, User experience 0 Comments

What should you do? Require that users register to participate or allow anonymous submissions? That is the question. In traditional online community applications, systems ask users to log in or register upfront in order to participate. As we’ve seen, asking users to register before they interact, deters some users from participating. So you end up with fewer submissions. But you don’t have to go all the way to allowing anonymous submissions. There’s a better way to do this.

06 May

Ongoing community engagement fosters trust and support for change

Marcelo Bursztein Online engagement, Open dialogue, Public engagement, Transportation Tags: , , 0 Comments

As more and more people have access to the Internet, the connected city of the future will unequivocally incorporate citizens’ input in everyday decisions. Therefore, organizations must encourage, promote, support and participate in active dialogue with the communities they serve.

05 Apr

Online civic engagement and the digital divide

Marcelo Bursztein Accessibility, Blog, Digital Divide, Online engagement, Public engagement Tags: , , 2 Comments

Governments are increasingly keen and open to listening to citizens through consultations and online civic engagement. A small set of data points by the Pew Research Center reminds us that while we are keen to include, the digital divide is still pretty much alive.