Do You Really Have Big Data?
Image source Shutterstock.com. Free for commercial use, no attribution required.
How many times have you - or your data scientists - lamented the fact that the team didn't have enough data to properly train a model? How did the team handle the problem? Train the model anyway? Abandon the project? Muddle through some other way?
Did the team consider changing to a method that worked better with small data?
Small data is becoming valued again. For those who are not training self-driving cars or chess-playing robots, small data has potential. It is cheaper to store, easier to move, analyze and understand, and it is more personal.
What is small data?
Prior to researching for this article, I erroneously assumed that small data was simply data that was small enough to analyze in Excel. But after some searching, I found that it actually has acquired a more specific meaning:
From https://www.dataversity.net/big-data-small-data (2016): Small Data can be defined as small datasets that are capable of impacting decisions in the present. Anything that is currently ongoing and whose data can be accumulated in an Excel file. Small Data is also helpful in making decisions, but does not aim to impact business to a great extent, rather for a short span of time.
From https://cmr.berkeley.edu/2019/11/small-data (2019): Small data is concerned with identifying causations in data that are small and logical enough to be understood in the context of a given business and can be analyzed for insights that lead to better decisions. Relevant small data can typically be identified through the analysis of business processes, both internal and external, as well as through the analysis of key resources to the business.
From https://www.toptal.com/finance/data-analysis-consultants/big-data-vs-small-data: Small data, on the other hand, is a subclass of data deemed modest enough so as to make it accessible, informative, and actionable by people, without the need for overly complex analytical tools. Best reduced by ex-McKinsey consultant Allen Bonde, “Big data is about machines, while small data is about people”—specifically, meaningful insights organized and packaged for the derivation of causations, patterns, and the reasons “why” about people.
One interpretation of this may be that small data could be thought of as more useful at the tactical level, while big data is more useful at the strategic level.
A few blog posts ago, I wrote about the importance of having a purpose for acquiring something (Better Gardening Through User Stories, August 2020). I see this as a classic example: in order to understand whether or not you need big data, maybe you need to determine if your analytic requirement is driven more by tactical or strategic objectives.
Does anyone really value small data?
My search uncovered that yes, small data is gaining popularity:
From https://cmr.berkeley.edu/2019/11/small-data (2019):However, many executives are still left bewildered, thinking “How may I benefit from big data?” The answer is: “You probably won’t! Simply because big data is irrelevant for most companies worldwide.” Does this mean that executives should ignore everything there is to know about big data and the advantages of digitalization and business analytics? By no means. However, executives should be looking for data-strategies that are relevant to their companies and can help them improve the performance of their business model - and in some instances even open up the potential for new business models. We term these data-strategies “small data.”
From https://towardsdatascience.com/why-small-data-is-the-future-of-ai-cb7d705b7f0a (2018): This might be a distressing fact for the artificial intelligence community. For many, if not most, professional jobs there are no available big data sets in the community. Gathering a big dataset to represent that task may be prohibitively expensive. I believe that big-data is nearing the peak of its hype. As more and more companies reach maturity in their collection and usage of big datasets, they will begin to ask, “What’s Next?” I believe more and more companies will be looking towards automation using small datasets as the next phase of their data strategy.
"But I can't analyze the data I already have, there's too much of it and it's moving too fast."
OK, but is it really big data, or just a fast-moving volume of data that requires some better methods to handle. Here's a question: how do you handle all of the email you receive each day? As you progressed in your career and life, undoubtedly the volume, variety and velocity of data have all increased and you are finding it difficult to keep up. Did you build and train a machine learning model in Python, or did you just find better ways of managing it?
You can build a message classification model in Excel, using small data.
So, let's say you want to analyze Twitter data using Natural Language Processing. You have a specific way you'd like to classify tweets for your immediate problem, and you need a model now. You may never have this particular need again. The data are already in Excel, and you don't have a model that meets what you need. Well, you can preprocess and tokenize data, derive conditional probabilities of tokens, test and run a model in Excel using functions such as 'lower', 'substitute', 'vlookup', 'mid', and 'find', plus pivot tables and some forethought about the data itself.
How much data can you do this with? If you were to try this with Twitter data, and each tweet was 50 words or less, you could use this process to build a model using 21,000 tweets, then use it to classify as many tweets as an Excel worksheet holds (and your computing power allows). The process is extremely transparent and can be tweaked easily, and it meets a specific need you have right now. You will hear from others that 'you can't do data science in Excel', I implore you to just try it.
No, Excel isn't the only place that small data is useful.
A while ago I shared an article in LinkedIn about classifiers for small data (https://www.data-cowboys.com/blog/which-machine-learning-classifiers-are-best-for-small-datasets). This article describes the author's structured experiment on 108 small datasets comparing linear SVM, Logistic Regression, Random Forest, AugoGluon and LightGBM. While highlighting the approaches that worked best depending on the dataset, the relevant point for this article is that the author found that most of these worked very well for small datasets.
Determine what you really have, and need.
The takeaway from this article is that you should determine what you really have in terms of dataset size, and what you really need in terms of analytic capability. Do you really have big data? Do you ever think you'll get to having big data? Do you really think big data exists for the specific type of problem you're trying to solve? If you honestly answer these questions, and you just can't see a future for big data for your organization, then you need to rethink your data analytics capability.
Reach out to Cybele Data Advisory for help with your small and big data problems!