The amount of data produced by humanity is growing exponentially every year with no signs of stopping. It’s estimated that in just the first month of 2023 more data will be produced than in the entirety of 2013. With this inexorable rise comes incredible new opportunities for innovation, but with it the inevitable problem of information overload and the challenge of how to organise this information in a way that provides the most benefit to humanity.
There is clearly great business value in well organised data. In executing its mission statement to “Organise the world's information and make it universally accessible and useful” Google has become one of a handful of companies to achieve a market cap of over one trillion dollars. Indeed the vast amounts of data a business generates are almost worthless without some means of organising and categorising it.
Imagine going into a library only to discover that all of the books had been taken off the shelves and piled up in an enormous heap in the middle of the floor. The vast body of knowledge towering in front of you would be completely useless without the books being categorised on the shelves in the familiar Dewey Decimal system.
This is why categorisation is essential for a businesses to unlock the value in the data they own. Put simply: Once the amount of data you hold is more than you can browse through it becomes worthless without the means to categorise it.
Before we can start organising data into categories it’s essential that we decide what those categories are. Going back to our jumbled library it’s no use if I call Inspector Poirot a “Detective novel”, but you call it a “Murder Mystery”. That’s where taxonomies come in. It might sound like a technical word, but it’s simply a tree structure into which everything has its place.
There are many types of Taxonomies we are all familiar with from the aforementioned Dewey Decimal System for books, to Darwin’s Tree of Life breaking animals and plants down into their species.
There are many standardised taxonomies for different types of data which will allow you to freely exchange information and collaborate with other companies, but the most important thing is that the categories you use are consistent and meaningful to you.
Once data has been categorised it unlocks a vast array of opportunities for us to use that data:
This is perhaps the most obvious. We find it fast and easy to find our way around a library or a book shop, or navigate through a hierarchically structured website like Wikipedia.
Once we have data arranged in a taxonomy it becomes easy to ask questions of it for example:
This is where taxonomies become really powerful. Categorised information can be used to drive automated behaviours:
As well as driving automation categorisation can drive calculations:
There are many ways to categorise data from the straightforward to techniques using the most cutting edge machine learning techniques.
The most simple way to categorise is manually. Humans are amazing at understanding and categorising data.
Manual categorisation works well when the data you need to categorise numbers in the hundreds, or thousands of items but once you are dealing with millions of items it can become cost prohibitive unless the work is of a very high value.
Even though manual categorisation is the least sophisticated it still has a number of important uses in even the most sophisticated systems:
If you want to categorise large volumes of data then having some kind of rule based system can make this process much faster and easier to scale.
Rules can be very simple, or extremely complex for example:
Many existing tools have categorisation engines built into them like email clients, or specific categorisation tools may be used for businesses that need to ingest large volumes of data.
Real world data is often messy and incomplete and this is where rule based systems benefit from cleaning and validation tools. These might:
This makes it much easier for simple rules to accurately categorise imperfect, real world data.
The inflexible nature of rules is both a blessing and a curse. A rule based system is always consistent and its decision making can always be explained (which is more than can be said for humans in many instances), but before we can create a rule we must find a subject matter expert who can explain in unambiguous terms exactly how the categorisation should be done. In many real world situations such as “Is this a picture of a cat, or a dog?”, or “What genre of music is this?” humans can categorise very accurately, but struggle to explain how they came to a conclusion.
This is where artificial intelligence techniques come in.
There are two main sub types of artificial intelligence categorisation: supervised and unsupervised.
Supervised learning is the most common type of machine learning categoriser. In supervised learning a categoriser is trained using a set of pre-categorised data that is known to be correct. Typically this will be done using a set of data that has been manually categorised by humans. Sometimes this will have been done specially for the purpose of training the machine learning system, but often if a manual process is being automated then you can use the body of manual work that has been done in the past as the basis for training the model.
If we go back to our example of a library a supervised learning system could learn from the current locations of all the books in the library in order to automatically put new books onto the right shelves as they arrive.
Supervised learning is especially good for categorisation tasks where you have a pre-existing set of categorised that it’s hard to extrapolate rules from such as image recognition.
Sometimes we don’t have a set of categorised data to work from and that’s where unsupervised learning comes in. Unsupervised learning tries to work out patterns and clusters in existing data without relying on explicitly labeled data. The most common example is the recommendation engine used by Amazon and other online shops. When you buy something it uses information about what other people who bought that item also bought to try to find other things that you might wish to buy.
Machine Learning is an extremely powerful technology, but it has two major weaknesses:
Explainability and Bias.
Unlike a rule based system it’s not always obvious why a machine learning model has made the decision that it made and unlike a human you can’t ask it.
While this is OK for a system where it’s OK if it works most of the time like a Spam filter that might be completely unacceptable in finance, or medical applications. Some techniques can be used with machine learning to make it easier to understand why certain decisions were made, but it will never be as black and white as a rule based system.
This is a subject that has become increasingly important as more of our lives are affected by the output of machine learning. It’s easy to think of machine learning as a process as coldly logical as a Star Trek Vulcan, but in reality when a machine learning model is trained from a the output of a human it may well take on the biases of that human. In a real world example an application that reviewed CVs developed by Amazon displayed a gender bias in its behaviour based on the biased input data it was trained with.
As we have discussed there are huge benefits to categorising your data and many ways in which this can be done. Here at Curvestone data categorisation is one of our key technology pillars for helping businesses innovate and grow.
If you would like to know more please get in touch.