How Machine Learning Improves Data Cleansing

by Olivier Audino | Mar 22, 2022 | Spend analytics

Machine Learning Improves Data Cleansing

One of the many Holy Grails of machine learning within the spend analysis domain is the ability to disambiguate and classify customer purchases accurately, quickly, and automatically. It’s a fun problem to try to tackle since it’s approachable from many different angles.

At a very simple level, one could iterate through all item purchases and try to categorize each purchase based on the name of the purchase and the name of the available categories to which you’re mapping. As an example, a “spoon” can be mapped to the following UNSPSC categories which all have the word “spoon” in them.

41123402 – Dosing Spoon
42181512 – Typhoid Carrier Examination Spoons
42294000 – Surgical Spatulas and Spoons and Scoops and Related Products
42294003 – Surgical Spoons
42294519 – Ophthalmic Spoons or Curettes
52151617 – Domestic Wooden Spoon
52151651 – Domestic Measuring Spoon
52151704 – Domestic Spoons

If an automated system were to use this scheme, which category of “spoon” would it select? Hopefully there would be some context in the item description that could provide some hints such as the word “kitchen” or perhaps a supplier where you purchase the spoon such as “Staples”, but that’s an additional layer of complexity that one would have to account for (think lots of $$$).

Using the Machine Learning to classify

Sourcing Force is fortunate enough to have been in the business long enough to have developed a significant edge. Quite simply, we’ve classified a ton of items using custom created classification rules.

When a researcher is toying with machine learning algorithms such as Neural Networks (NN), Naive Bayes Classifiers (NBC), Hidden Markov Models (HMM) for Word Sense Disambiguation, etc., frequently he/she runs into a huge roadblock in that in order to effectively apply these algorithms, one needs training data in order teach and tune the algorithms.

Training data for some domains can be purchased while other training data needs to be painfully constructed by the researchers (or probably grad students). It’s not easy to come by in other words.

Our hard working Analysts have to date written hundreds of thousands of distinct classification rules that map item descriptions to category codes that we’ve used to classify items for a lot of companies.

These rules allow us to do a great job classifying items for our clients, but they are also an undeniable treasure trove of implicit semantic knowledge that can be used for algorithm training.

A great example that comes to mind of the “implicit” semantics that I refer to above can be seen in the problem of “how does one classify Tylenol?” There is no UNSPSC code for Tylenol but there is one for Tylenol’s chemical name: Acetaminophen.

The code is 51142001.

I fortunately knew that important detail from which I can write a classification rule. Consider this: an algorithm that was trained off these Sourcing Force classification rules just learned that mapping of Tylenol to 51142001 for free.

Once upon a time, I wrote a classification rule for a company which I turned around and used to train an algorithm.

Now that rule, to a degree, can help me classify forever.

Figuring out how to classify some items can be quite a nasty puzzle sometimes for a human especially when it comes to chemicals, and so having an Analyst figure out a mapping for an obscure item in a sense becomes “a gift that keeps on giving.”

As an added benefit, the more obscure the item is, the more accurate algorithmic predictions are going to be.

The reason for that is the context in which certain items appear for strange purchases is going to be rather limited. There won’t be much “noise” in the data to confuse an automated system.

One must add one last point in order to come full circle within the machine learning domain of spend analysis. Human beings are still masters here. Algorithmic approaches to spend analysis, albeit cool, cannot match the pattern recognition capabilities wired into the human brain – especially a Sourcing Force Analysts’ brain.

Machine learning approaches so far can only mirror what it is that they’ve learned and repeat back answers that have the highest probability of being correct within the limited context that they know. The large number of rules that Sourcing Force has to play with, broaden a machine’s perception of reality and give it a rich context to learn from.

Even though I personally have been the one trying to mature software to do automatic classification, I must give credit where credit is due.

The “parents” of our little electronic child are the Sourcing Force Analysts, none of whom were harmed during the training of any algorithms.

See how Sourcing Force helps businesses automate their procurement processes

Our latest articles

Optimal Spend Analysis – How to Leverage Multiple Data Classifications?

Focusing on Business Specific Category Sourcing Groups is key, with or without UNSPSC. Transforming raw Spend and Supplier data from payment and purchasing systems and reclassifying it into common, meaningful sourcing categories creates significant visibility and...

Turn Identified Savings Into Realized Savings and Prevent Spend Leakage

1. Does your company have an eProcurement system?If not, thoroughly investigate the significant savings that such a system can deliver. Most vendors are happy to provide you with an ROI analysis of your spend free of charge. This should be of particular interest if...

What is Procurement ?

All You Need To Know About ProcurementProcurement (or purchasing) is the process for an organization to obtain products or services from external parties (market supply) to ensure that production and business activities are smoothly carried out. Procurement practices...