There are many different ways of describing ‘Data Mining’ and probably lots of different understanding of what it means. At the time of writing there is no official or standardised description of what it is. Datanology carry out a number of different processes, and several of these could be described as ‘Data Mining’. The art of data management differs between datasets and end user requirements.
We use the terms Data Mining, Data Capture, Data Extraction, Data Grabbing etc that reach similar results through similar processes. The end objective is to retrieve or extract specific entities of data from a document, whether that be a paper document, millions of paper documents (all scanned) or from individual digital documents (or of course millions of digital files).
Whilst the process may be very similar from similar source datasets, there are unique steps that need to be taken in order to achieve the overall objective of getting the valuable data, essentially getting the good data from all the surrounding noise. The technologies that datanology use are only part ofg the solution. The most important aspect is using personal experience to create an innovative process. The technology can only do the job it is told to do, and if it doesn’t cover every angle then it won’t be able to complete it properly. There will be bits missing, anomalies in the data extract or worse, the wrong data.
Extracting data from paper documents will usually follow the process of scanning using up-to-date OCR (Optical Character Recognition) techniques. Whilst these are quite accurate we know from experience that zero’s and the letter o or O, or capital i and lower case l (Il both here) can be translated incorrectly which would pose a problem in standard data mining, however our technology allows for this and carries out additional validation sense checks.
We also consider that there are many different formats of files that are used around the world, word processors, PDF,xml, xls, text, zipped or even a full webpage among a few and your collection of documents may contain 1, a few or maybe all of them. Converting them to a standard is the first step of our process before carrying out the data extraction process.
We know that it might be extremely simple to extract all email addresses from millions of documents using a quick script:
Process to identify email addresses in large datasets
Scan entire text from beginning to end.
Whilst scanning, when an @ sign is identified, stop scanning through.
Go backwards to find the first occurrence of a space and stop.
Remember the position.
Scan forwards again until the next occurrence of a space is determined then stop.
Extract the text from the position of the first space to the second space.
Continue scanning and do the same…
This process will extract the email addresses from your text documents, and seems quite logical. What could go wrong ? Well, humans aren’t always as efficient and accurate as machines, and we have our own little quirks and ways we like to do things. Whilst this process might grab the correctly structured email addresses, what if the person originally stored their email address prefixed with ‘E:’ or maybe with ‘Email:” and didn’t include spaces ? Maybe they put apostrophes around the email address or even inadvertently included a space within the email address ? Our experience tells us that we should expect anything when it comes to a human having originally entered data and create our processes accordingly. Whilst our processes carry out the necessary calculations, we are also aware that the majority of a process we create must be for error detection and correction to ensure clean and accurate ‘Data Mining’.
Our processes can successfully extract email addresses, telephone numbers, postal addresses where valid postcodes are used, names (from our extensive names database), sku codes, numerical barcodes, vehicle registrations with make and model and other structured items.
Unstructured items can usually be extracted too, with a rules based process similar to the structured items and these would be specifically tailored for each dataset.
Data Mining handwritten documents
This is never an easy task by any automated process and the possibility of errors is extremely high even using the best and newest technology. Scanning using handwritten OCR does have it’s merits and is usually the best place to start to get the information into a digital format. There is no substitution for human intervention here though, and we combine the OCR process and a unique data grabbing / checking with human supervision to clean up the information. No matter how small or large the handwritten dataset is we always recommend a keen eye watches over it and ensures a smooth, accurate dataset.
We have seen that Data Mining, Data Capture and Data Extraction can be heavily automated but there is also ‘Data Grabbing’ to consider in certain situations. A ‘Data Grabber’ is a bespoke built process system usually built in Excel that allows the user to import their text in bulk, and to show this text on screen for a user to highlight, drag and drop to a data entry tool built in, and therefore populate a database accurately and very quickly where Data Mining wouldn’t work.
An example of where our Grabbers are used is in the capture of data from legal documents, claim forms and other documents where data is not structured in a list, but is structured within sentences describing an event for instance. The objective may be to marry up the name, address, email address, telephone number, Date of birth etc of a person who is being described in a document. The Data Mining process might be able to extract these data entities successfully and very quickly but it is very unlikely that the information can be spliced together correctly where absolute accuracy is required. The Grabber system does as much of the work as it can, highlighting specific entities such as names, telephone numbers etc when the mouse moves over them which usually then just requires a click, drag and drop or in more advanced systems a double click to send that piece of information to the correct input area of the data entry form.
If Data Mining, Data Capture or Data Grabbing is something you are doing or are about to do, feel free to drop us a message for some advice and guidance using the contact form or leave a comment below.