Blackcat Technology Solutions

ETL project for document deduplication for top FTSE 100 client

...quick productive progress delivering a web application that solved the business problem

Project Synopsis

BlackCat Technology Solutions, an agile software development consultancy, required Yobibyte Solutions to help with their top FTSE 100 client.

They had an urgent requirement for a new web application to map duplicative documents between the legal and risk side of the business such that duplicate documents could be found and removed for an upcoming release of a client facing system.

De-duplicating this content was critical to the release of this client facing system.

The Challenge

This new web application was required to map duplicative documents between the legal and risk side of the business across in excess of 89,000 documents such that editors could remove the duplicate documents from the system.

This web application had to automatically identify as many of the duplicative documents as possible to reduce the manual effort required by the editors. Identifying these duplicate documents manually across the 89,000 odd documents would be a very time-consuming process. The more that could be automated, the quicker and more cost effective the process would be.

The process of finding duplicate content was not going to be easy due to differences in document contents, structure and metadata between the different types of documents. Many of the documents were also versioned and to complicate matters further the effective start/end dates of the documents between risk and legal were often not aligned.

What was required was a flexible application where the matching algorithm could be configured for different document sets.

The Process

We began to understand the initial requirements of the system populating the backlog with initial stories. Being a greenfield project we started with a skeleton of the application and began with the building blocks.

Yobibyte Solutions was able to utilise previous experience with the Quartz Scheduler to quickly add functionality to allow matching jobs to be scheduled. These jobs would become responsible for finding duplicate documents through a matching process.

Working with the product owner, functionality was added to the matching process to support the first document set. Once the how of finding duplicate documents was understood it was possible to then create the basics of a framework to support different algorithms for different document sets.

Yobibyte Solutions was responsible for most of the backend design and development. As more document sets were supported, the matching side of the application was refactored to support the ever-growing complexity of matching different document sets using different metadata and document contents.

The Success

Yobibyte Solutions was part of a small team that was seen as making quick productive progress delivering a web application that solved the business problem. This web application was delivered to the editors in advance of the release of the client-facing system to allow the editors to remove the duplicate documents.

The result?

Let's take a look at the numbers...

89,180 Documents processed

77,294 Documents auto-matched

86 Percentage auto-matched

The matching algorithms proved to be flexible enough to support all the required document sets and allowed us to gain an overall auto match rate of 86% across the 89,180 documents.

That figure doesn't really do the statistics justice. In fact, the match rate was higher than this as some of the documents were a table of contents documents that did not have an equivalent on the legal side. Therefore there was never a duplicate document to find meaning the overall auto match rate is actually higher than 86%. Unfortunately, it is not possible to get the statistics for these table of contents documents as it is a manual process to identify them.

Tech Stack

Java 7
Spring
Quartz Scheduler
Hibernate/JPA
JUnit
Mockito
Tomcat
MySql
Oracle
Docker
Maven