Yobibyte Solutions

Blackcat Technology Solutions

ETL project for document deduplication for top FTSE 100 client

...quick productive progress delivering a web application that solved the business problem

Project Synopsis

BlackCat Technology Solutions, an agile software development consultancy, required Yobibyte Solutions to help with their top FTSE 100 client.

They had an urgent requirement for a new web application to map duplicative documents between the legal and risk side of the business such that duplicate documents could be found and removed for an upcoming release of a client facing system.

De-duplicating this content was critical to the release of this client facing system.

The Challenge

This new web application was required to map duplicative documents between the legal and risk side of the business across in excess of 89,000 documents such that editors could remove the duplicate documents from the system.

This web application had to automatically identify as many of the duplicative documents as possible to reduce the manual effort required by the editors. Identifying these duplicate documents manually across the 89,000 odd documents would be a very time-consuming process. The more that could be automated, the quicker and more cost effective the process would be.

The process of finding duplicate content was not going to be easy due to differences in document contents, structure and metadata between the different types of documents. Many of the documents were also versioned and to complicate matters further the effective start/end dates of the documents between risk and legal were often not aligned.

What was required was a flexible application where the matching algorithm could be configured for different document sets.

The Process

We began to understand the initial requirements of the system populating the backlog with initial stories. Being a greenfield project we started with a skeleton of the application and began with the building blocks.

Yobibyte Solutions was able to utilise previous experience with the Quartz Scheduler to quickly add functionality to allow matching jobs to be scheduled. These jobs would become responsible for finding duplicate documents through a matching process.

Working with the product owner, functionality was added to the matching process to support the first document set. Once the how of finding duplicate documents was understood it was possible to then create the basics of a framework to support different algorithms for different document sets.

Yobibyte Solutions was responsible for most of the backend design and development. As more document sets were supported, the matching side of the application was refactored to support the ever-growing complexity of matching different document sets using different metadata and document contents.

The Success

Yobibyte Solutions was part of a small team that was seen as making quick productive progress delivering a web application that solved the business problem. This web application was delivered to the editors in advance of the release of the client-facing system to allow the editors to remove the duplicate documents.

The result?

Let's take a look at the numbers...

89,180 Documents processed

77,294 Documents auto-matched

86 Percentage auto-matched

The matching algorithms proved to be flexible enough to support all the required document sets and allowed us to gain an overall auto match rate of 86% across the 89,180 documents.

That figure doesn't really do the statistics justice. In fact, the match rate was higher than this as some of the documents were a table of contents documents that did not have an equivalent on the legal side. Therefore there was never a duplicate document to find meaning the overall auto match rate is actually higher than 86%. Unfortunately, it is not possible to get the statistics for these table of contents documents as it is a manual process to identify them.

"Ian worked for me on a very challenging large scale ETL project which needed flexible and pluggable methods to allow users to create batches of complex data cleansing activities configurable via a complex UI. The team comprised of 3 developers and a QA. He was instrumental in helping set a clear technical direction within the application, providing guidance and an unswerving drive towards clean, well factored code, making future extensions for further content sets trivial. Furthermore he's happy to interface directly with both product owner and external teams as needed and always arrives at a well considered and balanced solution. Always jovial and positive in his outlook he's a true asset to any development team and I'm more than happy to recommend him."

Craig BarkerCo-founder at Blackcat Technology Solutions

"Ian and I worked in the same Scrum teams tasked implementing cloud migrations on legacy systems for a challenging client. Working with Ian was always a pleasure. As well as bringing to the table an excellent set of technical skills Ian displayed an ability to quickly pick up new technologies and apply them effectively. If Ian is working on something you can be sure that it is in a safe pair of hands and will be delivered in a timely manner and to the highest quality."

Mike TalbuttDelivery Manager at Blackcat Technology Solutions

Tech Stack

Java 7

Spring

Quartz Scheduler

Hibernate/JPA

JUnit

Mockito

Tomcat

MySql

Oracle

Docker

Maven