HSBC Mole Whacking Development

Last week a customer asked me to help him process a few hundred of his documents. He had PDFs for several bank accounts going back to 2018. I had a look a documents arranged a price and then got to work processing his documents. My idea was to run it through Bank Statement Converter’s (BSC) PDF to CSV processor. However a few errors occurred. BSC has a generic algorithm for detecting transaction records in a PDF file. This usually works pretty well, for many banks it works perfectly. The generic algorithm doesn’t work for a lot of statements though. To get around this, I detect the document type and then run custom code for that document.

Often a customer will report some sort of parsing error. I’ll ask them for the document that causes the parsing problem. Then I’ll create a custom parser for that document type. I believe this sort of process is called “Whack-A-Mole” development. For this blog post, I thought it might be fun to walk you through the development process of creating a beautiful custom processor for my HSBC bank statements.

What is it like now?

It looks pretty crap! The description heights are uneven. For some reason a date value is missing. The dates don’t have the year on them. The results are split into two tables. A currency value is missing. Let’s walk through each of the problems and come up with some sort of solution.

The description heights are uneven

This one is quite easy to deal with. When I merge descriptions that go across multiple lines I add in a new line character. Instead I think it would be better to add in a space character. No one lines uneven line heights in their CSV. I’m considering doing this for all documents, not just this HSBC one.

For some reason a date value is missing

This is a feature of HSBC’s bank statements. If more than one transaction happens on a day, they only print the date the first time. It’s a bit annoying isn’t it. It doesn’t make the statement easier to read at all. It also makes my life as a PDF King more difficult. It’s not too difficult to solve though, I can copy the date from the previous record if it’s blank.

The results are split into two tables

This happens because the statement includes my HKD savings accounts as well as my USD and AUD foreign currency accounts. The foreign currency accounts have an extra “CCY” column. The generic algorithm is smart enough to merge together tables with the same headers. If the headers are different, it outputs the tables side by side.

We can resolve this issue by adding in a “CCY” column to my HKD savings records.

A currency value is missing

We can use the same solution for the missing date value.

The date values don’t have a year

We can use the advanced technique we developed in a previous blog post to ascertain the year from the statement date. Let’s code it up.

The description heights are uneven

I changed RowMerger.kt so that it accepts a separator string. Then I passed in a space character. Pretty simple.

Missing date and currency values

I’ve dealt with this issue before in previous documents. These two functions calls solved the problem

The results are split into two tables

I’m not very happy about how I solved this issue. Basically I transform any TableHeaders that do not have a “CCY” column in them. This makes the TableHeaders the same across the document, which means they can join together nicely. It’s a hack for sure. It might be nicer to have a gridTransform method

The date values don’t have a year

I solved this with two steps. First I wrote some code to capture the statement date.

Then I transform the date column by calling the enrichDates on every date value. I’ve talked a bit about date enrichment code in a previous blog post.


It looks a lot better. However you could say it’s still not perfect. The CCY column has a few blank values. It might be nice to set these to HKD. Also, there are gaps in the balance column. This is because HSBC statements only show the balance once per day. It is possible to calculate this value. I think that would be asking for trouble though.

Join The Mailing List