Last week a customer asked me to help him process a few hundred of his documents. He had PDFs for several bank accounts going back to 2018. I had a look a documents arranged a price and then got to work processing his documents. My idea was to run it through Bank Statement Converter’s (BSC) PDF to CSV processor. However a few errors occurred. BSC has a generic algorithm for detecting transaction records in a PDF file.
Lately I’ve been getting a lot of complaints that the year is wrong in Bank Statement Converter’s resulting CSV. At first when I got these complaints I thought “What the hell you talking about? All we do is find the transaction data and then write it out to a CSV file. How can the year be wrong?”. Let’s walk through an what’s going on with one of my HSBC bank statements.
The other day I had a bit of revelation about what problem Bank Statement Converter solves. The obvious answer seems to be “it solves the problem of extracting transaction data from PDF bank statements”. That’s true, but you could also say it solves a more general problem. “It gives user access to their bank transaction data”. However, it’s a bit of a pain to use. To get your 2021 transaction data you need to:
A few weeks ago the BankStatementConverter (BSC) exceed $1000 in Monthly Recurring Revneue (MRR). I figured you lot would be interested to hear the story $0 MRR, March to July 2021 I got the idea to build BSC to help users process PDFs. I spent about a week playing around in Kotlin to see if my idea was feasible, it was. Soon after I meet up with a friend for beers and tell him my idea.
This is the second part in a series of blog posts where I explain how Bank Statement Converter works. In the previous article I talked about how I extract each character and its bounding box from a PDF. In this article I’ll talk about how I use the characters and bounding boxes to deteect the headers of the transaction table. val pageRegion = Rectangle(0f, 0f, page.cropBox.width, page.cropBox.height) val lines = LineExtractor(page).
This is part one of a series of blog posts where I explain how BankStatementConverter works. In this post I’m going to explain the code that figures out the bounding boxes and other attributes of characters on a page. Lots of this code was lifted from DrawPrintTextLocations and PDFTextStripper. First thing we do is load the PDF file using PDFBox and then we process the document page by page. The PDFs are processed page by page because we don’t run out of memory, most documents are less than ten pages long, but there are documents out there that are over 10,000 pages long, if we tried to load all the data from a large document into memory we would quickly run out and crash our app.
Originally I wanted to analyse my HSBC bank statements from 2014. 2014 was a great year for me, I started off the year by launching a new app, joined a game development company and rented an apartment with a friend. Unfortunately my 2014 bank statements from HSBC’s internet banking are no longer available, it seems they only go back a few years. So let’s go through my 2015 bank statements instead.
I use Grafana to create graphs that show me various business and performance metrics for Bank Statement Converter. One of the graphs I created tracks the number of Internal Server Errors the server returns to its clients. I do this by writing a record into the database whenever a 500 is sent to the client. This graph has been really helpful for ironing out bugs I didn’t anticipate. Last Thursday at 12:55 AM HKT my servers started throwing Java’s infamous OutOfMemoryErrors.
The lowercase friendly NAB logo reminds me of Jeb Bush’s 2016 ‘Jeb!’ logo National Australia Bank Formed in 1982 after the merging of National Bank of Australalasia and the Commercial Banking Company of Sydney. The 21st largest bank by market capitalisation. They have an Irish subsidary called Danske Banke, which was formerly known as National Irish Bank. Sounds like a pretty good bank, but how good are their PDFs?
Are the people in this building good at making PDFs? I’ve seen quite a few statements when helping customers get through issues converting bank statements. A lot of bank statements follow a logical structured format that is easy for a human to read and easy for an algorithm to extract data from. You’d think it’s something that would normalise into a standard format, banks would look at statements from other banks and copy the bank with the best statements.