This is the second part in a series of blog posts where I explain how Bank Statement Converter works. In the previous article I talked about how I extract each character and its bounding box from a PDF. In this article I’ll talk about how I use the characters and bounding boxes to deteect the headers of the transaction table.
val pageRegion = Rectangle(0f, 0f, page.cropBox.width, page.cropBox.height) val lines = LineExtractor(page).
This is part one of a series of blog posts where I explain how BankStatementConverter works. In this post I’m going to explain the code that figures out the bounding boxes and other attributes of characters on a page. Lots of this code was lifted from DrawPrintTextLocations and PDFTextStripper.
First thing we do is load the PDF file using PDFBox and then we process the document page by page. The PDFs are processed page by page because we don’t run out of memory, most documents are less than ten pages long, but there are documents out there that are over 10,000 pages long, if we tried to load all the data from a large document into memory we would quickly run out and crash our app.
Originally I wanted to analyse my HSBC bank statements from 2014. 2014 was a great year for me, I started off the year by launching a new app, joined a game development company and rented an apartment with a friend. Unfortunately my 2014 bank statements from HSBC’s internet banking are no longer available, it seems they only go back a few years. So let’s go through my 2015 bank statements instead.
I use Grafana to create graphs that show me various business and performance metrics for Bank Statement Converter. One of the graphs I created tracks the number of Internal Server Errors the server returns to its clients. I do this by writing a record into the database whenever a 500 is sent to the client. This graph has been really helpful for ironing out bugs I didn’t anticipate. Last Thursday at 12:55 AM HKT my servers started throwing Java’s infamous OutOfMemoryErrors.
The lowercase friendly NAB logo reminds me of Jeb Bush’s 2016 ‘Jeb!’ logo
National Australia Bank Formed in 1982 after the merging of National Bank of Australalasia and the Commercial Banking Company of Sydney. The 21st largest bank by market capitalisation. They have an Irish subsidary called Danske Banke, which was formerly known as National Irish Bank. Sounds like a pretty good bank, but how good are their PDFs?
Are the people in this building good at making PDFs?
I’ve seen quite a few statements when helping customers get through issues converting bank statements. A lot of bank statements follow a logical structured format that is easy for a human to read and easy for an algorithm to extract data from. You’d think it’s something that would normalise into a standard format, banks would look at statements from other banks and copy the bank with the best statements.
I own a limited liability company in Hong Kong called Dragon King Creation Limited. The company has no employees, I’m the sole director and I own 100% of the shares of the company. I created the company in 2015 to manage the revenues from my Android and iPhone application sales. The company doesn’t make a lot of money, but it’s still officially a company. Every year limited companies in Hong Kong must go through an audit.
Getting data from a PDF file into an Excel file is a major pain in the ass. A lot of people resort to manually copying it. If you want to automate the conversion follow the steps below
Go to bankstatementconverter.com Click the Convert a PDF button Select the PDF you want to convert Wait for it to finish uploading Press the inspect button You should be taken to a page where you can see the PDF you uploaded.
I’ve been working on a new project that requires lots of different social sign in providers. This meant I needed to learn what OAuth is and how to create the APIs to communicate with OAuth providers like Facebook, Google, Github and… Twitter. The first three providers were very easy to work with, Twitter was not. I spent about five hours reading Twitter’s documentation, going through blog post and libraries and writing code.
A few weeks ago I got an email from a user in Australia, it read “I uploaded a bank statement and nothing happens How do I see the converted file?”. I took a look at my Grafana dashboards and figured out this user had uploaded an image based PDF, probably a scanned bank statement. I’ve replied to quite a few users telling us that scanned documents don’t work. I fished out a reply from my sent folder and sent it to this user.