Detecting Headers of the Transaction Table

This is the second part in a series of blog posts where I explain how Bank Statement Converter works. In the previous article I talked about how I extract each character and its bounding box from a PDF. In this article I’ll talk about how I use the characters and bounding boxes to deteect the headers of the transaction table.

val pageRegion = Rectangle(0f, 0f, page.cropBox.width, page.cropBox.height)
val lines = LineExtractor(page).extractLines()
val headers = headerDetector.detect(allCharacters, lines, pageRegion).map(::headerTransformer)

In the code above, we determine the size of the page, extract the graphical lines from the page and call the detect method on our TransactionHeaderDetector class. Often the graphical lines can be used to improve the quality of the headers for a table, we’ll talk about that a bit more in detail later.

override fun detect(__unsortedCharacters: List<CharAndBound>, graphicalLines: List<Line>, pageRegion: Rectangle): List<TableHeader> {
    val words = merger.merge(__unsortedCharacters)
    ...
}

First we merge the list of characters into a list of words. We do this by sorting the characters by their Y and X positions and then measuring the horizontal and vertical distances between characters. If either value crosses a certain threshold it is considered to be the start of a new word. Here’s the interesting code from CharacterMerger.kt that is responsible for detecting boundaries between words.

val xDistance = current.bound.left() - previous.bound.right()
val yDistance = Math.abs(current.bound.bottom() - previous.bound.bottom())
val isCloseEnoughOnX = xDistance <= xThreshold
val isCloseEnoughOnY = yDistance <= yThreshold

if (!isCloseEnoughOnY || !isCloseEnoughOnX) {
    words.add(createTextAndBound(buffer))
    buffer.clear()
}

After that we form the words into lines, and go through each line to see if it looks like a header for a bank statement’s table of transactions. If we find one more headers we return them. If we don’t, we try look for headers across two lines at a time. If that works, great! If not, we look for headers across three lines at a time. This isn’t the most elegant solution, but in practice it works pretty well. Also it isn’t great from a performance point of view, a lot of work being done in the detectLines method is needlessly repeated.

override fun detect(__unsortedCharacters: List<CharAndBound>, graphicalLines: List<Line>, pageRegion: Rectangle): List<TableHeader> {
    val textLines = splitByYLevel(words)
    var headersFound = detectLines(textLines, graphicalLines, pageRegion, 1)

    if (headersFound.isNotEmpty()) {
        return headersFound
    }

    headersFound = detectLines(textLines, graphicalLines, pageRegion, 2)

    if (headersFound.isNotEmpty()) {
        return headersFound
    }

    return detectLines(textLines, graphicalLines, pageRegion, 3)
}

detectLines is actually a bit of a wrapper method. As you might have guessed, originally this algorithm only looked for headers across one line at a time. As users emailed me with errors I improved the algorithm to look across multiple lines. If you haven’t coded in Kotlin this may look a little scary, all it’s really doing is figuring out which lines to give to the parseTransactionHeader method. Later on it calls useLinesToEnrich to improve the quality of the header boundaries, we’ll talk about that method later. If you don’t understand this code, don’t worry, it is not interesting.

private fun detectLines(
    textLines: List<List<TextAndBound>>,
    graphicalLines: List<Line>,
    pageRegion: Rectangle,
    followingLineCount: Int
): List<TableHeader> {
    val headers = textLines.mapIndexedNotNull { index, _ ->
        val toIndex = index + followingLineCount

        if (toIndex > textLines.lastIndex) {
            null
        }
        else {
            val lines = textLines.subList(index, toIndex).flatten()
            parseTransactionHeader(lines)
        }
    }

    val enrichedHeaders = headers.map { useLinesToEnrich(it, graphicalLines, pageRegion, logger) }
    return enrichedHeaders
}

Now we’re getting to the interesting code. We attempt to convert each word into an XRange model, which is a simple model to represent one of the headers in table’s header. getXRange magically knows if a word is a word you might see in the header of a transaction table. We analyse the xRanges we pulled back to see if certain columns are present. If there’s no description, date, credit, debit or amount column then the code says “This list of words does NOT make up a transaction header”. We also consider the match percentage, that is, how many characters in the entire line were ‘keywords’. The match percentage needs to be greater than 35% to be considered a transaction header. This is because bank statements often have fine print with tons of words, and if a line or a group of lines contains the words “credit”, “date” and “description” then the parser will falsely say “Aha! I’ve found a header”, when in fact it has detected some fine print.

data class XRange(val name: String, val start: Float, val end: Float, val columnType: ColumnType?)

private fun parseTransactionHeader(
    lines: List<TextAndBound>
): TableHeader? {
    val result = lines.map(::getXRange)
    val xRanges = result.map { it.first }.sortedBy { it.start }
    val charactersMatched = result.sumOf { it.second }
    val totalCharacters = lines.sumOf { it.text.length }
    val matchPercent = charactersMatched / totalCharacters.toFloat()

    val hasDate = xRanges.any { it.columnType == ColumnType.DATE }
    val hasBalance = xRanges.any { it.columnType == ColumnType.BALANCE }
    val hasCredit = xRanges.any { it.columnType == ColumnType.CREDIT }
    val hasDebit = xRanges.any { it.columnType == ColumnType.DEBIT }
    val hasAmount = xRanges.any { it.columnType == ColumnType.AMOUNT }
    val hasDescription = xRanges.any { it.columnType == ColumnType.DESCRIPTION }
    val hasCreditDebitAmountOrBalance = hasCredit || hasDebit || hasBalance || hasAmount

    if (!hasDate || !hasDescription || !hasCreditDebitAmountOrBalance || matchPercent < 0.35) {
        return null
    }

    logger.info("Match Percent {}/{} = {}%", charactersMatched, totalCharacters, matchPercent * 100)

    val top = lines.minOf { it.bound.top() }
    val bottom = lines.minOf { it.bound.bottom() }
    return TableHeader(xRanges, top, bottom)
}

You might think I’m going to show you what getXRange looks like, but instead I’ll just show you the data getXRange uses and that should give you a better of what getXRange does. Below we have a list of words, and some matchers that use those word lists to determine whether a word could be for a certain header. If you pass in the word “date” to getXRange, it’ll go through the list of matchers in an attempt to classify the word as a ColumnType. The matchers actually look for a substring of the known words. So getXRange(“sabalance”) will be classified as a BALANCE ColumnType.

private val dateHeaders = listOf("date", "datum", "posted", "posdatum", "achat", "tanggal", "data")
private val balanceHeaders = listOf("balance", "saldo")
private val amountHeaders = listOf("amount", "bedrag", "mutasi")
private val creditHeaders = listOf( "credit", "moneyin", "paid in", "deposit", "krediet", "crédit", "entrate", "deposits")
private val debitHeaders = listOf("debit", "moneyout", "paid out", "withdrawal", "debiet", "débit", "uscite")
private val descriptionHeaders = listOf("description", "particulars", "details", "transaction", "beskrywing", "narrative", "transaksiebeskrywing", "transaksie", "texte", "keterangan", "descrizione")
private val chargeHeaders = listOf("koste", "charge")

private val matchers = listOf(
    { header: TextAndBound -> isDateColumn(header) } to ColumnType.DATE,
    { header: TextAndBound -> isBalanceColumn(header) } to ColumnType.BALANCE,
    { header: TextAndBound -> isCreditColumn(header) } to ColumnType.CREDIT,
    { header: TextAndBound -> isDebitColumn(header) } to ColumnType.DEBIT,
    { header: TextAndBound -> isAmountColumn(header) } to ColumnType.AMOUNT,
    { header: TextAndBound -> isDescriptionColumn(header) } to ColumnType.DESCRIPTION,
    { header: TextAndBound -> isChargeColumn(header) } to ColumnType.CHARGE
)

If you’re a steely-eyed missile man you will have noticed this won’t work for every language. Let’s say I was to cover every possible language, that would mean adding to the key word lists, which would mean more processing for every word in the document. I think a Trie data structure might be useful as the keyword lists increase in size. Finally let’s talk about the extendXRanges method.

val xRanges = header.xRanges.mapIndexed { index, xRange ->
    val mid = xRange.mid()
    val isFirstHeader = index == 0
    val isLastHeader = index == header.xRanges.lastIndex

    val lineBeforeX = getLineBeforeXPosition(linesAtThisLevel, mid, isFirstHeader)
    val lineAfterX = getLineAfterXPosition(linesAtThisLevel, mid, isLastHeader)

    // Wasn't able to find a line before or after this header. Not a valid table
    if (lineBeforeX == null || lineAfterX == null) {
        logger.info("Wasn't able to find a line before or after header named '${xRange.name}'. Not a valid table")
        return null
    }

    // The enriched headers have shrunk. Not a valid table
    if (lineBeforeX > xRange.start || lineAfterX < xRange.end) {
        logger.info("The enriched headers have shrunk. Not a valid table.")
        return null
    }

    xRange.copy(start = lineBeforeX, end = lineAfterX)
}

extendXRanges is called after a transaction header has been detected. It attempts to use vertical lines to extend the xRanges. This makes more sense when looking at a real transaction table with graphical lines.

The image above is from one of my Hang Seng Bank statements. If we use the bounding rectangles of the header text, we’ll end up with XRanges represented in red. extendXRanges uses the graphics lines in document to improve the XRanges so that we end up with XRanges represented in blue. The next step in the algorithm uses the XRanges to associate pieces of text with a column, it checks to see if a piece of text intersects with any of the XRanges.

As you can see above, the Date and Transaction Details values all intersect with the correct header, however none of the Deposit, Withdrawal or Balance values do. All the values resolve correctly after extending the XRanges.

That’s it for this blog post. Next time on Bank Statement Converter Blog we’ll in more detail about how values are associated with headers.