Automating Personal Finance Monitoring, Part 2

Published in PHP, Databases, Web Scraping on May 30, 2021

In part 1, I used PHP and a set of related libraries and tools to scrape data from my accounts with the financial institutions that I patronize. In this post, I'll talk about how I store, update, and process the data to useful ends.

Invoking Updates

I wrote a class that invokes code to update select accounts based on its input, then wrapped it in a symfony/console command to execute it from a shell.

With no input, this code defaults to checking the e-mail account that receives alerts and updating accounts with alerts present. It will also update accounts lacking an update within the last day. I've set up a cron job that uses this mode to check for updates sporadically over the course of each day.

I can also specify a list of one or more individually accounts to force an update of them specifically. This is useful when I'm debugging an issue, such as having to update my scraping code to account for a change in the DOM of an account's web site.

Lastly, I can specify an --all flag to update all accounts. Most accounts typically don't change daily, so this is a good approximation of when the cron job detects accounts without an update in the last day and updates them all.

I had to add to the chromedriver CLI flags from part 1 to allow this to run in headless mode (yet again, bless Stack Overflow) without exposing that I was running in headless mode by spoofing the user agent (thanks to this post for that tip).

$client = \Symfony\Component\Panther\Client::createChromeClient(
    null,
    [
        '--disable-blink-features=AutomationControlled',
        // new flags below
        '--headless',
        '--window-size=1920,1080',
        'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
    ]
);

Storing Events

Late last year, Frank de Jonge gave a talk published to YouTube on event sourcing in general and his library EventSauce in particular. During the talk, he mentioned that the architecture worked well for financial applications. I'd never had the opportunity to work on a codebase that used this as a central facet of its architecture, and my finances project seemed like a good opportunity.

In my case, it made sense to make accounts an aggregate root. Think of an aggregate root as a stream of events centered around a particular thing, in this case an individual account, that you want to capture. In a way, they're like long-lived orphan git branches that are never merged, with changes to state tracked like commits.

For my use case, I wanted to track the state -- specifically, balance and recent postings -- of each account that I'm scraping data from. I created an immutable value object class to represent this state, then wrapped it in an EventSauce event to associate it with an individual aggregate root and to handle serialization.

For the sake of simplicity, I used the stock EventSauce classes SynchronousMessageDispatcher and ConstructingMessageSerializer for the message dispatcher and serializer respectively.

Rather than using stock persistence options requiring Doctrine or Laravel libraries, I decided to implement a simple custom repository. For this, I used the aura/sql library together with Ben Ramsey's ramsey/uuid library to store my data in a local SQLite database. Aside from the methods declared in MessageRepositoryInterface, I also implemented a method to retrieve the most recent state of a given aggregate root to locate accounts lacking updates within the last day.

Consuming Updates

When I look at my accounts manually, my interest often centers around what has changed since I last looked at it.

This generally entails looking through recent postings and determining which are new, which have cleared, and, on rare occasion, which have disappeared entirely (e.g. a pending cancelled transaction left to "fall off" rather than clearing and being explicitly reversed).

This process is tedious, error-prone, too reliant on my memory (which isn't great), and takes time. My computer is better at crunching data than I am. Why not have it take over doing all this for me?

The code I wrote that retrieves updates also handles persisting them. When this happens, EventSauce dispatches the event to any consumers registered with the dispatcher.

I wrote a consumer to handle the event I created earlier. This consumer uses retrieveAllAfterVersion() method of my aggregate repository to retrieve the two most recent events for the event's associated aggregate root.

The consumer then passes these two states off to another class, which analyzes them and returns a value object populated with the results of the analysis, which include the relative balance change as well as new, cleared, and dropped postings.

Analyzing Updates

I'm not sure I've tested all possible cases using the algorithm that derives these results -- I'm using PestPHP to write automated tests, a change from my default choice of PHPUnit -- but the algorithm seems to be working so far.

  1. From postings in the current state, remove any that are not present in the previous state (i.e. that don't have the same date, description, and amount). Designate these removed postings as new.
  2. From these new postings, remove any that have a settlement date and where a posting exists in the previous state without a settlement date and with the same amount. Designate these removed postings as cleared.
  3. From postings in the previous state, remove any that don't have a settlement date or have one after the earliest settlement date of the postings in the current state and that are not present in any postings from the current state or pending postings from the previous state. Designate these removed postings as dropped.

Note that this algorithm assumes that you retrieve enough transactions such that there is always some overlap between previous and current states. This must account for the possibility of the current state containing a large number of pending transactions occurring in a short timeframe that "push down" those in the previous state, removing any such overlap.

Publishing Results

Once it receives these results, the consumer formats them into a message and uses symfony/notifier and symfony/slack-notifier to send them to a channel on our household Slack instance. The result looks something like this.

MoneyWatcher APP 7:06 PM

New Account Update
Account: [account name]
Balance: $#,###.##
Change: $-###.##

New:

  • $-###.## -- [posting description]

Cleared:

  • $-###.## -- [posting description]

Dropped:

  • $-###.## -- [posting description]

The consumer removes any empty sections to reduce noise, but otherwise, account updates result in a message like the one above showing up in Slack.

All this happens without me having to log into my accounts and check them manually or to check my e-mail inbox for account alerts. This makes me aware of any unexpected transactions soon after they show up and makes it easier to review past activity and even annotate it by starting a thread on the relevant update.

Further Developments

The next thing I'm hoping to do with this is to add the ability for me and my wife to log postings (probably via Slack) that may not show up in our accounts soon after they happen, so that we have a balance that takes those postings into account. This will require me to change my existing consumer to discount these postings once the associated pending or cleared postings do show up in the accounts.

I'd also like to automate the process of adding advance postings like these for our monthly bills using an RRULE-based configuration, so that we'll always know what funds we have available after we pay these bills. One possible implementation of this could be a script that generates an image of a burn down chart of these funds over the course of the period between paychecks, so we have a better visualization of where we are financially.

Other possible additions are monitors for our credit utilization ratio, how much we're paying in interest, how soon we'll pay off specific credit lines assuming we pay on them consistently, and so forth. There's a wealth of possibilities for transforming and visualizing this data.

Outro

I'm not sure if I'll publish another post in this series, but depending on interest, it's certainly possible. Regardless, I hope you've enjoyed reading about my journey in developing this project. Thanks for reading!