Serverless Forum Keyword Monitoring with PHP

Published in PHP, Web Scraping on Apr 6, 2020

I recently came across a post entitled "How To Monitor a Forum for Keywords Using Python and AWS Lambda" in my Twitter feed. While there are a number of things I like about Python as a language, the Python package management system isn't one of them. PHP is my lingua franca and Composer has ruined every other language package management system for me.

I read the post and thought, "I wonder how difficult it would be to do this in PHP?" My friend Rob Allen gave a talk on serverless PHP applications at the Midwest PHP 2020 conference recently. I recently published the second edition of my book on using PHP for web scraping. So, I rolled my sleeves up and got to it.

Prerequisites

As with the original post, we'll use the Serverless framework, but in conjunction with the Bref framework that Rob mentions in his talk.

To fetch and parse HTML from the forum, we'll use the Symfony HttpClient, BrowserKit, and CssSelector libraries. (If you have my book, you can find more information on these in Chapter 14.)

I personally like keeping my host operating system minimal and tidy, so I run as many things as I feasibly can using Docker containers. The easiest way I've found to do that with Bref is to use the same Docker image used for its development.

First, we'll install the dependencies mentioned above using Composer.

docker run --rm -it -v $(pwd):/var/task bref/dev-env \
    composer require bref/bref symfony/http-client \
    symfony/browser-kit symfony/css-selector

Next, we'll initialize the project to get our serverless configuration and initial boilerplate code, using defaults when prompted.

docker run --rm -it -v $(pwd):/var/task bref/dev-env \
    php ./vendor/bin/bref init

Opening the generated serverless.yml file, we see that Bref defaults to using the layer for PHP 7.3. We'll change it to use 7.4 instead, which is what its development Docker image uses, and update the function name while we're at it.

functions:
  indiehackers: # Change this to an appropriate function name
    handler: index.php
    description: ''
    layers:
      - ${bref:layer.php-74} # Change 73 on this line to 74

Keyword Monitoring

As in the original post, we're going to target Indie Hackers, specifically post titles on the landing page.

To do this, we'll open the generated index.php file and edit it to have these contents.

<?php declare(strict_types=1);

require __DIR__ . '/vendor/autoload.php';

use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

return function () {
    $base_url = 'https://www.indiehackers.com';
    $browser = new HttpBrowser(HttpClient::create());
    $browser->request('GET', $base_url);
    $crawler = $browser->getCrawler();
    $links = $crawler->filter('a.feed-item__title-link');
    $filtered = new CallbackFilterIterator(
        $links->getIterator(),
        function ($link) {
            return stripos($link->textContent, 'design') !== false;
        }
    );
    $formatted = array_map(
        function ($link) use ($base_url) {
            return $base_url . $link->getAttribute('href');
        },
        array_values(iterator_to_array($filtered))
    );
    return $formatted;
};

Using the aforementioned Symfony component libraries, we do the following.

  1. Fetch the landing page of the target web site.
  2. Parse the page for feed links.
  3. Filter those links for the phrase "design."
  4. Extract the link addresses and format them to be absolute URLs.
  5. Return the URLs.

If we wanted to integrate this with Slack, it would be simple to do so with the maknz/slack library. While it is no longer being maintained at the time of this writing, it works and makes sending Slack messages via an incoming webhook very simple.

The following code segment can be placed before the return statement in the above function.

<?php
$client = new Maknz\Slack\Client(WEBHOOK_URL);
foreach ($formatted as $link) {
    $client->send($link);
}

Deployment and Invocation

To have this function run once a day, open serverless.yml and modify the block you changed earlier to include the events block shown below.

functions:
  indiehackers:
    handler: index.php
    description: ''
    layers:
      - ${bref:layer.php-74}
    events: # Add this block
      - schedule: rate(1 day)

To deploy this, create AWS keys and run the following, replacing KEY and SECRET with the key and secret values for your IAM user respectively.

docker run --rm -it -v $(pwd):/var/task \
    --env AWS_ACCESS_KEY_ID=KEY \
    --env AWS_SECRET_ACCESS_KEY=SECRET \
    bref/dev-env serverless deploy

Finally, to invoke the function manually:

docker run --rm -it -v $(pwd):/var/task \
    --env AWS_ACCESS_KEY_ID=KEY \
    --env AWS_SECRET_ACCESS_KEY=SECRET \
    bref/dev-env serverless invoke -f indiehackers

Here's an example of this function's output.

[
    "https://www.indiehackers.com/post/advice-recommendations-for-improving-design-and-ux-of-dashboard-778bfdbe52"
]
 

Conclusion

The amount of effort involved in this seemed at least on par with what's required on the Python side, minus any potential package management system shenanigans. Overall, I was pleased with the results.

I hope you found this post useful. Thanks for reading!