Automating the Web


  • Web scraping (extracting data from web sites)
  • Programmatic interaction with web sites
  • Consuming web services / APIs
  • Acceptance testing

The Book Was Better

The book cover for 'Web Scraping with PHP, 2nd Edition'



Packages: find, install, and autoload them

Composer logo

Obligatory Disclaimer

  • Some uses of this content may have unclear or lacking legality
  • This content is intended to be strictly educational, not prescriptive or legal advice
  • If in doubt, consult a lawyer (i.e. not me)

This is Lafayette

A map of Louisiana with Lafayette Parish marked in red

This is Lafayette 911

A screenshot of lafayette911.org


This is Ray

Raymond Camden


Ray Created a Viewer

Screenshot of Ray's 911 viewer mapping incidents using Google Maps

Blog Post circa 2010

Six Months Later...

... He Had a Lot of Data

Visualizations of 911 statistics gathered by Ray

Blog Post circa 2010

Can We Recreate It?

Use the Source, Luke

Chrome / Firefox: Right-click > View Page Source

Use the Source, Luke

Into the Fire & Frames

Into the Fire & Frames

A screenshot of lafayette911.org with no incidents listed

Into the Fire & Frames

Where's the Data?

Some Possible Sources

Learn the DevTools

  • Chrome: View > Developer > Developer Tools
  • Firefox: Tools > Browser Tools > Web Developer Tools

Check for XHRs

  1. Chrome / Firefox: DevTools > Network tab
  2. Chrome: Fetch/XHR filter, Firefox: XHR filter
  3. Chrome / Firefox: click request in table

Network tab in Firefox Developer Tools showing XHRs made by lafayette911.org

Inspect the Request

Firefox Developer Tools showing XHRs made by lafayette911.org

Inspect the Request

POST /L911/Service2.svc/getTrafficIncidents HTTP/2
Host: apps.lafayettela.gov


Inspect the Request

  • POST = method or operation
  • /L911/Service2.svc/getTrafficIncidents = Uniform Resource Identifier (URI)
  • HTTP/2 = client protocol and version
  • Host = header name
  • apps.lafayettela.gov = header value
  • ... = request body

Inspect the Response

Firefox Developer Tools showing an XHR response

Inspect the Response

Inspect the Response

  • HTTP/2 = server protocol and version
  • 200 = status code
  • OK = status description
  • cache-control = header name
  • private = header value
  • {"d":"..."} = response body


Mimic the Request

Streams / Filesystem

Mimic the Request

Streams / Filesystem

Text Fu

Programmers manipulate text the same way woodworkers shape wood. "The Pragmatic Programmer: Your Journey to Mastery, 20th Anniversary Edition" by David Thomas and Andrew Hunt

Extract the Markup


Inspect the Markup

Handle Malformations

Handle Malformations

  1. Install tidy extension
  2. Optionally, configure its options
  3. Parse markup
  4. Verify that malformations don't cause data loss

Handle Malformations

Handle Malformations

Parse the Markup

DOM / libxml

Parse the Markup

Parse the Markup

For complex queries: XPath + DOMXPath

Parse the Markup

To use CSS selectors: symfony/css-selector

MDN: Tutorial, Reference

CSS Selectors

A comic strip by Julia Evans on CSS selectors

Trim the Address

String / Multibyte String

Parse the Address


Regular Expressions

Pattern Syntax / Modifiers

Regular Expressions

Parse the Date

Date and Time

Thank This Guy

Derick Rethans

Derick Rethans

We Did It!

Now What?

  • Store data in JSON files or a database
  • Put it behind a web server or API
  • Add a Google Maps frontend
  • Profit!

Lafayette Traffic

Google Maps showing Lafayette, LA with traffic incidents marked Google Maps showing Lafayette, LA with a traffic incident overlay Lafayette Traffic showing a live city traffic camera feed

Google Play / GitHub Android, Data

So What?

On JIRA, Briefly

Screenshot of a project board in JIRA


Copy as cURL

  • Chrome: DevTools > Network tab > right-click request > Copy > Copy as cURL
  • Firefox: DevTools > Network tab > right-click request > Copy Value > Copy as cURL

Recreate Requests

  1. Install frizz925/curl-parser
  2. Provide copied cURL command to parser

Recreate Requests

Extract request data

Recreate Requests

Get request object

Send a Request

  1. Install guzzlehttp/guzzle
  2. Create the client
  3. Configure options and send the request

Using Responses

Debug Requests

  1. Install alexkart/curl-builder
  2. Download cURL

Debug Requests

Convert request to server request

Debug Requests

Convert server request to cURL command

Debug Requests

Tweak cURL command as needed

  • Add the -v flag to get more verbose output
  • If the request uses the POST method and has no body, add -X flag with POST as its argument

Debug Requests

Run the cURL command

Repeat Requests

Repeat Requests

  • Chrome: DevTools > Network tab > right-click in request pane > Save all as HAR with content
  • Firefox: DevTools > Network tab > right-click in request pane > Save All as HAR
  • HAR Analyzer

Repeat Requests

Bring Out the Big Guns

Automate the Browser

Automate the Browser


Automate the Browser

This is Fine

WebDriver Flag

WebDriver Flag

Other WebDriver Flag

Other WebDriver Flag

Headless Mode

Quick Demo

Debugging Tips

Free Project Idea

  • PsySH
  • Add integration with symfony/panther
  • e.g. custom commands to interactively fetch pages, filter and interact with elements, etc.

Other Resources

Leonardo DiCaprio in a tuxedo smiling and raising a glass toward the viewer

