Comparing directories with HTML output

If you have two copies of static sites and want to compare the output of the two, here are some Linux command line utilities I’ve used for that task on a recent project.

HTML Tidy

I used HTML tidy to normalize the HTML output of the two directories. This step formats the HTML files consistently and helped reduce false positives.

I was happy to see that HTML tidy has been resurrected, on Ubuntu I installed it with

apt-get install tidy

I ran tidy against all the HTML files in each output directory. I used the default configuration and it worked fine for me

find . -name '*.html' -type f -print -exec tidy --warn-proprietary-attributes false -mq '{}' \;

Diff

You can compare two directories at the command line and pipe the output to a text file for review. Once you’re familiar reading diff output, you can make sense of the lines that are different. I ended up using the –exclude and –ignore-matching-lines flag to get rid of lines which were different due to cache-busting flags or machine-generated CSS and JavaScript filenames. Doing so helped focus to find changes that matter.

diff -burw html/ html-new-nav/ > html-diff.txt

This diff output was useful for identifying which files were actually different. To see the changes in specific pairs of files, I like using a visual diff utility like WinMerge, Meld, or the diff built-in to PHPStorm.