Remove unapproved comments from WordPress exports

Recently, I needed to migrate some WordPress blogs to another system. WordPress provides a handy way to export content in its WXR format. However, it’ll export all comments, whether approved or not. This is good from a data backup standpoint, but I didn’t need to import these. They were also bloating the XML file and affecting how long it took my import to process.  I needed a way to remove unapproved comments, the following code will do that using PHP’s DOMDocument extension to walk an input file. The cleaned up content is sent to STDOUT so you can pipe it to another file to save.

<?php
if (!isset($_SERVER['argv'][1])) {
    echo "\nSpecify input file \n";
    exit;
}

$infile = $_SERVER['argv'][1];

$doc = new DOMDocument();
$doc->recover = TRUE;
$doc->load($infile);

$comments = $doc->getElementsByTagName('comment');
$to_remove = array();

foreach ($comments as $comment) {
    if ($approved = $comment->getElementsByTagName('comment_approved')) {
        if ($approved->length > 0) {
            $app = $approved->item(0);

            // can't remove nodes while looping
            if (0 == $app->nodeValue) {
                $to_remove[] = $comment;
            }
        }
    }
}

if (count($to_remove)) {
    foreach ($to_remove as $elt) {
        $elt->parentNode->removeChild($elt);
    }
}

$doc->formatOutput = true;
$doc->preserveWhiteSpace = false;
echo $doc->saveXML();

Carousels are bad for Accessibility

I’ve never been a fan of carousels. They’ve become a crutch for designers and clients who want to spice up a homepage presentation with something that moves. ShoulIUseACarousel was shared by a lot of folks I follow, NetMagazine did an interview with the accessibility expert who created the site.

JS: Carousels are seemingly an easy fix to two universal design problems: how do I fit so much content into so little space, and how do I decide what content is the most important? It’s easy to justify away the usability issues of a carousel when you consider the benefits of presenting multiple content pieces in such little real estate

From: Accessibility expert warns: stop using carousels | News | .net magazine

From an information architecture perspective, Travis Lafleur provides a better alternative. In spirit, it’s very similar to the approach we used on DCUnited.com back when I was there.

Consider this simple, straightforward alternative. First, determine essential content to be featured on the page. Keep in mind the desired outcomes of the project as a whole, the mindset and goals of your users, and what actions you want them to take on the particular page. Next, prioritize. This can be as simple as assigning numbers to each item. If users notice only one thing on the page, what should that be? If they notice two, what should the second be? – and so on. If you’re having trouble prioritizing – or have too many items to promote – consider breaking the content into logical groups and spreading it over multiple pages.

From: Biggs|Gilmore – A Critique of Carousels

It turns out they also don’t lead users to take meaningful actions.

I’m sure you’ve come across dozens, if not hundreds of image sliders or carousels (also called ‘rotating offers’). You might even like them. But the truth is that they’re conversion killers.

From: Don’t Use Automatic Image Sliders or Carousels, Ignore the Fad

Eric Runyon has the stats to back this up, click through to see how many people click beyond the first slide.

Carousels. That gem of a web feature that clients love, and many developers hate. One thing is certain, they are the darling of HigherEd. In fact, they’re loved so much, I’ve been assigned many times to retroactively add them to sites that have already been live for years. This led me ask how much are users really interacting with the carousels.

From: Carousel Interaction Stats | WeedyGarden

Finally, Jack Shepard lists better alternatives to using a carousel slide.

Let me preface this by saying this discovery is not anything new, however unless you’re really geeking out you won’t be in the know on this stuff.

From: The cure for the common image slider carousel

Highest attended soccer matches in the USA

This started as a reply to a reddit poster claiming a USA-Turkey match in 2010 was “the highest attended soccer match ever”

According to this the attendance was 55,407. Nice, but not the highest ever for soccer.
http://www.ussoccer.com/news/mens-national-team/2010/05/turkey-game-report.aspx

But not the larget for soccer that I can find. Portugal played the USA at RFK during the 1996 olympics, attendance was 58,012. 
http://www.dcconvention.com/Venues/RFKStadium/UniqueSpaces.aspx

Also MLS Cup 1997 at RFK featuring home side D.C. United was attended by 57,431 people.
http://en.wikipedia.org/wiki/MLS_Cup_’97

Also, the LA Coliseum would sell out for soccer matches, albeit ones not featuring the USA. Capacity is 92k
http://articles.latimes.com/1999/apr/29/news/mn-32450

Turns out the USSF has a page with attendance records, and the USA-Turkey game, or the others mentioned by me above, would not make it, as the minimum cutoff is around 78,000. Maybe the US turkey game was the best attended USMNT during the previous world cup cycle?
http://www.ussoccer.com/teams/us-men/records/attendance-records/largest-crowds-in-us.aspx

Quick mysqldump snapshot script

I find myself needing this often on different servers, you may find it useful too. This script creates a timestamped dump of a mysql database to the current directory. Assumes it runs as a user who can connect to the database. You can set those credentials using the -u and -p command line switches for mysqldump

#!/bin/bash
# retrieve a database and gzip it

if [ $# -ne 1]
then
  echo "Usage: `basename $0` {database_name}"
  exit $E_BADARGS
fi

DB="$1"

DATEZ=`date +%Y-%m-%d-%H%M`
OUTFILE="$DB.$DATEZ.sql.gz";

echo "mysqldump for $DB to $OUTFILE"
sudo mysqldump --opt -e -C -Q $1 | gzip -c > $OUTFILE

Extract images from an HTML snippet

The function here will take an HTML fragment and return an array of useful images it finds.

<?php
/**
 * extractImages
 *
 * @param $text
 * @return array|bool
 */
function extractImages($text)
{
    $header = '<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />';
    $text = $header . $text;
    $dom = new DOMDocument();
    if (@$dom-&gt;loadHTML($text)) {
        $xpath = new DOMXpath($dom);
        if ($images = $xpath-&gt;evaluate("//img")) {
            $result = array();
            foreach ($images as $i =&gt; $img) {
                $ht = $img-&gt;getAttribute('height');
                $wd = $img-&gt;getAttribute('width');
                // if height &amp; width are 1 its a bug, ignore
                if (1 === (int)$ht &amp;&amp; 1 === (int)$wd) {
                    continue;
                }
                // if it doesn't end in an image file extension
                // then ignore
                $src = $img-&gt;getAttribute('src');
                if (!preg_match('/.(png|jpg|gif)$/i', $src)) {
                    continue;
                }
                // do we need to figure out the full url to the image?
                if (!preg_match('/^https?:///', $src)) {
                    continue;
                }
                $alt = $img-&gt;getAttribute('alt');
                $result[$i] = array('src' =&gt; $src, 'alt' =&gt; $alt, 'height' =&gt; $ht, 'width' =&gt; $wd);
            }
            if (!empty($result)) {
                return $result;
            }
        }
    }
    return false;
}

Measuring Site Speed localy

This is a cool utility that uses YSlow! and PhantomJS to measure your site’s speed across many pages. Should be good for identifying slow individual pages as well as practices that impact multiple pages.

This is how it works: You feed it with a start URL and how deep you want it to crawl. The pages are fetched and all links within the same domain from that page are analyzed, and a couple of HTML pages are created with the result. Sounds simple?

From: Performance Calendar » Do you sitespeed?

Smelly PHP code

Adam Culp posted the 3rd article in his Clean Development Series this week, Dirty Code (how to spot/smell it). When you read it, you should keep in mind that he is pointing out practices which correlate with poorly written code not prescribing a list of things to avoid. It’s a good list of things to look for and engendered quite a discussion in our internal Musketeers IRC.

Comments are valuable

Using good names for variables, functions, and methods does make your code self commenting, but often times that is not sufficient. Writing good comments is an art, too many comments get in the way, but a lack of comments is just as bad. Code can be dense to parse where a comment will help you out. They also let you quickly scan through a longer code block, just skimming the comments, to find EXACTLY the bit you need to change/desbug/fix/etc. Of course, the latter you can also get by breaking up large blocks of code into functions.

Comments should not explain what the code does, but should capture the “why” of how you are solving a problem. For example, if you’re looping over something a bad comment is “// loop through results”, a good comment is “// loop through results and extract any image tags”

Using Switch Statements

You definitely should not take this item in his list to mean that “Switch statements are evil.” You could have equally bad code if you use a long block of if/then/elseif statements. If you’re using them within a class, you’re better off using polymorphism, as he suggests, or maybe look at coding to an Interface instead of coding around multiple implementations.

Other code smells

In reviewing the article, I thought of other smells that indicate bad code. Some are minor, but if frequent, you know you’re dealing with someone who knows little more than to copy-and-paste code from the Interwebs. These include:

  • Error suppression with @. There are very, very, very few cases where its ok to suppress an error instead of handling the error or preventing it in the first place.
  • Using globals directly. Anything in $_GET, $_POST, $_REQUEST, $_COOKIE should be filtered and validated before you use it. ‘Nuff said
  • Deep class hierarchy. A deep class hierarchy likely means you should be using composition instead of inheritance to change class behaviors.
  • Lack of Prepared DB Statements. Building SQL queries as strings instead of using PDO or the mysqli extensions’ prepared statements can open up sql injection vulnerabilities.
  • Antiquated PHP Practices. A catch all for things we all did nearly a decade ago, includes depending on register globals being on, using “or die()” to catch errors, using the mysql_* functions. PHP has evolved, there’s no reason for you not to evolve with it.

That’s generally what I look for when evaluating code quality. What are some things I missed?

Building CandiData

This past weekend, my colleague and friend Sandy Smith participated in Election Hackathon 2012 (read his take of the hackathon). We built our first public Musketeers.me product, Candidata.me. This was my first hackathon, and it was exciting and exhausting to bring something to life in little more than 24 hours. Our idea combined a number of APIs to produce a profile for every candidate running for President or Congress in the United States. The seed of the idea was good enough that we were chosen among 10 projects to present it to the group at large on Sunday afternoon.

Under the Hood and Hooking Up with APIs

We used our own PHP framework, Treb, as our foundation. It provides routing by convention, controllers, db acccess, caching, and a view layer. Along the way, we discovered a small bug in our db helper function that failed because of the nuances of autoloading.

I quickly wrote up a base class for making HTTP Get requests to REST APIs. The client uses PHPs native stream functions for making the HTTP requests, which I’ve found easier to work with than the cURL extension. The latter is a cubmersome wrapper to the cURL fucntionality.  

To be good API clients, we cached the request responses in Memcached between an hour to a month, depending on how often we anticipated the API response to change.

Sandy also took on the tedious – but not thankless – task of creating a list of all the candidates that we imported into a simpl Mysql table. For each candidate, we could then pull in information such as

  • Polling data from Huffington Post’s Pollster API, which we then plotted using jqplot. Polls weren’t available for every race, so we had to manually match available polls to candidates.
  • Basic Biographical information from govtrack.us
  • Campaign Finance and Fact Checked statements from Washington Post’s APIs.
  • Latest News courtesy of search queries to NPR’s Story Api.
  • A simple GeoIP lookup on the homepage to populate the Congressional candidates when a user loads the page

Bootstrap for UI goodness.

I used this opportunity to check out Twitter’s Bootstrap framework. It let us get a clean design from the start, and we were able to use its classes and responsive grid to make the site look really nice on tablets and smartphones too. I found it a lot more feature filled than Skeleton, which is just a responsive CSS framework and lacks the advanced UI elements like navigation, drop downs, modals found in Bootstrap.

Improvements that we could make

We’ve already talked about a number of features we could add or rework to make the site better. Of course, given the shelf life this app will have after November 6th, we may not get to some of these.

  • Re-work the state navigation on the homepage so that it plays nice with the browser’s history. We did a simple ajax query on load, but a better way to do it would be to change the hash to contain the state “http://candidata.us/#VA”, and then pull in the list of candidates. This would also only initiate the geoip lookup if the hash is missing.
  • Add a simple way to navigate to opponents from a candidate’s page.
  • Allow users to navigate to other state races from a candidate’s page.
  • Get more candidate information, ideally something that can provide us Photos of each candidate. Other apps at the hackathon had this, but we didn’t find the API in time. Sunlight provides photos for Members of Congress.
  • Pull in statements made by a candidate via WaPo’s Issue API, maybe running it through the Trove API to pull out categories, people, and places mentioned in the statement.
  • Use the Trove API to organize or at least tag latest news stories and fact checks by Category.

Overall, I’m very happy with what we were able to build in 24 hours. The hackathon also exposed me to some cool ideas and approaches, particularly the visualizations done by some teams. I wish I’d had spent a little more time meeting other people, but my energy was really focused on coding most of the time.

Please check out CandiData.me and let me know what you think either via email or in the comments below.

Drupal Workflow and Access control

Large organizations want to, or may actually need, strict access control and content review workflows to manage the publishing process on their website. Drupal’s default roles and permissions system is designed to handle the simplest of setups. However, there is a wide variety of modules that each address these needs. Many of them overlap, some complement each other, while others are abandonded and haven’t been ported to Drupal 7. In this article, I’ll look at some active modules for these requirements.

Revisioning for Basic Workflow

Unless your web publishing workflow is very complicated -= in which case, I think you have other problems – the Revisioning module should suit a basic author-editor review setup. You’ll need at least two roles, an author role that can create and update but not publish new nodes, and an editor role that reviews and publishes changes made by authors.

Authors write content that prior to being made publicly visible must be reviewed (and possibly edited) by moderators. Once the moderators have published the content, authors should be prevented from modifying it while “live”, but they should be able to submit new revisions to their moderators.

Once installed, the module provides a view to show revision status at /content-summary.

Content Access for write privileges

The Content Access module allows more fine grained control over view, edit, and delete permissions. Furthermore, it allows you to set access controls on a per node basis. This lets you restrict the set of pages, stories, or other content types that a single role can change.

This module allows you to manage permissions for content types by role and author. It allows you to specifiy custom view, edit and delete permissions for each content type. Optionally you can enable per content access settings, so you can customize the access for each content node.

Related Modules

If you need “serious” workflow in your Drupal, there is the aptly named Workflow module, and the newer State Machine and Workbench Moderation modules. These modules seem to be actively developed, and with enough time, tweaking and experimentation should help solve many workflow issues.