Remove unapproved comments from WordPress exports

Recently, I needed to migrate some WordPress blogs to another system. WordPress provides a handy way to export content in its WXR format. However, it’ll export all comments, whether approved or not. This is good from a data backup standpoint, but I didn’t need to import these. They were also bloating the XML file and affecting how long it took my import to process.  I needed a way to remove unapproved comments, the following code will do that using PHP’s DOMDocument extension to walk an input file. The cleaned up content is sent to STDOUT so you can pipe it to another file to save.

recover = TRUE;

$comments = $doc->getElementsByTagName('comment');
$to_remove = array();

foreach ($comments as $comment) {
    if ($approved = $comment->getElementsByTagName('comment_approved')) {
        if ($approved->length > 0) {
            $app = $approved->item(0);

            // can't remove nodes while looping
            if (0 == $app->nodeValue) {
                $to_remove[] = $comment;

if (count($to_remove)) {
    foreach ($to_remove as $elt) {

$doc->formatOutput = true;
$doc->preserveWhiteSpace = false;
echo $doc->saveXML();

Frustration with Drupal core growing

When a prominent developer and contributor lashes out that Drupal is in dire straits, you better listen.  You ought to read his critique of how Drupal core development is stalling, or at least stuck in the mud.  That can’t be good news for anyone looking to upgrade to Drupal 7.  My thoughts after the quote.

In addition to the half-baked, single-purpose product features mentioned above, Drupal core still carries around very old cruft from earlier days, which no one cares for. All of these features are not core functionality of a flexible, modular, and extensible system Drupal pretends to be. They are poor and inflexible product features being based on APIs and concepts that Drupal core allowed for, five and more years ago.

Where would Drupal be if they had worked more closely with the PHP community early on?  I have no idea, but a lot of PHP programmers have looked down on Drupal because most of the codebase can be messy, with poor API design decisions, overuse of globals, and leaky separation of concerns. Along with Drupal eschewing Object Oriented Programming and resulting best practices, its no wonder that talented developers would choose to use a framework like Zend, Symfony, or Cake to build a complicated website.  It sounds like a lot of short cuts and idiosyncrasies are now baked deep into Drupal core, and ripping them out too much work for core developers.

I’ve always thought that Drupal’s greatest strength is certainly not the great design of its codebase, but the Drupal community and ecosystem. A contrib module usally exists for many common website needs, like managing redirects, creating useful URLs for content, integrating with analytics,  and plugging in 3rd party commenting systems.  On top of that, there are super-modules like Views, Panels, and Context, which let you prototype and build parts of a website without having to write any code at all.  The Drupal community has solved a lot of problems through determination and individual brilliance, but that model can’t be sustainable in the long run.

Is there a solution?

Drupal core should cater to programmer’s needs, via coherent APIs and pluggable subsytems.  A complete rewrite of core, or even big parts of core, would be a waste of time. Drupal would stagnate while other frameworks kept improving.  I think Drupal 8 should seriously consider using a framework like Symfony2 as the foundation for core.  I mention Symfony because it has an EventDispatcher component that can replace most of Drupal’s magical hooks system.  The next release of the Zend Framework will have a similar component. A tested framework, used not just by content management applications would expose developers to a wider range of best practices, particularly around configuration management, deploytment, and unit testing.

Contrib should cater to site builders needs and focus on adding features on top of core that they want.  Modules in contrib can improve faster to meet user needs, fix bugs, and innovate.  This is an idea proposed in the discussion linked above. Moving as many modules as possible out of core also makes Drupal leaner.


A List Apart: Strategic Content Management

If you're involved in designing, developing, supporting, working with, changing, upgrading, or selecting some sort of content management system, you owe it to yourself to read this article and heed its advice.  A good content model is important to make a system useful to the intended users.  A good content model also has to be flexible and extensible, especially as you iterate and improve it.  A good content management system has to be pliable to meet your content model, instead of imposing its own one-size-fits-all model on you.  However, that flexibility means you'll need custom development to get what you want.

As Karen McGrane says, it’s easy to sketch a faceted navigation on a wireframe. It’s more difficult to implement a CMS to power the implied taxonomy, and to commit to ongoing editorial maintenance over time. A wireframe without a corresponding content strategy and a realistic CMS design is a work of fantasy. A CMS that could realize one of these fantasy wireframes would need plenty of magic pixie dust. We need content strategy to help us decide which of our aspirations is feasible; CMS design is an essential part of that decision.

A List Apart: Articles: Strategic Content Management


$5 Million for a CMS?

Jamesr on Column Two sets the bar pretty high when he observes that it costs $5 million to write a CMS. On the face of it, I can’t argue with it.  The temptation with writing one from scratch is that it so easy, and tempting, to have something functional with a pretty minimal investment (say less than $100k, heck, one marathon coding weekend of a halfway talented/dangerous programmer.  If the $5 million is to write an a CMS that "just works right" CMS, I wonder what the price tage for a CMS that handles the majority of the features actually needed by most  web sites.  I expect the Pareto principle to be in full effect, in that 20% of the features will account for 80% of the cost.  Given that rough estimate, you can write a usable CMS for $1 million invested, or can you?  I haven’t convinced myself but now since its written on the Interwebs, it must be true.

This is real money (cash) that needs to be found and spent by vendors, to produce an acceptably good mid-market solution. This isn’t a all-singing-all-dancing product we’re talking about here, but rather a CMS that just "works right", and meets the expectations of most purchasers.

What are the difficult 20% features?  Workflow, Full Internationalization/Localization of content.  ContentHere has a full list of the hard stuff.


Managing Web Content Down Under

The Australian eGovernment Resouce Center has posted Whole of Victorian Government Web Content Lifecycle and Content Management Roles." While its not a huge document, it does a good job of mapping the management process and hitting on the important roles within it.


How will normal, non-emacs-using people write the semantic web?

Jon Udell’s slides of his OSCOM keynote are available. Lots of insight for people building their own CMSs.