Retrotag Your Weblog - Tagging with the Yahoo! and Tagyu APIs (part 2)

Published in Application Programming Interfaces on Friday, November 25th, 2005

In Part one we looked at the concept behind how to go about reatrotagging content. Here in part two, with a little cURL and PHP magic we will request our tags from Tagyu and Yahoo!.

Assuming that you read part 1 of this article, we're going to jump right into action.

Update: The downloadable code has been updated to reflect that Tagyu now allows registration. Once you have a username and password, you can hit the Tagyu server as quickly as you want, although you are capped at 1000 hits per day. Thanks to evariste for spotting the update on Tayu!

Step 1: Prepare your tables

Our lookup table

The basic design of this process will to be query our database for the data that we want tagged, and then to loop thru the result, sending a query to the API server for each result and then placing the returned data - the tags - into our database. So, for that, we need some tables into which we will store our tag data:

We will be inserting the tags that are found along with their respective post_ids into the table above. Later, we will build a master list of tags in a separate table and then parse back thru the table above to insert tag_ids, effectively making this into a lookup table. Don't worry if this seems confusing, it will all make sense in the end, I hope.

Our tag table

Next we'll create a table that will become our master tag table.

Once we do our query to either API server, and hold tags and post_ids in post2tag, we'll query that table for distinct tags and fill up tagmaster.

Step 2: Prepare your query

This part is easy. We need to query the database for the content and id of that content. We will use the returned resource in a typical while loop along with mysql_fetch_array and query the API server in each loop:

Step 3: Build and send the request

As mentioned above, all of this next section takes part within the while loop of our database query.

CURL, Client URL Library Functions

We'll be using cURL to send our request to the API servers. For more information on cURL, see here and here - good, brief overviews.

The code below is the heart of the process. We set the url of the API server, and build the query. We then initialize a cURL handle $ch = curl_init();, set some options and pass cURL our url and query (explained in the code) and then execute the request curl_exec($ch);.

At this point, we would have the result, that is, our tag data, held in the variable $chresult.

Step 4: Unserialize and store the data

Both of the services that we will be querying return XML. In order to extract our data, we'll use the same library that was used in the Yahoo! Search article - Keith Devens' PHP XML library. So we'll pass our results to the library, and then take the returned array and store it in our database, specifically the tags2post table.

I won't bother printing the whole $data array here, you can do that for yourself, however from above you can see that "the meat", for this, the Yahoo! request, is held in $data['ResultSet']['Result'].

Boy, this has gotten long.

Step 5: Refine your tags

Here is where things get a bit personal. I've decided to move all of the tags over to the tagmaster table, and to simply scroll through that table and eliminate the tags that I feel are unnecessary. Others may find that they prefer to look at each post and the tags assigned to it and decide that way. I have neither the time nor the patience :-)

Now we want to move all of our tag values into the master table:

With the tags into the tagmaster table, I set about deleting the ones that I was not interested in, and then I did the following query to finalize things and set up the lookup table:

That series of queries inserts the tag ids for the tags I'm keeping, and deletes anything left over that I decided not to use.

Putting it all together

Download the complete set of code. It is commented up for easy switching between the two APIs, Yahoo! and Tagyu. Pay attention to the comments!

The downside of Tagyu

While I feel that Tagyu provides ready to use results, at the moment you can only hit the server once a minute from a given IP. The workaround is to insert a sleep(65) into the while loop, but as you can imagine that gets to be a fairly long running script!

I managed to knock off a couple of runs on my local server, adjusting the script execution time (in php.ini) and using PHP5 rather than PHP4 due to a bug I ran into between PHP4 and MySQL.

In addition, in both "full" runs, some posts did not return data. I can only imagine that was because the timeout for the cURL process kicked in (curl_setopt($ch, CURLOPT_TIMEOUT, 15);). In the end I sutured together the two full runs to get a full compliment of data (I ran one run by selecting for ids ASC and the other with ids DESC in the hopes of overlapping).

Picking your tags

One last note, how you ultimately decide to pick your tags is up to you. I don't have a lot of content, so I was able to judge which tags to pick and which ones to delete. Andrew Krespanis left a good comment in part 1, asking:

Would comparing results from both sources and only including matches produce a more concise tag set or would it merely blow out response/processing time?

Processing wise, this would depend on how you do it, but I don't see it being an issue. Whether or not you end up with a better or more concise set of tags is left to be seen.

Personally, I don't think that the process can be entirely automated, but using both or either one of these services along with some hand edits on my part has definitely made things faster, and added some tags that I would never had thought of using. Good luck!

Comments and Feedback

This is really, really great!

Since I have almost 10,000 posts in my database, to use Yahoo! I would have wanted to modify your mysql query with a LIMIT X,Y statement at the end. That way in the first 24-hour period, I could have LIMIT 0,5000 and in the second 24-hour period, LIMIT 4999,5000.

So the line:

$query = "select id, content from weblog"; // The query for your content

becomes:

$query = "select id, content from weblog LIMIT 0,5000"; // The query for your content

and 24 hours later when I want to do the rest, I edit the file to say:

$query = "select id, content from weblog LIMIT 4999,5000"; // The query for your content

Another change I made was to run it as a shell script instead of in my browser. Simply add the line:

#!/usr/bin/php

on its own line at the very top of the script, before <?php, and name the file retrotag.sh. A quick chmod +x retrotag.sh and ./retrotag.sh > retrotag_out & then log out of your shell, and you can periodically check progress by visiting http://yourdomain.com/retrotag_out, but you don't have to worry about limits to how long php scripts can run on your server because they don't apply to shell scripts.

Now, after running the Yahoo! for the first 5000 entries, I found that I had over 90,000 tags to deal with. So I quickly emptied the table and decided to go the Tagyu way, slower though it is the data is far higher quality so I'm willing to wait for it. My blog was originally based on Movable Type, and although I've switched to homebrewed PHP code because MT was too slow for me. So I had MT's database schema to contend with, since my data is still in the MT format. Here's how I modified the query line so it worked with MT's database:

$query = "select entry_id as id, concat(entry_text,' ', entry_text_more) as content from mt_entry"; // The query for your content

This lets the code blithely and gracefully carry on with its assumption that the entry fields are called "id" and "content", while allowing for the fact that MT has an "extended entry" field for long entries, the contents of which can be valuable for tagging purposes. This is working like a charm. I'm sure WordPress has a similar "extended entry" scheme so this will come in handy for someone.

Anyway, I hope this helps someone else! Thanks for the great code, Mike P! In about five days, a year of my blog will be retrotagged, all thanks to you :-)

Wow, evariste, thanks for taking the time to write that up! It hadn't even occurred to me that some people may have >5000 posts... Good advice on the shell script too..

Good luck with Tagyu!

By the way, if you're curious, I found you through 9rules's front page's random post lister. So, you got at least one reader that way :-)

Well, that's a good start. Now if only Scrivs would read the site!

Late update: I've discovered you can skip the sleep() if you register with Tagyu. Then you can do up to 1000 queries a day with no 60-second rate limiting. The API uses HTTP Basic Auth, so you need to add one line of code. Locate this:

$ch=curl_init();:

add this line after it:

//HTTP AUTH

curl_setopt($ch, CURLOPT_USERPWD, 'username:password');

Then comment out or delete the sleep() line:

// Sleep if we're hitting Tagyu.

($mode == 'Tagyu') ? sleep(65):'';

becomes

// Sleep if we're hitting Tagyu.

//($mode == 'Tagyu') ? sleep(65):'';

Use LIMIT X, Y to make sure you don't go over the 1000 quota, and just do it once a day till you're done. I just finished LIMIT 0,1000 so tomorrow I'll do LIMIT 999, 1000 and the day after I'll do LIMIT 1999, 1000 etc, until I've retrotagged everything in my database.

Thanks evariste,

That registration is new! I'll sort out the code accordingly. I had access that way myself after asking the fine folks at Tagyu for registration... It made (makes) things much easier, for sure.

Okay, code updated!

I left a sleep in there for 1 second, and it applies to both cases now. I figured that if we're hitting the server that many times, better to be polite about it. I know I hate when bots hit our servers too quickly...

Argh. I've discovered that if Mysql's concat() is fed two fields, one of which is null, it returns NULL! So you'll be asking Tagyu/Yahoo to tag a NULL, which gets you...precisely nothing back. Disastrous, eh? Instead, use concat_ws(), and be aware that it takes its arguments in a different order.

So this:

$query = "select entry_id as id, concat(entry_text,' ', entry_text_more) as content from mt_entry";

// The query for your content. concat() returns NULL if any field is NULL!

Should be this, instead:

$query = "select entry_id as id, concat_ws(' ', entry_text, entry_text_more) as content from mt_entry";

// The query for your content, which is now safe if you've got NULL in entry_text or entry_text_more

Home » Blog » Web Development » Programming and Scripts » Application Programming Interfaces

Check out the blog categories for older content

The latest from my personal website,
Mike Papageorge.com

SiteUptime Web Site Monitoring Service

Sitepoint's web devlopment books have helped me out on many occasions both for finding a quick solution to a problem but also to level out my knowlegde in weaker areas (JavaScript, I'm looking at you!). I am recommending the following titles from my bookshelf:

The Principles Of Successful Freelancing

I started freelancing by diving in head first and getting on with it. Many years and a lot of experience later I was still able to take away some gems from this book, and there are plenty I wish I had thought of beforehand. If you are new to freelancing and have a lot of questions (or maybe don't know what questions to ask!) do yourself a favor and at least check out the sample chapters.

The Art & Science Of JavaScript

The author line-up for this book says it all. 7 excellent developers show you how to get your JavaScript coding up to speed with 7 chapters of great theory, code and examples. Metaprogramming with JavaScript (chapter 5 from Dan Webb) really helped me iron out some things I was missing about JavaScript. That said each chapter really helped me to develop my JavaScript skills beyond simple Ajax calls and html insertion with libs like JQuery.

The PHP Anthology: 101 Essential Tips, Tricks & Hacks

Like the other books listed here, this provides a great reference for the PHP developer looking to have the right answers from the right people at their fingertips. I tend to pull this off the shelf when I need to delve into new territory and usually find a workable solution to keep development moving. This only needs to happen once and you recoup the price of the book in time saved from having to develop the solution or find the right pattern for getting the job done..