2008-08-18

PixCede gets shorter file identifiers

My initial filenaming convention for PixCede was pure rubbish.

It all started with the best of intentions. After reading about the development of Pastebin and why the database got scrapped, I was inspired to make PixCede rely on the filesystem and not the database. I did, however, need to keep the timestamp so I could sort the images by the order in which they arrived. So my initial naming convention was:

2008-07-04T07:19:03+00:00_f6cfee1f4e477963a3c24d8f9b769722.jpg

That is, PHP's date('c') followed by an MD5 of the file. My logic was, date('c') would keep the time property and, if by chance any two images hit the system in the same second, they would certainly have different contents, and the MD5 would differentiate them (unless of course, the same image hit at the same time, but.. I don't see why it would need to be there twice). Using date('c') was just a bad idea from the start; if I had spent 2 seconds more thinking about it, I would have just used time(), which returns the number of seconds since the Unix epoch. Using an MD5 hash is a dumb idea too, because it's a pretty expensive operation.

So for version two, I used time() concatenated to uniqid(), a function that creates a UID based on the current time in microseconds (it's what the PHP manual pages recommend to use for Session IDs). Without any parameters, uniqid() returns a 13 character string. That brings me to:

1219098558133aaffc6178602.jpg

Substantially better, but.. as the great doctor says, if something's worth doing, it's worth doing right. Ideally, I want PixCede to send a small enough unique identifier back to the user that he can type it into his browser, after receiving an SMS back. Next on the chopping block: the 13 character unique identifier.

In a dream world, I might get 1000 pictures per second with PixCede. Realistically, I think I only need to worry about two at once (and that's a stretch), but 1000 seems like a nice round number. If I use PHP's base_convert(), I can create a random number from 0 to 1,679,616 (36^4), and convert it to base 36 (10 digits + 26 letters) and only use 4 precious characters. This brings it down to:

1219098558xe21.jpg

Which is a lot better! It preserves the ordering in the first 10 characters, and uses 4 characters on the end to make it unique. But why not go balls to the wall? I might as well convert both the timestamp and the random number to base 36; a timestamp in base36 will sort just as well as one in base10. The final code I am using:

$id = (time() * pow (10, 7)) + rand(0, pow(36, 4));
$uid = base_convert($id, 10, 36);


3bnd9c4wf66.jpg

11 character total, a saving of 47 over my original, poorly encoded 58 characters! This, I feel is an acceptable UID to have to type in by hand. If you want to improve on it, base62 is just small function away (there's a user contributed one on the PHP base_convert() page).

EDIT:
I decided to go ahead and switch it over to base62 encoding. The comment on base_convert() actually didn't work for what I needed; the images were mostly sorted, but off a little bit. It turns out you need to switch the upper and lower case letter sets in the dec2any() function ("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"), and now everything works proper, with a final, 9 character encoding of:

tfebHWV5s.jpg

No comments: