Podcast Downloads Part Deux

November 21, 2004 · 4 comments

in Uncategorized

I just checked the Apache logs and grepped out the lines dealing with the podcasts, and it looks like the podcasts have been downloaded a total of 1377 661 times! (see update below) Holy crap! Who is downloading this stuff? No, really. I’d love to know. If you’ve downloaded any of the experimental podcasts, please let me know!

The simple and not exactly efficient command I call to count podcatches is:


grep "/~dnorman/podcasts/" < /var/log/httpd/commons_access_log | grep "GET " | grep " 200 " | grep ".mp3" | grep -v "65536" | wc -l

The basic logic of that statement is something like: “Look in the apache log, and pull out all lines referring to files in the /~dnorman/podcasts/ directory. Of those lines, retain only those requests that were GET (ignoring HEAD requests…), and of those, retain only those that were 200 (not incomplete or file not found etc..). Of those lines, retain only those that point directly to a .mp3 file (ignoring directory listsings…), ignore the silly repeated download of files by podcatching software (which appear to download 65536 bytes of files just to check that they’re still there…)and feed what’s left into the wc (word count) command, telling it to return the number of lines left.

I’m sure there’s a more efficient and reliable way of doing this (I’m sure if I knew wtf I was doing with regular expressions that I could combine all of the piped grep statements into a single one with a more complex pattern.), but it seems to work… If I want to see how many times a particular podcast has been downloaded, I can change the last “.mp3″ grep to be the filename I’m looking for.

UPDATE: Christian let me know in the comments that the podcatch software typically does a very silly thing – it repeatedly downloads about 65K of the mp3 to see if it’s changed. They don’t do the right thing and request the HEAD, which would tell them the same thing in a few bytes of text, but instead download about 65K of the actual file and abort it. That’s so unbelievably silly that my mind reels. And it also pollutes the apache logs so it’s MUCH harder to see how many real downloads are going on. Perhaps I could modify my grep to drop lines containing the (hopefully fixed) downloads associated with podcatch pinging…

Anyway, I’m not disappointed by the lower “real” numbers – I was more freaked out by the extremely inflated “raw” numbers. Still, if you’re downloading these things, I’d love to know…

{ 4 comments… read them below or add one }

1 Christian Hessmann November 21, 2004 at 1:50 pm

Sorry to disappoint you, but I don’t think your numbers are correct.
Quite a few old Podcast-clients are checking for updates by downloading the beginning of a mp3-file and stopping after some time.
And yes, unfortunately, Apache says GET, 200 and MP3:

80.203.##.### – - [17/Nov/2004:17:09:27 +0100] “GET
/podcast/poetcast_2004-11-17.mp3 HTTP/1.1″ 200 14360 http://www.hessi.org “-”
“-” “-”

Look for a bunch of lines like this – it checks all mp3s mentioned in your feed within a few seconds.
Every podcaster I contacted is experiencing entries like this.

Combined with some buggy clients that downloaded shows again and again, it is quite hard to tell the number of real listeners.

btw, I am one of the real listeners to your show. :-)

Greetings from Germany.

Reply

2 D'Arcy Norman November 21, 2004 at 2:01 pm

Christian, thanks for the tip! You’re RIGHT! My apache log is full of repeated requests from many IP addresses. Why on earth wouldn’t the podcatcher just check the HEAD of the URL? That’s so darned wasteful downloading a bit of the file (about 65K per shot!) rather than a few bytes of HEAD text…

OK. So it’s nowhere near 1377 actual downloads – more like a couple hundred, with LOTS of repeated braindead partial file downloads to see if the file is still there… Someone hit the podcatch developers with a clue stick…

Reply

3 Christian Hessmann November 21, 2004 at 5:45 pm

D’Arcy, well, I asked the developers of the usual podcast-clients (ipodder, iPodderX, jPodder) – noone is using this kind of algorithm anymore. In fact, most of them said they never used such an algorithm, but you can’t be sure about that.
Anyway, hopefully most people are going to upgrade their clients, since features are added nearly every day, so we might have a chance to get rid of these partial downloads.

Reply

4 Tim Lauer November 22, 2004 at 7:07 am

Hi, I’ve downloaded your podcasts. I find your format interesting. Kind of like having a cup of coffee (or a beer :-) ) with a colleague and talking about things of interest. Actually last night I was doing the dishes and was listening to your Nov. 12 podcast… Take care…

Reply

Leave a Comment

Previous post:

Next post: