This tweet asking how many URL shorteners an average tweeted link goes through piqued my interest, so I decided to find out. I wrote a couple of scripts to collect public tweets, look for links, follow them through redirections, and then do some quick and dirty analysis of the resulting data.
I was interested because I’ve noticed the browser hopping through multiple (slow) sites when clicking links on Twitter. But if you’re not familiar, you should also read Joshua Schachter’s excellent overview of URL shorteners and the problems they cause – it’s worse than just a longer wait for your page. But waiting for your page sucks too: compare this to that, which not only is slower, but has annoying clickthroughs.
But on to the data. The short version: the vast majority of links only go through 1 or 2 shorteners, but nearly as many go through 2 as go through 1. The average number of shorteners you have to go through is about 1.48. In other words, on average you’ll need to make 2.48 synchronous HTTP requests to start displaying a page linked on Twitter.
To determine this, I collected 101,668 tweets containing 20,868 URLs. Following the links through redirects to their destination and counting them up gives this histogram (note the log scale):
So a few links manage to get through the tweeting process unscathed, most have one or two shorteners between you and the real page, and then there are some outliers. And one was wrapped 7 times!
But wait, there’s more…
When I created the link above to the Wikipedia page about URL shortening that goes through many shorteners, some of them rejected already shortened links. They are at least making some effort not to re-shorten links. This may have more to do with spam than any other motivation, but it indicates some URL shorteners are avoiding these nested shortened URLs.
So I wanted to look into who is generating rewrapped links. First I looked at the top host names in shortened links, including ones discovered by following redirects. (Again, log scale.)
This essentially just shows market share. Not surprisingly, t.co is the winner, with some other big names following them.
But I was really interested in re-shortened links: links that redirect to another shortened link that also needs to be resolved. If you cut out the last shortened link (the one that finally points to the real resource), you can see who the worst “re-shorteners” are.
Twitter is by far the worst offender. Of course, as the last hop between the tweeter and the tweet being published, this isn’t surprising, but it also suggests that Twitter is always or at least very aggressively re-shortening links. This definitely meshes with my experience where Twitter will unnecessarily wrap a short link in t.co form, even if my tweet was short enough already. I’m assuming this is because they want to collect analytics – the value in collecting data about where tweets spread outweighs the cost of making links brittle and slower to load.
If you compare the last two graphs, you can also see which services seem to be actively avoiding reshortening. Those that drop to a lower position in the second graph are either not reshortening links or don’t receive links to be reshortened (i.e. they generate them only for their own pages). Good examples, I think, are Facebook (which isn’t reshortening) and Tumblr or YouTube (which probably only short-link their own pages).
My basic approach was pretty simple:
Pull tweets from Twitters public streaming API. I pulled around 100,000 tweets over 40 minutes on April 12 2012 around 10 AM. If you’re curious, this runs at about 100 KB/s. The tweetstream library made this trivial.
Unshorten links by following redirects. I found about 20,000 URLs in the tweets I collected. Since I didn’t have the analysis code ready, I just dumped the list of redirections and status codes to a file. Thanks to the very nice Requests library, this was trivial. They even already held onto the history of requests for me.
Run the analysis. Classify requests as shortened URL redirects and count them up. Ignore anything that didn’t ultimately result in a successful page load to avoid noise due to issues like shortened links that were removed because they pointed at spam or random server failures.
There were a couple of gotchas:
- Extracting links is tricky. You obviously can’t rely on looking for properly formatted URLs. I used a simple regex I had developed previously to look for links in tweets, which I use for archiving tweets and making sure I have the real link URLs. It’s better to be liberal here because we can filter failures out later – we’ll get 404s or name resolution issues.
- Classifying shortened/long links isn’t hard, but you need to tweak things a bit. Just looking for redirects (HTTP 301 or 302 status codes) isn’t sufficient since there are non-URL-shortening uses of these. An initial analysis looked a lot worse because these redirects were also counted. Ultimately I took into account the HTTP status code, the length of the host name portion of the URL, and the total length of the URL. It’s still not perfect, but false positives/negatives seem to be pretty rare in my data set.
- This is simple, but you need async requests for accessing the pages or threading – some sites go down or are slow to respond and it would have taken too long without this. I ran 20 concurrent transfers and the whole process probably took about 30 minutes.
See the code for full details.