The Art Of Turning URLs Into A User Readable Preview

Von Protonet Team. Veröffentlicht 24. August 2010.

If you’re of the impatient kind click here for the demo – otherwise enjoy the great finale ;).

In this blog post I’d like to introduce you to a technology that we here at protonet call „Text Extensions“.
Text extensions are small embeddable representations of urls mentioned in real time chat messages. Every time a user wants to share an url with the community we generate a preview including the most important information (title, short description, thumbnail, …) representing the website.

This approach gives our users a series of benefits:

  • The sender doesn’t need to explain what’s behind the url
  • The Recipient gets more much more information than a link and can then decide whether it makes sense for him to leave protonet – for the moment – and give another (external) page his attention. Sometimes the preview is so good that the user doesn’t need to leave our chat in order to get that content. This is specially the case where we’re able to reflect the main information and interaction in the text extension: Twitter status updates, Slideshare presentations, YouTube videos, Flickr images, Google Maps, …
  • Text Extensions generate more valuable data for our search index since we get all information associated with an url. Users don’t necessarily need to ask google for interesting pages related to a specific topic. At protonet they get search results that are locally and socially relevant. The users themselves are web crawlers delivering highly relevant external content.
  • Text Extensions replace the need to use a social bookmarking tool (delicious, google bookmarks, …). Protonet users just have to share their urls with the community and are still able to find them in the future by searching for related keywords.

Ok, so how do we fetch external data?

First let’s have a look at the tool that helps us scraping external web content:

YQL

Amongst other APIs we take advantage of a great service provided by Yahoo. It’s called Yahoo Query Language (YQL). YQL allows us to query the web by using simple SQL statements. In the “where” clause of those statements we can specify the url to fetch content for and also optionally an xpath selector. Using xpath selectors we ensure that YQL only gives back HTML segments that are of interest to us (such as meta tags).
From my personal point of view xpath is not the best way for filtering/selecting html, but that’s just the opinion of a frontend developer who is used to CSS selectors. 🙂

// Example query to fetch meta tags and title
SELECT * FROM html WHERE url = ‘http://www.amazon.de’ AND xpath=’descendant-or-self::title | descendant-or-self::meta’

Getting structured machine readable data

In order to get the most relevant information for an url we rely on a bunch of (more or less) web standards that have evolved over the last years.

#1 Open Graph Protocol

By specifying certain meta tags, your website becomes a so called “graph object”. For instance, this is used on Facebook to enable any web page to have the same functionality as a Facebook Page.

Example taken from the source code of this article on mashable.com

  og:title“ content=“Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]“ />
  og:description“ content=“A team of archaeologists, scientists and oceanographers will soon be revisiting the wreck of the Titanic for further scientific discovery and documentation.“ />
  og:image“ content=“http://cdn.mashable.com/wp-content/uploads/2010/08/titanic-1.jpg“ />

The resulting text extension:


For more info see http://opengraphprotocol.org/

#2 Facebook/Digg/Yahoo Share Standard

This standard is also based on meta tags and was invented by facebook for their way of converting urls in status updates into previews and later on was adapted by several other major web services such as Digg and Yahoo Search.

Example taken from the source code of this epic flickr photo:

  title“ content=“Henley Beach Reflections“>
  description“ content=“Enjoy  – Canon 5D Mark2. – ISO 100, f16, 17mm. – Canon 17-40 f/4 L. – Tripod.  View On Black to see more 🙂  Standard 3 exposure (+2,0,-2 EV)  Processing  Photomatix – Tonemapped generated HDR using detail enhancer option  Photoshop – Adjustment of hue/saturation – Adjustment of curves to increase the overall contrast (Linear) – Adjustment of levels – USM on Background Layer – Reduced noise with Noiseware Pro – Sig/Borders Added  Thanks for all the comments, faves, views, notes and invites!“>
   image_src“ href=“http://farm3.static.flickr.com/2555/3888623982_fdc013b4d9_m.jpg“>

This snippet gives us all information we need for a basic text extension. Title and description are taken from theelements and the image preview from the element.

Besides offering an image preview, webmasters can also specify other media types such as video or audio.

Let’s have a look at the source code of this lovely music video on vimeo.com

  title“ content=“Colbie Caillat & Jason Reeves ‘Droplets’“ />
 description“ content=“Shot on a Bluff in Malibu where the Song was written“ />
 video_type“ content=“application/x-shockwave-flash“ />
   image_src“ href=“http://ats.vimeo.com/210/195/21019554_200.jpg“ type=“image/jpeg“ />
   video_src“ href=“http://vimeo.com/moogaloop.swf?clip_id=1828755″ type=“application/x-shockwave-flash“ />

Hackers! This is by far the best we could get by an external site. The source code gives us the title, a short description, a preview image and the most important thing, the url to the flash video in a simple machine readable format.
Users can therefore embed videos in their messages allowing others to watch them directly within the timeline.


You would be amazed if you knew how many popular video services out there support this. Just to name a few: Break.com, DailyMotion.com, CollegeHumor.com, Discovery.com, …

For more info regarding this standard see http://about.digg.com/thumbnails or http://forum.developers.facebook.com/viewtopic.php?id=32464

#3 Microformats (hCard)

Microformats are a set of simple machine readable open data formats.”hCard” (also known as vCard) is one of those data formats and is adopted by many websites for representing people, companies, organizations, and places. Webmasters highlight particular information by using semantic (X)HTML properties and values.

Example taken from a social profile on XING.com

 

vcard“>


  work
  
    given-name“>Christopher
    family-name“>Blum
  
  title“>Frontend Engineer
  org“>XING AG
  photo“ width=“140″ height=“185″ alt=“Christopher Blum“>

#4 OEmbed

OEmbed is probably the best tool for turning an url into an embedded representation. It doesn’t require us to parse the external page by ourselves by using complex xpath selectors. Instead we just have to take the url and ask the provider to give us structured json data back. Even though it’s pretty easy to use we haven’t implemented it (yet) in our text extensions code: at the moment the amount of websites supporting this is just too small (oembed.com lists only 10 providers). Also we figured out that most of those who support this have also adopted one of the other mentioned standards.

See http://oembed.com/ for detailed documentation.

#5 Special cases

Many websites don’t support one of the listed standards above. Some of them simply can’t because their content is too complex or specific to fit into a couple of structured html elements.

To name two cases:

GitHub

Protonet has currently a test instance running in an environment full of web developers whose daily need is to share code snippets and repository commits. To fullfill that need we decided to implement a GitHub text extensions that embeds code snippets and commit diffs into messages. Of course GitHub doesn’t support one of the standards explained above. So we had to came up with a dedicated solution. Luckily GitHub provides a JSON API which offers all related data to a specific commit (including the diff itself, message, author and many more). Thanks to the api we were able to easily satisfy our dearest Nerds. =)

Google Maps/Street View

Google maps is probably one of the most used and shared services. As it is the same for GitHub we couldn’t generate a proper text extension for google maps based on the source code.
So we had a closer look at typical google maps urls and how it is possible to embed a google map in your own web page. We were quite surprised when we found out that by adding “&output=embed” to the url generates an embeddable map preview suitable for iframes.Every time a user enters a maps url, we attach an iframe with that url.

http://maps.google.com/maps?q=new+york

becomes

http://maps.google.com/maps?q=new+york&output=embed

Btw: The way the same applies for google street view. Just that the parameter to add is “&output=svembed”.

#6 Fallback: Page Title, Meta Tags & Screenshot

Luckily almost every website has aelement and aelement which allows us to still render basic information related to an url.In order to generate a thumbnail preview we capture a screenshot of the page. This is easily done by using one of the free/non-free services out there. The guys from sitepoint collected a list of screenshot providers: http://www.sitepoint.com/blogs/2008/07/10/9-ways-to-put-site-screenshots-in-your-web-app/
Nevertheless we at protonet don’t want to rely on a third party provider, since it often takes minutes or even hours until the provider has generated a screenshot for an url. We therefore decided to setup our own screenhot service by using a free open source command line utility named “CutyCapt”.

Demo!

Since we can’t give you access to our entire chat right now, we encapsulated the text extension code and set up a demo . Happy playing!