Wherein the IngentaConnect Product Management, Engineering, and Sales Teams
ramble, rant, and generally sound off on topics of the day

Persistent linking, web crawlers and social bookmarking

Wednesday, March 07, 2007

Typically a web crawler, unless configured to use a separate index or crawling algorithm, will use the URL from which it retrieves some content as the entry in its search index. This means that anyone clicking on a search result will be taken to this URL.

Where a site has access controlled content and the full-text resides at a different location, this presents a problem. The site owner or publisher would like users to go to one page, e.g. the abstract, but will want the crawler to get the full-text. Making this work involves some dialogue between site owner and the search engine. For example the web crawler needs to use an alternate index or additional metadata to make the connection between the index entry link and the full-text retrieval link.

Some site operators, with approval, use a technique known as "cloaking" to achieve this. This involves serving different content to a web crawler, e.g. a PDF, than would be served to an end user, e.g. an abstract. Most search engines disapprove of this approach, but Google Scholar, for example has allowed it. This has caused some debate.

On IngentaConnect we use cloaking to serve content to some crawlers. But we no longer do this for Google Scholar. The reason for this is that Google were interested in obtaining the richer metadata that we include (as embedded Dublin Core) in abstract pages. This metadata, supplemented with the full-text, improves the quality of Scholar search indexes.

I thought I'd explain the fairly simple solution I concocted to achieve this and point out where the same technique could be used to improve another problem: persistent linking in social bookmarking services.

When the Googlebot requests an abstract page from IngentaConnect, it gets fed some additional metadata that looks like this:

<meta rel="schema.CRAWLER" href="http://labs.ingenta.com/2006/06/16/crawler"/>
<meta name="CRAWLER.fullTextLink" content=""/>
<meta name="CRAWLER.indexEntryLink" content=""/>

The embedded metadata provides two properties. The first, CRAWLER.fullTextLink, indicates to the crawler where it can retrieve the full-text that corresponds to this article.

The second link, CRAWLER.indexEntryLink, indicates to the crawler the URL that it should use in its indexes. I.e. the URL to which users should be sent.

The technique is fairly simple and uses existing extensibility in HTML to good effect. It occured to me recently that the same technique could be used to address a related problem.

When I use del.icio.us, CiteULike, or Connotea or other social bookmarking service, I end up bookmarking the URL of the site I'm currently using. Its this specific URL that goes into their database and associated with user-assigned tags, etc.

However, as we all know, in an academic publishing environment content may be available on multiple platforms. Content also frequently moves between platforms. The industry solution to this has been to use the DOI as a stable linking syntax. Some sites like CiteULike make attempts to extract DOIs from bookmarked pages, or resolve DOIs via CrossRef. But the metadata they collect is still typically associated with the primary URL and not the stable identifier. This presents something of a problem if, say, one wants to collate tagging information across services, or ensure that links I make now will still work in the future.

A more generally applicable approach to addressing this issue, one that is not specific to academic publishing, would be to include, in each article page, embedded metadata that indicates the preferred bookmark link. The DOI could again be pressed into service as the preferred bookmarking link. E.g.

<meta rel="schema.BOOKMARK" href="http://labs.ingenta.com/2007/03/7/bookmark"/>
<meta name="BOOKMARK.bookmarkLink" content="http://dx.doi.org/10.1000/1"/>

This is simple to deploy. It'd also be simple to extend existing bookmarking tools to support this without requiring specific updates from the owners of social bookmarking sites. If the tool found this embedded link it could use it, at the option of the user, instead of the current URL.

The only downside I can see to this is the potential for abuse: it could be used to substitute links to an entirely different site and/or content for that which the user actually wants to bookmark. This is why I think users ought to be given the option to use the link, rather than silently substituting it. If owners of sites like CiteULike or Connotea decided to support this crude "microformat" then they can easily deploy a simple trust metric, e.g. that they'll use this metadata from known and approved sites.

I'd be interested in feedback on this as its something that we'll likely deploy on IngentaConnect in the next few weeks.

Labels: , , ,

posted by Leigh Dodds at 3:41 pm


<<Blog Home

The Team

Contact us

Recent Posts


Blogs we're reading

RSS feed icon Subscribe to this site

How do I do that

Powered by Blogger