All My Eye

Bridging between the blogging and scientific communities

Thursday, January 24, 2008

Interesting posting from Jon Udell today ("Bloggers talk to bloggers, scientists talk to scientists") in which he draws attention to the disconnects between discourse happening in blogs and mainstream media, and that happening in scientific journals, even when the conversation is about the same article.

The comment and the subsequent discussion is worth reading. There's an interesting comment Geoff Bilder about CrossRef's forthcoming blog engine plugin; keep an eye on CrossTech for a formal announcement.

The BPR3 project is also making practical steps towards helping link up discussions in these two domains.

I've been arguing for a long time that publishers need to keep an eye what's happening in the blogging arena, as its a good test bed for exploring the transformation of discourse (scholarly or otherwise) that the Web enables.

Labels: blogging, citations, DOI

posted by Leigh Dodds at 2:18 pm

Persistent Links in Bookmarks

Wednesday, March 28, 2007

A few weeks ago I blogged an idea for incorporating a "preferred bookmark link" into web pages to improve the stability of links submitted to social bookmarking sites. The comments were favourable so I think its worthwhile pushing ahead with the idea.

However, I've since decided that my proposed implementation is wrong! Originally I suggested embedding the link in a META tag in the HTML. But it's dawned on me that the LINK tag is obviously a better alternative. The LINK tag is intended to be used to convey relationships between documents, and that's essentially what we're trying to achieve. There's even a predefined link type for indicating bookmark links.

This mechanism is already in use on many blogs to identify the "permalink" for a specific article, e.g:


<a href="...some...url" rel="bookmark" title="Permalink">Permalink</a>.

So I'm going to revise my proposal so that persistent links to academic articles, e.g. DOIs, are embedded into web pages by adding a LINK tag into the HEAD of the document as follows:


<link rel="bookmark" title="DOI" href="http://dx.doi.org/10.1000/1"/>

The system is extensible as we can agree a convention, similar to RSS auto-discovery that the combination of the rel and title attributes convey information about the type of link. For example to include both a stable DOI link and a direct link to the current publisher's website, we could use the following:


<link rel="bookmark" title="DOI" href="http://dx.doi.org/10.1000/1"/>
<link rel="bookmark" title="Publisher" href="http://www.doi.org/index.html"/>

User agents (e.g. bookmarklets and other tools) and social bookmarking sites can then offer the user a choice of which link to use (to avoid security issues) or simply store both.

Labelling DOIs like this also enables them to be more easily extracted for other purposes. We're already including DOIs, expressed as info: URIs, in our embedded Dublin Core metadata, but the actual web link is useful too (if not more so!)

Thoughts?

Labels: DOI, linking

posted by Leigh Dodds at 11:37 am

Persistent linking, web crawlers and social bookmarking

Wednesday, March 07, 2007

Typically a web crawler, unless configured to use a separate index or crawling algorithm, will use the URL from which it retrieves some content as the entry in its search index. This means that anyone clicking on a search result will be taken to this URL.

Where a site has access controlled content and the full-text resides at a different location, this presents a problem. The site owner or publisher would like users to go to one page, e.g. the abstract, but will want the crawler to get the full-text. Making this work involves some dialogue between site owner and the search engine. For example the web crawler needs to use an alternate index or additional metadata to make the connection between the index entry link and the full-text retrieval link.

Some site operators, with approval, use a technique known as "cloaking" to achieve this. This involves serving different content to a web crawler, e.g. a PDF, than would be served to an end user, e.g. an abstract. Most search engines disapprove of this approach, but Google Scholar, for example has allowed it. This has caused some debate.

On IngentaConnect we use cloaking to serve content to some crawlers. But we no longer do this for Google Scholar. The reason for this is that Google were interested in obtaining the richer metadata that we include (as embedded Dublin Core) in abstract pages. This metadata, supplemented with the full-text, improves the quality of Scholar search indexes.

I thought I'd explain the fairly simple solution I concocted to achieve this and point out where the same technique could be used to improve another problem: persistent linking in social bookmarking services.

When the Googlebot requests an abstract page from IngentaConnect, it gets fed some additional metadata that looks like this:


<meta rel="schema.CRAWLER" href="http://labs.ingenta.com/2006/06/16/crawler"/> 
<meta name="CRAWLER.fullTextLink" content=""/>
<meta name="CRAWLER.indexEntryLink" content=""/>

The embedded metadata provides two properties. The first, CRAWLER.fullTextLink, indicates to the crawler where it can retrieve the full-text that corresponds to this article.

The second link, CRAWLER.indexEntryLink, indicates to the crawler the URL that it should use in its indexes. I.e. the URL to which users should be sent.

The technique is fairly simple and uses existing extensibility in HTML to good effect. It occured to me recently that the same technique could be used to address a related problem.

When I use del.icio.us, CiteULike, or Connotea or other social bookmarking service, I end up bookmarking the URL of the site I'm currently using. Its this specific URL that goes into their database and associated with user-assigned tags, etc.

However, as we all know, in an academic publishing environment content may be available on multiple platforms. Content also frequently moves between platforms. The industry solution to this has been to use the DOI as a stable linking syntax. Some sites like CiteULike make attempts to extract DOIs from bookmarked pages, or resolve DOIs via CrossRef. But the metadata they collect is still typically associated with the primary URL and not the stable identifier. This presents something of a problem if, say, one wants to collate tagging information across services, or ensure that links I make now will still work in the future.

A more generally applicable approach to addressing this issue, one that is not specific to academic publishing, would be to include, in each article page, embedded metadata that indicates the preferred bookmark link. The DOI could again be pressed into service as the preferred bookmarking link. E.g.


<meta rel="schema.BOOKMARK" href="http://labs.ingenta.com/2007/03/7/bookmark"/> 
<meta name="BOOKMARK.bookmarkLink" content="http://dx.doi.org/10.1000/1"/>

This is simple to deploy. It'd also be simple to extend existing bookmarking tools to support this without requiring specific updates from the owners of social bookmarking sites. If the tool found this embedded link it could use it, at the option of the user, instead of the current URL.

The only downside I can see to this is the potential for abuse: it could be used to substitute links to an entirely different site and/or content for that which the user actually wants to bookmark. This is why I think users ought to be given the option to use the link, rather than silently substituting it. If owners of sites like CiteULike or Connotea decided to support this crude "microformat" then they can easily deploy a simple trust metric, e.g. that they'll use this metadata from known and approved sites.

I'd be interested in feedback on this as its something that we'll likely deploy on IngentaConnect in the next few weeks.

Labels: DOI, Google, hypertext, linking

posted by Leigh Dodds at 3:41 pm

Bridging between the blogging and scientific communities

Persistent Links in Bookmarks

Persistent linking, web crawlers and social bookmarking

Contributors