All My Eye

Same debate, different forum: self-archiving of academic papers ... via iTunes?

Friday, May 26, 2006

Here's a development in OA that might be considered, well, a little leftfield: Greg Restall has submitted an RSS feed of the metadata of his research opus to iTunes, to enable interested parties to locate and download the full text of his articles (either by receiving the file as an RSS enclosure, or by using the URL in the enclosure to link to the file on Greg's site).

Although Greg admits that "using iTunes in this way is just a bit of a joke", this is an interesting (and, in some quarters perhaps, alarming) development of the self-archiving idea. Whilst Greg's agreements with his publishers allow him to make a version of his papers freely available (so there's no copyright concerns in this case), it's how others might respond to, and develop, the concept that rings alarm bells. For example, Greg's reader Fernando Gros is already suggesting that all journal articles should be thus distributed, thereby saving users from having to pay to access research.

Trying to treat academic research in the same way as digital music files is not a new idea; Leigh Dodds has pointed me at a paper from 2004¹, which touches on this idea from the reverse angle i.e. why it is more complicated to download/manage academic papers than MP3s. While the authors' pains have to an extent been resolved since, by online reference management/bookmarking tools such as Connotea or CiteULike (which both launched later that year), and by the increase in XML as a format for online articles (which unites the full text and metadata in one file), their issues with full text availability remain.

Ay, there's the rub. Fernando Gros is confusing availability of metadata in iTunes with free availability of full text (probably because Greg's full text just happens to be free, so can be served up along with the metadata). Fernando's suggestion that publishers "rethink their distribution" and take advantage of iTunes misses the pretty obvious point that publishers already distribute their metadata widely, and via more appropriate, and in some cases more accessible, channels (think, for example, PubMed, which is freely available and, unlike iTunes, doesn't require you to have a plugin to use it).

Ultimately, of course, this is not a new issue. If a publisher is "green", i.e. allows authors to self-archive their papers for open access, then iTunes is just another potential self-archiving channel. What Greg's use of iTunes, and the responses it provoked, highlight is the lack of awareness of existing repositories amongst many of those who are best placed to use them, and perhaps also an underlying need for greater repository functionality, to help users quickly locate, collate and share relevant research.

We've been mulling over the implications of this, and had some ideas about what repositories could usefully do that might encourage increased usage by publishers:

accept metadata with a link to the full text (many of them now do)
accept metadata via RSS feed (ditto – but:)

what if access rights metadata could be embedded into the RSS feed?
could the repository's reader then grab the full text as appropriate based on "its" access rights?
as always with RSS, though, this represents problems with statistics – if a tool is automatically downloading the full text, how can the usage be tracked?

And finally ... following the iTunes metaphor to a conclusion, Leigh also went back to the core² of the iTunes story to consider how it might be analogised to distribution of scholarly research; consider this posting on ITWire – "Imagine a retailer being able to dictate pricing policies to the world's four largest wholesalers of a market segment. Well that's the position Apple is now in with iTunes." Ultimately, iTunes' success overwhelmed Universal, Warner, SonyBMG and EMI's reservations about its 99¢/track policy, and at the end of the rocky road of negotiation, they had to agree to continue supplying their music to be sold at the flat price.

Could a reasonable, flat, wholesale price be a more realistic answer to the great OA debate?

¹ Howison, J. and Goodrum, A.. Why can't I manage academic papers like MP3s? The evolution and intent of Metadata standards. Presented at the UMUC Colleges, Code and Copyright, June 2004. http://freelancepropaganda.com/archives/MP3vPDF.pdf
² Sorry, couldn't resist.

With thanks to Leigh for provocative discussion and digging out the examples he cited :)

posted by Charlie Rapple at 11:19 am

FOAF for the lazy, with embedded RDF

Thursday, May 25, 2006

So, in the New World Order... all documents and things on the web must have machine readable RDF Metadata. But, I'd never until this week actually bothered to produce some DC+ metadata for our project pages, or a FOAF file for the team, or myself.

Why? Well:

1. Who the heck wants to write machine readable metadata?

- I mean, seriously, tapping out angle brackets in notepad is grim enough if you have to write a bit of HTML, at least then you get to hit refresh and see something shiny. But writing RDF/XML ? Come on.

2. I'm too lazy to keep it up to date.

- It will become yet another piece of internet junk, caused by me.

3. There's something faintly embarassing about it.

- The prospect of writing a FOAF file gives me the shivers - like writing a CV. A 'homepage' with a tiled background and a picture of my hamster/holiday - now that would be fun. But a FOAF file? I suspect it's supposed to be rather more swish and work-ish with an impressive list of things I've done and VIPs I've met. Eww.

So I was quite keen on Ian from Talis's talk about embedded RDF at XTech . Basically the idea is to write an HTML page first - so the prose is aimed at humans - and then add a few extra tags and attrs to the HTML to define the things that would be needed to make some RDF metadata. Then you use some XSLT to scrape it out on the fly - there's even a webservice to do the transform for you.

My efforts mainly consisted of putting 'rel' attributes in my existing <a href>nchors, and adding a few extra s.

Eg:
My name is Katie.
becomes
My name is Katie
-> creates the triple:
_x foaf:firstName Katie

And:

I work at <a href="http://www.ingenta.com/corporate/">Ingenta</a>
becomes
I work at <a href="http://www.ingenta.com" rel="foaf-workplaceHomepage">Ingenta</a>
-> creates the triple:
_x foaf:workplaceHomepage http://www.ingenta.com

Obviously there's much more to it than this. Eg, how to get rdf:types in there.

But I just got to grips with the basics. I was able to make this foaf metadata from my new homepage.

The con is:
The *extra stuff* in your HTML does make it a bit crowded.

The pros are:
1. It's lo-fi. Just a text editor and basic HTML skills required to get going - people may even do it.
2. It gets kept up to date automatically, as you update the human-readable page - which you *might* do.
3. You don't have type out RDF/XML.
4. Less ewww.

Now I've been inspired to produce this nice machine readable page, I can query it with Leigh's newly extended SPARQL service, to get results like this.

posted by Katie Portwin at 5:02 pm

performance, triplestores, and going round in circles..

Wednesday, May 24, 2006

Can industrial scale triplestores be made to perform?

Is breaking the "triple table" model the answer?

Scaling SPARQL performance is something I worry about. Mike Bergman worries about it too.

In our XTech paper, we showed that even a simple, bread and butter sample query, is taking 1.5 seconds on our 200million triplestore. (Scroll down for the colourful graph). In our Jena paper, we showed that with a triplestore in a relational database, you're at the mercy of your query planner - some configurations of SPARQL query perform reaaaaly badly.

Both these facts make sense when you think that your SPARQL doesn't just get done by magic - of course, it gets turned into a ma-hoosive SQL JOIN statement, across the main triples table again and again, like this:

Select A0.Subj, A2.Subj, A3.Obj, A4.Obj, A5.Subj, A6.Obj From jena_g1t1_stmt A0, jena_g1t1_stmt A1,
jena_g1t1_stmt A2, jena_g1t1_stmt A3, jena_g1t1_stmt A4, jena_g1t1_stmt A5, jena_g1t1_stmt A6 Where
A0.Prop=...

Databases weren't really designed to do queries like this - it's not surprising that they aren't very fast at it...

So I was really interested by Kevin Wilkinson's paper on property tables - also at the JUC last week. Kevin's idea is basically this:

Say, in your triplestore, you have this:

triples table
---------------
| s | p  | o  |
---------------
| X | p1 | o1 |
| X | p2 | o2 |
| X | p3 | o3 |
| Y | p1y| o1y|

(This means that X has three properties attached to it. )

Kevin points out that often, you get "groups" - patterns in the data. Often, X's always have exactly one o1, one o2 and one o3.

For example, consider X to be an article, and articles always have one title, one doi, one volume, issue, etc.

In this case, you could have a "property" table like this:

------------------
s | p1 | p2 | p3 |
------------------
X | o1 | o2 | o3 |
Y | o1y| o2y| o3y|

You could still keep other unsuitable-for-grouping stuff in the main triples table, but drag out groups into 'property tables'. This would make them fast to query. (There's a lot more to it than this - eg how properties with cardinality >1 go in their own table too - but you'll have to read the paper!)

This made sense to me. There are definitely patterns in the data. We identify articles by their issn/volume/issue/page data. They always(?) have these properties. How sensible it would seem to put them in a nice normal table, so we can at least occasionally have sql like:
"select * from articles where articles.vol=...."
- rather than scary mega-joins, to throw at our poor old database. Sounds totally common sense!

BUT

....

....

Hangon....

Consider this: our 'articles' table would look like this:


Articles
---------------------------------------
id | title  |    doi   |   volume  | .....
---------------------------------------
 1 | "bla"  | 123.456..|  42       | .......
 2 | "foo"  | 789.101..|  45       | .......

Er.... does this look rather familiar? Isn't this exactly what we started with?

Hang on - what was wrong with what we started with again?

Well, for us, the problem was that we found that we had to keep re-modelling the database - initial assumptions were correct for the first hundred thousand articles.. and then suddenly weren't. For example, we recently discovered that sometimes Articles have 2 DOIs. Horrible, but true. The great thing about the triplestore is that we don't have to bake assumptions about the data into the database - we can have as many whatevers as we like.

I'm still very keen to try out Kevin's architecture - I hope to nag him into solving this undecided-cardinality-problem within his app by changing the schema dynamically and shufting the data around. I just hope that could scale. Much looking forward to first release and finding out more!

posted by Katie Portwin at 1:44 pm

At the 2006 Jena User Conference

Katie and I were at the Jena User Conference in Bristol on the 10th and 11th of May. The conference had a good mix of papers and demos that covered a variety of subjects. We met a number of people and got useful pointers and ideas from them. There were some papers that we found particularly useful and are keen to try out ourselves in the Metastore project.

One of them was Chris Dollin's Eyeball tool for validation of the data in the triplestore. This would be great for sanity-checking the metastore data given the extreme flexibility of the store. The main question would be whether we can run it for the huge amount of data in the store and if yes, then the amount of time it would take to run it. I think we will try this on a single-publisher store initially and then on increasing sizes of the store...certainly something to look into further.

Another was Max Völkel's RDFReactor - it transforms an ontology in RDF Schema into a Java object model. In the Metastore project, we had to go through a number of iterations and a lot of head-scratching while translating our schemas to a Java object model that would most closely follow the schemas. So it would be interesting to have a look at the model that RDFReactor generates.

An interesting development was the Jena property tables by Kevin Wilkinson to store patterns of RDF statements.

Another paper that I liked was Kate Byrne's Tethering Cultural Data with RDF, particularly the natural language processing part- something I would like to explore in my spare time (what there is of it!)

We presented our paper "Scaling Jena in a commercial environment: the Ingenta Metastore Project" and got positive feedback and suggestions. A few people were interested in the customised schemas that we have developed for representing some parts of our dataset.

Max Völkel was interested in using our schemas to compare the object models generated automatically by RDFReactor with the ones that we have developed ourselves.

In the paper, we had mentioned the out-of-memory problems we had with using OWL for inferring relations, due to the sheer size of our database. The Jena developers gave us some useful suggestions on things that we could try to make this work and also asked us for examples of relations that we would want to infer.

The conference was a very enjoyable learning experience for us - we even managed to present our paper without too much panic or too many mishaps :-)

Thanks to the Jena team for organising such a great event...looking forward to the 2007 conference!

posted by Priya Parvatikar at 9:06 am

It's worth letting your techies out every now and then...

Saturday, May 20, 2006

Congrats to Katie and Priya, who presented a paper on the Metastore project at the Hewlett Packard Jena User Conference earlier this month, and came away with the prize for "Best Applications Paper".

They went on to talk further about the project at XTech in Amsterdam, where Leigh also presented a paper on SPARQL.

Watch this space for further updates on these and other projects...

posted by Kirsty Meddings at 1:49 pm

Same debate, different forum: self-archiving of academic papers ... via iTunes?

FOAF for the lazy, with embedded RDF

performance, triplestores, and going round in circles..

At the 2006 Jena User Conference

It's worth letting your techies out every now and then...

Contributors