All My Eye

ASA Conference review: "Value in Acquisition"

Tuesday, February 28, 2006

This week, I have been mostly attending the annual conference of the Association of Subscription Agents and Intermediaries. Held at London's Royal College of Nursing (oh, how I enjoy its proximity to John Lewis on Oxford Street ...), this year's conference addressed the issue of "Value in Acquisition", with 16 speakers grouped into 5 themed categories.

For your reading pleasure, I feverishly scribbled notes down during the event in order to blog a review upon my return. Of course, I would have liked to blog during the event, but (a) I didn't see anywhere to plug in my laptop, so you'd only have got the first 45 minutes' worth, (b) I don't think RCN has wifi and (c) I'm known for the fierceness (and thus, noisiness) of my typing, and I don't think the presentations would have been greatly enhanced by an arrhythmical staccato percussion accompaniment emanating from the audience.

Anyhoo. For simplicity, I shall blog each of the conference's sub-themes in a separate posting. They are as follows:

I may also, here and there, interject my synopses of the official version of events with a "View from the bar", recording some of the delegates' opinions on the subjects that caused most discussion. Read on, Macduff ...

posted by Charlie Rapple at 9:30 pm

ASA: "Aggregating content and the substitution of subscriptions"

In this final session, customisation and interoperability were recurring themes, with some aggregators performing better than others in these areas. From the publisher perspective, the importance of balancing aggregator licensing with other revenue streams was highlighted, whilst aggregators will need to consider the broadening of future usage both in terms of media/format and geography.

ProQuest's Simon Beale (view slides – .pps) was not the first to warn against over reliance on Google (the many librarians in the audience were no doubt nodding their heads off in agreement), and suggested that, in time, Google's efforts for the public good will have to conflict with their need to give investors a good ROI. Google has relevance to some of the wider debates which Simon listed:

open access vs total access
aggregation vs standalone delivery
general search engines vs specialist retrieval systems
big deals vs specialist content
print vs online

Simon wondered whether the Open Access (OA) issue has become bigger in debate than in reality; though it will continue to grow, it is but one piece of the overall journal business, and will ultimately be only one of a number of journal access models. Nonetheless it is valid to suggest that public/government organisations will not always be able to support non-commercial publishing (e.g. PubMedCentral?). Institutions will increasingly want institutional repositories to leverage their intellectual property (by disseminating their output more widely, thus raising their profile – and having control of that process).

Intermediaries, Simon asserted, are able to sit somewhat on the fence where OA is concerned. Aggregators should embrace any option which supports their overall aim to provide access to quality content. The future will require content providers to become less English (language)-centric, with increasing importance of "local" content necessitating multi-lingual interfaces, search options etc. within the next 10 years. Simon suggested aggregators would also need to prepare for wider use of mobile devices when accessing scholarly content (it would be interesting to know if current statistics bear this out – I find it hard to imagine most of the "content users" I know wanting to access publications in this way).

EBSCO's Melissa Kenton (view slides – .pps) went back to basics to explain that libraries subscribe to aggregated databases to add to their core journal collections, and that aggregated content proves popular with undergraduates who are less fussy about the source of their content. Publishers, meanwhile, use aggregators to access new markets e.g. public libraries, smaller colleges, high schools. The major difference between subscription and aggregator access is that databases do not guarantee to provide content in perpetuity (as the aggregator does not own the content it licenses).

Both Melissa and Simon conceded that very few databases are sold via agents (less than 1%, in ProQuest's case), with orders tending to come direct to the aggregator. Following a question from Scholarly Information Strategies' Chris Beckett, both confirmed that publishers are given some control over the markets their (aggregated) content is sold into, although Melissa noted that EBSCO prefers not to license content with such stipulations, as they are costly to observe.

Southampton University's Gordon Burridge enquired about libraries' ability to customise aggregator collections according to their needs, both in terms of content and interface. Whilst ProQuest is "aiming for more commonality" in its interface, EBSCO's aggregated databases must be careful not to conflict with other EBSCO products, and thus do not enable creation of a library-specific e-journal collection (which is possible for subscribed content accessed via the EBSCO EJS package). Queries about Gale's capabilities in this area were addressed by delegate (and Gale representative) Diane Thomas, who confirmed that Gale enables libraries to create customer-specific collections (thus allowing the "buy-by-the-drink" purchasing referred to the previous day by Rick Anderson).

Consultant John Cox's polished performance at the podium (view slides – .pps) posited a second distinction between journals and databases, whereby journals are the "minutes of science", and databases are used "largely for teaching". Embargoes, being considerably more obstructive to research than to teaching, are therefore enough to resist the cannibalisation of subscriptions by aggregated content (though they should be reasonable i.e. more than 12 months is unacceptable). Title specificity continues to be a factor, such that libraries are more likely to replace a cancelled subscription with on-demand document delivery than with access to an aggregator collection.

Not so, argued the ever-voluble Chris Beckett of Scholarly Information Strategies, whose recent survey indicated that 48% of respondents considered a database to be an adequate substitution for a subscription. Some scholarly joshing followed, and eventually the two agreed that cannibalisation is a risk, but primarily from budget squeeze; databases are a secondary issue.

Order was restored by Helen Edwards of the London Business School, a postgraduate business school which subscribes to 700 journals and 137 aggregator products (view slides – .pps). LBS has invested in creating its own interfaces with relatively "deep" links to available content, to provide interface commonality to its users. (It has also configured its link server at Google Scholar to enable "surfacing [of] content through a range of access points".) Helen reiterated the importance of customisability and interoperability: libraries need to be able to stamp their "seal of approval" on resources they license, and said resources must recognise that they are but one piece of a wider picture.

Blackwell Publishing's Steven Hall brought the conference to a close with a presentation (view slides – .pps) which he suggested might have been called "practising safe aggregation" ... some publishers abstain; some practise a somewhat unsatisfactory withdrawal method. The mass of aggregators can be broken down by focus – be it a niche market, an extension of A&I activities, or a specific discipline. Some are complementary to publishers' activities; others are alternatives. The complementary business model comprises:

an embargo (12 months is acceptable for STM, but not for humanities)
no archival rights for libraries
licensing of collections only (not individual journals)
a "one stop shop" for content (meeting the needs of undergraduates hunting for citations)
functional (i.e. non-sophisticated?) interfaces
non-core markets

The alternative business model comprises:

no embargo (strong cannibalisation risk, but in some markets there may be advantages which outweigh the potential losses)
archival rights (maybe)
licensing of individual journals
emphasis on full text offering, not A&I merits
competition in core markets

A third way is evolving:

other types of content added to "core" content
contextualisation to assist understanding
"smart" searching
interactivity

Steven provided a handy checklist for publishers considering licensing content to aggregators, to ensure they keep the balance right and avoid subscription cancellations:

royalty rates/model
currency/embargo
archival rights
length of agreement
reporting of customers/titles/usage
sub-licensing (the longer the chain, the less control you have)

ASA Chairman Peter Lawson questioned whether the societies on behalf of which Blackwell publishes had views on content licensing; yes, said Steven, but they mostly allow us to make these decisions.

posted by Charlie Rapple at 8:24 pm

ASA: "Controlling low-priced access"

In a relatively short session, we heard from Swets' Robert Jacobs (view slides – .pps) and Karger's Moritz Thommen (view slides – .pps) about the supportive role agents can take to simplify life for both publishers and libraries as licensing models/pricing diverge, and about some specific activities being undertaken to reduce subscription fraud.

Robert addressed the increasing range of licensing models, which has come about largely as a result of the squeeze on institutional budgets. The more choice a model offers for libraries, the more complex it is (print = simple, usage dependent pricing = relatively complex). Libraries primarily want clarity of pricing, so title-based subscriptions remain dominant. As the models fragment, so agents can increase their value to the library by:

increased management of e-resources
driving standardisation
data discovery and evaluation
maintenance of centralised metadata databases.

Moritz presented an interesting overview of some of the work Karger has done to slow the decline in personal subscriptions during the last few (30?) years, and to avoid abuse of personal rates by unscrupulous institutions/agents. As background, Moritz noted that personal subscriptions began to decline during C20th as library copies became more readily available/accessible to scientists. As subscriptions are cancelled, so prices increase – but due to higher price sensitivity in the personal subscriber market, personal subscription prices cannot rise as much as institutional. Thus, there is now a (widening) gap between individual and institutional prices for a journal. (Moritz cited a journal for which individual/institutional rates were £80/£125 in 1985, for which the gap is now £300+).

This kind of price gap evidently encourages exploitation, such as:

society members

donating their issues to the library

or ordering on behalf of the library

or reselling their copy to the library

agents purchasing personal subscriptions, and reselling them to libraries at lower prices than the publisher's list price

Karger's efforts uncovered instances of multiple "personal" subscription orders coming from the same address, accompanied by cheques signed by the same drawer. When the personal subscriptions were withheld, they were chased by an agency which in the end re-ordered institutional subscriptions. Karger have implemented new measures to prevent such fraud, for example, requiring evidence of personal membership to a scientific/professional association, or ensuring that subscriptions are paid by private funds. This has caused some unhappiness amongst Karger users but seemingly amongst those who cancel their personal subscriptions only to reorder institutional ones! (Moritz noted that the ASA's guidelines on the subject are firm and clear). Publishers have won court cases against such unscrupulous agents, and many now offer personal subscriptions only directly. Could societies (in particular) do more to vet members to prevent "phony" members joining and reselling lower-price subscriptions? It is in the interests of all links in the information chain to eliminate this problem.

Moritz suggested that the problem may be easier to manage in an online era, due to the data required to set up online access; Chris Beckett (Scholarly Information Strategies) later commented that online access presents its own problems such as username/password sharing, which is equally hard to control. Ian Johnson (Robert Gordon University) made the analogous point that library associations have noticed members moving from institutional to personal memberships.

posted by Charlie Rapple at 6:17 pm

ASA: "Low price access"

Monday, February 27, 2006

The HINARI initiative (launched in 2002) enables access to current medical research for researchers in developing countries. HINARI's Maurice Long (view slides – .pps) highlighted the strong support for this initiative from scientific publishers, with 3,300 journals now offered for free or subsidised access. This can be attributed to two key facts:

the majority of HINARI-eligible institutions could not afford to pay for the material were it not offered to them without charge
publishers are able to opt out of enabling access in countries where they already have, or expect to have, good business

Uptake by eligible institutions is good, with 3 million article downloads in 2005, but connectivity can still be difficult/expensive in developing countries. Elsevier's Tony McSeán spoke (view slides – .pps) of hardware and power supply limitations, as well as cultural issues making it hard for local librarians to persuade researchers to use the service. Participating publishers are implementing workarounds where possible, for example, using loband.org to strip content down to text-only for thin-pipe delivery. (Tony also made the interesting suggestion that HINARI may be encouraging politicians in developing countries to support better bandwidth for HE institutions).

Elsevier are considering offering Scopus for inclusion in HINARI later this year, which would significantly extend the project's coverage. They dismiss oft-raised concerns about abuse of their content, and indicate that there is no evidence to date of systematic content harvesting, or other forms of abuse.

HINARI is currently looking to further devolve technical support to localised centres, and to fund better training/promotion of its services. In this respect, Maurice Long mentioned evaluation of the HINARI programme in relation to its predecessor INASP, which is successfully facilitating journal purchasing in developing countries. One of INASP's publisher partner representatives, Alan Harris from Springer, was next on the podium, and proposed (view slides – .pps) some potential pricing models by which institutions in developing countries could purchase journal content in the future, for example:

a low, flat fee based on GDP
a calculated fee e.g. (cost + overheads + small margin) divided by (need + available funding)

Both Alan and Tony suggested that current initiatives are a stepping stone; linking researchers from developing countries into the wider academic publishing network will encourage more publishing within developing countries. Agents, then, would take on the role of mediators and administrators between developing world libraries and publishers – but may need to accept lower service charges.

posted by Charlie Rapple at 9:56 pm

ASA: "Collection Management"

Main conclusions from this session:

ERM to save the world (part 2)?

e.g. reducing burden of holdings database maintenance, and improving efficacy of processes drawing on that data

libraries need to act like a business (part 2)

marketing of library's value/services is necessary to protect budget/jobs

refocus available resources to use them more effectively

e.g. consider the article as the unit, not the journal, within a collection

agents to take on wider role (part 2)

e.g. informing libraries of title/publisher changes; activating and troubleshooting online access

Mieko Yamaguchi is now back in the front-line of e-journal administration following Bangor University's famous (infamous?) decision to lose 8 of its 12 librarians last year. In her presentation (view slides – .pps), Mieko traced the recent history of Bangor's periodicals collection, from e-access for £10 per publisher per year in 1996 (under the terms of NESLi's predecessor, the Periodicals Site Licence Initiative), through a move to e-only and back again (2001–2004; reverting to e+print for VAT reasons), configuration of a link resolver in 2003 (and hopes that a forthcoming ERM implementation will help with holdings database maintenance), and finally to 2005, when Bangor began licensing individual titles for the first time (all previous access being purchased via NESLi, other big deals, or backfile deals).

Mieko provided some food for thought re. marketing of libraries and their services, when she mentioned that Bangor had not publicised the implementation of their link resolver (believing it should be transparent to users) but that, in retrospect, this may have undermined their perceived value to users/funding bodies – and perhaps informed the funding body's decision to axe those 8 librarian jobs.

Mieko suggested that, since e-subscriptions are generally not purchased at list price (and instead reflect several parameters e.g. JISC banding, FTE, off-site access requirements, past value of print subscriptions etc.), libraries should in future negotiate the terms of their licence directly with publishers, but order and pay through agents. She supported Rick Anderson's premise that agents could take on more troubleshooting, and suggested also that they should be responsible for activating library access [Ingenta plug: I think we are unique in enabling agents to do this via our agent activation program].

Following a question about Bangor's users' opinions of the shift from print to e-only, Mieko made an interesting (to me – probably obvious to most serials librarians!) distinction between students, "who think print doesn't exist", and academics, who favour specific journals and are not particularly bothered about format.

Indiana University's Julie Bobay is a captivating speaker whose treatment of the theme (view slides – .pps) incorporated a perspicacious overview of US collection trends over the last 50 years. Collection development in the noughties, Julie suggested, is like "gopher-bashing" (I don't think I was the only member of the audience grateful for Julie's explanation/illustration of this term!), with potential problems continually popping up all over the place which require fast, focussed attention to resolve.

Today's libraries are hybrid libraries, part of the global digital library and no longer defined by their local collections. Whilst users have increasingly varied needs and expectations, the market place and associated procedures are less certain – and budgets are tight. Julie came up with some great quotations to indicate how libraries feel about their current role; my favourite was from Mark Sandler (Collection Development Officer at the University of Michigan): "Hopelessly lost, but making good time".

Julie further cited Sandler's description of librarianship as a team sport, where decisions are not binary/straightforward but include many variables e.g. what to get free, what to license, what to build, and what to do cooperatively. Collections should be redefined for an e-world where the unit is increasingly the *article* not the *journal*; libraries need to operate at a more granular level, and projects like the JISC-sponsored TOCRoSS are a step in the right direction. TOCRoSS is a collaboration between a publisher, a library software supplier and a university library; as a community, we need to collaborate more – for example, to track OA articles and decide how libraries could best incorporate these in their collections.

Right now, however, the journal is the main guarantor of quality and there is no reliable alternative to ascertain the credibility of "free-floating" articles – plus, libraries will need to re-focus resource in order to start cataloguing/controlling access at article level. (I think it was Gordon Burridge of Southampton University who made an interesting point at this juncture, suggesting that pricing could move to a phone-company model, with a flat fee (line-rental) topped up by fees per article used (call charges).)

posted by Charlie Rapple at 7:45 pm

ASA: "Comparing and contrasting the methods of purchasing"

Some excellent presentations in the first session, from which the recurring issues seemed to be:

libraries need to adapt their roles in a digital era

agents need to adapt with them to protect and justify their place in the chain

a lot of eggs are being placed in the basket that is ERM software

e.g. overlap analysis

libraries need to operate as businesses, for many diverse reasons, e.g.

to provide compelling value propositions to funding bodies in order to protect and increase their share of institutional budgets

to reclaim VAT paid on e-journal licences
to ensure that the available resources are being used efficiently

libraries will need to take increasing responsibility for success of e-journal access

e.g. maintaining clean, current directory services for Shibboleth

archiving is climbing the agenda, but libraries don't have the funds

Session-by-session overview
The perennially excellent Rick Anderson (Director of Resource Acquisition at the University of Nevada, Reno) opened the event with a presentation (view slides – .pps) which touted materials budget, staff time and (I liked this) staff morale as key library resources, and went on to outline the ways in which agents/intermediaries help to conserve those resources – for example, by consolidating invoicing (saves time) and troubleshooting e.g. access problems (saves "weight", i.e. the impact on staff of dealing with such irritations, both in terms of time and stress). However, Rick stressed that whilst libraries will be prepared to pay some premium for these services, no good librarian will divert much of the budget to these ends, and that there are other ways to make the most of the materials budget – for example:

consortial purchasing offers discounts as a result of consolidation (although are these discounts enough to balance the "weight" of dealing with "mind-boggling" consortial business models?)
big deals use up a lot of the budget, but offer overall savings – or do they? It's important to assess the real value of the additional titles
document delivery a.k.a. "by-the-drink" purchasing. Rick does not count this as a threat to subs agents in the short term, but suggests that in the long term it may be easier to justify "just in time" purchasing than the "just in case" (traditional subscription) model, which arguably results in much wasted material

Rick points to his own library's decline in circulation figures (from 20.1 items per patron in 1994 to 9.7 in 2005) as evidence of a sea change in use of libraries, which will "have to" have an effect on management of serials budgets – at what figure will this decline stop, and what will continue to be circulated at that point? How can agents protect their role in this market? To the latter question, Rick offers some suggested answers:

to protect the value of invoice consolidation, agents should improve the renewals process (and minimise errors within it), and could perhaps begin brokering database licences as well as journal subscriptions
to protect the value of claiming, agents could do better at keeping libraries in the loop
to protect the value of troubleshooting, agents should be finding and fixing problems before libraries (and their patrons!) do; *this* really is worth money

ASA Chairman Peter Lawson revisited Rick's mention of the "white elephant" of residual holdings (the result of the "just in case" purchasing") and suggested these holdings haven't changed in 20 years; so much for collection development? Rick conceded that librarians are not adjusting well to the concept of limited resources, and need to re-assess the basic function of the academic librarian – are the available resources being used efficiently?

In response to Elsevier's Tony McSeán, Rick agreed that the afore-mentioned elephant is still worshipped by academics, who will require "something cataclysmic" to change their view – but what, argued Peter Lawson, represents a cataclysm if not these last two decades of the serials inflation crisis? Rick suggests the effect on *academics* may not yet be felt, but will be over the next few years.

Ian Johnson of Robert Gordon University wondered what libraries could do to get more of their university's budget, given that research content is still valuable and academics are still being encouraged to produce it. We need to demonstrate increasing value to our institutions, responded Rick; we need to present our role NOT as collection building/management, but as ensuring people get the information they need. And it's easier to meet research needs by providing online access. Don't expect funding bodies to be philosophical and altruistic about the role of the library – prove your value.

Paul Harwood of Content Complete had the unenviable task of following Rick on the podium, with a fact-packed presentation (view slides – .pps) of results from a recent survey of NESLi 2 reps in UK HE institutions. Paul indicated that the UK's slow migration to online holdings (31% of respondents' budgets was still print-only) is predominantly due (47%) to VAT charges on e-journals, with only 22% of respondents citing archival concerns, and 19% indicating academic preference (the latter is increasingly less of an issues, but departments do like to retain a physical presence in the library). The government's recent Select Committee Report (concluding that VAT would continue to be charged on e-journals) means that libraries will be forced to accept this issue (as, indeed, Bangor University and likely others have done). 80% of libraries say they cannot currently reclaim VAT, but Paul noted that it is possible for a library to set itself up as a company in order to do so.

Archiving is an increasingly hot topic for librarians, with considerable support for the Dutch National Archive, C/LOCKSS, Portico and the British Library's National Research Reserve (which has even gained coverage in the national press – which I cannot resist mentioning since my own letter on the subject was published by The Times (of London) last month :-) [Sally Rumsey, below, warns that despite the enthusiasm for archiving, very few libraries have a budget for digital preservation].

Big deals remain popular with UK libraries, despite the main downsides (no cancellations, irrelevant content, movement of titles, inability to tailor list, high administration costs, etc.) Paul raised an interesting new (to me) issue: the lack of ownership where big deals are concerned, the variety of the content meaning that no one area owns the deal or takes responsibility for evaluating its usage.

Although 82% of libraries currently buy more than 75% of their subscriptions through an agent, this percentage is decreasing as publishers make it easier for libraries to deal direct, and as the agent's role is (perceived to be) devalued in the electronic environment. Paul suggests it may be time to change the pricing model, but warns that the rigidity of university finance systems will delay any potential changes to the granularity of purchasing models.

Sally Rumsey (LSE Library) set out to consider the overall costs of e-resources, above and beyond licensing/subscription fees (view slides – .pps). Sally suggested that despite increasing standardisation (e.g. COUNTER) and federation (e.g. Athens), e-journal administration is costing more than print journals, and more interoperability is needed. In addition to staff costs and overheads, Sally flagged up the oft-forgotten end-user costs of e.g. printing articles. She also noted that the costs of open access are not yet known; who really pays under author-pays models, or, are we really making savings when effective self-publishing requires an institutional repository to publish in?

JISC consultant Terry Morrow then addressed us on the subject of Shibboleth, a standard which passes authentication responsibility from the resource provider to the current user's home institution, thus enabling providers to authorise access without knowing specifically who the user is (view slides – .pps). Terry highlighted that federated access control requires trust between the various links in the chain, hence current Shibboleth federations tend to be country- and/or sector-based. For the moment, the UK's federation is managed by JISC, and funding from JISC will continue until 2008. (Shibboleth differs from AthensDA in that it is open source; AthensDA uses a proprietary protocol, and must be licensed). Shibboleth's success depends on clean, up-to-date, compatible directory services being maintained by participating institutions.

posted by Charlie Rapple at 6:15 pm

Paper at XTech

Friday, February 24, 2006

The MetaStore project team have had a paper accepted at the XTech 2006 conference in Amsterdam.

XTech (formerly XML Europe) is all about XML technologies; topics include Semantic Web and RDF, Tagging, Annotation, Mashups, Web Services - all the new fangled "Web 2.0" stuff...

It is apparently "the premier European conference for developers, information designers and managers working with web and standards-based technologies." (yikes!) The keynote speakers are from Amazon and Yahoo. (double yikes!).

This is the abstract:

The aim of the Ingenta MetaStore project is to build a flexible and scalable repository for the storage of bibliographic metadata spanning 17 million articles and 20,000 publications.

The repository replaces several existing data stores and will act as a focal point for integration of a number of existing applications and future projects. Scalability, replication and robustness were important considerations in the repository design.

After introducing the benefits of using RDF as the data model for this repository, the paper will focus on the practical challenges involved in creating and managing a very large triple store.

The repository currently contains over 200 million triples from a range of vocabularies including FOAF, Dublin Core and PRISM.

The challenges faced range from schema design, data loading, SPARQL query performance. Load testing of the repository provided some insights into the tuning of SPARQL queries.

The paper will introduce the solutions developed to meet these challenges with the goal of helping others seeking to deploy a large triple store in a production environment. The paper will also suggest some avenues for further research and development.

Now we just need to write and deliver a presentation... eep! In case it all goes horribly wrong, I'm now recruiting for a volunteer member of the audience to faint at my signal...

posted by Katie Portwin at 3:44 pm

My triplestore's bigger than your triplestore..

Monday, February 20, 2006

I am currently working with Priya, on a giant RDF triple store. The aims of the "MetaStore" Project are to:

1. Pin down all of our metadata in one place. (Currently we have article headers in XML files, references in a relational database, other databases which map legacy article identifiers to other (legacy) article identifiers.. etc etc - basically, an integration nuisance for the content team!)

2. Model it properly - using an ultra-flexible database model, and industry standard vocabularies. (For example, it is nice to model books as books, and supplementary data as itself - rather than shoehorning them into the journal/article model.)

The first stage of the process was to do some initial RDFS modelling and convert a big chunk of test data into RDF/XML.

Next, we experimented with a few RDF engines. We went in the end for Jena, with a PostgreSQL back-end. Jena has a solid Java API, good support, and scaled well in initial query performance tests. The Postgres back-end gives us a tried and tested master-slaving mechanism.

We had 4.5 million articles to load.. so had to develop strategies for optimising load performance.

After a January spent watching load logs scroll past, all of the backdata headers are finally loaded, and a daily process keeps the store updated with new arrivals.

The interesting thing about this project (in terms of RDF research) is its scale:

This is a BIG triplestore:


metastore=# select count(subj) from jena_g1t1_stmt;
count
-----------
158267598
(1 row)

(And the references aren't in there yet.)

Because of the scale, we haven't been able to use our OWL ontology to infer extra triples, as Facilities won't buy the planet-sized amount of memory we'd need... This is a big, dumb(ish) store.

One of the interesting features of the store is the number and variety of identifiers we have ended up with for each resource. (We've been using dc:identifier from Dublin Core). But more on that later...

posted by Katie Portwin at 2:45 pm

Use of RSS by Librarians and Researchers

Thursday, February 16, 2006

I've previously written a bit about the IngentaConnect RSS feeds and the improvements we've made to them. One of the key reasons we've included as much metadata as possible in the feeds is to make them useful in a number of different contexts, not just for human readers.

The IngentaConnect feeds are currently available as RSS 1.0, enriched with Dublin Core (DC), PRISM and FOAF metadata. This hits a "sweet spot" in that the RSS 1.0 format is well supported in feed readers and aggregated and, as a RDF format, that data is also available to semantic web applications. I've got an eye on Atom support for the future, but for now I don't see a compelling reason to alter how we're publishing the data.

I'm not aware of any feed readers that can do much with the enhanced data we're providing, although as a number of publishers are providing DC and PRISM metadata in their feeds, a reader tailored for researchers and/or embedded in a citation manager seems like a "no brainer".

As I don't see this kind of innovation happening in RSS readers, I think expansion of RSS usage in scientific and publishing circles is going to come more from dedicated applications or "mashups".

CiteULike was the first application that I'm aware of that harvested the IngentaConnect RSS feeds. It extracts the publishing metadata for display to its users and for storing alongside their bookmarks. I'm pleased to see that others are now starting to emerge, the most recent being uBioRSS a "taxonomically intelligent feeder reader".

uBioRSS harvests the IngentaConnect feeds (amongst others) to attempt to spot and extract species names from the the content. These names are checked against the uBio taxonomy database to enable users to browse and subscribe to RSS alerts tailored to their research interests: e.g. a particular genus of animal, class of micro-organism, etc. There's immediate value add for researchers here as they can avoid having to subscribe to each feed individually, scan them for relevant content, etc.

I'm encouraged to see this kind of innovation coming from researchers themselves. I'd love to hear from other people using data from IngentaConnect in this way. Actually I'd love to hear what additional data people would like; we have plans in this direction, but it'd be nice to make those dove-tail with needs from end-users.

One possible growth area for use of RSS is in library applications. The JISC funded TOCRoSS project is aimed at exploring the interface between publishing, RSS and library applications. One deliverable of this project will be a means to feed data from RSS feeds into a library application such as an OPAC. It's an interesting project and I'll be keen to see how it plays out. (Although, I must admit to some quibbles over the technical directions: RSS1.0+DC+PRISM seems like a better bet than RSS2.0+ONIX/PRISM to my mind).

We've been doing some brainstorming in this direction ourselves. For example, how can RSS feeds be further tailored to library applications and users? e.g. inclusion of proxy service and OpenURL linking, library branding, or perhaps inclusion of additional controls to create "Immediate Action Feeds".

If you're developing applications in this space, or can think of intended uses of RSS feeds that would benefit your patrons, we'd like to hear your thoughts. Leave a comment under this posting, or drop myself or our product manager, Charlie Rapple an email.

posted by Leigh Dodds at 4:50 pm

State of Full-Text

Alf Eaton has conducted two thorough surveys on the state of full-text articles in Biomedical journals. He's investigated content available as PDF and HTML, from a range of different publishers and sites.

Alf's conclusions show up some shortcomings in the way that content is published, in particular the lack of appropriate metadata that would enable researchers to better manage content in their personal libraries.

While IngentaConnect isn't explicitly covered in the review, you can be sure that we'll be taking these usability suggestions on board as we plan enhancements to our content management and delivery systems.

Enhancing of PDF content to include additional metadata (e.g. as XMP) and linking (e.g. to relevant data sources, not just references) is an area in which we've been conducting some research on recently.

posted by Leigh Dodds at 4:48 pm

What we're excited about today: digitisation

This is not shameless plugging, honest. I genuinely am excited about the fact that we can now digitise content i.e. you give us your hard copy, we give you all-singing, all-dancing, kick-ass active PDFs. (Previously we worked from PDF or PostScript). This is such a cool development because it enables us to start putting reeeaaalllly looooonnnnggg backfiles online in cases where the publisher would previously have been stumped by lack of digital content – we have one such customer (not sure if I'm allowed to say who) who is now able to put a 100-year backfile on IngentaConnect. Duuuude.

Now, of course I know that JSTOR have got stuff online that goes back way longer, as have all sorts of others – but let's not throw the oranges in with the apples. To my knowledge, content in archives like JSTOR's is "flat", i.e. it doesn't sing, dance or kick anybody's ass. Which is why I'm so excited that we now Have The Power to add sexy features to ancient content (without resorting to plastic surgery). I can still remember trawling through dusty old indexes when I was at University, desperately wishing that it was easier to trace a relatively little-known artist's correspondence through the volumes of early C20th art journals, without having to write out reference lists, return to the OPAC, painstakingly look for each reference, write out its location, order it from the store, retrieve it some time later (do I imagine that it could be days?), and start the process again. It's why I fell in love with reference linking when I started working at CatchWord, and it still informs the passion I have for electronic publishing & associated technologies today.

Anyway. So. It's pretty simple really. We are scanning hard copies and running them through Readiris, which uses OCR to extract the text and embed it back into the PDF in a format that can then be processed through our proprietary metadata extraction software (known, oddly, as LemonElf.) At that point, Bob's pretty much your uncle – our reference activation within PDFs is one of our oldest and most successfully honed & refined products; it's one of the key differentiators between us and other e-journal hosts. (I'd like to be able to claim that we were the first to do it, but those pesky physicists got there first. Ain't that just the way ...). I'm hoping to gain a current publisher customer's permission to open up an article for demonstrating just how cool the digitisation is, both in terms of the quality of the scanned PDF, and the success of the metadata extraction & reference activation. I'll update this post with a link to that as soon as I can. In the meantime, I feel a celebratory pint or two coming on. Woo.

posted by Charlie Rapple at 4:34 pm

Cast of Characters

Tuesday, February 14, 2006

No self respecting group blog could possibly get itself up and running without first introducing its cast of characters.

We've asked everyone to fill out their blogger profiles with details about what they do at Ingenta, their interests, "lame claim to fame" and the usual mix of personal interests. So, rather than repeat all that here, here are links to everyone's profiles:

Vicky Buser, Information Architect

John Clapham, Technical Lead

Rob Cornelius, Web Developer

Leigh Dodds, Engineering Manager

David Jones, Software Engineer

Talvinder Matharu, Software Engineer

Kirsty Meddings, Product Development Manager

Priya Parvatikar, Software Engineer

Katie Portwin, Technical Lead

Lucy Power, Information Architect

Charlie Rapple, Head of Marketing

As you can see, if you browse the profiles, we've been having way too much fun with the South Park Studio!

I expect the list of contributors will grow over time, hopefully to include the majority of the technical team. Some of us already have personal websites and/or blogs, but "All My Eye" will give us a more focussed forum to share the Cool Stuff we've been doing in our day jobs. Hope you find something useful!

posted by Leigh Dodds at 12:37 pm

All my eye ... why?

Monday, February 13, 2006

Hallo and welcome. We've finally agreed to have a public blog with postings by various folks on the Ingenta staff. It's taken us a while, because we're a naturally modest bunch and didn't imagine we'd have much of interest to say. But it turns out we keep thinking of stuff we'd like to share with the outside world – stuff that's not quite worthy of a press release, and that's going to be a bit stale by the time our next publisher or library newsletter comes out, but stuff, nonetheless, that we'd like to communicate.

define: stuff
Hmmn. Well, we don't want to restrict ourselves unduly by setting boundaries, so let's just say that, *amongst others*, we will post the following types of stuff:

our views on current technical practices and hot topics
commentaries, reviews, passing mentions & recommendations of projects, blogs, technologies &c that we find interesting
updates on what Ingenta's been doing, our plans, and occasional general news (where we think it has Human Interest – i.e., we're unlikely to post the annual report)
synopses of any public speaking we're involved in, plus notes of conferences, workshops &c which we've attended
additional lightweight stuff to break up the monotony of relentless geekery

Whatever we're telling you, please be assured that our blog postings will always be signed (so you can, for example, be gentle on any technical ignorance on my part, but feel free to take it up with Leigh). We will also be sure to fingerprint our (significant) edits, because we're blog readers too and we know it's very annoying when stuff vanishes.

And finally ... All my eye? Que?
World Wide Words tells us it's an antiquated British term for nonsense (or, to throw in a favourite word of mine, poppycock). Since it survives in both British and American English as "my eye!", I hope it is not too parochial for the transglobal audience we fondly imagine hanging on our every word (despite our supposed modesty). In the context of our blog, it should be taken to indicate not so much that we're talking utter baloney, but that this is a forum for ideas as well as fact, for discussion as well as statement, and ultimately, that it's good to talk.

Oh, and it ties in nicely with the company logo too. On which more later.

posted by Charlie Rapple at 5:57 pm