Clifford Lynch Keynote at Charleston
Thursday, November 08, 2007
As Charlie Rapple recently posted, I'm currently attending the XXVII Annual Charleston Conference. This morning I attended the key note presentations and wanted to share the notes I took during Clifford Lynch's presentation "Scholarly Literature as an Object of Computation: Implications for Libraries, Publishers, Authors".
Lynch opened by describing the need to take an expanded view of the scholarly literature, suggesting that forthcoming changes in how the literature is used and accessed with put strain on a number of arrangements within the industry, including areas such as service delivery, licensing, etc.
Lynch's main topic was the growing board of research and effort that surrounds computational access to the scholarly literature; in short analysing scholarly papers using text and data mining techniques. Lynch suggested that Google's PageRank was an early example of the power of this kind of analysis, and proposed that the time is very ripe for additional computational exploration of scientific texts. Lynch noted that the life sciences are particularly active in this area at the moment, largely because of the large commercial pay-offs in mining life science literature (think pharmaceutical industry).
When research conduct this research they often want to combine public literature with private research data (used as "leverage on the literature"), mixing it with other useful data sources, such as thesauri, geographical data, etc.
Lynch also noted that current content licensing agreements don't allow or support this kind of usage, and wondered how the legal and licensing frameworks needed to adapt to support it?
Lynch then moved on to discussing three key questions:
Firstly was "Where is the scientific corpus mined?". Lynch observed that there are "horrenous" problems with distributed computing scenarios, e.g. normalization, performance problems, cross-referencing and indexing across sites, and how to securely mix in private data.
Lynch felt that realistically (at least in near term), mining will be about computation on a local copy of the literature. How do we set things up so that people are able to do that? Lynch noted that institutions may need to consider their role in maintaining, building (and purging) these local copies of the literature. E.g. do they become owned and maintained as library services?
Lynch also noted that current hosting platforms don't make it easy to compile collections, although things are much easier with Open Access titles.
Lynch's second question was: "What is the legal status of text mining?". Lynch considered this to be a highly speculative area, with littel case law to draw on.
Lynch introduced the notion of "derivate works" under copyright law. Some derivatives can be mechanically produced (e.g. "first 5 pages"). Some derivatives, like translations involve some additional creativity. Summaries of works are not usually viewed as derivatives however, becoming the property of the summarizer. The current presumption, therefore, is that computationally produced content is derivative and so there are obvious issues for data mining.
Lynch suggested that we need to start being explicit about "making the world safe for text mining". By including special provisions in licenses for example.
As an aside, Lynch wondered what computation Google might be doing, or might be able to do, on the corpus its currently through various digitization efforts.
Lynch's third and final questions was "Do we need to change the nature of the literature to support computation?" E.g. by making it easier to analyze texts which are available as XML.
Lynch pointed to some useful underpinning thats are already under active research, e.g. efforts to analyze texts to identifying things, extracting references to people, places and organizations from the text. Lynch explained that these are major underpinning to more sophisticated analysis.
Adding microformats or subject specialist markup to documents would also help identify key entities in the text. Lynch wondered who would be responsible for doing this, the authors, or would it become a new value added service provided by publishers?
Lynch opened by describing the need to take an expanded view of the scholarly literature, suggesting that forthcoming changes in how the literature is used and accessed with put strain on a number of arrangements within the industry, including areas such as service delivery, licensing, etc.
Lynch's main topic was the growing board of research and effort that surrounds computational access to the scholarly literature; in short analysing scholarly papers using text and data mining techniques. Lynch suggested that Google's PageRank was an early example of the power of this kind of analysis, and proposed that the time is very ripe for additional computational exploration of scientific texts. Lynch noted that the life sciences are particularly active in this area at the moment, largely because of the large commercial pay-offs in mining life science literature (think pharmaceutical industry).
When research conduct this research they often want to combine public literature with private research data (used as "leverage on the literature"), mixing it with other useful data sources, such as thesauri, geographical data, etc.
Lynch also noted that current content licensing agreements don't allow or support this kind of usage, and wondered how the legal and licensing frameworks needed to adapt to support it?
Lynch then moved on to discussing three key questions:
Firstly was "Where is the scientific corpus mined?". Lynch observed that there are "horrenous" problems with distributed computing scenarios, e.g. normalization, performance problems, cross-referencing and indexing across sites, and how to securely mix in private data.
Lynch felt that realistically (at least in near term), mining will be about computation on a local copy of the literature. How do we set things up so that people are able to do that? Lynch noted that institutions may need to consider their role in maintaining, building (and purging) these local copies of the literature. E.g. do they become owned and maintained as library services?
Lynch also noted that current hosting platforms don't make it easy to compile collections, although things are much easier with Open Access titles.
Lynch's second question was: "What is the legal status of text mining?". Lynch considered this to be a highly speculative area, with littel case law to draw on.
Lynch introduced the notion of "derivate works" under copyright law. Some derivatives can be mechanically produced (e.g. "first 5 pages"). Some derivatives, like translations involve some additional creativity. Summaries of works are not usually viewed as derivatives however, becoming the property of the summarizer. The current presumption, therefore, is that computationally produced content is derivative and so there are obvious issues for data mining.
Lynch suggested that we need to start being explicit about "making the world safe for text mining". By including special provisions in licenses for example.
As an aside, Lynch wondered what computation Google might be doing, or might be able to do, on the corpus its currently through various digitization efforts.
Lynch's third and final questions was "Do we need to change the nature of the literature to support computation?" E.g. by making it easier to analyze texts which are available as XML.
Lynch pointed to some useful underpinning thats are already under active research, e.g. efforts to analyze texts to identifying things, extracting references to people, places and organizations from the text. Lynch explained that these are major underpinning to more sophisticated analysis.
Adding microformats or subject specialist markup to documents would also help identify key entities in the text. Lynch wondered who would be responsible for doing this, the authors, or would it become a new value added service provided by publishers?
Labels: "charleston conference", "data mining"
posted by Leigh Dodds at 4:32 pm
<<Blog Home