Wherein the IngentaConnect Product Management, Engineering, and Sales Teams
ramble, rant, and generally sound off on topics of the day

performance, triplestores, and going round in circles..

Wednesday, May 24, 2006

Can industrial scale triplestores be made to perform?

Is breaking the "triple table" model the answer?

Scaling SPARQL performance is something I worry about. Mike Bergman worries about it too.

In our XTech paper, we showed that even a simple, bread and butter sample query, is taking 1.5 seconds on our 200million triplestore. (Scroll down for the colourful graph). In our Jena paper, we showed that with a triplestore in a relational database, you're at the mercy of your query planner - some configurations of SPARQL query perform reaaaaly badly.

Both these facts make sense when you think that your SPARQL doesn't just get done by magic - of course, it gets turned into a ma-hoosive SQL JOIN statement, across the main triples table again and again, like this:
Select A0.Subj, A2.Subj, A3.Obj, A4.Obj, A5.Subj, A6.Obj From jena_g1t1_stmt A0, jena_g1t1_stmt A1,
jena_g1t1_stmt A2, jena_g1t1_stmt A3, jena_g1t1_stmt A4, jena_g1t1_stmt A5, jena_g1t1_stmt A6 Where

Databases weren't really designed to do queries like this - it's not surprising that they aren't very fast at it...

So I was really interested by Kevin Wilkinson's paper on property tables - also at the JUC last week. Kevin's idea is basically this:

Say, in your triplestore, you have this:
triples table
| s | p | o |
| X | p1 | o1 |
| X | p2 | o2 |
| X | p3 | o3 |
| Y | p1y| o1y|

(This means that X has three properties attached to it. )

Kevin points out that often, you get "groups" - patterns in the data. Often, X's always have exactly one o1, one o2 and one o3.

For example, consider X to be an article, and articles always have one title, one doi, one volume, issue, etc.

In this case, you could have a "property" table like this:
s | p1 | p2 | p3 |
X | o1 | o2 | o3 |
Y | o1y| o2y| o3y|

You could still keep other unsuitable-for-grouping stuff in the main triples table, but drag out groups into 'property tables'. This would make them fast to query. (There's a lot more to it than this - eg how properties with cardinality >1 go in their own table too - but you'll have to read the paper!)

This made sense to me. There are definitely patterns in the data. We identify articles by their issn/volume/issue/page data. They always(?) have these properties. How sensible it would seem to put them in a nice normal table, so we can at least occasionally have sql like:
"select * from articles where articles.vol=...."
- rather than scary mega-joins, to throw at our poor old database. Sounds totally common sense!





Consider this: our 'articles' table would look like this:

id | title | doi | volume | .....
1 | "bla" | 123.456..| 42 | .......
2 | "foo" | 789.101..| 45 | .......

Er.... does this look rather familiar? Isn't this exactly what we started with?

Hang on - what was wrong with what we started with again?

Well, for us, the problem was that we found that we had to keep re-modelling the database - initial assumptions were correct for the first hundred thousand articles.. and then suddenly weren't. For example, we recently discovered that sometimes Articles have 2 DOIs. Horrible, but true. The great thing about the triplestore is that we don't have to bake assumptions about the data into the database - we can have as many whatevers as we like.

I'm still very keen to try out Kevin's architecture - I hope to nag him into solving this undecided-cardinality-problem within his app by changing the schema dynamically and shufting the data around. I just hope that could scale. Much looking forward to first release and finding out more!

posted by Katie Portwin at 1:44 pm


<<Blog Home

The Team

Contact us

Recent Posts


Blogs we're reading

RSS feed icon Subscribe to this site

How do I do that

Powered by Blogger