All My Eye

Can industrial scale triplestores be made to perform?

Is breaking the "triple table" model the answer?

Scaling SPARQL performance is something I worry about. Mike Bergman worries about it too.

In our XTech paper, we showed that even a simple, bread and butter sample query, is taking 1.5 seconds on our 200million triplestore. (Scroll down for the colourful graph). In our Jena paper, we showed that with a triplestore in a relational database, you're at the mercy of your query planner - some configurations of SPARQL query perform reaaaaly badly.

Both these facts make sense when you think that your SPARQL doesn't just get done by magic - of course, it gets turned into a ma-hoosive SQL JOIN statement, across the main triples table again and again, like this:

Select A0.Subj, A2.Subj, A3.Obj, A4.Obj, A5.Subj, A6.Obj From jena_g1t1_stmt A0, jena_g1t1_stmt A1,
jena_g1t1_stmt A2, jena_g1t1_stmt A3, jena_g1t1_stmt A4, jena_g1t1_stmt A5, jena_g1t1_stmt A6 Where
A0.Prop=...

Databases weren't really designed to do queries like this - it's not surprising that they aren't very fast at it...

So I was really interested by Kevin Wilkinson's paper on property tables - also at the JUC last week. Kevin's idea is basically this:

Say, in your triplestore, you have this:

triples table
---------------
| s | p  | o  |
---------------
| X | p1 | o1 |
| X | p2 | o2 |
| X | p3 | o3 |
| Y | p1y| o1y|

(This means that X has three properties attached to it. )

Kevin points out that often, you get "groups" - patterns in the data. Often, X's always have exactly one o1, one o2 and one o3.

For example, consider X to be an article, and articles always have one title, one doi, one volume, issue, etc.

In this case, you could have a "property" table like this:

------------------
s | p1 | p2 | p3 |
------------------
X | o1 | o2 | o3 |
Y | o1y| o2y| o3y|

You could still keep other unsuitable-for-grouping stuff in the main triples table, but drag out groups into 'property tables'. This would make them fast to query. (There's a lot more to it than this - eg how properties with cardinality >1 go in their own table too - but you'll have to read the paper!)

This made sense to me. There are definitely patterns in the data. We identify articles by their issn/volume/issue/page data. They always(?) have these properties. How sensible it would seem to put them in a nice normal table, so we can at least occasionally have sql like:
"select * from articles where articles.vol=...."
- rather than scary mega-joins, to throw at our poor old database. Sounds totally common sense!

BUT

....

....

Hangon....

Consider this: our 'articles' table would look like this:


Articles
---------------------------------------
id | title  |    doi   |   volume  | .....
---------------------------------------
 1 | "bla"  | 123.456..|  42       | .......
 2 | "foo"  | 789.101..|  45       | .......

Er.... does this look rather familiar? Isn't this exactly what we started with?

Hang on - what was wrong with what we started with again?

Well, for us, the problem was that we found that we had to keep re-modelling the database - initial assumptions were correct for the first hundred thousand articles.. and then suddenly weren't. For example, we recently discovered that sometimes Articles have 2 DOIs. Horrible, but true. The great thing about the triplestore is that we don't have to bake assumptions about the data into the database - we can have as many whatevers as we like.

I'm still very keen to try out Kevin's architecture - I hope to nag him into solving this undecided-cardinality-problem within his app by changing the schema dynamically and shufting the data around. I just hope that could scale. Much looking forward to first release and finding out more!

performance, triplestores, and going round in circles..

Contributors