« September 2005 | Main | November 2005 »

2005-10-21

More Wilbur2 progress

I have made good progress with the Wilbur2 manual. It is very useful to write a document that - supposedly - would help other understand your thinking (about design, that is). I have discovered things that I now want to change. Here are some of them:

  1. I no longer see any reason to have two separate packages for the Wilbur code, so I will merge the two; for compatibility reasons "NOX" will be made a nickname of the "WILBUR" package.

  2. It seems that proper handling of information about where triples came from cannot be postponed any longer (I was inspired by this thread on the Semantic Web Interest Group mailing list). The current Wilbur design does not allow "duplicate" triples, but records at most one source with every triple. What if multiple documents assert the same triple - we still would like only one in the database, right?. Consequently, the triple class will be changed to allow multiple sources, and when a source is deleted (as happens when you reload a source, for example), only those triples are deleted that came only from that source.

  3. Regarding #2, we still need to give the option of deleting a triple from all sources, when individual triples are being deleted.

  4. I am wondering if there are applications where one really does not need to record where triples came from?

  5. Many function signatures in the "Data Source Loading Protocol" will change, as I have rethought the design; no worries, though, because db-load will still stay the same.

Any comments regarding any aspects of Wilbur design are always welcome.

Posted by ora at 10:06 | Comments (3)

2005-10-07

Erdös Numbers

Another take on the Small World Problem ("six degrees of separation" as some people like to say) is the concept of Erdös Numbers. Hungarian mathematicial Paul Erdös was very prolific and published a huge number of papers; he also had a large number of co-authors. Erdös number is defined as follows: Erdös himself has the Erdös number 0, his co-authors have the Erdös number 1, their co-authors have 2, etc. In other words, in a social network where co-authorship is a link, how many hops away are you from Erdös.

Since mathematics spills over to computer science, also many computer scientists have low Erdös numbers. Determining your Erdös number is not all that simple, however. If only we had special FoaF-style data about publishing and co-authorships, searching could be done automatically.

So far, I believe my own Erdös number is at most 6, given, for example, the following path: me, James Hendler, Lynn Stein, David Karger, Robert Tarjan, Stephen Hedetniemi, Paul Erdös. The real problem, I find, is that once you start pondering about path lengths, you cannot stop trying to find shorter ones. :-)

Posted by ora at 11:48 | Comments (3)

2005-10-01

Good Progress with Wilbur2

Lately I have been making nice progress with Wilbur2, and I am confident that I will get past the "pre-release" phase soon. Several people have provided bug fixes (for which I am grateful). The bug-fix-provider-of-the-month -award goes to Richard Newman.

I am fine-tuning the new Wilbur2 API by structuring it into various "protocols" (expressed as collections of DEFGENERICs). So now I have things like the "Data Management Protocol", "Data Source Loading Protocol", "Parsing Protocol", etc. Documentation is progressing, too.

I have also done some performance measurements. On an 867 MHz PowerBook G4 running OpenMCL, I can populate the triple store approximately at the rate of 800 μs per triple (loading a file with RDF/XML). I am using an indexed main memory database with literal interning. Performance is not terrible, considering that I could still do all kinds of code optimizations (none so far) and even switch to a compiler that produces faster executable code (say, SBCL). I did the tests with data sets of approx. 200,000-300,000 triples. I will post accurate numbers later, with comparisons to other toolkits/libraries. Eventually, I would expect to beat at least the Java-based implementations.

After some improvements to the Wilbur query engine, I was also able to query at speeds that are quite adequate (a few seconds to produce results sets of 50,000-100,000 nodes using moderately simple and short path patterns). I am particularly interested in query performance.

Posted by ora at 14:21 | Comments (3)