Latest news and information on the D'ni Sphider search engine.
Now that D'ni Sphider has accumulated over 9 million keyword-link relations, and a bit of usage, here's a top 20 list of search queries, just for a bit of fun:
|Query||Count||Average results||Last queried|
|"It's begun whether we like it or not"||3||0.0||2013-08-11 16:28:38|
|mysterium 2013||2||33.0||2013-08-20 03:17:34|
|"revelation editor"||2||2.0||2013-08-15 01:47:37|
|"shed some light on the myst"||2||0.0||2013-08-11 17:23:48|
|open cave||2||62.0||2013-08-15 17:18:33|
|"mentioned the Myst music as original"||2||0.0||2013-08-16 19:21:19|
I'm planning to take D'ni Sphider down for some maintenance on Wednesday, August 21. When I originally set up the database for D'ni Sphider I forgot to check that it'd handle non-Latin characters properly, so there's a mix of Latin-1 and UTF-8 between the code and the database now (d'oh!). To fix that, I need to convert the database tables (and all their contents) from Latin-1 to UTF-8. Since the database is now around 1.4GB that could take a bit of time.
The outage is likely to be 7-9 PM BST (2-4 PM EDT, 8-10 PM CET), but I'll post here when the work is done.
The database has been modified and D'ni Sphider is back online. There are still some oddities with UTF-8 hanging around, but they're in the code and easier to pick off as I find them than the database issue.
I've added some links to the top of the search page (that's probably how you
found this page, so it may be stating the obvious). What's maybe more important
though is that one of the links is for "feedback": That can be in many
forms, whether you want to ask us to add a site (or indeed remove one), report
links that D'ni Spider is returning that you don't think ought to be there or
just make a general comment, observation or suggestion. It saves trying to PM
me through the forums!
[Go to Feedback page]
D'ni Sphider can now index (some) PDFs: This is an incomplete PDF extraction, but it should deal with most common things. It won't, obviously, extract data from any PDF that has security settings enabled to prevent copying of text; it doesn't seem to handle some older versions of PDF (not sure why yet, though); it can't locate URLs embedded within a PDF; text added as captions to images seems to be getting dropped. But it's better than just letting all those PDFs out there go unindexed.
I'm sure that Tai'lahr (chief indexer and acting unpaid bot driver) has several sites identified as hosting PDFs that will now need to be reindexed.
I cleaned up the database this morning and removed just shy of 10,000 keywords that weren't associated with any valid link in the database. Those will have come from sites that were deleted or edited because the searches were going out of scope and catching things that weren't Uru/Myst/Cyan related. Even with that I've still got almost 230,000 unique keywords in the database which in itself seems quite a remarkable statistic. I also got rid of nearly 1/4 million entries that had got stuck in the temporary tables of the database (probably as a result of timeouts or errors during indexing).
The metrics as of this moment for D'ni Sphider are: