IKS search engine proposal

Creating a prototype search engine that understands microformats and RDFa might be very useful in the context of IKS:

  • As a benchmark tool, by measuring how well CMSes allow the search engine to extract meaning and semantic information from the published content.
  • To support the creation and communication of best practices about microformats and RDFa.
  • As a publically visible example of what IKS is doing.

I'm envisioning a search engine that:

  • Is publically available at http://search.iks-project.eu
  • Crawls a list of websites that are expected to provide metadata using microformats and RDFa.
  • Allows new websites to be added to that list, suject to validation by search engine administrators.
  • Extracts and indexes semantic information encoded in the site's web pages.
  • Provides a number of example queries that demonstrate the benefits of this embedded semantic information.
  • Allows for additional experiments using the "semantic index" built by crawling the configured websites.

CMS vendors and website owners can then enhance specific websites as part of pilot projects or customer projects, and use the search engine to validate and demonstrate their use of microformats and RDFa.

Comments

Semantic Search Engine architecture suggestion

Below is the architecture that DERI would like to suggest for the IKS Semantic Search Engine. The figure [1] contains a set of CMS site complying to the best practices of RDF data publishing, which include RDFa, a local schema export (site vocabulary), a SPARQL endpoint. We have worked on a set of modules for Drupal detailed in a technical report at [2], but their features could be generalized to other CMSs. The sites can request to be included in the IKS search engine via a form on the IKS search engine site or programmatically via a ping. Pings are also used in the case where a specific resource/page has been updated on a given site in order for the search engine to schedule a recrawl of the resource as soon as possible.

The semantic search engine stack is composed of several layers of parsing, validation and indexing. The search engine first gathers the data by crawling the sites, it then parses the RDF data with the any23 parser [3], a java library that extracts structured data in RDF format from a variety of Web documents (supports microformats, RDFa and other common RDF serialization formats). If needed, the NxParser [4] cleans up the data and formats it in n-quads [5]. Before a site can be included in the IKS search engine, it first goes through the RDFAlerts validator, which ensures the RDF data contained in the sites complies with the RDF publishing best practices. RDFAlerts also does some RDF consistency checking. Additionally, other IKS specific policies regarding the sites included in the search engine could be added here. Finally, the SWSE engine [6] takes care of the indexing and storage of the data. Powered by YARS2, it provides distributed storage and retrieval facilities. Indexing structures are optimized for retrieval of RDF statements including context (quads) while minimizing the need for joins, plus Lucene fulltext indexing for efficient keyword searches. SWSE's SPARQL endpoint allows to plugin any RDF visualization tool, e.g. VisiNav [7] for example. See the screencast at [8] (1'36) for the possibilities offered by VisiNav.
 
[1] http://srvgal65.deri.ie/files/iks_search_engine_cloud.pdf
[2] http://www.deri.ie/fileadmin/documents/DERI-TR-2009-04-30.pdf
[3] http://code.google.com/p/any23/
[4] http://sw.deri.org/2006/08/nxparser/
[5] http://sw.deri.org/2008/07/n-quads/
[6] http://www.swse.org/
[7] http://visinav.deri.org/
[8] http://www.youtube.com/watch?v=r4WgTRIRoa0

Ping the semantic web

The search engine could get the list of sites to crawl from Ping the Semantic Web

Extend it to Semantic CMIS Federated Search

And for content applications which do not support RDF output but are offering CMIS support to try to extend the AIIM CMIS Federated Search prototype but with semantic extraction in mind (semantic CMIS):

http://wordofpie.com/2009/05/15/the-source-code-from-the-aiim-iecm-cmis-...