[ic] How would you search, store, and display documents

Paul Jordan
Fri Aug 12 15:25:18 UTC 2011

Racke mentioned on Friday, August 12, 2011...
Subject: Re: [ic] How would you search, store, and display documents
On 08/12/2011 10:23 AM, Paul Jordan wrote:
> >
I'm tasked with building a pretty complex training/educational system
one of my clients. This would be bring their paper manual and assets, html
newsletters, FAQ, videos, how-to's, etc, etc all into one intuitive
"knowledgebase" if you will.
> >
I know how I want it to work and look, but what I don't know ATM is
what is the best format to use. My main concern is the searchability
and storage of the main body of each article. This text will
arbitrarily contain html for formatting, images, div's for quotes, or
tables for data, and the like (everything will be styled with css of
course)
> >
It seems to me there are several paths...
> >
#1 Store the text page with any html needed for the article in table and
assuming html doesn't play well with fulltext searches, work around that
saving text-only into a second field used for searching only.
Fulltext search engines like Lucene are able to parse HTML and adjust the
weight according to the position of the words (HTML title, etc).
> >
#2 Delve into xml/xsl.
> >
#3 Create some sort of wiki parser to use in conjunction with IC. I
would have liked to use Kevins system, and improved upon that, but that
doesn't seem likely.
I hacked on Wiki stuff based on Wiki::Toolkit, it is inside the WellWell
repository. If you have already Wiki formatted text, you can use this for
HTML formatting and display:
http://git.icdevgroup.org/?p=wellwell.git;a=blob;f=lib/Vend/Wiki.pm;h=8cb
8499ef669ed60d26867db04a41cfba3a641e8;hb=HEAD
> >
#4 Have a parser like Kevin's made, extend it to handle images, and
a simple online "editor" for it.
> >
An editor for Wiki text shouldn't be hard to come up with.
Regards
Racke

Thank you Racke

In my research of lucene I ran across this post on someone contemplating
exactly my issue


In there he proposes a pretty nifty idea - Here it is in summary:

One solution I am kicking around is trying to write / find some sort of
text style markup language that is stored separate from the text data
(This has to exist somewhere, probably an old school Unix format, 
but I am not even sure where to start looking). I am thinking it could 
work something like this:

The stylesheet, in its most basic form, would be a type and 
position-length pair. So for the text:

This <b>is</b> <i>example</i> text, <b>man</b>.

A parser would sniff out the tags, and make a stylesheet that could look

(sheet (bold (5,2), (22,3)), (italic (8,7)) )

I read through the comments and the only valid issue someone had about it
was regarding editing and resyncing the logistics. However, my simple
solution to that is to delete and resubmit all  of this "positional
logistics" each time, thereby no needing to "adjust positions".

Not that I can build this kind of thing myself, but I think it would not be
that complicated. In fact, instead of supporting code standards, why not
just store the tag verbatim, so in this persons example it would be more


Could this not be stored in a single field, then applied via a regex on

My target dataset would be something like the body of a blog post. Anything
interactive would be built by IC on the page itself as the environment.


