[ic] How would you search, store, and display documents
paul at gishnetwork.com
Fri Aug 12 15:25:18 UTC 2011
> Racke mentioned on Friday, August 12, 2011...
> To: interchange-users at icdevgroup.org
> Subject: Re: [ic] How would you search, store, and display documents
> On 08/12/2011 10:23 AM, Paul Jordan wrote:
> > I'm tasked with building a pretty complex training/educational system
> one of my clients. This would be bring their paper manual and assets, html
> newsletters, FAQ, videos, how-to's, etc, etc all into one intuitive
> "knowledgebase" if you will.
> > I know how I want it to work and look, but what I don't know ATM is
> > what is the best format to use. My main concern is the searchability
> > and storage of the main body of each article. This text will
> > arbitrarily contain html for formatting, images, div's for quotes, or
> > tables for data, and the like (everything will be styled with css of
> > course)
> > It seems to me there are several paths...
> > #1 Store the text page with any html needed for the article in table and
> assuming html doesn't play well with fulltext searches, work around that
> saving text-only into a second field used for searching only.
> Fulltext search engines like Lucene are able to parse HTML and adjust the
> weight according to the position of the words (HTML title, etc).
> > #2 Delve into xml/xsl.
> > #3 Create some sort of wiki parser to use in conjunction with IC. I
> would have liked to use Kevins system, and improved upon that, but that
> doesn't seem likely.
> I hacked on Wiki stuff based on Wiki::Toolkit, it is inside the WellWell
> repository. If you have already Wiki formatted text, you can use this for
> HTML formatting and display:
> > #4 Have a parser like Kevin's made, extend it to handle images, and
> a simple online "editor" for it.
> An editor for Wiki text shouldn't be hard to come up with.
Thank you Racke
In my research of lucene I ran across this post on someone contemplating
exactly my issue
In there he proposes a pretty nifty idea - Here it is in summary:
One solution I am kicking around is trying to write / find some sort of
text style markup language that is stored separate from the text data
(This has to exist somewhere, probably an old school Unix format,
but I am not even sure where to start looking). I am thinking it could
work something like this:
The stylesheet, in its most basic form, would be a type and
position-length pair. So for the text:
This <b>is</b> <i>example</i> text, <b>man</b>.
A parser would sniff out the tags, and make a stylesheet that could look
(sheet (bold (5,2), (22,3)), (italic (8,7)) )
I read through the comments and the only valid issue someone had about it
was regarding editing and resyncing the logistics. However, my simple
solution to that is to delete and resubmit all of this "positional
logistics" each time, thereby no needing to "adjust positions".
Not that I can build this kind of thing myself, but I think it would not be
that complicated. In fact, instead of supporting code standards, why not
just store the tag verbatim, so in this persons example it would be more
Could this not be stored in a single field, then applied via a regex on
My target dataset would be something like the body of a blog post. Anything
interactive would be built by IC on the page itself as the environment.
More information about the interchange-users