[ic] weighted search result sorting

Kevin Walsh interchange-users@icdevgroup.org
Thu Jan 2 17:23:00 2003


Paul Jordan [paul@gishnetwork.com] wrote:
> 
> In recent testing of my search engine I realized that while, yes it DOES
> return a very good result set, it poorly sorts them. We have a content
> site, that for example sells images. So for example:
> 
> sku		keywords
> sku123	ocean, island, sky, trees, water
> sku124	sky, clouds, blue, day
> 
> Lets say I have thousands similar to this. The problem arises when
> someone searches for the term 'sky'. It will pull both results from
> above, but if I sort by sku it will show the pictures of the island with
> water and sky, or any number of picture with sky in it WILL appear
> BEFORE a simple brilliant SKY by itself.... which is not good, if left
> to sorting a field value.
> 
> I have been thinking of ways to "weight" the result set. I am not an
> expert on efficiency nor databases. I am using Mysql, but NOT an SQL
> query because I am doing full text searches.
> 
> A pseudo idea would be like:
> 
> sku		keywords
> sku123	ocean_7, island_9, sky_5, trees_4, water_5
> sku124	sky_10, clouds_10, blue_3, day_2
> 
> I have no idea if this is possible but in the above it is assumed that
> with substring matching turned on, 'sky' will still be a HIT for both,
> then maybe create some custom tf=? or method of sorting based on the
> numeric TOTALs of the corresponding _'n' with regards to the words
> matched by the users search spec.
> 
> So now with the above, a search for 'sky' will still return both, but
> the first one visible will be sku124 (because sky=10) and for the other
> (sky=5)
> 
> But if someone searched for 'sky ocean' then both would still be
> returned but sku123 will be first because (sky+ocean=12) and the other
> sku is (sky=10)
> 
> I still want to return both, because a Graphic artists can just take the
> sky from one and the ocean from another, so both are relevant. I know I
> know this is starting to sound terribly inefficient :) but the but any
> normal tf=?,?,? will simply not work well at all for us.
> 
I'd have a table along the lines of the following:

    sku     keyword         weight
    ------- --------------- --------
    sku123  ocean           7
    sku123  island          9
    sku123  sky             5
    sku123  trees           4
    sku123  water           5
    sku124  sky             10
    sku124  clouds          10
    sku124  blue            3
    sku124  day             2

PRIMARY INDEX: sku, keyword
NON-UNIQUE INDEX: keyword

The following (untested) SQL queries should then work as follows:

Query:

    SELECT  sku, SUM(weight) AS skuweight
    FROM    keywords
    WHERE   keyword IN ('sky')
    GROUP BY sku
    ORDER BY skuweight DESC

Results:

    sku     skuweight
    ------- ---------
    sku124  10
    sku123  5

Query:

    SELECT  sku, SUM(weight) AS skuweight
    FROM    keywords
    WHERE   keyword IN ('sky','ocean')
    GROUP BY sku
    ORDER BY skuweight DESC

Results:

    sku     skuweight
    ------- ---------
    sku123  12
    sku124  10

Or that's the theory, at least. :-)

You'd pass the keywords into a page which would split them into
individual lower-case words for use in the WHERE clause.  If you
wanted to allow substring matches, you could have multiple "LIKE"
lines instead of one large "IN" line:

User types "sk oce" and the following is generated:

    SELECT  sku, SUM(weight) AS skuweight
    FROM    keywords
    WHERE   keyword LIKE '%sk%'
    OR      keyword LIKE '%oce%'
    GROUP BY sku
    ORDER BY skuweight DESC

Results:

    sku     skuweight
    ------- ---------
    sku123  12
    sku124  10

Once you have your SQL query, you'd execute it with the [query] tag
and display the results in any way you see fit.

-- 
   _/   _/  _/_/_/_/  _/    _/  _/_/_/  _/    _/
  _/_/_/   _/_/      _/    _/    _/    _/_/  _/   K e v i n   W a l s h
 _/ _/    _/          _/ _/     _/    _/  _/_/    kevin@cursor.biz
_/   _/  _/_/_/_/      _/    _/_/_/  _/    _/