Site Search Bench-marking via Crowd Sourcing

Objective:

Our objective is to device crowdsourcing platform find the relevancy of SRP and benchmark the with competitor. Apart from competitor we could also gauge search result relevancy with production and new releases planned.

Since Site search contributes to 60%+ revenues, It is essential to gauge search relevance and device a systematic approach for the same. The crowdsourcing platform could provide us optimal result in a very short span of time,

Query Set Preparation:

A comprehensive query set (2500 in number) , which will be a representative of overall search queries on our website (this set will form a significant %age of our key metrics – visits and revenue). It will consist of:

Top 450-500 search queries based on visits, since they represent 20-25% of the total visit count, and are quite important
From the next 27k queries (which represent 40% of the search visits), we will randomly choose samples from all deciles/quartiles
Randomly select queries from the long tail

Now, we will categorize this query set in the below mentioned buckets (and all possible combinations). The key objective here is that QUL should be able to correctly label these params within a search query.
Technical Buckets

1 Product name/PRODUCT

2 Brand

3 Cat/ Sub-Cat

4 Brand + Cat/Sub-Cat

5 PRODUCT+ Cat/Sub-Cat

6 Cat/ Sub-Cat+ Highlights/Attributes/Filter values

7 PRODUCT+ Cat/Sub-Cat+ Highlights/Attributes/Filter values

8 Brand + Cat/Sub-Cat+ Highlights/Attributes/Filter values

9 Brand+ Highlights/Attributes/Filter values

10 Highlights/Attributes/Filter values

Each Buckets will be represented with at least 150 queries
Category / Sub Category Buckets(Top interms of revenues/orders)

Mobiles & Tablets

Computers & Peripherals

Appliances

TVs, Audio & Video

Men’s Fashion

Men’s Footwear

Women’s Clothing

Watches

Bags & Luggage

Automotive

Kitchenware

Home Furnishing

Toys & Games

Sports & Fitness

Each Buckets will be represented with at least 100 queries,

Find out best practices if site Search

Input to Crowdsourcing Platform

The test user will be provided with UI where query and respective top 10 -20 results (?) is shown.

The results will be shown as per our business objective:

New Release Plan
Product Result and Test avatar results will be shown
Competitor Benchmarking

Product Results for host site & similar sites like flipkart, eBay and amazon will be displayed.

Product and Competitor results will be shown

User Interface

Most Relevant , Best Result : Totally relevant

: The document completely answers the question.

Relevant , Good Result : Partly relevant

: The information in the document is relevant to the question but not complete.

Irrelevant , Somewhere Close: Related

: The document mentions the subject or holds potentially good hyperlinks to relevant pages, but does not contain any actual information regarding the query itself

Completely Irrelevant, Useless: Not relevant/Spam

: The document is off topic or spam, not giving information about the subject.

Serial No.	Subject	Question Asked	Scores Assigned
1	Most Relevant	Best Result	3
2	Relevant	Good Result	2
3	Irrelevant	Somewhere Close	1
4	Completely Irrelevant	Irrelevant, Useless	0

Output from CrowdSourcing Platform

The user will be asked to mark each PRODUCTs wrt to query provided, and the score will be stored in following format a query:

Query	PRODUCTS	Rank	User Scoring
Q1	PRODUCT1	1	3
	PRODUCT2	2	0
	PRODUCTn	4	2
Q2	PRODUCT4	1	3
	PRODUCT9	2	0
	PRODUCT11	4	2
	PRODUCTz	1	3

For multiple user feedback for same query weighted average rank and weighted average user scoring will be considered.

Post Analysis of Scores

We ll be calculating Normalized Discounted Cumulative Gain for each query(explained below), once we have computed NDCG values for each query, we can average them across thousands of queries. We can now compare two algorithms: we take the mean average NDCG values for each, and check using a statistical test (such as a two sided t-test) whether one algorithm is better than the other, and with what confidence.

Calculation of Cumulative Gains

Cumulative Gain (CG) is does not include the position of a result in the consideration of the usefulness of a result set. In this way, it is the sum of score values of all PRODUCTs for aquery. The CG at a particular rank position p is defined as:

Calculation for provided data

After Computing weighted average scores for rank and scores, the CGs score as follows

PRODUCTS	Rank	User Scoring	CG
PRODUCT1	1	3	3
PRODUCT2	2	0	3
PRODUCT3	3	2	5
PRODUCT4	4	3	8
PRODUCTn	5	1	9

Calculation of Discounted Cumulative Gains

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the score value is reduced logarithmically proportional to the position of the result. The discounted CG accumulated at a particular rank position is given by:

Rel1-> Score for top position PRODUCT,

Reli -> for any position PRODUCT

Calculation for provided data

PRODUCTS	Rank(i)	User Scoring(rel)	CG	Log2(i)	rel/ Log2(i)	DCG
PRODUCT1	1	3	3	0	N/A	3
PRODUCT2	2	2	5	1	2	5
PRODUCT3	3	3	8	1.585	1.892	6.892
PRODUCT4	4	0	8	2	0	6.892
PRODUCTn	5	1	9	2.322	0.431	7.323

Calculation of Normalized Discounted Cumulative Gains

The Normalized part in NDCG allows us to compare DCG values between different queries.

Search result lists vary in performance depending upon query . It’s not fair to compare DCG values across queries because some queries are easier than others: for example, maybe it’s easy to get four perfect results for the query samsung s4, and much harder to get four perfect results for short micro usb cable .

This done normalizing DCG wrt to ideal discounted cumulative gains (IDCG), which is the best possible score is given the results we’ve seen so far.

Example : the best scores for query with increasing order of rank are

3 3 2 2 0, then IDCG = 8.01

Our NDCG is the score we for given set DCG divided by the ideal DCG Now we can compare scores across queries, since we’re comparing percentages of the best possible arrangements and not the raw scores

PRODUCTS	Rank(i)	User Scoring(rel)	CG	Log2(i)	rel/ Log2(i)	DCG	NDCG
PRODUCT1	1	3	3	0	N/A	3	0.37
PRODUCT2	2	2	5	1	2	5	0.62
PRODUCT3	3	3	8	1.585	1.892	6.892	0.86
PRODUCT4	4	0	8	2	0	6.892	0.86
PRODUCTn	5	1	9	2.322	0.431	7.323	0.91

NDCG = 0.91

Final Score Calculation For All set of queries

Once we’ve computed NDCG values for each query, we can average them across thousands of queries.

Testing Across various setup

Once the score is calculated of each setup (production, avatar, competitor) . The two setup algorithm will be compared using a statistical test (such as a two sided t-test) whether one algorithm is better than the other, and with what confidence.

by Saugata Halder

Saugata is Senior Product Manager at a leading retail eccomerce player and consultant with Zombie Software. You can connect on LinkedIn.

“For a free consultation with a member of our team call us now on +971-544177921 or send query via this link / email . “

Objective:

Query Set Preparation:

Input to Crowdsourcing Platform

Output from CrowdSourcing Platform

Post Analysis of Scores

Calculation of Cumulative Gains

Calculation of Discounted Cumulative Gains

PRODUCTS

Rank(i)

User Scoring(rel)

CG

Log2(i)

rel/ Log2(i)

DCG

PRODUCT1

1

3

3

0

N/A

3

PRODUCT2

2

2

5

1

2

5

PRODUCT3

3

3

8

1.585

1.892

6.892

PRODUCT4

4

0

8

2

0

6.892

PRODUCTn

5

1

9

2.322

0.431

7.323