Site Search Bench-marking via Crowd Sourcing
Objective:
Our objective is to device crowdsourcing platform find the relevancy of SRP and benchmark the with competitor. Apart from competitor we could also gauge search result relevancy with production and new releases planned.
Since Site search contributes to 60%+ revenues, It is essential to gauge search relevance and device a systematic approach for the same. The crowdsourcing platform could provide us optimal result in a very short span of time,
Query Set Preparation:
A comprehensive query set (2500 in number) , which will be a representative of overall search queries on our website (this set will form a significant %age of our key metrics – visits and revenue). It will consist of:
- Top 450-500 search queries based on visits, since they represent 20-25% of the total visit count, and are quite important
- From the next 27k queries (which represent 40% of the search visits), we will randomly choose samples from all deciles/quartiles
- Randomly select queries from the long tail
Now, we will categorize this query set in the below mentioned buckets (and all possible combinations). The key objective here is that QUL should be able to correctly label these params within a search query.
Technical Buckets
1 Product name/PRODUCT
2 Brand
3 Cat/ Sub-Cat
4 Brand + Cat/Sub-Cat
5 PRODUCT+ Cat/Sub-Cat
6 Cat/ Sub-Cat+ Highlights/Attributes/Filter values
7 PRODUCT+ Cat/Sub-Cat+ Highlights/Attributes/Filter values
8 Brand + Cat/Sub-Cat+ Highlights/Attributes/Filter values
9 Brand+ Highlights/Attributes/Filter values
10 Highlights/Attributes/Filter values
Each Buckets will be represented with at least 150 queries
Category / Sub Category Buckets(Top interms of revenues/orders)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Each Buckets will be represented with at least 100 queries,
Find out best practices if site Search
Input to Crowdsourcing Platform
The test user will be provided with UI where query and respective top 10 -20 results (?) is shown.
The results will be shown as per our business objective:
- New Release Plan
Product Result and Test avatar results will be shown - Competitor Benchmarking
Product Results for host site & similar sites like flipkart, eBay and amazon will be displayed.
Product and Competitor results will be shown
User Interface
- Most Relevant , Best Result : Totally relevant
: The document completely answers the question.
- Relevant , Good Result : Partly relevant
: The information in the document is relevant to the question but not complete.
- Irrelevant , Somewhere Close: Related
: The document mentions the subject or holds potentially good hyperlinks to relevant pages, but does not contain any actual information regarding the query itself
- Completely Irrelevant, Useless: Not relevant/Spam
: The document is off topic or spam, not giving information about the subject.
Serial No. | Subject | Question Asked | Scores Assigned |
1 | Most Relevant | Best Result | 3 |
2 | Relevant | Good Result | 2 |
3 | Irrelevant | Somewhere Close | 1 |
4 | Completely Irrelevant | Irrelevant, Useless | 0 |
Output from CrowdSourcing Platform
The user will be asked to mark each PRODUCTs wrt to query provided, and the score will be stored in following format a query:
Query | PRODUCTS | Rank | User Scoring |
Q1 | PRODUCT1 | 1 | 3 |
PRODUCT2 | 2 | 0 | |
PRODUCTn | 4 | 2 | |
Q2 | PRODUCT4 | 1 | 3 |
PRODUCT9 | 2 | 0 | |
PRODUCT11 | 4 | 2 | |
PRODUCTz | 1 | 3 |
For multiple user feedback for same query weighted average rank and weighted average user scoring will be considered.
Post Analysis of Scores
We ll be calculating Normalized Discounted Cumulative Gain for each query(explained below), once we have computed NDCG values for each query, we can average them across thousands of queries. We can now compare two algorithms: we take the mean average NDCG values for each, and check using a statistical test (such as a two sided t-test) whether one algorithm is better than the other, and with what confidence.
Calculation of Cumulative Gains
Cumulative Gain (CG) is does not include the position of a result in the consideration of the usefulness of a result set. In this way, it is the sum of score values of all PRODUCTs for aquery. The CG at a particular rank position p is defined as:
Calculation for provided data
After Computing weighted average scores for rank and scores, the CGs score as follows
PRODUCTS | Rank | User Scoring | CG |
PRODUCT1 | 1 | 3 | 3 |
PRODUCT2 | 2 | 0 | 3 |
PRODUCT3 | 3 | 2 | 5 |
PRODUCT4 | 4 | 3 | 8 |
PRODUCTn | 5 | 1 | 9 |
Calculation of Discounted Cumulative Gains
The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the score value is reduced logarithmically proportional to the position of the result. The discounted CG accumulated at a particular rank position is given by:
Rel1-> Score for top position PRODUCT,
Reli -> for any position PRODUCT
Calculation for provided data
PRODUCTS |
Rank(i) |
User Scoring(rel) |
CG |
Log2(i) |
rel/ Log2(i) |
DCG |
PRODUCT1 |
1 |
3 |
3 |
0 |
N/A |
3 |
PRODUCT2 |
2 |
2 |
5 |
1 |
2 |
5 |
PRODUCT3 |
3 |
3 |
8 |
1.585 |
1.892 |
6.892 |
PRODUCT4 |
4 |
0 |
8 |
2 |
0 |
6.892 |
PRODUCTn |
5 |
1 |
9 |
2.322 |
0.431 |
7.323 |
Calculation of Normalized Discounted Cumulative Gains
The Normalized part in NDCG allows us to compare DCG values between different queries.
Search result lists vary in performance depending upon query . It’s not fair to compare DCG values across queries because some queries are easier than others: for example, maybe it’s easy to get four perfect results for the query samsung s4, and much harder to get four perfect results for short micro usb cable .
This done normalizing DCG wrt to ideal discounted cumulative gains (IDCG), which is the best possible score is given the results we’ve seen so far.
Example : the best scores for query with increasing order of rank are
3 3 2 2 0, then IDCG = 8.01
Our NDCG is the score we for given set DCG divided by the ideal DCG Now we can compare scores across queries, since we’re comparing percentages of the best possible arrangements and not the raw scores
PRODUCTS |
Rank(i) |
User Scoring(rel) |
CG |
Log2(i) |
rel/ Log2(i) |
DCG |
NDCG |
PRODUCT1 |
1 |
3 |
3 |
0 |
N/A |
3 |
0.37 |
PRODUCT2 |
2 |
2 |
5 |
1 |
2 |
5 |
0.62 |
PRODUCT3 |
3 |
3 |
8 |
1.585 |
1.892 |
6.892 |
0.86 |
PRODUCT4 |
4 |
0 |
8 |
2 |
0 |
6.892 |
0.86 |
PRODUCTn |
5 |
1 |
9 |
2.322 |
0.431 |
7.323 |
0.91 |
NDCG = 0.91
Final Score Calculation For All set of queries
Once we’ve computed NDCG values for each query, we can average them across thousands of queries.
Testing Across various setup
Once the score is calculated of each setup (production, avatar, competitor) . The two setup algorithm will be compared using a statistical test (such as a two sided t-test) whether one algorithm is better than the other, and with what confidence.
One comment