In recent years, Search Personalisation has been attracting more and more attention in both academia and industry. Different from classical search methods in which the personalisation is not taken into account, personalised search engines utilise personal data of each user to tailor search results to the specific user, which depend not only on the input query but also on the user interest. Such personal data can be used to construct a user profile which is crucial to effective personalisation.
A typical approach is to build user profiles using main topics discussed in the user’s relevant (previously clicked) documents. The topics of a document are often obtained from a human-generated online ontology, such as the Open Directory Project (ODP) (White et al., 2013). This approach has a disadvantage that many topics may not appear in the ontology. Moreover, it requires expensive manual effort to determine correct categories for each document. To remedy this problem, we apply Latent Dirichlet Allocation (LDA) algorithm (Blei et al., 2003) to automatically learn latent topics from the user’s relevant documents (Section 1).
The performance of a search personalisation relies on the richness of a user profile (Teevan et al., 2009). Recent studies (Dou et al. 2007, White et al. 2013) indicated that enriching a user profile with data from a group of users who share common interests helps to improve search personalisation. In these methods, users with shared interests are statically grouped using some predetermined criterions such as: common clicked documents (Dou et al., 2007), searching locations (White et al., 2013).
Despite being successful in improving search performance, the methods neglect the fact that users in a statically-formed group may have different interests on different topics with respect to an input query. An example of static group formation using common clicks is described in Figure 1, in which regardless of the input query (either “plum tree” or “iPhone 5s”) from the second user, the first and the third users are grouped in the same group with the second user.
To handle this problem, in this research, we propose a novel model for query-dependent user grouping and leveraging the dynamic group information to enrich user profiles for search personalisation (Sections 2, 3). For example, in Figure 2, the second user shares her interest with the first user on tree-related topics and with the third user on Apple product–related topics, respectively. When she submits the query “plum tree”, the first user should be grouped in the same group with her; and we will only use the search history of the first user to enrich her search profile.
1. Constructing a user profile
We build the topic-based user profile using main topics extracted automatically from the user’s relevant data (i.e., clicked documents) using LDA algorithm (Blei et al., 2003). The proposed approach consists of three following steps:
- The first step is to extract the relevant data of each user from the query logs. A log entity consists of an anonymous user-identifier, a submitted query, top-10 returned URLs, and clicked results along with the user’s dwell time.
- In the second step, we apply LDA algorithm to extract topics from the relevant data. After this step, each document is also described as a distribution over topics, in which each topic is described as a distribution over a fixed vocabulary. Figures 3 and 4 show an example of LDA outputs.
- In the final step, we use the topic distribution of each relevant document extracted from the second step to build a user profile. Each user is described as a set of relevant documents, in which each document is represented as a distribution over topics. Therefore, we can model each user profile as a distribution over topics. The probability of each topic on the user profile indicates how the user is interested in the topic. Figure 5 shows an example of how to construct a topic-based user profile from three relevant documents.
2. Query-dependent user grouping
For each input query of the current user, we use the input query as an indicator to dynamically group other users who share common interests with the user. It leads to a fact that different input queries will lead to different groups of similar users. Our method contains three steps detailed as follows:
- The first step is to build the shared user profile between two users using the common relevant documents of the users. We construct the shared user profile as a distribution over the topics (Figure 6). After that, we get all shared user profiles of the current user with other users (Figure 7).
- In the second step, with each input query, we calculate the similarity between the query and a shared user profile to get a similarity score which indicates how much interest is shared between two users given the query. By using the LDA topic distribution, we model each input query as a vector of topics. For example, in Figure 8, the query “windows 8” is described as a vector of topics. We then measure the similarity between a shared profile and an input query by using simple techniques such as Cosine Similarity. Figure 9 shows an example of cosine similarity scores between the input query “windows 8” and three shared profiles (Figure 7).
- In the final step, we use the similarity score to form a group of the K-nearest users who share common interest with the current user given the input query. As shown in Figure 9, if we set K = 2, the third (green) and the second (pink) users are the two nearest users with the similarity scores of 0.7894 and 0.6130, respectively.
3. Re-ranking search results using group information
After obtaining the K-nearest users, we use the data from the K users to enrich the current user profile. Figure 10 shows an example of an enriched user profile. The profile is enriched on the fly because the K-nearest users are extracted at the query time.
We then utilise the enriched user profile to re-rank the original list of documents returned by a search engine to obtain a new ranked list (Vu et al., 2014 for more details). After re-raking, the performance of the search engine is best if more relevant documents are ranked higher.
The dataset used in our experiments is the query logs of 106 anonymous users from Bing search engine for 15 days from 1st to 15th July 2012. We split the dataset into training set (the first 10 days) and test set (the remaining 5 days). In our experiments, we evaluate our proposed method by comparing the original rank list given by the commercial search engine and the re-ranked list given by our methods with the Inverse Average Rank (IAR) and Personalisation Gain (P-Gain) metrics (Vu et al., 2014). For each metric, the higher value indicates the better ranking quality.
Baseline and personalisation strategies
The baseline is the original ranked results from Bing; S_Profile is a personalisation approach using only the current user profile obtained from Section 1; S_Group uses the profile enriched with information from static grouping (as mentioned in Figure 11, we use only the number of common clicks to group users and are regardless of the input query); and D_Group is enriched with information from dynamic grouping.
Table 1 shows that all personalisation strategies can improve over the baseline (i.e., all reported changes are positive in IAR and P-Gain values). Interestingly, D_Group has the highest (p < 0.01 with the paired t-test) improvement of 8.12% over the baseline (using IAR metric). S_Group and S_Profile also have significant improvements of 7.64% and 5.94%, respectively (p < 0.01) over the baseline. This shows that latent topic-based personalisation methods generally yield better web search performance.
Table 1 also shows that the Group-based methods (S_Group and D_Group) improve the IAR and P-Gain values over the S_Profile. Consistent with Teevan et al., (2009), our result confirms that the information from the group of users who share some common interests is helpful in building better user profiles. Furthermore, D_Group leads to improvements of 0.45% and 14.22% in IAR and P-Gain respectively over S_Group (p < 0.01).
Table 1 Overall performance of the methods.
5. Last words
We have presented a novel dynamic grouping method to build a user profile for search personalisation. For a user, the profile is dynamically constructed and enriched with information from other users whose interests are similar to the user given a query. Applying it to web search, we used the enriched profile to rank relevant documents. In the experiment, we tested the proposed method on query logs from a major commercial web search engine. The results showed that the method improves the performance of current ranking and also achieves better performance than the static grouping method. In the future, we really hope to get larger scale query logs from data providers to perform large scale experiments.
- [Blei et al., 2003] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993-1022, 2003.
- [Dou et al., 2007] Z. Dou, R. Song, and J.-R. Wen. A large-scale evaluation and analysis of personalized search strategies. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 581-590, NY, USA, 2007. ACM.
- [Teevan et al., 2009] J. Teevan, M. R. Morris, and S. Bush. Discovering and using groups to improve personalized search. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09, pages 15-24, USA, 2009. ACM.
- [Vu et al., 2014] Vu, Thanh; Song, Dawei; Willis, Alistair; Tran, Son N. and Li, Jingfei (2014). Improving search personalisation with dynamic group formation. The 37th Annual International ACM SIGIR Conference, 6-11 Jul 2014, Gold Coast, Australia, ACM.
- [White et al., 2013] R. W. White, W. Chu, A. Hassan, X. He, Y. Song, and H. Wang. Enhancing personalized search by mining and modeling task behavior. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, pages 1411-1420, Switzerland, 2013. ACM.