Dynamic Group Formation for Search Personalisation

In recent years, Search Personalisation has been attracting more and more attention in both academia and industry. Different from classical search methods in which the personalisation is not taken into account, personalised search engines utilise personal data of each user to tailor search results to the specific user, which depend not only on the input query but also on the user interest. Such personal data can be used to construct a user profile which is crucial to effective personalisation.

A typical approach is to build user profiles using main topics discussed in the user’s relevant (previously clicked) documents. The topics of a document are often obtained from a human-generated online ontology, such as the Open Directory Project (ODP) (White et al., 2013). This approach has a disadvantage that many topics may not appear in the ontology. Moreover, it requires expensive manual effort to determine correct categories for each document. To remedy this problem, we apply Latent Dirichlet Allocation (LDA) algorithm (Blei et al., 2003) to automatically learn latent topics from the user’s relevant documents (Section 1).

The performance of a search personalisation relies on the richness of a user profile (Teevan et al., 2009). Recent studies (Dou et al. 2007, White et al. 2013) indicated that enriching a user profile with data from a group of users who share common interests helps to improve search personalisation. In these methods, users with shared interests are statically grouped using some predetermined criterions such as: common clicked documents (Dou et al., 2007), searching locations (White et al., 2013).

Despite being successful in improving search performance, the methods neglect the fact that users in a statically-formed group may have different interests on different topics with respect to an input query. An example of static group formation using common clicks is described in Figure 1, in which regardless of the input query (either “plum tree” or “iPhone 5s”) from the second user, the first and the third users are grouped in the same group with the second user.

Static Group Formation

Figure 1 Static Group Formation

To handle this problem, in this research, we propose a novel model for query-dependent user grouping and leveraging the dynamic group information to enrich user profiles for search personalisation (Sections 2, 3). For example, in Figure 2, the second user shares her interest with the first user on tree-related topics and with the third user on Apple product–related topics, respectively. When she submits the query “plum tree”, the first user should be grouped in the same group with her; and we will only use the search history of the first user to enrich her search profile.

Dynamic Group Formation

Figure 2 Dynamic Group Formation

1. Constructing a user profile

We build the topic-based user profile using main topics extracted automatically from the user’s relevant data (i.e., clicked documents) using LDA algorithm (Blei et al., 2003). The proposed approach consists of three following steps:

  • The first step is to extract the relevant data of each user from the query logs. A log entity consists of an anonymous user-identifier, a submitted query, top-10 returned URLs, and clicked results along with the user’s dwell time.
  • In the second step, we apply LDA algorithm to extract topics from the relevant data. After this step, each document is also described as a distribution over topics, in which each topic is described as a distribution over a fixed vocabulary. Figures 3 and 4 show an example of LDA outputs.
The output topic set of LDA

Figure 3 The output topic set of LDA

The distribution over the topic set of a football paper

Figure 4 The distribution over the topic set of a football paper

  • In the final step, we use the topic distribution of each relevant document extracted from the second step to build a user profile. Each user is described as a set of relevant documents, in which each document is represented as a distribution over topics. Therefore, we can model each user profile as a distribution over topics. The probability of each topic on the user profile indicates how the user is interested in the topic. Figure 5 shows an example of how to construct a topic-based user profile from three relevant documents.
Modelling a user profile

Figure 5 Modelling a user profile

2. Query-dependent user grouping

For each input query of the current user, we use the input query as an indicator to dynamically group other users who share common interests with the user. It leads to a fact that different input queries will lead to different groups of similar users. Our method contains three steps detailed as follows:

  • The first step is to build the shared user profile between two users using the common relevant documents of the users. We construct the shared user profile as a distribution over the topics (Figure 6). After that, we get all shared user profiles of the current user with other users (Figure 7).
Constructing a shared user profile

Figure 6 Constructing a shared user profile

The final shared profiles

Figure 7 The final shared profiles

  • In the second step, with each input query, we calculate the similarity between the query and a shared user profile to get a similarity score which indicates how much interest is shared between two users given the query. By using the LDA topic distribution, we model each input query as a vector of topics. For example, in Figure 8, the query “windows 8” is described as a vector of topics. We then measure the similarity between a shared profile and an input query by using simple techniques such as Cosine Similarity. Figure 9 shows an example of cosine similarity scores between the input query “windows 8” and three shared profiles (Figure 7).
A vector over topics of a Query

Figure 8 A vector over topics of a Query

The cosine similarity between a query and the shared profiles

Figure 9 The cosine similarity between a query and the shared profiles

  • In the final step, we use the similarity score to form a group of the K-nearest users who share common interest with the current user given the input query. As shown in Figure 9, if we set K = 2, the third (green) and the second (pink) users are the two nearest users with the similarity scores of 0.7894 and 0.6130, respectively.

3. Re-ranking search results using group information

After obtaining the K-nearest users, we use the data from the K users to enrich the current user profile. Figure 10 shows an example of an enriched user profile. The profile is enriched on the fly because the K-nearest users are extracted at the query time.

An enriched user profile

Figure 10 An enriched user profile

We then utilise the enriched user profile to re-rank the original list of documents returned by a search engine to obtain a new ranked list (Vu et al., 2014 for more details). After re-raking, the performance of the search engine is best if more relevant documents are ranked higher.

4. Experiment


The dataset used in our experiments is the query logs of 106 anonymous users from Bing search engine for 15 days from 1st to 15th July 2012. We split the dataset into training set (the first 10 days) and test set (the remaining 5 days). In our experiments, we evaluate our proposed method by comparing the original rank list given by the commercial search engine and the re-ranked list given by our methods with the Inverse Average Rank (IAR) and Personalisation Gain (P-Gain) metrics (Vu et al., 2014). For each metric, the higher value indicates the better ranking quality.

Baseline and personalisation strategies

The baseline is the original ranked results from Bing; S_Profile is a personalisation approach using only the current user profile obtained from Section 1; S_Group uses the profile enriched with information from static grouping (as mentioned in Figure 11, we use only the number of common clicks to group users and are regardless of the input query); and D_Group is enriched with information from dynamic grouping.

The static grouping method

Figure 11 The static grouping method

Overall Performance

Table 1 shows that all personalisation strategies can improve over the baseline (i.e., all reported changes are positive in IAR and P-Gain values). Interestingly, D_Group has the highest (p < 0.01 with the paired t-test) improvement of 8.12% over the baseline (using IAR metric). S_Group and S_Profile also have significant improvements of 7.64% and 5.94%, respectively (p < 0.01) over the baseline. This shows that latent topic-based personalisation methods generally yield better web search performance.

Table 1 also shows that the Group-based methods (S_Group and D_Group) improve the IAR and P-Gain values over the S_Profile. Consistent with Teevan et al., (2009), our result confirms that the information from the group of users who share some common interests is helpful in building better user profiles. Furthermore, D_Group leads to improvements of 0.45% and 14.22% in IAR and P-Gain respectively over S_Group (p < 0.01).

Table 1 Overall performance of the methods.

Strategy IAR P-Gain
Baseline 0.3473
S_Profile 0.3679 0.1579
S_Group 0.3738 0.2848
D_Group 0.3755 0.3253

5. Last words

We have presented a novel dynamic grouping method to build a user profile for search personalisation. For a user, the profile is dynamically constructed and enriched with information from other users whose interests are similar to the user given a query. Applying it to web search, we used the enriched profile to rank relevant documents. In the experiment, we tested the proposed method on query logs from a major commercial web search engine. The results showed that the method improves the performance of current ranking and also achieves better performance than the static grouping method. In the future, we really hope to get larger scale query logs from data providers to perform large scale experiments.


  1. [Blei et al., 2003] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993-1022, 2003.
  2. [Dou et al., 2007] Z. Dou, R. Song, and J.-R. Wen. A large-scale evaluation and analysis of personalized search strategies. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 581-590, NY, USA, 2007. ACM.
  3. [Teevan et al., 2009] J. Teevan, M. R. Morris, and S. Bush. Discovering and using groups to improve personalized search. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09, pages 15-24, USA, 2009. ACM.
  4. [Vu et al., 2014] Vu, Thanh; Song, Dawei; Willis, Alistair; Tran, Son N. and Li, Jingfei (2014). Improving search personalisation with dynamic group formation. The 37th Annual International ACM SIGIR Conference, 6-11 Jul 2014, Gold Coast, Australia, ACM.
  5. [White et al., 2013] R. W. White, W. Chu, A. Hassan, X. He, Y. Song, and H. Wang. Enhancing personalized search by mining and modeling task behavior. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, pages 1411-1420, Switzerland, 2013. ACM.
About Thanh Vu
Thanh Vu

Thanh Vu is a second-year PhD student in the Computing and Communications Department at the Open University. His main research focuses on web search personalisation. Thanh was awarded a B.Sc in Information System and an M.Sc in Computer Science from VNUH, University of Engineering and Technology, Vietnam in 2009 and 2013, respectively.