Benchmarking News Recommendations: The CLEF NewsREEL Use Case

Many online news portals display at the bottom of their articles a small widget box labelled, “You might also be interested in”, “Recommended articles”, or similarly where users can find a list of recommended news articles. An example is shown in Figure 1.

Figure 1: News article recommendations are often displayed in a widget box below the original article. (Image: Courtesy of T. Brodt of plista GmbH)

From a technical point of view, this recommendation use case is rather challenging. First of all, recommendations are required in real-time whenever a visitor accesses a news article on one of these portals. Real-time news recommendation differs from most traditional recommender scenarios that have been studied in literature. Instead of computing recommendations based on a static set of users and items, the challenge here is to provide recommendations for a news article stream characterised by a continuously changing set of users and items. Moreover, news content publishers constantly update their existing news articles, or add new content. The short lifecycle of items and the strict time-constraints for recommending news articles make great demands on the recommender strategies. In a stream-based scenario the recommender algorithms must be able to cope with a large number of newly created articles and should be able to discard old articles, since recommended news articles should be “new”. Thus, the recommender algorithms must be steadily adapted to meet the special requirements of the news recommendation scenario. Moreover, the recommendations must be provided quickly since most users are not willing to wait for recommendations that they did not even request in the first place.

Since 2014, this recommendation scenario is addressed in the News REcommendation Evaluation Lab (NewsREEL), a campaign-style evaluation lab of CLEF [1]. The lab focuses on support through (personalised) content selection in form of news recommendations. This blog post provides an overview over the objectives and challenges of the NewsREEL 2015 lab. For a more detailed introduction, we refer to below references and to the NewsREEL session at CLEF 2015 in Toulouse, France.

CLEF NewsREEL 2015

CLEF NewsREEL 2015 consisted of two tasks in which news recommendation algorithms of streamed data could be evaluated in an online, and an offline mode, respectively. The online evaluation platform used in Task 1 enabled participants to provide recommendations and observe users’ responses. This scenario can be seen as an example of Evaluation-as-a-Service [2, 3] where a service API is provided rather than a data set. Task 2 is based on a recorded data set providing the ground truth for a simulation-based evaluation. In contrast to the first scenario, performing an offline evaluation allows us to issue the same request to different algorithms and subsequently compare them. Additionally, it allows to measure factors such as time and space complexity.

Task 1: Online Evaluation

In the first subtask, the idea of living laboratories was implemented, i.e., researchers gained access to the resources of a company to evaluate different recommendation techniques using A/B testing. A/B testing aims to benchmark varieties of a recommender system by a larger group of users. It is increasingly adopted for the evaluation of commercial systems with a large user base as it provides the advantage of observing the efficiency and effectiveness of recommendation algorithms under real conditions. While online evaluation is the de-facto standard evaluation methodology in Industry, university-based researchers often do not have access to either infrastructure or user base to perform online evaluation on a larger scale. NewsREEL is the first living lab where researchers gain access to both infrastructure and user requests to benchmark algorithms for information access systems using A/B testing. The living lab is described in detail in [4]. A similar, somewhat more constrained, IR-centric set-up is implemented in the CLEF Living Labs for Information Retrieval lab [5].

Within NewsREEL, the infrastructure is provided by plista GmbH, a company that provides a recommendation service for online publishers. Whenever a user requests an article from one of their customers’ web portals, plista recommends further articles that the user might be interested in. In NewsREEL, plista outsourced a subset of this recommendation task to interested researchers via their Open Recommendation Platform (ORP) [6]. Once a user visits a news web page assigned to the NewsREEL challenge, a recommendation request is sent to a randomly selected team who registered with ORP. For each request, the team then had to provide a list of up to six recommendations. Providing recommendations to real users, a time constraint of 100ms was set for completing the recommendation request.
We provided a baseline algorithm implementing a simple, but powerful recommendation strategy. The strategy recommends users the items most recently requested by other users. The idea behind this strategy is that items currently interesting to users might also be interesting for others. Thereby, the strategy assumes that users are able to determine relevant articles for others. The performance of the recommender algorithms was measured based on the click-through-rate (CTR) recorded in four pre-defined time periods. Figure 2 shows the CTR of the baseline recommender observed during the final evaluation period of NewsREEL 2015.

Figure 2: CTR of the baseline recommender algorithm for the NewsREEL’s evaluation period (May–June 2015)

Task 2: Offline Evaluation

The evaluation of recommender algorithms online in a living lab leads to results that are difficult to reproduce since the set of users and items as well as the user preferences change continuously. This hampers the evaluation and optimisation of algorithms due to the fact that different algorithms or different parameter settings cannot be tested in an exactly repeatable procedure. Addressing this issue, the second subtask of NewsREEL focused on simulating a constant data stream as provided by ORP. In contrast to the first scenario, performing an offline evaluation allows us to issue the same request to different algorithms and subsequently compare them. Additionally, it allows to measure factors such as time and space complexity. The offline task is described in more detail in [7].

To achieve this goal, we provided a large data set comprising interactions between users and various news portals in a two-month time span. Since these news portals publish articles in German, around 80% of all users came from one of the German-speaking countries in Central Europe. Figure 3 highlights the regions from where interactions were triggered.

Figure 3: First-level and second-level NUTS in Germany, Austria, and Switzerland from were requests for articles were triggered. The scale indicates the number of requests during one month.

Moreover, we employed the benchmarking framework Idomaar that makes it possible to simulate data streams by “replaying” a recorded stream. The framework is being developed in the CrowdRec project and adopts open-source technologies widely known by the research community to allow handling of large-scale streams of data (e.g., Apache Kafka, Apache Spark, etc.). Idomaar allows to execute and test the proposed news recommendation algorithms, independently of the execution framework and the language used for the development. Participants in this task had to predict users’ clicks on recommended news articles in simulated real-time. The proposed algorithms were evaluated against both functional (i.e., recommendation quality) and non-functional (i.e., response time) metrics.

NewsREEL and You

The NewsREEL challenge supports recommender system benchmarking in making a critical step towards wide-spread adoption of online benchmarking (i.e., “living lab evaluation”). Further, the Idomaar framework for offline evaluation of stream recommendation allows multi-dimensional evaluation of stream-based recommender systems. Testing of stream-based algorithms is important for companies who offer recommender systems services, or provide recommendations directly to their customers. However, until now, such testing has occurred in house. Consistent, open evaluation of algorithms across the board was frequently impossible. Because NewsREEL provides a huge data set and enables reproducible evaluation of recommender system algorithms, it has the power to reveal underlying strengths and weaknesses of algorithms across the board. Such evaluation provide valuable insights that help to drive forward the state of the art.

If you would like to learn more about the advantages of living lab evaluation, have a look at the slides of this presentation. Also, in case you attend ACM RecSys’15 in Vienna, don’t forget to register for our tutorial on “Real-Time Recommendation of Streamed Data” where we will explain how to use the technologies presented in NewsREEL for your own research. Last, but not least, you can also have a look at the list of papers that are based on the NewsREEL use case.


NewsREEL would like to acknowledge the funding that it has received from the EU FP7 project CrowdRec.


  1. Nicola Ferro: CLEF 15th Birthday: Past, Present, and Future. SIGIR Forum 48(2): 31-55 (2014)
  2. Jimmy Lin, Miles Efron: Evaluation as a service for information retrieval. SIGIR Forum 47(2): 8-14 (2013)
  3. Frank Hopfgartner, Allan Hanbury, Henning Müller, Noriko Kando, Simon Mercer, Jayashree Kalpathy-Cramer, Martin Potthast, Tim Gollub, Anastasia Krithara, Jimmy Lin, Krisztian Balog, Ivan Eggel: Report on the Evaluation-as-a-Service (EaaS) Expert Workshop. SIGIR Forum 49(1): 57-65 (2015)
  4. Frank Hopfgartner, Benjamin Kille, Andreas Lommatzsch, Till Plumbaum, Torben Brodt, Tobias Heintz: Benchmarking News Recommendations in a Living Lab. CLEF 2014: 250-267
  5. Krisztian Balog, Liadh Kelly, Anne Schuth: Head First: Living Labs for Ad-hoc Search Evaluation. CIKM 2014: 1815-1818
  6. Torben Brodt, Frank Hopfgartner: Shedding light on a living lab: the CLEF NEWSREEL open recommendation platform. IIiX 2014: 223-226
  7. Benjamin Kille, Andreas Lommatzsch, Roberto Turrin, Andras Sereny, Martha Larson, Torben Brodt, Jonas Seiler, Frank Hopfgartner. Stream-Based Recommendations: Online and Offline Evaluation as a Service. CLEF 2015. to appear


About Frank Hopfgartner
Frank Hopfgartner