This section presents the results we obtained by using the approach presented in Sect. Diversity in multi-partite graphs and recommender systems in the context of a dataset recording user activity on an online platform featuring musical content. We first present the general setting (Sect. Experimental setting), before investigating different questions: how diversified are the recommendations and what is the impact of the parameters (Sect. Analysis of the recommendations)? What is the effect of the recommendations on users’ diversity (Sect. Effect of the recommendations on users’ diversity)? How to explain the differences in the way recommendations affects users’ diversity (Sect. Examining the effect of representative recommendations)?
Experimental setting
Dataset
The dataset we used comes from the Million Song Dataset project (Bertin-Mahieux et al. 2011). To create the first two layers of the triparite graph and the user–item links, we used the Echo Nest user taste profile dataset of the project. It features triplets of (user, item, play count) that describe how many times a user has listened to a given song. To add the third layer of the tripartite graph and create the item–category links, we used the last.fm dataset of the project. It contains (item, tag, strength) triplets that describe the tags associated to each song with their strength. Furthermore, in order to obtain a coherent tripartite graph, we performed the following operations: we selected only the 1, 000 most popular tagsFootnote 3, deleted songs with ambiguous identifiersFootnote 4, and deleted songs with no tag and users with no songs. Finally, in order to reduce the training time of our models, we randomly sampled 100, 000 users and their recorded items.
Implementation
We ran all of the experiments using the Python programming language on a 40 cores Intel Xeon server, equipped with 256 GB of RAM. We used the package lenskit (Ekstrand 2020) to instantiate, train and evaluate the recommender system. The code is available on GitHubFootnote 5 along with instructions to install and run it. Following the test procedure described in Sect. Collaborative filtering, the dataset was split in 5 user folds, using the leave-one-out strategy to create the train/test datasets. The hyper-parameters were then chosen to empirically maximize the average Normalized Disconted Cumulative Gain, which resulted in 512 latent factors, \(\mu = 40\) and \(\lambda = 5* 10^{-3}\). Unless stated otherwise, those are the values used in the rest of the paper. When studying particular users the first fold was used.
Tripartite graph of the recommendations
To evaluate the diversity of the recommendations, we define a second tripartite graph \({\mathbb {T}}_r\). In this graph, the links between songs and categories are the same as in the dataset but the links between the users and the songs now represent the recommendations (instead of the musical record of the user). In addition, to account for the impact of the position rank(i, u) of an item i in the list \(L_u\) of recommendations generated to user u, the weight of the corresponding u–i link in the tripartite graph is set to \(|L_u| - rank(i, u)\). The diversity of the recommendations \(L_u\) exposed to a user u is therefore the \(\alpha\)-diversity of u in \({\mathbb {T}}_r\). Finally, to differentiate between the two tripartite graphs, we will refer to the organic diversity of a user as his/her diversity before being exposed to the recommendations.
Analysis of the recommendations
Diversity of the recommendations
First, we investigate the properties of the recommendations in terms of diversity. Figure 3 presents the recommendation diversity (for \(k=50\) recommendations and \(\alpha =0, 2\) and \(\infty\)) with respect to the organic diversity of each user. The users’ volume (the sum of the play counts of all items the user has listened to) is represented by a color in a log scaleFootnote 6.
This figure provides several pieces of information. First, we can observe that there is no strong relation between organic diversity and recommendation diversity since low organic diversity does not necessarily lead to low recommendation diversity. Some users with a low organic diversity are exposed to diversified recommendations,
while others have a narrower exposure. In addition, and whatever the order of the diversity, we can observe that, as soon as a minimum number of items are recommended, the recommendations tend to be more diverse than the organic ones. This is particularly true for the variety captured by \(\alpha =0\), but it can also be observed (although to a lesser extent) for \(\alpha =2\) and \(\alpha =\infty\).
However, one can clearly see that this relation depends on the value of the organic diversity. As this value increases (for all values of \(\alpha\)), it becomes more difficult for the recommendations to reach the same level of diversity.
Impact of the parameters of the model
As regards the impact of the parameters of the model, it has been suggested that there is a trade-off between the usual notion of performance (engagement or accuracy) and the notion of diversity (Holtz et al. 2020). To investigate this relation, in Fig. 4 we show how the number of latent factors impacts the diversityFootnote 7 of the recommendations exposed to the users, along with the model performance (measured by the NDCG, in black dashed lines).
As expected, it can be observed that when the number of latent factors increases, so does the performance of the model. With more latent factors, the model is efficient in recommending items related to the user’s past musical record. However, this efficiency has a clear effect on diversity, which decreases when the number of latent factor increases, except for a particularly low number of recommendations (\(k=10)\) and for \(\alpha =0\). In this situation, there does seem to be a trade-off between performance and diversity. This supports the observations made in Holtz et al. (2020) and clarifies the effect: the suggested trade-off can be observed for the variety but vanishes as soon as the balance exposure is taken into account in the diversity measure.
Effect of the recommendations on users’ diversity
Figure 3 presented in Sect. Analysis of the recommendations reveals that a list of recommendations can be diversified, even for users whose past musical records are narrow in scope. However, this plot does not provide any information on the real impact of the recommendations on users’ musical habits. How diverse are the user’s listening habits after being exposed to the recommendations? To investigate this question, for each user we computed the diversity increase, which is defined as
$$\begin{aligned} \Delta _\alpha = D_\alpha ({\mathbb {T}}_{t,r}) - D_\alpha ({\mathbb {T}}_t) \end{aligned}$$
(7)
where \({\mathbb {T}}_t\) is the tripartite graph associated to the training dataset and \({\mathbb {T}}_{t + r}\) the tripartite graph in which we added the recommendations. Thus a positive value of \(\Delta _\alpha\) indicates that the recommendations have improved the musical habits of a user in terms of diversity, while a negative value indicates the opposite effect.
It is worth noting here that, in order for this diversity measure to make sense, one has to carefully adapt the weights in \({\mathbb {T}}_{t + r}\) and in particular to derive a relevant notion of play count for the recommendations. We chose to use a linear relation between the weight and the rank. Assuming that a user would listen to as many recommendations as the number \(u_v\) of items he/she has listened to and that the last item recommended is not listened to (\(w(u, i_{n_r}) = 0\)), we used the following weight function:
$$\begin{aligned} w(u, i) = \frac{2 u_v}{k (k - 1)} (k - \text {rank}(i, u)) \end{aligned}$$
This choice clearly hides a simple user model and other choices could be investigated. In particular, one could use the models proposed in Poulain and Tarissan (2020) that account for the saturation effects in the diversity of users’ attention.
Figure 5 presents the distribution of the diversity increase (with \(\alpha =2\)) for different numbers of recommended items (\(k=10, 50, 100\)). Surprisingly, contrary to what Fig. 3 suggested, the musical listening habits of most users are diversified by the recommendations. Even with only ten items recommended, there is a positive increase for more than 89% of users. However, the log-scale of the y-axis could be misleading and one could object that the increase, although mostly positive, is relatively small. This does in fact prove to be the case and one can observe that the mean increase is always lower than 10, even for hundreds of recommendations.
To achieve a more detailed and comprehensive picture of the effect recommendations have on users’ diversity, in Fig. 6 we have plotted the diversity increase for different numbers of recommendations (\(k=10, 50, 100\)) and different orders of diversity (\(\alpha =0, 2, \infty\)) for all users. The effect of the number of recommendations on the diversity of users’ attention is clear: diversity increases with the number of items recommended (from left to right) for all diversity orders. However, one can also observe that the capacity of the recommendations to improve users’ diversity is closely related to the users’ profiles. Such an increase can mainly be seen in users that listen to a small number of songs (darker dots on the plots). For very active users or those with broad musical listening habits, the recommendations barely have any positive increase.
The plots also reveal that the recommendations do not have the same effect when diversity is considered in terms of variety or balance. In relation to variety (\(\alpha =0\), top row), we can see that the recommendations improve diversity even with very few recommendations (\(k=10\)). This is not the case for other diversity orders. This suggests that the main effect of the recommendations is to introduce new categories into users’ musical habits. As soon as balance is taken into account in the measure (middle and bottom row), the recommendations are less effective in increasing diversity.
These results indicate that collaborative filtering impacts the diversity of musical listening habits in a specific way. To study this relation in greater depth, we will now investigate exactly how the recommendations affect users’ diversity.
Examining the effect of representative recommendations
We will conclude this section by presenting the cases of some specific users. This will shed light on the relation between recommended items and users’ past musical records, and how, as a result, recommendations have different effects on users’ attention. In Fig. 7, we present four users whose past musical records are either narrow (left column) or diverse (right column) and for whom recommendations (\(k=50\)) generate either a decrease (bottom row) or an increase (top row) in diversity (\(\alpha =2\)). For each user case, the top part of the plot displays the proportion of the categories of the past musical record (left blue line) and of the recommendations (right orange line) independently, while the bottom part shows the distribution of the categories as a result of the recommendations, that is after the user is exposed to the recommendations. It is worth noting that, to ease the comparison between the different cases, we only display the most important categoriesFootnote 8. Finally, users presented in this picture are also visible in Fig. 6 with their respective red marker.
This figure reveals some interesting aspects in terms of how recommendations impact users. First, in all four cases, a significant fraction of the musical categories associated with the recommendations are completely new to the users, thus increasing variety. This is obvious in Fig. 7a–c, even if only the main categories are displayed. In Fig. 7d, the new categories belong to less dominant categories, thus not visible in the distribution. To support this observation, we computed the proportion of new categories when a positive increase is observed in the whole dataset: on average, a positive diversity increase (for \(\alpha = 2\) and \(k = 50\)) leads to 322 new categories. This more than quadruples the number of categories of the past musical record of an average user.
Second, we can observe that users with a similar profile in terms of organic diversity might be differently affected by recommendations. The real effect of the recommendations relies strongly on how they fit in with the users’ musical habits. For user fdeb (bottom left), for instance, the most recommended category, hiphop, corresponds to the main category in his/her past musical record. This completely unbalances the musical landscape of this user, whose musical record was already largely restricted to hiphop music.
By contrast, the musical exposure of user dbc1 (top left) is diversified as a result of the recommendations, although he/she also has one particular musical focus: progressive metal and progressive death metal. One reason for this is that those categories barely appear in the list of recommendations (only metal, related to the former, is highly represented). Instead, the recommendations expose other new or less dominant categories in the user’s past musical record (such as alternative or gothic metal). This provides a far more balanced range of musical categories, leading to a higher diversity.
One might wonder whether this effect could be due to the fact that those users have a low organic diversity to begin with. The study of users 4c0b and ac10 (right column) show this is not the case. Both are highly active users (respectively in the top \(7\%\) and top \(25 \%\) of the most active users on the dataset) with a very high organic diversity (top \(2\%\)). Yet, they both experience a completely different impact of the recommendations exposure : while the four main categories of user 4c0b (triphop, indie, rock and electronic) are also the four main categories of the recommendations, thus unbalancing the range of musical categories to which he/she is exposed, the musical record of user ac10, on the contrary, is broadened by the recommendations due to the more nuanced musical exposure. In particular, on can notice that indie (his/her most listened categories) is one of the least recommended.