Data
Mercari is an online C2C marketplace service, where users trade various items among themselves. The service is operating in Japan and the United States. In the present study, we used the data obtained from the Japanese market between July 2013 and January 2019. In addition to normal transactions, we focused on the following types of problematic transactions: fictive, underwear, medicine, and weapon. Fictive transactions are defined as selling non-existing items. Underwear refers to transactions of used underwear; they are prohibited by the service from the perspective of morality and hygiene. Medicine refers to transactions of medicinal supplies, which are prohibited by the law. Weapon refers to transactions of weapons, which are prohibited by the service because they may lead to crime. The number of sampled users of each type is shown in Table 1.
Network analysis
We examine a directed and weighted network of users in which a user corresponds to a node and a transaction between two users represents a directed edge. The weight of the edge is equal to the number of transactions between the seller and the buyer. We constructed egocentric networks of each of several hundreds of normal users and those of fraudulent users, i.e., those engaged in at least one problematic sell. Figure 1 shows the egocentric networks of two normal users (Fig. 1a, b) and those of two fraudulent users involved in selling a fictive item (Fig. 1c, d). The egocentric network of either a normal or fraudulent user contained the nodes neighboring the focal user, edges between the focal user and these neighbors, and edges between the pairs of these neighbors.
We calculated eight indices for each focal node. They are local indices in the meaning that they require the information up to the connectivity among the neighbors of the focal node.
Five out of the eight indices use only the information about the connectivity of the focal node. The degree \(k_i\) of node \(v_i\) is the number of its neighbors. The node strength (Barrat et al. 2004) (i.e., weighted degree) of node \(v_i\), denoted by \(s_i\), is the number of transactions in which \(v_i\) is involved. Using these two indices, we also considered the mean number of transactions per neighbor, i.e., \(s_i/k_i\), as a separate index. These three indices do not use information about the direction of edges.
The sell probability of node \(v_i\), denoted by \({{\mathrm {SP}}}_i\), uses the information about the direction of edges and defined as the proportion of the \(v_i\)’s neighbors for which \(v_i\) acts as seller. Precisely, the sell probability is given by
$$\begin{aligned} {{\mathrm {SP}}}_i = \frac{k_i^{{\mathrm {out}}}}{k_i^{{\mathrm {in}}}+k_i^{{\mathrm {out}}}}, \end{aligned}$$
(1)
where \(k_i^{{\mathrm {in}}}\) is \(v_i\)’s in-degree (i.e., the number of neighbors from whom \(v_i\) bought at least one item) and \(k_i^{{\mathrm {out}}}\) is \(v_i\)’s out-degree (i.e., the number of neighbors to whom \(v_i\) sold at least one item). It should be noted that, if \(v_i\) acted as both seller and buyer towards \(v_j\), the contribution of \(v_j\) to both in- and out-degree of \(v_i\) is equal to one. Therefore, \(k_i^{{\mathrm {in}}} + k_i^{{\mathrm {out}}}\) is not equal to \(k_i\) in general.
The weighted version of the sell probability, denoted by \({\mathrm {WSP}}_i\), is defined as
$$\begin{aligned} {\mathrm {WSP}}_i = \frac{s_i^{{\mathrm {out}}}}{s_i^{{\mathrm {in}}}+s_i^{{\mathrm {out}}}}, \end{aligned}$$
(2)
where \(s_i^{{\mathrm {in}}}\) is node \(v_i\)’s weighted in-degree (i.e., the number of buys) and \(s_i^{{\mathrm {out}}}\) is \(v_i\)’s weighted out-degree (i.e., the number of sells).
The other three indices are based on triangles that involve the focal node. The local clustering coefficient \(C_i\) quantifies the abundance of undirected and unweighted triangles around \(v_i\) (Newman 2010). It is defined as the number of undirected and unweighted triangles including \(v_i\) divided by \(k_i(k_i-1)/2\). The local clustering coefficient \(C_i\) ranges between 0 and 1.
We hypothesized that triangles contributing to an increase in the local clustering coefficient are localized around particular neighbors of node \(v_i\). Such neighbors together with \(v_i\) may form an overlapping set of triangles, which may be regarded as a community (Radicchi et al. 2004; Palla et al. 2005). Therefore, our hypothesis implies that the extent to which the focal node is involved in communities should be different between normal and fraudulent users. To quantify this concept, we introduce the so-called triangle congregation, denoted by \(m_i\). It is defined as the extent to which two triangles involving \(v_i\) share another node and is given by
$$\begin{aligned} m_i = \frac{(\text {Number of pairs of triangles involving }v_i \;\text {that share another node})}{{\mathrm {Tr}}_i({\mathrm {Tr}}_i-1)/2}, \end{aligned}$$
(3)
where \({\mathrm {Tr}}_i = C_ik_i(k_i-1)/2\) is the number of triangles involving \(v_i\). Note that \(m_i\) ranges between 0 and 1.
Frequencies of different directed three-node subnetworks, conventionally known as network motifs (Milo et al. 2002), may distinguish between normal and fraudulent users. In particular, among triangles composed of directed edges, we hypothesized that feedforward triangles (Fig. 2a) should be natural and that cyclic triangles (Fig. 2b) are not. We hypothesized so because a natural interpretation of a feedforward triangle is that a node with out-degree two tends to serve as seller while that with out-degree zero tends to serve as buyer and there are many such nodes that use the marketplace mostly as buyer or seller but not both. In contrast, an abundance of cyclic triangles may imply that relatively many users use the marketplace as both buyer and seller. We used the index called the cycle probability, denoted by \({\mathrm {CYP}}_i\), which is defined by
$$\begin{aligned} {\mathrm {CYP}}_i = \frac{{\mathrm {CY}}_i}{{\mathrm {FF}}_i + {\mathrm {CY}}_i}, \end{aligned}$$
(4)
where \({\mathrm {FF}}_i\) and \({\mathrm {CY}}_i\) are the numbers of feedforward triangles and cyclic triangles to which node \(v_i\) belongs. The definition of \({\mathrm {FF}}_i\) and \({\mathrm {CY}}_i\), and hence \({\mathrm {CYP}}_i\), is valid even when the triangles involving \(v_i\) have bidirectional edges. In the case of Fig. 2c, for example, any of the three nodes contains one feedforward triangle and one cyclic triangle. The other four cases in which bidirectional edges are involved in triangles are shown in Fig. 2d–g. In the calculation of \({\mathrm {CYP}}_i\), we ignored the weights of edges.
Random forest classifier
To classify users into normal and fraudulent users based on their local network properties, we employed a random forest classifier (Breiman 2001; Breiman et al. 1984; Hastie et al. 2009) implemented in scikit-learn (Pedregosa et al. 2011). It uses an ensemble learning method that combines multiple classifiers, each of which is a decision tree, built from training data and classifies test data avoiding overfitting. We combined 300 decision-tree classifiers to construct a random forest classifier. Each decision tree is constructed on the basis of training samples that are randomly subsampled with replacement from the set of all the training samples. To compute the best split of each node in a tree, one randomly samples the candidate features from the set of all the features. The probability that a test sample is positive in a tree is estimated as follows. Consider the terminal node in the tree that a test sample eventually reaches. The fraction of positive training samples at the terminal node gives the probability that the test sample is classified as positive. One minus the positive probability gives the negative probability estimated for the same test sample. The positive or negative probability for the random forest classifier is obtained as the average of single-tree positive or negative probability over all the 300 trees. A sample is classified as positive by the random forest classifier if the positive probability is larger than 0.5, otherwise classified as negative.
We split samples of each type into two sets such that 75% and 25% of the samples of each type are assigned to the training and test samples, respectively. There were more normal users than any type of fraudulent user. Therefore, to balance the number of the negative (i.e., normal) and positive (i.e., fraudulent) samples, we uniformly randomly subsampled the negative samples (i.e., under-sampling) such that the number of the samples is the same between the normal and fraudulent types in the training set. Based on the training sample constructed in this manner, we built each of the 300 decision trees and hence a random forest classifier. Then, we examined the classification performance of the random forest classifier on the set of test samples.
The true positive rate, also called the recall, is defined as the proportion of the positive samples (i.e., fraudulent users) that the random forest classifier correctly classifies as positive. The false positive rate is defined as the proportion of the negative samples (i.e., normal users) that are incorrectly classified as positive. The precision is defined as the proportion of the truly positive samples among those that are classified as positive. The true positive rate, false positive rate, and precision range between 0 and 1.
We used the following two performance measures for the random forest classifier. To draw the receiver operating characteristic (ROC) curve for a random forest classifier, one first arranges the test samples in descending order of the estimated probability that they are positive. Then, one plots each test sample, with its false positive rate on the horizontal axis and the true positive rate on the vertical axis. By connecting the test samples in a piecewise linear manner, one obtains the ROC curve. The precision–recall (PR) curve is generated by plotting the samples in the same order in \([0, 1]^2\), with the recall on the horizontal axis and the precision on the vertical axis. For an accurate binary classifier, both ROC and PR curves visit near \((x, y) = (0, 1)\). Therefore, we quantify the performance of the classifier by the area under the curve (AUC) of each curve. The AUC ranges between 0 and 1, and a large value indicates a good performance of the random forest classifier.
To calculate the importance of each feature in the random forest classifier, we used the permutation importance (Strobl et al. 2007; Altmann et al. 2010). With this method, the importance of a feature is given by the decrease in the performance of the trained classifier when the feature is randomly permuted among the test samples. A large value indicates that the feature considerably contributes to the performance of the classifier. To calculate the permutation importance, we used the AUC value of the ROC curve as the performance measure of a random forest classifier. We computed the permutation importance of each feature with ten different permutations and adopted the average over the ten permutations as the importance of the feature.
We optimized the parameters of the random forest classifier by a grid search with 10-fold cross-validation on the training set. For the maximum depth of each tree (i.e., the max_depth parameter in scikit-learn), we explored the integers between 3 and 10. For the number of candidate features for each split (i.e., max_features), we explored the integers between 3 and 6. For the minimum number of samples required at terminal nodes (i.e., min_samples_leaf), we explored 1, 3, and 5. As mentioned above, the number of trees (i.e., n_estimators) was set to 300. The seed number for the random number generator (i.e., random_state) was set to 0. For the other hyperparameters, we used the default values in scikit-learn version 0.22. In the parameter optimization, we evaluated the performance of the random forest classifier with the AUC value of the ROC curve measured on a single set of training and test samples.
To avoid sampling bias, we built 100 random forest classifiers, trained each classifier, and tested its performance on a randomly drawn set of train and test samples, whose sampling scheme was described above.