Detecting Malicious Accounts in Permissionless Blockchains using Temporal Graph Properties

The temporal nature of modeling accounts as nodes and transactions as directed edges in a directed graph -- for a blockchain, enables us to understand the behavior (malicious or benign) of the accounts. Predictive classification of accounts as malicious or benign could help users of the permissionless blockchain platforms to operate in a secure manner. Motivated by this, we introduce temporal features such as burst and attractiveness on top of several already used graph properties such as the node degree and clustering coefficient. Using identified features, we train various Machine Learning (ML) algorithms and identify the algorithm that performs the best in detecting which accounts are malicious. We then study the behavior of the accounts over different temporal granularities of the dataset before assigning them malicious tags. For Ethereum blockchain, we identify that for the entire dataset - the ExtraTreesClassifier performs the best among supervised ML algorithms. On the other hand, using cosine similarity on top of the results provided by unsupervised ML algorithms such as K-Means on the entire dataset, we were able to detect 554 more suspicious accounts. Further, using behavior change analysis for accounts, we identify 814 unique suspicious accounts across different temporal granularities.


Introduction
A Blockchains is an ever-growing large directed temporal network with more and more industries starting to adopt it for their businesses. In permissionless blockchains, interactions (also called as transactions) happen between different types of accounts. In Ethereum mainnet public blockchain, these accounts can be either Externally Owned Accounts (EOA) or Smart Contracts (SC ). Here, transactions from an EOA (called as an external transaction) are recorded on the blockchain ledger whereas transactions from an SC (called as an internal transaction) are not recorded on the ledger.
With actual money involved in most of the permissionless blockchains, an account must be able to perform secure transactions. Recently, many security threats to various blockchain platforms have been identified [1]. For some identified vulnerabilities, counter-measures have been implemented. We do not delve into surveying all the security threats. In [2], the authors survey security flaws that exist in Ethereum blockchains. In many of the security vulnerabilities identified in Ethereum blockchain, hackers target other accounts by either hacking SCs or implementing malicious SCs for cybercrimes such as ransomware, scams, phishing, and hacking of exchanges or wallets [3].
With an ever-increasing growth and adoption of blockchain technology by the industry and the crypto-currency market, permissionless blockchains are at the epicenter of increased security vulnerabilities and attacks. Our motivation for this work is based on the fact that there is limited work on learning the behaviors of the accounts in permissionless blockchains which are malicious and potentially victimize other accounts in the future. In short, we aim to identify malicious accounts so that the potential victims and blockchains can deploy counter-measures. In this paper, henceforth, we use term blockchain to represent permissionless blockchain. The techniques proposed in related studies classify accounts as malicious using either machine learning (ML) algorithms or motif-based (basic building subgraphs of a network) methods. Nonetheless, the features used by the available techniques are: (a) limited and not learned from the previous attacks on blockchains, and (b) extracted from the aggregated snapshot of time-dependent transaction graphs that do not consider temporal evolution of the graphs.
The temporal aspects attached to the features are essential in understanding the actual behavior of an account before we can classify it as malicious. For example, inDegree and outDegree features are time-variant and should be considered a time series. Nonetheless, it has been proven that the aggregated node degree distribution for accounts follows a power-law in blockchains such as Ethereum [4]. Here, questions that we ask are: does such behavior exist in all accounts? Is there a burst of degree for certain accounts at certain instances and can the existence of such bursts be used to identify malicious activity? To answer these question, we first identify the existence of bursts. Then to study the effect of bursts, we introduce features such as temporal burst, degree burst, balance burst, and gasPrice burst.
The fat-tailed nature of power-law degree distribution also gives rise to neighbor-hood-based fitness preferential attachment in blockchains [5]. In [5] authors defined fitness as "the ability of the node to attract new connections" and showed that the accounts that have high fitness sometimes are short-lived and indulge mostly in malicious activities while when they are long-lived they represent large organizations. Here, the authors define the fitness factor considering one previous time instance interactions. As it does not consider a temporal window, one drawback of the method lies in its ability to correctly classify malicious transactions that appear at an interval of 2 time units or more. Inspired by this, we define a neighborhood-based feature called attractiveness that takes into account a temporal window of size θ a where (0 < θ a < T DS ) and T DS is the duration for which we collect the dataset (DS). Our attractiveness measure takes into account the stability of directed transactions that happened between two accounts in the past. Intuitively, a malicious account will have high attractiveness as it will tend to transact with new accounts while benign accounts will have high neighborhood stability or low attractiveness.
As the behavior of an account can change from malicious to benign or from benign to malicious over time, there is a need for continuous monitoring and analysis of the real-time transactions given the history of transactions performed by an account. We thus study the evolution of malicious behavior over different timescales by creating sub datasets and then answer would a certain account show malicious behavior in future? Towards this, we first apply different ML algorithms and identify the most suitable unsupervised ML algorithm in the entire dataset that is able to cluster accounts most accurately. Then we apply the identified algorithm to different sub datasets within a temporal scale to capture the behavior changes.
In summary, following are our main contributions: • Feature Engineering : We identify feature vector for identifying malicious accounts based on previous attacks on blockchains and perform time series analysis. As new features, we propose temporal burst, degree burst, balance burst, gasPrice burst, and attractiveness.
• Comparative analysis: We perform a comparative study with techniques proposed in related studies and identify best possible supervised and unsupervised ML algorithm with related hyperparameters when we use Ethereum transaction data.
• Results: Our results demonstrate that ExtraTreesClassifier performs best with respect to balanced accuracy under supervised settings for the entire dataset while when using clustering techniques, we are able to identify 554 more suspect accounts. Analysis of behavioral changes reveal 814 suspects across different temporal granularities.
The rest of the paper is organized as follows. In section 2, we present background and the state of the art techniques for identifying malicious accounts and compare them. In sections 3 and section 4, we present detailed description of our methodology and the feature vector, respectively. This is followed by in-depth evaluation along with the results in section 5. We finally conclude in section 6 providing details on prospective future work. Further, in Table 1 we provide list of acronyms used in the paper.

Background and Related Work
There are two types of blockchain technologies, permissionless and permissioned. The major difference between two technologies is that in permissioned blockchain prior access approval is needed for performing any action on the blockchain while in permissionless blockchain anyone can perform actions on the blockchain without any approval. Further, there is no way to censor anyone from permissionless blockchains. Such aspects lead to more frauds and malicious activities to prevail in permissionless blockchains. Ethereum and Bitcoin use permissionless technology. Ethereum was developed by Vitalik Buterin in 2013 [6] and allows users to run programs in its trusted virtual environment known as Ethereum Virtual Machine (EVM). These programs are called Smart Contracts (SC) and are stored on the ledger along with transactions performed on a given fixed address. Ethereum uses "Ether " as its native crypto-currency for transfer and transaction fees. Smart Contracts can also send, store and receive ethers. Once deployed it is a hard coded program that could only be fed with input to get output. Smart Contracts are also used by some applications for their processing. Such applications are called distributed applications or dapps. Although Ethereum is known for its security and trust a small bug in SC code can cause huge loss [7] of crypto-currency. Unlike Bitcoin, Ethereum uses list of accounts. For a valid transaction, amount is transferred from sender to receiver. If receiver is a SC, its code is executed and the state of the SC is updated. Internally, a SC could send a message or perform internal transactions with other accounts. Ethereum currently uses a refined form of PoW (Proof of Work) consensus algorithm. PoW is computationally expensive and energy inefficient.
There are vast number of studies in fraud detection [8]. Nonetheless, targeting Ethereum, Chen et al. [2] base their survey on attacks and defences in Ethereum. We do not survey all the attacks and defense mechanisms in this work. However, we provide an in-depth understanding of different methods used to detect accounts involved in malicious activity. Several works have tried to identify or categorize malicious accounts and activities in different types of blockchains. As blockchains have graph structure, most of these techniques study graph properties (such as node degree) to identify features before applying supervised or unsupervised learning.
In [9], authors used a bitcoin transaction network to detect malicious activity. They were able to detect three malicious attacks using unsupervised ML algorithms with a limited amount of available transaction data. In their followup work, they used a more comprehensive bitcoin transaction dataset (starting from genesis block until April 7 th , 2013) [10]. They employed data in two types of graphs namely User Graph and Transaction Graph. In user-graph nodes represent accounts and edges represent transactions, whereas in transaction-graph nodes represent transactions and edges represent flow of bitcoins. They first studied the flow of bitcoins to prove the existence of anomalies and then performed clustering to identify different attacks. They were able to detect the existence of one attack using the Local Outlier Factor (LOF). Inspired by [10], in [11], Monamo et al. also used bitcoin transaction data and proposed an update to counter scaling issues that are inherent in LOF. They validated their approach using trimmed K-Means, argued its usefulness in detecting anomalies and detected 5 out of 30 fraudsters.
In another bitcoin-related malicious activity detection [12], authors studied the detection of addresses involved in the Ponzi scheme. They used supervised learning and validated their results after addressing the class imbalance that is inherent in any malicious activity related to datasets. They identified that the Gini coefficient of outgoing values and the ratio between incoming and total transactions are the most important features for detecting Ponzi scheme related accounts. In another Ponzi scheme related study, in [13], authors use Ethereum data to extract features from operation codes (opcodes) of the smart contract's bytecode. Their motivation behind the study was based on the fact that the opcodes reflect logic implemented in a SC and therefore provide useful features for identifying Ponzi and non-Ponzi SC. They also figured out that opcode features are more efficient than account based features while detecting Ponzi scheme accounts. In [14], authors use partial Ethereum transaction data to classify malicious accounts. They also performed a sensitivity analysis to study the effect of different classifiers on the feature set. In [15], to counter class imbalance, authors assumed that accounts connected to malicious accounts via incoming transactions are also malicious. They then studied various supervised ML algorithms to identify malicious accounts over this over-sampled Ethereum dataset. In a followup study of [15], in [16], authors used only those benign accounts who have never transacted with malicious accounts. Due to this, their feature vector has only transaction based properties but not the graph based properties.
N-motifs are frequently occurring subgraphs that serve as a basic building block of a network. Authors in [17] defined N-motif as a path of length 2N between two entities where transactions are also considered as vertices. Using N-motifs that are present in the transaction graph, in [17], authors studied transactions happening between entities (people or organizations with multiple accounts). They were able to correctly identify malicious accounts involved in gambling. In another study [18], authors analysed transfer of funds within a subnet and used temporal feature such as how quickly funds are cashed.
We present all the above-mentioned techniques in detail in Table 2 and present the features that the techniques used along with studied ML algorithm, their hyperparameters, accounts considered in the dataset and perfor-mance score. Note that all these techniques use features that are based on some graph properties, transacting amount, and active state to train the ML model. However, several other studies, such as [4,19], use inferences drawn from the analysis of the transaction graph to mark malicious accounts. In [4], authors try to identify accounts indulging in DDos attack and argue that accounts that create multiple rarely used contracts are malicious. A similar approach is followed in [20] where they used only verified SC codes and introduced features like SC size, lifetime and average time between transactions (i.e. Inter-event time). In [19], authors deploy honeypot and analyze RPC requests to identify malicious accounts. They then analyze transactions to mark accounts as suspicious that accept crypto-currency from malicious accounts. They perform behavior analysis to identify fisher accounts and attacks such as crypto-currency stealing.
All the above techniques either use a limited set of ML algorithms on a highly scaled-down data inducing over-fitting or apply inferences on the graph structure to identify malicious activities and accounts. In most cases, studies use features that do not capture temporal behavior and are approximated by the mean behavior, thereby, further inducing a bias in their study thus having high accuracy. Techniques that use large datasets and have high class imbalance, on the other hand, either have high recall and low precision or low recall and high precision [14]. Nonetheless, using our features, we identified ML algorithm that provides better precision as well as better recall.

Methodology
We use Ethereum mainnet blockchain transaction data and first validate our assumptions and approach. We segment the transaction data into sub-datasets (SD) to capture the behavioral changes. We create the SDs using different temporal granularities (T g such that T g ∈ T G ) where T G = {Day, W eek, M onth, Quarter, Half Y early, Y ear, All}. A granularity becomes coarser as we move from Day to Year. Here a SD in a Day consists of transactions of 6000 blocks. The choice of 6000 blocks is based on the fact that in Ethereum approx 6000 blocks are created every day. At a coarser T g , a SD in a Week consists of 7 Days data. Similarly, a SD in a Month consists of 30 Days data, a SD in a Quarter consists of 3 Month data, a SD in a HalfYearly consists of 6 Months data, and a SD in a Year consists of 12 Months of data.
On all the features that are time series based (features described in section 4), we perform time series analysis of all the SDs at different T g to quantify them using tsfresh that "extracts characteristics from time series" [21,22]. The analysis reveals that features such as quantile and median best describe the time series for most of the features we have. We observe this behavior not only in the entire dataset but also in different SDs at different T g s.
We first apply the AutoML pipeline using TPOT [23] to identify the best ML classifier on the entire dataset and validate state of the art techniques. We configure TPOT with existing tested ML algorithms and their hyperparameters. Note that TPOT internally performs imputation and feature scaling also.  Nonetheless, as our aim is to detect malicious accounts, we also apply clustering to identify accounts that show similar behavior to that of malicious accounts. For the entire dataset, we find that K-Means provides best silhouette score for k = 9 when we consider both EOAs and SCs. For clusters identified as malicious, we use cosine similarity to quantify the similarity among the accounts within the cluster. We acknowledge that there are other methods as well to identify similarity, but for this work we use cosine similarity. With this method we are able to identify 293 more suspect accounts that have similar behavior as malicious accounts. When considering only EOAs, we identify best silhouette score at k = 10 and 554 more suspects.
Assuming that K-Means with hyperparameter k = 9 identified for entire dataset performs best for all temporal sub-datasets at different temporal granularities, we determine a probability for an account to be malicious at different temporal granularities. Across all temporal granularities we identify 814 unique accounts as suspects.

Feature Engineering
We do not describe the blockchain graph models as they are well understood. Instead, we directly present features that we extract from the blockchain temporal graph structure. The set of features (F ) defined in the related work is limited and, in most cases, does not convey correct temporal behavior. We extend the feature set and introduce new features to detect malicious accounts. We follow a two-fold methodology to identify the relevant features. First, we study different attacks that have happened in the past to understand what features malicious accounts have used for malicious activity. Second, as most of the account features (for example, inDegree) are time series based, we perform time series analysis to identify features that best represent the salient properties of the relevant time series. Below we provide a list of all the features we use: • Non Time Series based (set F n |F n ⊂ F ) -Active state (AS): malicious activities are usually short-lived [5] and remain, for example, until remediation is introduced. It is thus essential that we consider features such as when the account first transacted (transact-edFirst), last transacted (transactedLast), how long it has been active (durationActive), and since when the account is continuously transacting (activeSinceLast).
• Time Series based (set F t |F t ⊂ F ): We analyze each of the following time series based features using tsfresh [21,22] and select 3 top features identified for each of the following attributes. Nonetheless, as inter-event time (IET) itself is a time series, we use it as a feature as well.
-inDegree (iD): it represents the number of transactions in which the account under consideration is a receiver at a particular instant. Most of the malicious activities involve transfer of money to a malicious account. Thus, it is one of the most important features used to understand the behavior of a malicious account. In [15], the author found that uniqueInDegree (defined as unique accounts from which the account under consideration has ever received money) to be one of the most critical feature for identification of malicious accounts. On top, we also use aggregated inDegreeAgg as a feature.
-outDegree (oD): represents the number of transactions in which the account under consideration has sent money at a particular instant. In some attacks such as Bitpoint Hack [24], after the attacker has received amount of sum from the victims in an alias account, they transferred the received sum to another account they hold or to an exchange. Such attacks increase the importance of outDegree as a potential feature. Similar to the case above, we also use aggregated outDegreeAgg as a feature.
-Balance (Bal): our motivation to use it as a feature is based on the fact that most malicious activities in a permissionless blockchains are finance based. For example, in one of the famous Parity Multisig wallets [25] attack the malicious account drained more than 150k Ethers (currency used in Ethereum blockchain). Thus the currency held by an account as well as its flow is an important feature. We identify balance time series for both in/out case. Besides balance, we identify for each instance max balance for both in and out cases (maxInBalance and maxOutBalance), zeroBal-anceTransactions (transactions where no money was transferred either to or from an account), totalBalance (final balance held with the account), and averagePerInBalance (average of received balance) as features.
-Transaction Fees (TF): in crypto-currency based blockchains, a transaction is marked by transaction fees that a sender is willing to spend on a particular transaction. In Ethereum blockchain, operations like transferring Ethers require a fixed sequence of instructions which consume 21,000 Gas (T F = Gas × GasP rice). Several attackers put higher gas price to bribe the miner so that a particular transaction of interest to them is included in the next block [19]. Nonetheless, in DDos attack [26], an attacker created multiple accounts at very low gas price to increase synchronization and processing time. Thus it is also an essential features.
-Attractiveness (A): mostly, malicious accounts tend to interact with accounts that they have not interacted with before. The probability of interacting with the same account that they have interacted before is very low. Consider N t i to be the neighborhood (accounts with whom the account i has received crypto-currency) of account i at time t, T = {t, t − 1, · · · , t − θ a }, and θ a the time window size. Based on this, we define attractiveness (A t i ) for account i at time t as shown in equation 1. otherwise. (1) • Burst (BB) (set F b |F b ⊂ F ): bursty behavior is defined as temporal nonhomogeneous sequence of events [27] and has been characterized by a fattailed inter-event time (∆t) distribution. In one of the bitcoin blockchain attacks (Allinvain Theft [28]), a malicious account generated a large number of transactions to taint the bitcoin platform. Motivated by this incident, we define four types of bursts (temporal, degree, balance and gasPrice) that occur in the network under consideration. As an account can either be a sender or a receiver, the following burst types are defined for cases (a) when the account acts as a sender, (b) when the account acts as a receiver, and (c) when the account acts as both a sender as well as a receiver.
-Temporal Burst: for an account i, non-homogeneous occurrences of events (in our case transactions) lead to some transactions occurring where ∆t is less than a threshold, θ i t , while for other transactions ∆t is large. If a transaction happens when ∆t < θ i t , we assume that it is a burst. Some burst can be long lived while some burst can be short lived, meaning, some event can happen continuously for long time intervals before going dormant. As features, we identify number of such temporal bursts (numberOfTem-poralBursts) and the duration of the longest burst (longestBurstDuration) for both in and out transactions separately as well.
-Degree Burst: it has been proven that the degree (also inDegree and outDegree) distribution of the aggregated transactions in blockchain such as Ethereum follows a power-law (fat tailed) distribution [4] with α ∈ [−2.8, −2.6]. This suggests that many accounts do not transact often while there are very few accounts that act as hubs (for example, exchanges). Nonetheless, when considering the temporal aspects, we believe such behavior also exists where some accounts have a very high degree for some instant while for other instants they have a low degree. Thus, we define a degree burst when at a given instant of time the degree of an account, i, is greater than θ i d . Similar to the temporal case, for degree bursts we also identify number of degree bursts (numberOfDegreeBursts) that happened for an account over time, number of instances where the degree burst happened (numberOfDegreeBurstInstances), and the time at which the largest burst of degree happened (largestBurstAt). Note that these features except for numberOfDegreeBurstInstances are defined for both in and out transactions separately as well.
-Balance Burst: in some cases transactions happen from accounts i to account j where the involved sum of crypto-currency was very large (more than a threshold value θ i b ). For example, some accounts associated to Silk Road [29] or involved in money laundering sometimes transact large sum for illegal activities. Busty behavior of transaction amount could be helpful in identifying potential malicious activities and accounts. Similar to the above cases, for an account i, we identify number of unique instances where balance is more than θ i b (numberOfBalanceBurstyInstances), and number transactions more than θ i b (numberOfBalanceBursts). Note that, we define these factors for both only in and out case.
-GasPrice Burst: As described before, an attacker can put higher gas price (more than a threshold value θ i g ) to bribe the miner so that the transaction is included in the block. This activity although abnormal is useful in understanding account's behavior. Towards this, similar to previous cases, we define numberOfGasPriceBurstyInstances as number of instances where the gasPrice was set more than θ i g . This is only defined for in case as gasPrice is only set by the sender.
Note that features such as in/outDegree, burst, attractiveness are some graph-based temporal features. Besides these features, other graph-based properties that we use as feature includes clustering-coefficient (CC) [30]. For an account i, let N t,in i be the neighborhood of account i at time t from which the account has received the crypto-currency, N t,out i be the neighborhood of account i at time t to which the account has paid the crypto-currency. Thus, the total account degree is deg tot i and a ir = 1 if there is a transaction between i and r, otherwise 0. We similarly define a is , a ri , a si , a rs , a sr . For a directed graph, CC of account i (CC t i ) at time t is defined as equation 2 [31].

Results and Evaluation
We evaluate the effectiveness of our method using Ethereum's external transactions data which is publicly available for download using the Etherscan APIs [32]. Note that the APIs do not provide any information about the account (such as the name and the account type). Nonetheless, as the hash of the accounts is available, one can check the associated information using the Ethereum Blockchain Explorer [33]. We perform all our evaluations using Python.

Dataset
Ethereum as on 20th December 2019 had ≈79M accounts. Out of these accounts, 3362 accounts were already tagged to be involved in malicious activities. The tags mainly include Phishing (3168 accounts), Gambling (8 accounts   158 SCs and 2 marked compromised exchanges. Note that for these accounts we collect only-but-all external transactions (transactions from EOAs to SCs, and between different EOAs). Also note that at the time of this study Ethereum had removed most of the malicious tags. But recently Ethereum provided new tags and marked more accounts as malicious. As of 27th May 2020, there were 4708 malicious accounts out of which 2019 were newly tagged accounts. Out of these 2019 accounts only 1252 accounts ever transacted. Out of these 1252 accounts 1029 were created before 7th December 2019 in which only 3 are present in our dataset. As the number of malicious accounts is constantly evolving, we take this opportunity to cross validate accounts that our analysis found malicious. There is a high class imbalance in the dataset as the number of benign accounts is large. Thus, we perform random under-sampling to uniformly sample 697K benign accounts from the 79M Ethereum accounts. In the total ≈700K accounts we have, there are 7 exchanges and 23,141 SCs while rest accounts are EOAs.
A unique transaction, Tx, contains information about blockHash, block-Number, source, destination, gas, gasPrice, Transaction hash, balance, and timestamp of the block. Note that the Tx data does not include the timestamp of when a transaction was performed by the account. The only time related information, we are able to extract is the information about when a block is mined. However, currently we do not use this information. We assume a time bin of 1 block for our study. We assign respective blockNumber as a timestamp to all the transactions 1 . Based on this notion of timestamp, we also segment the data into several SD of different T g and study the behavior of the accounts. We describe in the section 1 the different T g s we consider. For statistical purposes, we have 1,531 Day SDs, 219 Week SDs, 52 Month SDs, 18 Quarter SDs, 9 HalfYearly SDs, 5 Year SDs, and the entire dataset. A total of 1835 datasets.
For our study we assign: (i) θ t = 2 so that continuous burst of smallest size are also captured, (ii) for an account i, θ i d = 0.8 × (max(d)) where d is the in/outdegree of an account in the considered SD, (iii) θ i b = 0.8 × (max(b)) where b is the transaction balance for either in or out case, (iv) θ i a to be equal to the duration of the SD to keep the entire history of neighbors that a particular account transacted in the past in the given that sub-dataset, and (v)) θ i g = 0.8× (max(gasP rice)) where gasPrice is the the gas price for transactions associated with account i. We then analyse different time series based features to identify there characteristics as potential features.

Results
For the entire dataset, we first study inDegree and outDegree distribution for both malicious and benign accounts to validate the fat-tailed behavior of the degree distribution. From fig. 1, we identify that power-law distribution [35] with x min = 2.3, α ∈ [2.37, 2.54] and α ∈ [2.23, 2.33] fits inDegree and outDegree distribution, respectively, for both malicious and benign accounts. Here α and  x min are the powerlaw exponent and minimum x from where the powerlaw distribution is observed, respectively.
The fat-tailed nature of degree is evident because some accounts interact with more number of accounts at a certain instant, thereby inducing a bursty behavior. We study the distribution of inDegree for all individual accounts to understand if such behavior is shown by all the accounts. Fig. 2a presents distribution of inDegree for different accounts. We identify that the inDegree of very few accounts is high (>100) for very few time instances while most of the time it is low suggesting the existence of bursts. We observe a similar behavior for outDegree as well (see fig. 2b).
Next, we validate the existence of temporal bursts. For this we study the distribution of inter-event time (∆t) for all accounts. We find that it follows power law with x min = 3 and α = 1.25 and α = 1.76 for benign and malicious cases, respectively (see fig. 3a). Nonetheless, we also observe a truncation at 1.5 × 10 6 blocks. The truncation reflects that some accounts are inactive or did not perform any transactions for long period of time. When looking at the individual level, we observe that only few accounts have very large inter-event time (> 1 × 10 6 ) where the probability of occurrence of such events is very low. Most of the activity happens where the inter-event time is very small (see fig. 3b).
The attractiveness behavior of malicious and benign accounts differ significantly (see fig. 4). Most malicious accounts have high attractiveness value while most of the benign accounts have low attractiveness value. This justifies our assumption that most malicious accounts target those accounts that they have not previously interacted with.Some attacks (Upbit Hack -Fake Phishing1431: '0xdf9191889649c442836ef55de5036a7b694115b6') uses multiple accounts to evade detection while transferring money to exchanges. They use multiple accounts as buffer between account and exchange. This is the reason for relatively high probability (p(A = 0) > 0.2) for the low values of attractiveness (A = 0) for malicious accounts. Similarly for some benign accounts p(A = 1) = 0.1 because such accounts only have 1 incoming transaction in whole lifetime portraying account interacted only with new accounts.
For the entire dataset, after applying tsfresh, for every temporal feature F j t ∈ F t we get a set of features (F j t ) that describes F j t . FromF j t , we choose top three feature. We use Gini as the scoring method to identify the top three feature. After this process, we get a total of 59 features. For the entire dataset, using pearson correlation, we remove highly correlated features and find 36 important features. We also perform PCA to identify 28 features that cover >98.2% variance to further reduce the feature space in the entire dataset.
For the analysis purposes, besides performing PCA to identify 28 features and before running the AutoML tool (TPOT) to identify the best supervised learning algorithm, we segment the entire dataset into six dataset configurations. Note that these six dataset configurations are different from the temporal SDs. Three out of these six dataset configurations use all types of accounts (EOA and SC) and have 59, 36, and 28 features, respectively. While for the remaining three, we separate EOAs from SCs and use only EOAs. These three configurations again have 59, 36, and 28 features, respectively. We configure TPOT with all the supervised ML algorithms used in the state of the art studies along with other supervised ML algorithms to identify the algorithm that gives best balanced accuracy.   Table 3 lists different dataset configurations we have used along with the algorithm that provided the best balanced accuracy along with precision, recall Figure 4: Attractiveness and F1-score for each class. For each dataset configuration and the algorithm that provided the best balanced accuracy, we only provide values to those hyperparameters for which the values are different from the default case. We identify that ExtraTreesClassifier provides overall best balanced accuracy for all the dataset configurations and among them dataset with 59 features and all the account types has best balanced accuracy. The difference in balanced accuracy score between the dataset configurations when 36 and 59 features are used is only 0.5% for both when we consider only EOAs and all the accounts, respectively. Given such results, we show that correlated features do not provide much gain and can be removed without the loss of accuracy.
To validate our results, we test ExtraTreesClassifier with identified hyperparameters on newly identified set of 1252 malicious accounts. The classifier achieves 50% balanced accuracy. However, when we train the classifier with identified hyperparameters on the total dataset (dataset consisting of previously used 700k accounts and new 1252 accounts), we were able to achieve ≈ 92% balanced accuracy. This makes us wonder if the new malicious nodes have different characteristics. We check cosine similarity between the old 2946 malicious accounts and the new 1252 malicious accounts (cf. figure 5). We find that most of the newly added malicious accounts had low similarity score. Only one new malicious account had similarity score > 0.985 with only one old malicious account. In many cases the similarity score even reached < −0.89 showing that the accounts are not similar and there are some new aspects used by new malicious accounts. Note that to identify cosine similarity we do not use features such as transactedlast and transactedF irst because many of the We next test unsupervised learning algorithms such as K-Means, DBSCAN, HDBSCAN, and oneClassSVM to identify suspect accounts in the entire dataset. We find that for the six dataset configurations (mentioned above and not the SDs) and different values of k ∈ [3,24], K-Means provide the best silhouette score (score = 0.365) when k = 10 clusters and when we use all the features but only EOAs ('59 -EOA') (see fig. 6). Among these 10 clusters, for one initial condition, one cluster had the most number of already known malicious EOAs (≈ 73.9% (2062/2788)) (see fig. 7). We then identify the similarity between all the accounts in the identified cluster. We identify 554 benign accounts whose behavior (cosine similarity) (see fig. 8) is within 1 − where → 0 to that of malicious accounts. For our analysis we use = 10 −7 . We cross validate the transactions performed by these 554 benign accounts and find that (a) most of the EOAs have small transactedLast value, meaning, those accounts never transacted in recent past (in past 6 months 494 EOAs never interacted), (b) atleast 38 EOAs only have incoming transactions and are not exchanges, and (c) totalBalance ∈ [0.0, 150.0] Ethers with a median of 0.001 Ethers.
When considering both EOAs and SCs, we obtain the best silhouette score (score = 0.356) for k = 9 clusters but for the case when we use all the 59 features ('59 -EOA and SC') (see fig. 6). In this case, for one initial condition, there was one cluster with a maximum number of already tagged malicious EOAs (≈ 64.3% (1793/2788)) and malicious SCs (≈ 62.6% (99/158)). We identify     293 potential suspects EOAs and no suspect SCs within this cluster using our previous method. Out of these 293 accounts, 160 EOAs were also detected in the set of 554 accounts. We further tested if the accounts we identified as Figure 8: Cosine similarity between malicious accounts and benign accounts in the cluster with best Silhouette score.
suspects are present in the list of newly tagged malicious accounts. We found that none of the 3 new malicious tagged accounts that transacted during our analysis period were not in our list of suspects. This is possible as the accounts must have changed their behavior and become malicious after our collection period. We do not reveal the account hash for the sake of privacy and not maligning benign accounts in interacting with these either 554 or 293 suspects until they are officially tagged malicious. Other unsupervised ML algorithms did not perform better than K-Means. The range of silhouette scores for HDBSCAN was ∈ [−0.06, −0.022] while oneClassSVM did not converge.
To further understand the temporal behavior changes before classifying the accounts as malicious we use temporal sub-datasets (SDs) created at different temporal granularities (T g , see section 1). Consider a T g ∈ T G which consists of a several SDs. Let this set be set SD(T g ) where SD(T g ) = {SD(T g ) 1 , SD(T g ) 2 , · · · , SD(T g ) j , · · · , SD(T g ) n }. Further, consider an account i. We first analyse all the time-series based features in each SD(T g ) j and characterise them. We employ a similar approach as before where we identifyF i t using tsfresh for a F i t ∈ F in a given SD(T g ) j and use three features inF i t with highest gini score. We then use K-Means with previously identified hyperparameter (k = 9) and perform clustering. As before, we tag accounts in each SD(T g ) j as malicious and benign after identifying cosine similarity. This results in a vector (M ) for each account of size n i where each element (M j ) in M is either 0 or 1 and n i is the number of SDs in a T g in which the account appears. Here 0 represents not identified as malicious. Let this set of SDs be SD(T g ) = {SD(T g ) i 1 , SD(T g ) i 2 , · · · , SD(T g ) i j , · · · , SD(T g ) i n }. M depicts the behavior Figure 9: Probability distribution of number of changes in behavior of accounts with certain probability for being benign at different T g s.
of an account i where a change in behavior is captured if M j = M j+1 . We note that only one benign account, as per our analysis, has changed its behaviour most number of times (591) in the T g = Day. Figure 9 shows probability distribution of number of changes in behavior performed by accounts. The figure only considers those accounts where the change happened at least once.
For the daily case, as the data was significant we identify that lognormal-positive distribution with parameters x min = 1, µ = 1.25, and σ = 2.36 best fits the data. Further, across all T g s there were 9254 unique benign accounts that showed unstable behavior. From M , the probability of a particular account i to be malicious in a given T g is given by p i m = j∈SD(Tg ) i Mj ni . Number of accounts with certain probability for being benign at different T g s is shown in figure 10. We identify 814 unique accounts across different T g s as suspects that have p i m = 0. Further, as seen from the figure, most of the accounts accounts were identified as benign.

Conclusion
Growth of blockchains technology and concept has found its implementation not only in the financial sector such as crypto-currency market, hedge-fund, and insurance but also in sectors such as governance, education, healthcare, and law enforcement. Although blockchains are privacy-preserving, with an increase in its adoption, security threats are inevitable, more diverse, and deployed using novel techniques. It is essential to have secure transactions. Motivated by the fact that there is limited work in identifying accounts involved in potential malicious activities and those available do not target temporal aspects of blockchains, in this work, we present a way to detect malicious accounts considering the temporal nature of the blockchains.
In this work, we present graph-based temporal features (such as burst and attractiveness) that are inspired by the existing attacks in the blockchain on top of existing features used to identify malicious accounts. To do so, we first conduct a systematic study of the temporal behavior of the blockchain graph on a collected transaction data in one of the blockchains called Ethereum. Our results show that ExtraTreesClassifier performs best under the supervised setting and achieves balanced accuracy ∈ [87.2, 88.7] for different dataset configurations. Moreover, under the unsupervised settings, K-Means was able to cluster max 73.9% known malicious accounts together and identify 554 more suspects that had similar behavior to that of malicious accounts. When considering behavioral changes over time and studying them over different temporal granularities, we are able to detect the probability of an account being malicious at a particular temporal granularity.
Given such results, we expect that benign accounts would be more careful while transacting with suspects and safe-guard themselves from any fraud and security threats. Nonetheless, the current technique is applicable to permissionless blockchain. We would like to investigate the applicability of our method to blockchains where features such as Transaction Fees and Balance are missing. Despite whether a particular blockchain is permissionless or permissioned, there are many other centrality measures such as closeness, betweenness and page-rank that are applicable in blockchain graph. One another future research direction is to incorporate these measures as features and study the behavior of the accounts before tagging them as malicious or benign. Nonetheless, in this work, we detected suspects using supervised learning and unsupervised learning algorithms. Reinforcement learning is another type of ML that can be applied and studied to detect malicious activity. As our validations failed on the newly tagged malicious accounts one perspective is to study new features and new methods that the new malicious accounts are using and deploying to perform illegal activities.