implement ico rating ( https://arxiv.org/pdf/1803.03670.pdf )
Cryptocurrencies (or digital tokens, digital currencies, e.g., BTC, ETH, XRP, NEO) have been rapidly gaining ground in use, value, and understanding among the public, bringing astonishing profits to investors. Unlike other money and banking systems, most digital tokens do not require central authorities. Being decentralized poses significant challenges for credit rating. Most ICOs are currently not subject to government regulations, which makes a reliable credit rating system for ICO projects necessary and urgent. In this paper, we introduce IcoRating, the first learning--based cryptocurrency rating system. We exploit natural-language processing techniques to analyze various aspects of 2,251 digital currencies to date, such as white paper content, founding teams, Github repositories, websites, etc. Supervised learning models are used to correlate the life span and the price change of cryptocurrencies with these features. For the best setting, the proposed system is able to identify scam ICO projects with 0.83 precision. We hope this work will help investors identify scam ICOs and attract more efforts in automatically evaluating and analyzing ICO projects.READ FULL TEXT VIEW PDF
implement ico rating ( https://arxiv.org/pdf/1803.03670.pdf )
Cryptocurrencies (e..g, BTC, ETH, NEO) are gaining unprecedented popularity and understanding. As opposed to centralized electronic money and central banking systems, most digital tokens do not require central authorities. The control of these decentralized systems works through a blockchain, which is an open and distributed ledger that continuously grows. The market capitalization of cryptocurrencies has increased by a significant margin over the past 3 years, as can be seen from Figure 1. According to data provider Cryptocurrency Market Capitalizations,333https://coinmarketcap.com/, the peak daily trading volume of cryptocurrencies is close to the average daily volume of trade on the New York Stock Exchange in 2017.
Because of their decentralized nature, the crowdfunding of digital coins does not need to go through all the required conditions of Venture Capital investment, but through initial Coin Offerings (ICOs) Chohan (2017). In an ICO, investors obtain the crowdfunded cryptocurrency using legal tender (e.g., USD, RMB) or other cryptocurrencies (e.g., BTC, ETH), and these crowdfunded cryptocurrencies become functional units of currency when the ICO is done. A white paper is often prepared prior to launching the new cryptocurrency, detailing the commercial, technological and financial details of the coin. As can be seen in Figure 2, the number of ICO projects grew steadily from July 2013 to January 2017, and skyrocketed in 2017.
Despite the fact that ICOs are able to provide fair and lawful investment opportunities, the ease of crowdfunding creates opportunities and incentives for unscrupulous businesses to use ICOs to execute “pump and dump” schemes, in which the ICO initiators drive up the value of the crowdfunded cryptocurrency and then quickly “dump” the coins for a profit. Additionally, the decentralized nature of cryptocurrencies poses significant challenges for government regulation. According to Engadget,444https://www.engadget.com/ 46 percent of the 902 crowdsale-based digital currencies in 2017 have already failed. Figures 3 and 4 show an even more serious situation. Each bucket on the x axis of Figures 3 and 4 denotes a range of price change, and the corresponding value on the y axis denotes the percentage of ICO projects. As can be seen, of existing ICO projects suffered a price fall of more than after half a year, and this fraction goes up to after one year. About of projects fell more than after half a year, increasing to an astonishing value of after a year. Though it is not responsible to say that every sharply-falling ICO project is a scam, it is necessary and urgent to build a reliable ICO credit rating system to evaluate a digital currency before its ICO.
In this paper, we propose IcoRating
, a machine-learning based ICO rating system. By analyzing 2,251 ICO projects, we correlate the life span and the price change of a digital currency with various levels of its ICO information, including its white papers, founding team, GitHub repository, website, etc. For the best setting, the proposed system is able to identify scam ICO projects with a precision of 0.83 and an F1 score of 0.80.
IcoRating is a machine learning–based system. Compared against human-designed rating systems, IcoRating has two key advantages. (1) Objectivity: a machine learning model involves less prior knowledge about the world, instead learning the causality from the data, in contrast to human-designed systems that require massive involvement of human experts, who inevitably introduce biases. (2) Difficulty of manipulation by unscrupulous actors: the credit rating result is output from a machine learning model through black-box training. This process requires minor human involvement and intervention.
IcoRating can contribute to the cryptocurrency community in two aspects. First, we hope that this work would encourage more efforts invested in designing reliable, automatic and difficult-to-be-manipulated systems to analyze and evaluate the quality of ICO projects. Second, IcoRating can potentially help investors identify scam ICO projects and make rational investment in cryptocurrency.
The rest of this paper is organized as follows: we give a brief outline of cryptocurrencies, blockchains and ICOs in Section 2. We describe the construction of a dataset of ICO projects in Section 3, and provide some basic analysis of the data. In Section 4, we describe the machine learning model we propose。 Experimental results and qualitative analysis are illustrated in Section 5, followed by a short conclusion in Section 6.
In this section, we briefly describe relevant information on cryptocurrencies, blockchains, and ICOs.
A cryptocurrency is “a digital asset designed to work as a medium of exchange that uses cryptography to secure its transactions”.555Quoted from https://en.wikipedia.org/wiki/Cryptocurrency. Most cryptocurrencies use decentralized control. The first decentralized cryptocurrency is Bitcoin (BTC for short) Nakamoto (2008), created in 2009 by an unknown person or group of people under the name Satoshi Nakamoto. Since the design of BTC, various types of cryptocurrencies have been created, the most well-known of which include Ethereum Dienelt (2016), Ripple Carson (2014), EOS Stanley (2017) and NEO Go (2017).
A cryptocurrency’s transactions are validated by a blockchain. One can think a blockchain as a distributed ledger, which continuously grows and records all transactions between two parties permanently. Each record is called a block, which contains a cryptographic hash pointer that links to a previous block, a timestamp and transaction data. The ledger is owned in a distributed way by all participants, and the record cannot be altered without the alteration of all subsequent blocks of the network. Transactions are broadcast to all nodes in the network. Blockchains use various time-stamping schemes such as proof-of-work Dwork and Naor (1992) or proof-of-stake Vasin (2014).
The concept of blockchain eliminates the risks of data being held centrally: it has no central point of failure and data is transparent to every participant involved.
An initial coin offering (ICO)666https://en.wikipedia.org/wiki/Initial_coin_offering is a means of crowdfunding centered around cryptocurrency. In an ICO, crowdfunded cryptocurrency (mostly in the form of tokens) is transferred to investors in exchange for legal tender or other cryptocurrencies. These tokens become functional units of currency that can be exchanged for goods or other cryptocurrencies when the ICO’s funding goal is met.
ICOs provide a crowdfunding opportunity for early-stage projects to avoid the regulations required by venture capitalists, bank and stock exchanges. They also provide investment opportunities beyond venture capital or private equity investments, which have dominated early-stage investment opportunities. On the other hand, because of the lack of regulations, ICOs pose significant risks for investors. Different countries have different regulations on ICOs and cryptocurrencies. For example, the government of the People’s Republic of China banned all ICOs, and the U.S. Securities and Exchange Commission (SEC) indicated that it could have the authority to apply federal securities law to ICOs Higgins (2017), while the government of Venezuela launched its own cryptocurrency called petromoneda.
We collected information for 2,251 past ICO projects, including the white papers, website information, GitHub repositories at the time of ICO, and founding teams. We obtained the data from various providers including CryptoCompare,777http://cryptocompare.com/ CoinMarketCap888https://coinmarketcap.com/ and CoinCheckup.999http://coincheckup.com/
The white paper is a crucial part of an ICO project. It describes how the crowdfunding is intended to work, such as the landscape of the ICO project, how the tokens will be allocated or how the crowdfunded money will be spent. Out of the 2,251 ICO projects, we were able to obtain 1,317 white papers. We transform white paper PDFs to texts using the PDFMiner API.101010https://github.com/euske/pdfminer
|Doc||Ave Word||Std Word||Max Word||Min Word|
|Doc||Ave Sent||Std Sent||Max Sent||Min Sent|
Statistics for publicly available ICO projects: average/min/max/standard deviation for number of words and number of sentences in a white paper.
is the high variance in the length of white papers, with a maximum of 6,228 sentences and a minimum of 38. More concretely, the number of sentences in 10 randomly sampled white papers is [886, 143, 38, 967, 3379, 6228, 496, 2057, 3075, 298]. Though the length of a white paper does not necessarily reflect the quality of an ICO project, we can see the large variance in content among ICO white papers.
|3. for each white paper :|
|(i) sample topic|
|(ii) for each word in document|
We run a Latent Dirichlet Allocation (LDA) model Blei et al. (2003) on the collected white papers. LDA is a generative statistical model that explains text documents with word clusters called “topics” based on word-to-word co-occurrence. Each document is presented as a probabilistic distribution of latent topics, and each latent topic is characterized by a probabilistic distribution over words. The generative process of LDA is shown in Figure 5 and the figure of this process is illustrated in Figure 6.
We use collapsed Gibbs sampling and run 100 iterations on the dataset. We show top words for different LDA topics in Table 2
, along with the white papers/cryptocurrencies that are assigned to that topic with the highest probability.
We can see a clear semantic domain represented by each topic: ICOs for gambling, games, medical care, religion, machine-networks, cryptography, insurance, etc.
Out of 2,251 ICO projects, we were able to collect the information of 1,230 founding teams. The following passage illustrates the type of descriptions used for founding teams in our dataset:
Justin Sun, born in 1990, master of University of Pennsylvania, bachelor of Peking University, founder and CEO of mobile social application Peiwo and TRON, the former chief representative for Greater China of Ripple. 2011 Asia Weekly Cover People; 2014 Davos Global Shaper; 2015 CNTV new figure of the year; 2017 Forbes Asia 30 under 30 entrepreneurs;2015/2017 Forbes China 30 Under-30s; The Only Millennial Student in the first batch of entrepreneurs at Hupan University, an elite business school established by Jack Ma, the founder of Alibaba Group.
We aim to automatically extract the most important characteristics from bios of founding team members. We treat this as an NLP tagging problem Toutanova et al. (2003); Huang et al. (2015); Miller et al. (2004); Tjong Kim Sang and De Meulder (2003). Most tagging models are supervised or semi-supervised approaches, which require first designing annotation guidelines (including selecting an appropriate set of types to annotate), then annotating a large corpus.
We define 5 categories of tags: born-date, university, degree, companies and awards received. We annotated 500 bios of different people and split the dataset into 0.8/0.1/0.1 for training, dev and testing. Features we employ include:
Word level features:
Word window size of 3
Letter level word features
Starts with a capital letter?
Has all capital letters?
Has all lower case letters?
Has non initial capital letters?
Is all numbers?
Letter prefixes and suffixes
A dictionary of world companies. We use Freebase API222222https://developers.google.com/freebase/v1/ to augment the dataset by using synonyms.
A dictionary of world universities. Freebase API is used as well.
In this section, we detail the deep learning model for ICO rating. The model uses very little prior knowledge about ICO projects, but rather learns the importance of various features from the collected real-world dataset.
The model we use here is a supervised learning model. In a standard supervised-learning setting, we wish to find a model , that maps an input to an output :
The input is an ICO project, which includes different aspects of its publicly available information.
is a binary variable indicating whether an ICO project is a scam. We use a learning model to predict the proportional price change of any given currency after a year of its ICO. Out of the collected 2,251 projects, we are able to collect the information of 1,482 projects with known price history and ICOs conducted at least one year before this work was done.
At training time, we use the price change of an ICO project in a year as training signals, trying to predict this price change given its ICO information. The predicting function is learned by maximizing the distance between the predicted price change and the gold-standard price change:
where , and price(t) denotes the price of a currency after one year of its ICO.
At test time, we can use to predict the price change, and think a project is a scam if the predicted price is less than percent of its ICO price:
can be set as requested. In this paper it is set to 0.01, 0.1 and 1.
We split the obtained ICO projects into training, dev, and testing in a 0.8/0.1/0.1 ratio. For each value of , an ICO project in the training dataset is paired with a gold-standard label . is a binary value, indicating whether the project is a scam project. The value of can be different with respect to different split bar .
To note, it is inevitably easier to use the distance between predicted prices and gold-standard prices for evaluation. But the value of this distance looks more obscure to the readers/investors. We thus transform the predicted price change to a more interpretable binary value , in which case the evaluation can be transformed to whether a model is able to correctly predict a scam ICO project.
One of the key parts in supervised learning models is how to represent the input
. In this subsection, we will detail how we transform each aspect of an ICO (e.g., white paper, founding team, website) into a machine-readable vector.
We transform each white paper into a vector representation using deep learning methods. Each white paper consists a sequence of sentences , , where denotes the number of sentences in the current white paper . Each sentence consists of a sequence of words , , where denotes the number of words in sentence . Each word is associated with a vector .
We adopt a hierarchical LSTM model Serban et al. (2017); Li et al. (2015b) to map a white paper to a vector representation . We first obtain representation vectors at the sentence level by passing the vectors for each sentence’s words through four layers of LSTMs Hochreiter and Schmidhuber (1997); Gers et al. (1999). An LSTM associates each timestep with an input, memory and output gate, respectively denoted as , and . For notational clarity, we distinguish and , where denotes the vector for individual text units (e.g., words or sentences) at time step while denotes the vector computed by the LSTM model at time by combining and .
denotes the sigmoid function. The vector representationfor each time step is given by:
where The vector output at the ending time-step is used to represent the entire sentence as
The representation for the current document/paragraph is obtained by sum over the presentation of all its containing sentences:
The best values for the LSTM parameters are unknown. The most straightforward way to learn the LSTM parameters is directly through the objective function in Eq. 2. Unfortunately, training an LSTM model on a few thousand examples can easily lead to overfitting. We thus adopt an unsupervised approach, in which we train a skip-thought model Kiros et al. (2015); Tang et al. (2017). The skip-thought model is an encoder-decoder model based on sequence-to-sequence generation techniques Sutskever et al. (2014); Luong et al. (2015); Chung et al. (2014). The parameters of the LSTM are trained to maximize the predicted probability of each word in neighboring sentences given the current sentence. Our skip-thought model is exactly the same as kiros2015skip with the only difference being that word embeddings are initialized using 300-dimensional Glove vectors Pennington et al. (2014). Given a pre-trained auto-encoder, we use the encoder to obtain sentence level representations using Eqs. 3, 4 and 5, then obtain document-level representations using Eq. 8.
We also use the topic weights from the LDA model as white paper features. Let be the number of LDA topics, which is 50 in this work. Each document can be represented as a vector describing a multinomial distribution over the 50 topics. We concatenate the deep learning vector and the LDA vector. The entire process is illustrated in Figure 7.
We map the founding team of an ICO project to the following features:
Again, we use a hierarchical neural model to map the bios of founding teams to a neural vector representation.
For each person, we obtained his or her full name, fed the name into the Google+ API,252525https://developers.google.com/+/api/ and crawled the information. One key challenge here is that one name can be mapped to multiple Google+ accounts. We assume a person’s name is correctly mapped to a Google+ account if the company extracted from this person’s bio can be found in their Google+ account.
Based on the information extracted (both from Section 3.2 and crawled Google+ accounts), we include the following features:
Whether founding team’s bios can be found
Company the person previously works for
Number of jobs the person had in the past 3 years before ICO
Whether the person is involved in other ICO projects
We crawled the websites of each project and were able to retrieve information from 1,087 project websites. We map the website text to a vector representation using a hierarchical LSTM as described above. We also use a binary feature indicating whether an ICO project has a website.
We crawled the GitHub repository (if it exists) of each ICO project. A binary variable indicating whether an ICO project has a GitHub repository is included in the feature vector. We handle GitHub README files similarly to the white papers, using an encoder-decoder model to map the file to a vector representation. Additional features include the number of branches, the number of commits, the total lines of code, and the total number of files. We only consider the version before the time of ICO.
Additional features we consider include:
Platform, e.g., ETH.
Unlimited or hard cap: an unlimited cap allows investors to send unlimited funding to the project’s ICO wallet.
By concatenating white paper features, founding team features, website features and GitHub features, each ICO project is associated with a vector . The predicted price change is given by:
regularization is added. The number of iterations is treated as a hyperparameter to be tuned on the dev set.
We present experimental results and qualitative analysis in this section.
|white paper +||0.34||0.93||0.49|
|white paper +||0.36||0.94||0.52|
|white paper +||0.37||0.95||0.53|
|founding team +|
|white paper +||0.70||0.80||0.75|
|white paper +||0.72||0.82||0.76|
|white paper +||0.73||0.84||0.78|
|founding team +|
|white paper +||0.77||0.74||0.75|
|white paper +||0.80||0.76||0.78|
|white paper +||0.83||0.77||0.80|
|founding team +|
Tables 4, 5 and 6 present results for scam ICO project identification with respect to different values of scam bar . As the value of increases from 0.01 to 0.1, then to 1, the proportion of the scam projects increases, giving progressively higher precision and lower recall. The white paper and GitHub repository are the most important two classes of features, achieving approximate F1 scores of 0.7 when
is set to 0.1 and 0.5. By adding more features, we are able to get progressively better precision and recall. The model achieves 0.83 precision, 0.77 recall and 0.80 F1 score in predicting scam ICO projects in thesetting, when all features are considered.
We need to rationalize the output from the model. Deep learning models are hard to be rationalized directly Lei et al. (2016); Mahendran and Vedaldi (2015); Weinzaepfel et al. (2011); Vondrick et al. (2013)
. This is because neural network models operate like a black box: using vector representations (as opposed to humaninterpretable features) to represent inputs, and applying multiple layers of non-linear transformations.
Various techniques have been proposed to make neural models interpretable Koh and Liang (2017); Montavon et al. (2017); Mahendran and Vedaldi (2015); Li et al. (2015a, 2016). The basic idea of these models is to build another learning or visualization model on top of a pre-trained neural model to for interpretation. We adopt two widely used methods for neural model visualization purposes:
has been widely used for visualizing and understanding neural models Montavon et al. (2017); Simonyan et al. (2013); Li et al. (2015a). The basic idea of this saliency method is to compute the contribution of each cell/feature/representation to the final decision using the derivative of the final decision with respective to an input feature.
More formally, for a supervised model, an input is associated with a class label . A trained neural model associates the pair (x, y) with a score . The saliency of any feature is given by:
Suppose that one aspect (denoted by , e.g., whitepaper) of an ICO project is represented by a vector vector(). For example, the vector for a whitepaper is the concatenation of its topic vector output from LDA and the neural vector output from the hierarchical LSTM model. The saliency of the aspect is thus the average sum of its containing vector units:
The saliency of different four aspects are illustrated in Figure 9. As can be seen, whitepaper and github are the most salient aspects.
One problem with the first-derivative saliency method is that it is able to show which feature is important (in other words, salient), but cannot tell how positively or negatively a particular feature contributes to a decision. We thus adopt the representation erasing strategy for neural model visualization, which has been used in a variety of previous work Li et al. (2016). The basic idea of this method is as follows: how much a feature/cell/representation contributes to a decision is determined by the negative effect of erasing pieces of the representation.
More formally, for a pre-trained supervised model, denotes the score that the label of scam project is assigned to the input . denotes the score that the label of scam project is assigned to the input with feature being erased. If is a significant feature that leads the input ICO project to be a scam, should be larger than . Let denote the influence of feature on class (a scam ICO project), which is given as follows:
A negative value of means feature positively contributes to the input being thought as a scam project. By computing the derivative with respect to each LDA topic, we are also able to rank LDA topics by the risking of being a scam. We manually labeled 10 LDA topics, each of which has a clear meaning. We compute the influence score of each human-defined topic, as can be seen in Table 7. ICOs on gaming, gambling and entertainment are more likely to scams than exchange, payment and smart contract.
ICOs have become one of the most controversial topics in the financial world. For legitimate projects, they provide fairness in crowdfunding, but a lack of transparency, technical understanding and legality gives unscrupulous actors an incentive to launch scam ICO projects, bringing significant loss to individual investors and making the world of cryptofinance fraught with danger.
In this paper, we proposed the first machine learning–based scam-ICO identification system. We find that a well designed neural network system is able to identify subtle warning signs hidden below the surface. By integrating different types of information about the ICO, the system is able to predict whether the price of an cryptocurrency will go down. We hope the proposed system will help investors identify scam ICO projects and attract more academic and public-sector work investigating this problem.
Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010, Springer, pages 177–186.
Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition.In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, pages 142–147.
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, pages 337–344.