Predicting the future success of startup companies is of great importance for both startup companies and venture capital (VC) firms. For startup companies, predicting the future development of themselves and their competitors can help them adjust their development strategies and capture opportunities effectively. For VC firms, predicting the future success of startup companies helps them balance their profit and risk.
For late-stage companies, the evaluation of future success is mostly based on financial and operating information. However, for early-stage companies, there is usually not enough data publicly available for prediction. Traditionally, the evaluation of startup companies relies heavily on investor’s personal experience. In recent years, machine learning is developing rapidly and achieve great success in many areas. There exists some research applying machine learning algorithm to predict the future success of startup companies (Yankov et al., 2014; McKenzie and Sansone, 2017; Arroyo et al., 2019; Kaiser and Kuhn, 2020), but their methods are not well suited for dealing with sparse data, which is common in datasets of startup companies. With the development of machine learning methods, recent algorithms like XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017) have the potential to solve this data sparsity problem.
In this paper, we aim to make the following three contributions. First, we try to leverage the recent progress of machine learning to handle this data sparsity problem. We validate that the recently developed algorithms, such as XGBoost (Chen and Guestrin, 2016), LightGBM (Ke et al., 2017)
, and soft Decision Tree(Frosst and Hinton, 2017)
, outperform many traditional algorithms including Logistic Regression, K Nearest Neighbor, Decision Tree, Multilayer Perceptron, and Random Forests. Second, we construct 19 factors using the data from Crunchbase111http://www.crunchbase.com, a data aggregation platform built to track startups on a global scale. We define multiple time windows to enrich the number of data samples and take factors like macroeconomy into consideration. This time window definition is more practical in VC practices. Third, we introduce interpretability into machine learning models. We interpret the predictions from the perspective of the contribution of each factor, finding that company age and past funding experience are the most important factors.
The rest of the paper is structured as follows: The previous research and theoretical background are reviewed in Sec 2. Our approach is introduced in Sec 3. Experimental results are discussed in Sec 4. The construction of the portfolio is described in Sec 5. Sec 6 summarizes our main conclusion and discusses future research directions.
2 Theoretical background
The problem of predicting the future success of startup companies has existed for a long time (Schendel and Hofer, 1979; Chandler and Hanks, 1993) and is still exploring by scholars (Arroyo et al., 2019; Kaiser and Kuhn, 2020).
Common solutions are classification models based on decisive factors. Most earlier studies for the prediction of the success of startup companies are based on regression analysis such as logistic regression(Lussier, 1995; Kaiser and Kuhn, 2020), ordered probit model (Lussier and Pfeifer, 2001) (Halabí and Lussier, 2014) (Lussier and Halabi, 2010), and log-logistic hazard models (Holmes et al., 2010). Researchers also develop expert systems (Ragothaman et al., 2003) and rule-based methods (Yankov et al., 2014).
In recent years, with the emergence of platforms aggregating business information about companies and the development of machine learning approaches, it is possible to use machine learning methods to solve the problem of startup prediction. (Yankov et al., 2014) test several tree-based, rule-based, and Bayes-based machine learning methods based on questionnaires gathered from 142 startup companies in Bulgarian. The authors suggest that the best-derived model is the tree-based C4.5 (Quinlan, 1993). (McKenzie and Sansone, 2017)
compare the performance of human experts and several machine learning methods, including Least Absolute Shrinkage and Selection Operator, Support Vector Machines, and Boosted Regression. They analyze 2,506 firms in a business plan competition in Nigeria. The author suggests that investors using the combination of man and machine rather than relying on human judges or machine learning-chosen portfolios.(Arroyo et al., 2019)
analyze the performance of several machine learning methods in a dataset of over 120,000 startup companies retrieved from Crunchbase. They consider five machine learning algorithms: Support Vector Machines, Decision Tree, Random Forests, Extremely Randomized Trees, and Gradient Tree Boosting. The results suggest that the Gradient Tree Boosting performs better in predicting the next funding round, while Random Forests and Extremely Randomized Trees perform better in predicting acquisition. One common problem of the datasets with startup companies is their sparsity nature. For most early-stage companies, there is usually not very much data available to the public. Current approaches are not well suited for this problem.
Machine learning is developing rapidly in recent years, and many new models have emerged. Gradient Boosting Decision Tree (GBDT)(Friedman, 2001) is a highly effective and widely used machine learning method, due to its efficiency, accuracy, and interpretability. It has several effective implementations recently, including XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017). XGBoost (Chen and Guestrin, 2016)
is a scalable end-to-end tree boosting system, which is widely used in data science and achieves excellent results. XGBoost proposes a novel sparsity-aware algorithm for sparse data and a weighted quantile sketch algorithm for approximate tree learning. It uses the second-order approximation of the convex loss function that can optimize the objective quickly. LightGBM(Ke et al., 2017)
proposes novel Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) methods. With GOSS, LightGBM can obtain a quite accurate estimation of the information gain with much smaller data size. The EFB method bundles mutually exclusive features using a greedy algorithm, solving the data sparsity problem. Tree-based classifiers are usually more preferred in investment-related research, mainly because they can be interpreted. Soft Decision Tree(Frosst and Hinton, 2017) takes the knowledge acquired by neural nets and expresses the knowledge in a model that relies on hierarchical decisions, creating a more explicable model. To our knowledge, there is no previous approach applying these methods in predicting the future success of startup companies. We leverage this recent progress and compare the performance of these methods in predicting the future success of startup companies
3 Our Approach
We define the concept of ’success’ to include raising new funding, being acquired, or going for an IPO. To make the prediction closer to the reality of VC investment, we further restrict the concept of future success to be in a defined time window (18 months).
3.1 Problem Statement
We formulate the problem of evaluating startup companies as a binary classification problem. For each company , we synthesis a set of variables to evaluate its future success, is the size of features. The selection of the features is discussed in detail in Sec 3.3. The label is assigned according to whether the company will succeed in the defined time window:
Our data set can be denoted as: , ,
is the size of the dataset. Given the labeled sample, the machine learning methods learn the conditional probability ofgiven , i.e. . Given a new data sample with no label, this model then outputs a prediction that places the sample into the class of success or failure.
where is a threshold manually selected. In our experiment, th is 0.5 for fair comparison.
We then try to construct portfolios according to the suggestion of these algorithms. The aim of constructing a portfolio is to select a subset of companies with size , each corresponds to a company, such that we maximize the expected number of success companies in the subset .
If we assume that the success of each company is independent, is consisted of the companies with the largest .
3.2 Data Preprocessing and Time Window
Our data sample is extracted using the daily CSV export of Crunchbase on October 20, 2020. The full dataset contains 1,166,402 organizations and 351,236 venture deals. We filter out the companies that the date of establishment is missing. The companies founded before 1990 are removed. Unique ids are assigned to each company to distinguish from the duplication of company names. Total 776,273 organizations remain after this preprocessing.
To make the prediction closer to the reality of VC investment, we restrict the concept of future success to be in a defined time window. We expect that the company will raise new funding, be acquired, or go for an IPO within a time threshold after the prediction. These evaluation time windows can be interpreted as the time intervals for investors to evaluate the return of investments. We use multiple time windows to enrich the number of data samples and take more factors like macroeconomy into consideration.
In practice, most VC firms believe the time segment between two funding rounds is usually around 18 months. Table 1 summarizes the time startup companies need to raise next round funding, validating that more than half of the companies achieve their next round within 18 months in most of the rounds. So we define the evaluation time window of 18 months. The time intervals and the according label distribution are shown in Table 2 222There might be survival bias since the Crunchbase was founded in 2007. The companies failed before the creation of Crunchbase may not register in the dataset.. The time
denotes the start of the evaluation window, which is the time we make the prediction and can be considered as the moment VC investors invest in a company. The timedenotes the end of the evaluation time window. The companies that were acquired, went for an IPO, closed, or had no funding events before are removed. Accumulated in all the time windows, the final data sample consists of 398,489 sample events.
|Funding round||Mean interval (months)||Median interval (months)||90th percentile (months)||Within 18 months|
3.3 Factor Exploration
Crunchbase provides information about companies, news, founders, funding rounds, and acquisitions. We compile a set of 19 factors grouped in three categories related to the growth of companies based on the information available in the dataset, as summarized in Table 3. Since we use multiple time windows, all the factors are associated with the beginning of the evaluation time window (). Some variables that may have changed with time are omitted in our analysis.
|com found year||number of years from 1990 to the year the company founded|
|macroeconomy||number of newly established companies in the founding year|
|company age||how long has the company been founded at (in month)|
|number of news||total number of news before|
|monthly average number of news||monthly average number of news from the founding year to|
|province(city) prosperity||number of companies headquartered in the area at|
|mean(max) province(city) prosperity of industries||average and max local prosperity of all industries associated with the company at|
|number of funding rounds||number of funding rounds the company achieved before|
|total amount raised (in USD)||total amount raised in USD before|
|mean(max) IPO fraction||average and max IPO fraction of all the investors of the company before|
|mean(max) acquisition fraction||average and max acquisition fraction of all the investors of the company before|
|Founders||mean(max) founder fail fraction||average and max fail fraction of each founder before|
The first group of factors is about the general information of companies. The year the company founded and the economic situation of that year are important for startup companies (Holmes et al., 2010). Since we only consider companies founded after 1990 in our dataset, we use the years elapsed since 1990 to denote the founding year for convenience. We take the number of newly established companies in the founding year as an indicator of the macroeconomy. For different time windows, the company age at is an important indicator of the status of a company. It is counted in months in our analysis. News-related factors are useful measures of company performance (Xiang et al., 2012). We count the total number of news and its total number of news associated with each company. The information of geographic environment (Porter and Stern, 2001; Hoenen et al., 2012) and business sectors (Clarysse et al., 2011) are also very important for the development of startup companies. We quantify the prosperity of an area by the number of companies headquartered in the area registered in Crunchbase. Each company is associated with several industries in Crunchbase. The local prosperity of an industry is quantified by the number of companies associated with the industry in the area. We calculate these factors with the geographical granularity of province and city. When associating the prosperity of the industries to the company, we calculate the average and max local prosperity of all industries associated with the company.
The second group of factors is related to funding rounds and investors. A startup company receives funding in a sequence of rounds. The past funding experience is very important for both startup companies and venture capital firms (Nahata, 2008; Nanda et al., 2020). We calculated the number of funding rounds the company achieved and the total amount raised in USD before . Research (Nahata, 2008) shows that reputable VC firms are more likely to lead their companies to successful exits. The reputation of past investors is evaluated based on their historical investment data. For each investor, we define the IPO fraction as the number of venture deals the investor invested in and exited with an IPO before divide by the total number of venture deals of the investor. The definition of the acquisition fraction is similar to the IPO fraction, except we use the fraction of venture deals exited with an acquisition. A company often has more than one investors, we utilize the average and max IPO and acquisition fraction of all its investors.
The third group of factors is about the founders. The experience of the founders (Jenkins et al., 2014; Littunen and Niittykangas, 2010) and the founding team composition (Nann et al., 2010; Eesley et al., 2014) are also indicators that may influence the potential success of a company. We define the fail fraction as the fraction that a founder had previously founded companies and failed before . As a company may have more than one founder, we consider the average and max fail fraction of all founders.
3.4 Models and Algorithms
We test eight different machine learning classifier algorithms: Logistic Regression, K Nearest Neighbor, Decision Tree, Multilayer Perceptron, Random Forests, XGBoost(Chen and Guestrin, 2016), LightGBM (Ke et al., 2017), and soft Decision Tree (Frosst and Hinton, 2017).
We use 90% of the sample in the training phase, 10% for the testing phase. For the training dataset, Synthetic Minority Over-sampling TEchnique (SMOTE) (Chawla et al., 2002) is used to handle the class imbalance problem, as shown in Table 2
. SMOTE over-samples the minority class by taking each minority class sample and introducing synthetic examples along the line segments joining the k (k=5 in this paper) minority class nearest neighbors. Except for over-sampling methods, some algorithms can handle the class imbalance problem by adjusting weights inversely proportional to class frequencies to the samples in the minority class. For tree-based models (Decision Tree, Random Forests, XGBoost, and LightGBM ) the weights are adjusted in the calculation of split gain. For Logistic Regression, Multilayer Perceptron, and soft Decision Tree, the weights are adjusted in the calculation of the loss function. To optimize the hyperparameters with high efficiency, we use Bayesian optimization to tune the hyperparameters. We choose the best-performed hyperparameters for the experiments, some of the principal parameters are summarized in Table4.
|Random Forests||133 estimators, max depth 63|
|XGBoost||180 estimators, max depth 11|
|LightGBM||355 estimators, max depth 8|
|soft Decision Tree||tree depth 8, average weighted|
4 Experiments and Results
4.1 Performance Metrics
The following parameters are used in computing the performance metrics:
Recall (True Positive Rate, TPR): the percentage of correctly predicted successful companies to all successful companies in reality.
Precision: the percentage of correctly predicted successful companies to the total companies classified as successful by the classifier.
F1-measure: the weighted average of precision and recall.
False Positive Rate (FPR): the proportion of the failed companies got incorrectly classified by the classifier.
4.2 Experimental Results
The results of different machine learning classifier algorithms are shown in Table 5. The results without SMOTE or weight adjustment are shown in Table 5 (a). Due to the class imbalance problem, the results tend to have relatively higher precision and lower recall. The results with SMOTE over-sampling are shown in Table 5 (b). SMOTE can alleviate the class imbalance problem and improve the recall metric to some extent. Adjusting weight balance in the model leads to better results in most of the models, as shown in Table 5 (c). The results show that LightGBM and XGBoost with weight adjustment performs best among eight machine learning methods, achieving 53.03% and 52.96% F1, respectively. The data of startup companies are usually incomplete for startup companies, especially for early-stage companies. LightGBM and XGBoost are sparsity-aware algorithms and efficiently solve this data sparsity problem. The ROC curves of different models are shown in Fig 1.
4.3 Discussions on Single and Multiple Time Windows
We define multiple time windows to enrich the number of data samples and to conduct time-aware analysis. There may be a potential risk that the events that happen later in the training set can influence the prediction of earlier events, resulting in lower prediction power for future events. To eliminate this concern, we conduct several experiments comparing models using single and multiple time windows to validate the robustness in predicting current and future events. We use LightGBM for the experiments.
First, we compare the performance of models using single and multiple time windows in predicting in-sample current events. To do this, we use the test set in the last time window to evaluate the performance of the models. Take the time window from January 2009 to June 2010 as an example. For the single time window scenario, we take the 15517 sample events in this time window. Then we split 90% of them as the training set and 10% as the test set. For the multiple time window scenario, we take 37517 sample events from all the time windows before January 2009 and combine them with the training set of the current time window to form the training set. For a fair comparison, the test set of the multiple time window scenario is the same as the single time window scenario. For each time window, we train two models using LightGBM with single and multiple time windows scenarios. We use Bayesian optimization to tune the hyperparameters of each model. The result shows in Fig 2 (a). Since the time windows start from January 2000, the results of the two scenarios in the first time window are the same. The performance with multiple time windows scenario is slightly better than the single time window scenario, showing that adding historical data helps to improve the performance. The difference is larger in earlier time windows when the amount of sample events is small. After July 2013, there are more than 40,000 startup companies in a single time window, which is large enough to provide enough information and diversity for the prediction. Thus the performance of the two scenarios is similar in recent time windows.
We also compare the performance of models using single and multiple time windows in predicting out-of-sample future events. For this purpose, we use the models trained with historical time windows to predict the future success of the next time window. For example, to predict the future success of the 15517 sample events in the time window from January 2009 to June 2010, we use the models trained on the datasets of time windows before January 2009. For the single time window scenario, we take the 11490 sample events in the time window from July 2007 to December 2008 to form the training set. For the multiple time window scenario, we take the 37517 sample events merging all the time windows before January 2009 to form the training set. The result shows in Fig 2 (b). Since the time windows start from January 2000 the results of these two scenarios in the first time window are the same. The performance with multiple time windows scenario is slightly better than the single time window scenario. This result agrees with the experiment of in-sample events, showing that adding historical data has a positive impact on the extrapolation to predict future events.
4.4 Factor Importance
Based on the experimental results, we use LightGBM to explore the importance of different factors. The factor importance is calculated as the total gains of splits which use the feature. The higher the value the more important and predictive the factor. As shown in Fig 3, the most important factors are company age and past funding experience. The reputation of past investors, local prosperity, macroeconomy, and news also have some predictive power, while the experience of the founders has little influence on the future success of the companies.
4.5 Interpreting Model Predictions
When an investor decides whether to take actions based on a prediction, understanding why a model makes a certain prediction is also important. Explaining the reasons behind the prediction provides more insights into the model and makes the prediction more convincing. SHAP (SHapley Additive exPlanations) (Lundberg and Lee, 2017), a game-theoretic approach to explain the output of any machine learning model, is used to assign each feature an importance value for each prediction.
We take the company Market Logic Software as an example. The output of LightGBM based on the data on 2008-12-31 is 0.71, indicating that the company had a high success probability in the following 18 months (2009-01-01 to 2010-06-30)333Actually, Market Logic Software raised its next round on May 5 2010.. Figure 4 shows how each factor contribute to this output. The base value is the average model output sampled from the training dataset. Features pushing the prediction higher are shown in red, while those pushing the prediction lower are in blue. As shown in the figure, company age, the number of funding rounds and reputable investors have positive effects on the future success of the company, local prosperity of its industrial sector is the main drawback.
5 Portfolio Construction and Model Validation
To validate that the models trained on current values can predict the future success of companies, we validate the models with out-of-sample periods. We calculate the conditional probability of the success of each company in the time window from Jan 1st, 2019 to Jun 30, 2020, using their features on Dec 31, 2018. This out-of-sample set is composed of 121462 companies, with 20184 of them are successful. This time window is later than the last time window in the training. We construct portfolios of different sizes with LightGBM, XGBoost, and Logistic Regression according to Eq. 3. We use the number of successful companies in the portfolios to evaluate the predictive power of these machine learning models. The number of companies that succeed in the portfolio versus the portfolio sizes is shown in Figure 5. The LightGBM and XGBoost model performs better comparing to Logistic Regression, which agrees with the results of the in-sample period test dataset, showing that these two algorithms have a great generalization and extrapolation power in predicting future events. In the portfolio constructed by LightGBM of size 10, 8 succeed in the following 18 months, as listed in Table 6.
The performance of several top venture capital firms are also shown in Figure 5. The portfolio sizes of these points are the number of companies that the firm invests in 2018. This result shows that the success rates of machine learning models are better comparing to human experts.
|Company||Last Venture deal before 2019||Success probability||First venture deal from 2019-01-01 to 2020-06-30|
|Lyft||Series I on Jun 28, 2018||0.9751||IPO on Mar 29, 2019|
|Coinbase||Series E on Oct 30, 2018||0.9667||No Event|
|Grab||Series H on Dec 12, 2018||0.9593||Series H on Jan 7, 2019|
|Revolut||Series C on Apr 26, 2018||0.9590||Non Equity Assistance on Mar 27, 2019|
|Wealthsimple||Venture Round on Feb 22, 2018||0.9558||Venture Round on May 22, 2019|
|Improbable||Corporate Round on Jul 26, 2018||0.9549||No Event|
|Privitar||Corporate Round on Dec 10, 2018||0.9546||Series B on Jun 10, 2019|
|Stripe||Series E on Sep 27, 2018||0.9529||Series E+ on Jan 29, 2019|
|Kabbage||Debt Financing on Nov 16, 2017||0.9513||Debt Financing on Apr 8, 2019|
|BigBasket||Series E on Jul 18, 2018||0.9507||Series F on May 6, 2019|
Most of the companies listed in Table 6 are in their late stages. To further validate the robustness of our models in the early stages, we test the predictive power of the models in early-stage companies. We select companies in investment stages before Series A, Series A, and Series B according to their last funding round before 2019. The top-ranking companies are listed in Tables 7, 8, and 9. The comparison with several top venture capital firms is shown in Figure 6. The performance of machine learning models is still comparable to human experts.
|Company||Last Venture deal before 2019||Success probability||First venture deal from 2019-01-01 to 2020-06-30|
|Electron||Seed Round on Nov 20, 2018||0.9503||Seed Round on Mar 1, 2019|
|ACTO||Seed Round on Jan 18, 2018||0.9379||No Event|
|HqO||Seed Round on Sep 27, 2018||0.9322||Series A on Feb 8, 2019|
|Tomorrow Ideas||Convertible Note on Oct 3, 2018||0.9315||Venture Round on Nov 1, 2019|
|Optimal||Seed Round on Dec 22, 2017||0.9258||No Event|
|FunnelAI||Non Equity Assistance on Feb 21, 2018||0.9224||Seed Round on Mar 27, 2019|
|Sprout.ai||Pre Seed Round on Apr 2, 2018||0.9213||Seed Round on Jun 1, 2019|
|ICON||Seed Round on Oct 17, 2018||0.9195||Venture Round on Jan 23, 2020|
|Lifebit||Seed Round on Jul 19, 2018||0.919355||Seed Round on Apr 30, 2020|
|Liveoak Technologies||Seed Round on Sep 13, 2017||0.9183||Series A on Jun 4, 2019|
|Company||Last Venture deal before 2019||Success probability||First venture deal from 2019-01-01 to 2020-06-30|
|Honeycomb||Series A on Feb 1, 2018||0.9357||Series A+ on Sep 26, 2019|
|League Network MyDrCares FundMyTeam||Series A on Jan 12, 2018||0.9297||No Event|
|Bark Technologies||Series A on Aug 29, 2018||0.9259||Series B on Mar 1, 2020|
|Hometree||Series A on Sep 26, 2018||0.9241||Venture Round on Jan 1, 2019|
|Pypestream||Series A+ on Dec 13, 2018||0.9202||No Event|
|Inspectorio||Series A on Jul 11, 2018||0.9134||No Event|
|Triple W Japan||Series A+ on Nov 6, 2017||0.9111||Venture Round on Jan 1, 2019|
|Terminal||Series A on May 22, 2018||0.9104||Series B on Sep 26, 2019|
|Owkin||Series A+ on May 23, 2018||0.9100||Series A++ on Mar 7, 2019|
|Shipwell||Series A on Oct 9, 2018||0.9098||Series B on Oct 24, 2019|
|Company||Last Venture deal before 2019||Success probability||First venture deal from 2019-01-01 to 2020-06-30|
|BigID||Series B on Jun 25, 2018||0.9307||Series C on Sep 5, 2019|
|Valimail, Inc.||Series B on May 22, 2018||0.9274||Series C on Jun 19, 2019|
|Imperfect Foods||Series B on Jun 27, 2018||0.9231||Series B+ on Mar 31, 2020|
|Kayrros||Series B on Sep 18, 2018||0.9191||No Event|
|Aircall||Series B on May 15, 2018||0.9187||Series C on May 27, 2020|
|Zero Hash||Series B+ on Sep 12, 2018||0.9099||No Event|
|Thread||Series B on Oct 16, 2018||0.9074||Series B+ on Nov 20, 2019|
|Packet||Series B on Sep 10, 2018||0.9013||Acquired by Equinix on Jan 15, 2020|
|AXIOS Media||Series B on Nov 17, 2017||0.9010||Series C on Dec 29, 2019|
|Pagaya Investments||Series B on Aug 30, 2018||0.8979||Series C on Apr 3, 2019|
Our models could help investors to decrease the failure rates in their portfolios. The success rates of machine learning methods are high since we do not consider many practical factors, such as funding size, investment preference, whether the company is within reach, and so on.
6 Conclusion and Discussion
In this work, we try to solve the data sparsity problem with recent machine learning methods. We analyze several machine learning methods using a large dataset derived from CrunchBase. We conduct a time-aware analysis based on multiple time windows, which is more practical in real-world scenarios. We expand the scope of success that includes raising new funding, being acquired, or going for an IPO. The results show that the two sparsity-aware algorithms, LightGBM and XGBoost, perform best among eight machine learning methods and achieve 53.03% and 52.96% F1 scores in the prediction, respectively. Through feature mining, we find that company age and past funding experience are among the most important factors. We also interpret the predictions from the perspective of feature contribution. We construct portfolio suggestions according to these methods with out-of-sample periods, which achieve better results compared to human experts. The results show that our methods have great generalization and extrapolation power in predicting future events. These findings have substantial implications on how machine learning methods can help investors identify potential business opportunities.
Future studies will include integrating more data sources and discovering more features, such as features related to founders, public opinions, and so on. Instead of building a model to predict all companies, we will try to build multiple models according to sectors so that we can create customized features to help improve the performance of the prediction in different sectors. We will also focus on the generalization and interpretation of the models. such as introducing causal inference methods to extract the potential reasons for a successful prediction.
We would like to thank Jiren Zhu and Haomin Wang for their supports and great insights on success prediction. We would also like to thank Yaoqiang Xing for his supports on data management. We would like to thank Dr. Kaifu Lee for reviewing the paper and giving very illuminative suggestion. Our work would not have been possible without their support.
- Arroyo et al. (2019) Arroyo, J., Corea, F., Jimenez-Diaz, G., Recio-Garcia, J.A., 2019. Assessment of machine learning performance for decision support in venture capital investments. Ieee Access 7, 124233–124243.
- Chandler and Hanks (1993) Chandler, G.N., Hanks, S.H., 1993. Measuring the performance of emerging businesses: A validation study. Journal of Business venturing 8, 391–408.
Chawla et al. (2002)
Chawla, N.V., Bowyer, K.W.,
Hall, L.O., Kegelmeyer, W.P.,
Smote: Synthetic minority over-sampling technique.
Journal of Artificial Intelligence Research 16, 321–357.
- Chen and Guestrin (2016) Chen, T., Guestrin, C., 2016. Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794.
- Clarysse et al. (2011) Clarysse, B., Tartari, V., Salter, A., 2011. The impact of entrepreneurial capacity, experience and organizational support on academic entrepreneurship. Research policy 40, 1084–1093.
- Eesley et al. (2014) Eesley, C.E., Hsu, D.H., Roberts, E.B., 2014. The contingent effects of top management teams on venture performance: Aligning founding team composition with innovation strategy and commercialization environment. Strategic Management Journal 35, 1798–1817.
- Friedman (2001) Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics , 1189–1232.
- Frosst and Hinton (2017) Frosst, N., Hinton, G., 2017. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784 .
- Halabí and Lussier (2014) Halabí, C.E., Lussier, R.N., 2014. A model for predicting small firm performance: Increasing the probability of entrepreneurial success in chile. Journal of Small Business and Enterprise Development 21, 4–25.
- Hoenen et al. (2012) Hoenen, S., Kolympiris, C., Schoenmakers, W., 2012. Do patents increase venture capital investments between rounds of financing. Master’s thesis. Wageningen University and Research Center.
- Holmes et al. (2010) Holmes, P., Hunt, A., Stone, I., 2010. An analysis of new firm survival using a hazard function. Applied Economics 42, 185–195.
- Jenkins et al. (2014) Jenkins, A.S., Wiklund, J., Brundin, E., 2014. Individual responses to firm failure: Appraisals, grief, and the influence of prior failure experience. Journal of Business Venturing 29, 17–33.
- Kaiser and Kuhn (2020) Kaiser, U., Kuhn, J.M., 2020. The value of publicly available, textual and non-textual information for startup performance prediction. Journal of Business Venturing Insights 14, e00179.
- Ke et al. (2017) Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y., 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30, 3146–3154.
- Littunen and Niittykangas (2010) Littunen, H., Niittykangas, H., 2010. The rapid growth of young firms during various stages of entrepreneurship. Journal of Small Business and Enterprise Development .
- Lundberg and Lee (2017) Lundberg, S.M., Lee, S.I., 2017. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems 30, 4765–4774.
- Lussier (1995) Lussier, R.N., 1995. A nonfinancial business success versus failure prediction model for young firms. Journal of Small Business Management 33, 8.
- Lussier and Halabi (2010) Lussier, R.N., Halabi, C.E., 2010. A three-country comparison of the business success versus failure prediction model. Journal of Small Business Management 48, 360–377.
- Lussier and Pfeifer (2001) Lussier, R.N., Pfeifer, S., 2001. A crossnational prediction model for business success. Journal of Small Business Management 39, 228–239.
- McKenzie and Sansone (2017) McKenzie, D., Sansone, D., 2017. Man vs. machine in predicting successful entrepreneurs: evidence from a business plan competition in Nigeria. The World Bank.
- Nahata (2008) Nahata, R., 2008. Venture capital reputation and investment performance. Journal of Financial Economics 90, 127–151.
- Nanda et al. (2020) Nanda, R., Samila, S., Sorenson, O., 2020. The persistent effect of initial success: Evidence from venture capital. Journal of Financial Economics 137, 231–248.
- Nann et al. (2010) Nann, S., Krauss, J.S., Schober, M., Gloor, P.A., Fischbach, K., Führes, H., 2010. The power of alumni networks-success of startup companies correlates with online social network structure of its founders, in: MIT Sloan Research Paper, pp. 4766–10.
- Porter and Stern (2001) Porter, M.E., Stern, S., 2001. Innovation: location matters. MIT Sloan Management Review 42, 28.
- Quinlan (1993) Quinlan, J.R., 1993. C4. 5: Programs for Machine Learning. Morgan Kaufmann.
- Ragothaman et al. (2003) Ragothaman, S., Naik, B., Ramakrishnan, K., 2003. Predicting corporate acquisitions: An application of uncertain reasoning using rule induction. Information Systems Frontiers 5, 401–412.
- Schendel and Hofer (1979) Schendel, D.E., Hofer, C.W., 1979. A new view of business policy and planning. Strategic Management, Boston: Little, Brown Boston .
- Swets (1988) Swets, J.A., 1988. Measuring the accuracy of diagnostic systems. Science 240, 1285–1293.
- Xiang et al. (2012) Xiang, G., Zheng, Z., Wen, M., Hong, J., Rose, C., Liu, C., 2012. A supervised approach to predict company acquisition with factual and topic features using profiles and news articles on techcrunch, in: Proceedings of the 2012 International AAAI Conference on Web and Social Media, pp. 2690–2696.
- Yankov et al. (2014) Yankov, B., Ruskov, P., Haralampiev, K., 2014. Models and tools for technology start-up companies success analysis. Economic Alternatives 3, 1–10.