Combining Machine Learning Classifiers for Stock Trading with Effective Feature Extraction

by   A. K. M. Amanat Ullah, et al.

The unpredictability and volatility of the stock market render it challenging to make a substantial profit using any generalized scheme. This paper intends to discuss our machine learning model, which can make a significant amount of profit in the US stock market by performing live trading in the Quantopian platform while using resources free of cost. Our top approach was to use ensemble learning with four classifiers: Gaussian Naive Bayes, Decision Tree, Logistic Regression with L1 regularization and Stochastic Gradient Descent, to decide whether to go long or short on a particular stock. Our best model performed daily trade between July 2011 and January 2019, generating 54.35 profit. Finally, our work showcased that mixtures of weighted classifiers perform better than any individual predictor about making trading decisions in the stock market.



page 6

page 23

page 24


Application of a Shallow Neural Network to Short-Term Stock Trading

Machine learning is increasingly prevalent in stock market trading. Thou...

Can We Learn to Beat the Best Stock

A novel algorithm for actively trading stocks is presented. While tradit...

Stock Trading via Feedback Control: Stochastic Model Predictive or Genetic?

We seek a discussion about the most suitable feedback control structure ...

Stock Trading Using PE ratio: A Dynamic Bayesian Network Modeling on Behavioral Finance and Fundamental Investment

On a daily investment decision in a security market, the price earnings ...

Implementation of a Type-2 Fuzzy Logic Based Prediction System for the Nigerian Stock Exchange

Stock Market can be easily seen as one of the most attractive places for...

DoubleEnsemble: A New Ensemble Method Based on Sample Reweighting and Feature Selection for Financial Data Analysis

Modern machine learning models (such as deep neural networks and boostin...

Ensemble Committees for Stock Return Classification and Prediction

This paper considers a portfolio trading strategy formulated by algorith...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Background

Stocks are essentially small pieces of ownership of a company, and the stock market works like an auction where investors buy and sell stocks. Owning stock means the shareholder owns a proportion of the company equal to the number of shares the person bought, against the total outstanding shares of the company. For example, if a company has 1 million shares and an individual owns 50,000 shares of the company, the person has a 5% stake in it.

1.1.1 Investing in Stocks

According to numerous studies, stocks produce greater returns than other assets. Stock returns mainly come from capital gains and dividends. Capital gains are when you sell a particular stock at a higher price than at which you purchased it. Dividends are a share of the profit that the company whose stocks you purchased makes, and distributes it to its shareholders. According to S&P Dow Jones Indices, since 1926, dividends have contributed to a third of investment returns while the other two thirds have been contributed by capital gains.

The prospect of buying shares from largely successful companies such as Apple, Amazon, Facebook, Google, and Netflix, together denoted by the famous acronym FAANG, during the early stages of stock trading can seem tempting. Investors with a high tolerance for risk would lean more towards capital gains for earning profit rather than dividends. Others who prefer a more conservative approach may choose to stick with stocks which have historically been known to provide consistent and significant dividends.

1.1.2 Stock classification

Several classification methods can be applied to categorize stocks. They are usually classified in two ways- according to their sector, or by market capitalization.

Market capitalization equals the total amount of outstanding shares of a company. This is found by multiplying the present market price of a share with the total number of shares outstanding. Companies with a market capitalization of $10 billion or more are classified as large-cap, companies which have between $2 billion and $10 billion are mid-cap, and small-cap companies are those with a market cap between $300 million and $2 billion.

The Global Industry Classification Standard (GICS) is the industry standard for sector-wise classification of stocks. Developed by S&P Dow Jones Indices and MSCI (Morgan Stanley Capital International) in 1999, the GICS is a useful tool which reflects the dimensions and progress of the industry sectors. The four-tier industry classification system is made up of 24 industry groups across 11 sectors. The sectors are as listed in table 1.

Different Sectors
Health Care
Real Estate
Communication Services
Consumer Staples
Information Technology
Consumer Discretionary
Table 1: Different Sectors of the Stock Market

Sector classification allows traders to invest with respect to their respective risk preferences. A conservative investor, for example, may opt to buy stocks from industries with more stable prices and which continually provide dividends. Others who opt for a high-risk high-return strategy may opt to buy stocks from sectors such as energy, financial and IT.

1.1.3 Long-short investment strategy

Traditionally, stock investing was focused on looking for stocks to buy long that are likely to appreciate [jacobs1993long]. There was little, if any, thought given to capitalizing on short selling overvalued stocks. When investors began to employ both long and short strategies in their investment portfolio more benefits and opportunities presented themselves which was previously unavailable.

Buying long is simply buying stock that you think will appreciate, and selling for profit when the stock price rises. For instance, imagine that you bought 500 shares of a particular stock, at $10 per share. This amounts to $5000. After a week, the price of a share of ABC rises to $55. You sell the stock, pocketing a profit of $500.

Shorting is when you borrow stocks that you expect will depreciate from a broker, at interest, and selling them while you wait for the price to drop. Once the price has lowered a significant amount, you pay back the lender by buying the same number of stocks that you borrowed in the first place, at the lower price. Your profit is the difference in price minus the interest and commissions.

For instance, you borrow 100 shares of XYZ, at $50 per share, and immediately sell them for $5000 while waiting for the share price to depreciate. Once the price per share of XYZ has dropped to $45, you buy 100 shares of XYZ and pay $4500 for it. Return the 100 shares to the lender and whatever remains minus the interest and commissions is the profit. In this case, your profit is $500.

1.2 Motivation

We are all aware of the unpredictability of the stock market, and how difficult it is to predict. Some people believe that it is not possible to do so. We believe that with the advancements in Machine Learning algorithms and Artificial Intelligence, we can predict stock market trends sufficiently, given we provide sufficient, refined, data to our models. However, large, useful datasets are often not as accessible as we would like them to be. Our supervisor introduced us to the Quantopian platform. It was mainly using Quantopian’s resources that we were able to implement our trading algorithm using public datasets and Quantopian’s free resources. Quantopian allowed us to test our algorithm using its live trading feature so we could see how it would perform in real life. This encouraged us to believe that it is possible to tackle this problem, and an accurate predictor of stock market trends are highly valuable, as any investor will tell you.

1.3 Objective

The goal of this research is to adequately describe a prediction model that is able to predict stock market trends with sufficient accuracy and profitability. While there have been many attempts at doing this in the past, we wanted to find out if we could do this using the resources available to us, free of cost. At first we planned to do this using bitcoins but eventually widened our focus to a more generalized algorithm for the stock market. This thesis reflects our research on the stock market, various ML algorithms used to predict stock market trends in the past, and the specific features, classifiers, and datasets needed to do so accurately.

1.4 Paper orientation

In the next section of our paper we discuss how others used machine learning models in order to predict trends in the stock market. Our research plan is described in the section following the literature review. We talk about our research plan and workflow in chapter 3. Chapter 4 is a description of the factors, classifiers and datasets we used. Chapter 5 talks about the algorithms that were used to obtain our results, and the chapter following discusses our results. The final chapter concludes the paper and talks about the limitations in our thesis and future prospects.

2 Literature Review

The prevalence of volatility in the stock market, makes predicting stock prices anything but simple. Before investing, investors perform two kinds of analysis [patel2015predicting]. The first of this is fundamental analysis, where investors look into the value of stocks, the industry performance, economical factors, etc. and decide whether or not to invest. Technical analysis is the second, more advanced, analysis which involves evaluating those stocks through the use of statistics and activity in the current market, such as volume traded and previous price levels [patel2015predicting]. Technical analysts use charts to recognise patterns and try to predict how a stock price will change. Malkiel and Fama’s Efficient market hypothesis states that predicting values of stocks considering financial information is possible, because the prices are informationally efficient [malkiel1970efficient]. As many unpredictable variables influence stocks and the stock market in general, it seems logical that factors such as the public image of the company and the political scenario of a country, will be reflected in the prices. By sufficiently preprocessing the the data obtained from stock prices, and the algorithms and their factors are appropriate, it may be possible to predict stock or stock price index.

There were quite a few different implementations of machine learning algorithms for the purposes of making stock market price predictions. Different papers experimented with different machine learning algorithms that they implemented in order to figure out which models produced the best results. Dai and et al. attempted to narrow down the environment by selecting certain criteria [Dai_automatedstock]

. Under these criteria, they were able to achieve a profit of 0.0123, recall 30.05%, with an accuracy of 38.39%, and 55.07% precision, using a logistic regression model, after training the model for an hour. Zheng and Jin observed that when compared with Logistic Regression, Bayesian Network, and a Simple Neural Network, a Support Vector Machine having radial kernel gave them the most satisfactory results


. Due to their limited processing power, they were only able to use a subset of their data for training their model, and recommended that a more powerful processor be used to achieve better results . Similar recommendations were made by G. Chen and et al. stating that their preferred model, the Long Short-Term Memory (LSTM), would have performed better were they able to train the different layers and neurons using higher computing power


. Since the data was non-linear in nature, a Recurrent Neural Network (RNN) would be more suited to the task.

On [hegazy2014machine] it was discussed that when performing stock price prediction, it came out to be that ANN the algorithm that was once popular for prediction suffers from overfitting due to large numbers of parameters that it needs to fix [tao2004input]

. This is where support vector machine (SVM) came into play and it was suggested that this method could be used as an alternative to avoid such limitations, where according to the VC theory

[vapnik2013nature] SVM calculates globally obtained sol unlike the ones obtained through ANN which mostly tend to fall in the local minima. It was seen that using an SVM model the accuracy of the predicted output came out to be around 57% [kim2003financial]. There is one other form of SVM and that is LS-SVM (Least squared support vector machine). In the paper [madge2015predicting] it was mentioned that if the input parameters of LS-SVM is tuned and refined then the output of this classification algorithm boosts even further and shows promise to be a very powerful method to keep an eye out for. SVM being this powerful and popular as is it, is now almost always taken into consideration when it comes to predicting price of a volatile market, and thus we think that incorporating this into our research will boost our chances of getting a positive result.

While classical regression was more commonly used back in the day, non-linear machine learning algorithms are also increasingly being used as trading data regarded as time-series data which in non-stationary in nature. However, Artificial Neural Networks and SVM remain among the most popular methods used today. Every algorithm has a unique learning process. ANN simulates the workings of a human brain by creating a network of neurons [patel2015predicting]

. The Hidden Markov Model (HMM), Artificial Neural Networks (ANN) as well as Genetic Algorithms (GA) were combined into one fusion model in order to predict market behaviour

[hassan2007fusion]. The stock prices converted to distinct value sets using ANN, which then became input for the HMM. Using a selection of features determined from ARIMA analyses, Wang and Leu [wang1996stock] designed a prediction model which was helpful in predicting market trends in the Taiwanese stock market. This produced an acceptable level of accuracy in predicting market trends of up to 6 weeks, after the networks were trained using 4-year weekly data [patel2015predicting]. A hybridized soft computing algorithm was defined by Abraham and et al. for automatic market predictions and pattern analysis [abraham2001hybrid]. They made used the Nasdaq-100 index of the Nasdaq stock market for forecasting a day ahead with neural networks. A neuro-fuzzy system was used to analyze the predicted values. This system produced promising results. A PNN (probabilistic neural network) model was trained using historical data by Chen and et al. for investment purposes [chen2003application]

. When set against other investment strategies, namely the buy and hold concept and those which made use of forecasts estimated by the random walk and parametric Gaussian Mixture Model, PNN-based investment strategies produced better results.

By searching a higher dimension hyperplane, a well-known SVM algorithm which separates classes was developed by Vapnik

[vapnik1999overview]. To test the predictability of price trends in the NIKKEI 255 index, Wang and et al. used a SVM to make forecasts [huang2005forecasting]

. They also made comparisons with other methods of classification, such as Elman Backpropagation Neural Networks, Quadratic Discriminant Analysis, and Linear Discriminant Analysis, SVM produced better experimental results. Kim compared the use of SVM to predict the daily stock price direction against Case Based Reasoning and neural network in the Korean stock market

[kim2003financial]. The initial attributes were made up of twelve technical indicators. SVM was proven to have produced better results.

Ensemble methods such as random forests help to reduce the probability of the data overfitting. Random forests use decision trees and majority voting to obtain reliable results. In order to perform an analysis on stock returns, Lin and et al. tested a prediction model that used the classifier ensemble method

[tsai2011predicting] and took bagging and majority voting methods into consideration. It was found that models using single classifiers under-performed compared to the ones using multiple classifiers, in regards to ROI and accuracy when the performances of those using an ensemble of several classifiers and those using single baseline classifiers were compared [patel2015predicting]. An SVM ensemble based Financial Distress Prediction (FDP) was a new method proposed by Sun and Li [sun2012financial]

. Both individual performance and diversity analysis were used in selecting the base classifiers from potential candidates for the SVM ensemble. The SVM ensemble produced superior results when compared to the individual SVM classifier. A sum of ten data mining techniques, some of which included KNN, Naive Bayes using kernel estimation, Linear Discriminant Analysis (LDA), Least Squared SVM, were used by Ou and Wang to try and forecast price fluctuations in the stock market of Hong Kong

[ou2009prediction]. The SVM and LS-SVM were shown to produce better predictions compared to the other models.

The approach taken by an algorithm when it comes to predicting changes in the stock market is unique to each algorithm, as discussed above. Likewise, each algorithm also has its own unique set of limitations to be considered. Moreover, it has to be noted that the output depends not only on your choice of algorithm, but also the representation of input. The prediction accuracy can thus be improved by identifying and using a set of important classifiers instead of all of them. By putting together support vector regression(SVR) and a self-organizing map(SOM), Hsu and his fellow researchers designed a two-layer architecture


. The input environment was split into spaces where data points clumped together in order to properly dissect the non-linear nature of financial data, using the SOM. The SVR was run once the heterogeneous data was transformed into several homogeneous regions, in order to make predictions. The two stage architecture model yielded potentially significant results for the purposes of predicting stock prices. Variants of Genetic Programming (GP) have also been tried for modeling financial markets. To ensure generalization of the model, the model was further added with Multi Expression Programming and Gene Expression Programming, boosting and deep learning methods

[patel2015predicting]. While trying to model the stocks in NYSE (New York Stock Exchange) Garg and et al. analyzed to what degree model selection criteria had on the performance [garg2013empirical]. the FPE criteria was proven as the better fit for the GP model than other model selection criteria, as indicated by the results. In order to make predictions about the closing value of five international stock indices, Nair et al. made use of an adaptive neural network [nair2011ga]. The genetic algorithm helped the system adapt to the dynamic market conditions by making fine adjustments to the neural network parameters after each trade [patel2015predicting]. Using different neural networks models, trained with 14 years of data from NSE Nifty and BSE Sensex, Mantri and et al. tried to calculate volatilities of the Indian stock market [mantri2014artificial]. They came to the conclusion that, using the models mentioned previously, having no distinction in regards to volatility of Nifty and Sensex estimated [patel2015predicting]. Mishra and et al. tested the rate of returns series for the existence of nonlinear dependency and chaos, for 6 Indian market indices [mishra2011search]. The research indicated that random walk process was not followed by the returns [patel2015predicting]. To analyze and predict variations in price, Liu and Wang implemented an improved NN models by making the assumption which was an investor’s purchasing decision relies on historical stock market data [liu2012fluctuation]

. Araújo and Ferreira proposed the Morphological Rank Linear Forecasting model, and compared their results with that of Time-delay Added Evolutionary Forecasting and Multilayer Perceptron networks methods


3 Research Plan

The entire system is built using the Quantopian working environment. It provides a large variety of financial data of major US stocks, starting from 2002 all the way to the current date. Quantopian has a large number of factors at its disposal for us to use and is also flexible enough to let us create our own custom factor. These factors are a necessity, when it comes to predicting the future market price using any Machine learning algorithm.

Figure 1: Flowchart of Working Plan

The diagram in figure 1

shows the complete work-flow of our model. The processes include selection of stocks, Feature selection, Data pre-processing, training and generating predictions using machine learning algorithm and finally making changes to the portfolio according to the predictions.

Using the Quantopian algorithm feature we implemented our own strategy incorporating several machine learning algorithms into it. We started initially by collecting the data of 1500 of the top stocks in the market using the Q1500US function provided by Quantopian. Next we imported all the factors that is provided whilst also including factors from TA-LIB ( very important financial factor provider ). We also had to make some custom factor ourselves. Some of the factors are Asset Growth 3M, Earnings Quality, Capex to Cashflows, EBIT to Assets, EBITDA Yield, MACD Signal Line, AD, ADX, APO, ATR, BETA, MFI, Mean Reversion 1M, Asset to Equity Ratio, etc resulting in a total of 39 features.

But what we need to realize that due to the volatility of the market just feeding all the factors into the ML algorithm won’t give a very consistent result overall, because a given factor may effect the prediction positively or a negatively at different given time period with respect to the market. To overcome this problem we had to implement feature reduction dynamically which we will get into later down the line after we touch other important points.

After collecting the features we set the number of stocks we wanted to trade, Machine learning window length, Nth forward day we wanted to predict i.e. in this case we had the variable set to 5, and also the trading frequency i.e. the number of days after which we wanted to initiate the trade. From the 1500 stock data that we imported before, we sort and only trade on two separate quartiles upper 30% and lower 30%. We perform this slicing to make sure that we do not trade on stocks that has a very steady rate of change on its pricing, but only trade on stocks that is placed higher and lower down the ladder on which we could go long( upper 30%) and short( lower 30%) and have a significant success rate on it. We set the upper 30% to 1 i.e. long, lower 30% to -1 i.e. short and other 40% to 0 i.e. we do not perform any trade on them. The summation of this upper and lower quartiles results in 500 stocks i.e. the number we set earlier. We had to strip the Label ( Returns ) from the zipline and perform a 5 day computation on it. The T - 5 days data had to be discarded because there are no 5 day in forward time data for that given particular time resulting in NAN labels and thus was dropped from the zipline dataframe. The Label column had to be kept separate to pass it onto the ML algorithm.Since Quantopian does not support machine learning and data preprocessing to be done inside the pipeline hence we had to sort the entirety of it outside.

After preprocessing of the data we make a new column in the pipeline called ML and call the Machine learning function to fill it up for each and every stock for that given day. The parameters of the ML function is the universe and all the columns of the pipeline i.e. factors and labels that we calculated. Here is the part were we perform the factor reduction that was talked about earlier. This process is performed dynamically throughout the training process i.e. every single time we train the algorithm we only train it with the top 15 features using SelectKBest feature selection method.

Figure 2: Structure of the Ensemble Learning Model Used

Figure 2 illustrates our Ensemble learning model. Here four machine learning algorithm each give us a hypothesis which of each, give us an output. These four outputs are used for equal weighted ensembling to generate the final output. An initial problem that we faced while implementing ML algo is that whenever we tried to have 3 or more high complexity classifiers(SVM, ADABoost etc) along with the dynamic feature selection we use to run into TLE. But then we selected algorithms that has very small runtime usually around 2-3 seconds to test and train, which is a very important thing to have in a live trading algorithm.

4 Data processing

4.1 Primary Features

Our primary data is the daily (1 day) open, low, high, close, volume of each stock in our tradable universe. Additionally we are retrieving daily balance sheet, cash flow statement, income statement, operation ratios, earning report, valuation, valuation ratios are taken as primary data. Using these primary data the 4* secondary factors were created. These factors are commonly used in financial prediction by traders.
bs = morningstar.balanceSheet
cfs = morningstar.cashFlowStatement
is = morningstar.incomeStatement
or = morningstar.operationRatios
er = morningstar.earningsReport
v = morningstar.valuation
vr = morningstar.valuationRatios

4.2 Secondary Features

Balance_sheet :- Balance sheet is a very important financial statement and is both financial modeling and accounting. It is used to portray a company’s total assets. The balance sheet is usually calculated using the equation as follows.


Cash Flow Statement( CFS ) It is a measure of how a company manages its financial strength and liquidity. It has a very high correlation ship with balance sheet and income statement. It can be used to analyze a company.

Income Statement: It is used for reporting a company’s financial status over a specific accounting period.It summarizes a company’s total returns, expenses over a period of time.


Operating Ratio: It shows the economy of a company by comparing total operating expenses to company net sales. The less the ratio the more efficient the company is at generating revenue.


Earnings Report: It is a quarterly earnings report made by companies to report their companies.

Valuation:- It is used to determine the current worth of assets of a company.

ADX and DX: It is a technical index used to indicate the strength of the trade. This strength can either be positive or negative and this is shown by two indicators +DI and -DI thus ADX commonly includes 3 separate lines. Additionally, It is a technical indicator that is used to predict the divergence side of the market. The two components of DMI are +DI and -DI.


APO: It finds the absolute value and finds the difference between two different exponential moving averages . When the APO indicator goes above zero we go long i.e. Bullish and below zero we go short i.e. bearish.


Mean Revision Theory: It is used to statistically analyze the market condition, which can overall effect the trading strategy. Mean revision also takes advantage of extreme price fluctuation of particular stocks. They can be applied for both buying and selling strategies.


CMO: It is very similar to other similar momentum oscillators. It calculates momentum for both Market Up and Down days but it does not smooth-out the results. The oscillator indicates between +100 and -100.



It is an indication of total money made or lost during transactions.Returns can be expressed as a ratio of profit to investment.


Williams %R It is known as Williams perfect Range, which is a type of momentum calculator that has an indicator range between 0 to -100 and measures the level of over bought and oversold. It is used to find the most optimal time to entry and exit the market.


ATR: It measures the market volatility, by dissolving the entire range of an asset price. Stocks with a higher volatility has higher ATR and vice versa. This acts as an indicator for traders to exit and enter trade.


AD: This is an indicator that makes use of volume and price to determine if a stock is accumulated or distributed. This factor looks for changes between stock price and volume flow, thus providing a hint of how strong a trend is.
The Formula for the Accumulation / Distribution Indicator.



BETA: A coefficient measure of volatility for an individual stock in contrast to the entire market. Statistically beta is the gradient of the line. By default the market beta is 1.0.


MP: It calculates the mean of the high and low of a stock candle.


MFI: It is a technical indicator that makes use of price and volume as a reference to identify if a stock is overvalued or undervalued. It can be used to spot the change in daily price of the stock. The value of the oscillator ranges between 0-100.


PPO: This is used to show the correlation ship between two Moving average between 0-1. It compares asset benchmark, market volatility to come up with a trend signals and help predict the trend the market.


Asset to Equity Ratio: It shows the correlation between the assets owned by a firm to the total percentage of the shareholders. The higher the ratio the greater the firm’s debt.

Capex to Cash Flow: This is used to estimate a company’s long term assets and also how much cash a company is able to generate.


Asset Growth: It is the growth of the overall asset of a company.


EBIT to Asset: It is an sign of a company’s benefits which is generated from operations and trades, and ignores tax burden and capital structure.


EBITDA Yield: It is usually reported as a Quarterly earnings press release. It ignores taxes and non-operating expenses thus highlighting only important for the market analyst to focus on.

MACD Signal Line: This indicator shows the relationship between different stock’s moving averages. After calculation a nine day EMA-line is drawn over MACD line to use as buy or sell indicator.


Money Flow Volume: Money flow is an indicator of when and which price the stock was purchased. If more security was bought during the uptick time compared to down tick then the down tick time then the indicator is positive because almost all the investors participating in the trade were willing to give a high price for the stock and vice versa.

Operating Cash Flows to Assets: It is the flock of revenue generated by a company’s normal business operation. It is an indicator of whether a company can generate a substantial amount of positive cash flow to maintain its growth. This indicator gives market analyst a clear view of what a company is capable of.

Return on Invest Capital: This gives the market analyst an indicator of how well a company uses its resources to generate its revenue.


39 Week Returns: Gives us the total returns over a period of 39 weeks


Trend line: It shows the momentum / Trend of the market from one given point to another. Volume 22 Days: It gives us the total amount of volume generated over a period of 22 days.

4.3 Feature Selection

  1. Mean Period Wise Return By Factor Quartile

    Figure 3: Mean Period Wise Return By Factor Quartile

    The diagram in figure 3 represents the total return as per graph height.A positive graph represents long and negative short.In Our case we are taking 3 quartiles and breaking them into 3 separate days 1D 5D 22D, for which we can trade using the Quantopian environment. As per figure 4.1 what we can use this to observe for the factor taken works best with 5D trading as we have a good amount of return for both long and short, where as for 1d trading we can see that for the first quartile long gives a good result but is not the best for going short as displayed in the third quartile.

  2. Factor Weighted Long Short Portfolio Cumulative Return:

    This graph represents the position of portfolio of the trader given that person only traded taking that experimented factor into consideration alone. This represents the cumulative Returns on the portfolio of the trader.

    The graphs in figure 4 displays different positions of the portfolio given 3 different trading frequency 1D, 5D, 22D as per quartile deceleration.

    Figure 4: Factor Weighted Long Short Portfolio Cumulative Return (a)1D, (b)5D, (c)22D
  3. Period Wise Return By Factor Quantile

    This graph is famously known as violin graph, and comes well in handy when only the median value is not a reliable option to use in order to judge the state of the data being experimented on.

    This graph is very convenient when it comes to comparing the summary statistics of range of quartiles. The representation of this graph is very similar to that shown in the figure 5, but this time we get an idea of the density of where our returns are concentrated for each time period.

    Figure 5: Period Wise Return By Factor Quantile
  4. Cumulative Returns by Quantile

    The cumulative quantiles of each time period is taken and is aggregated over the period of trading time. The main objective of this curve is to see of the quartiles spreads as far away from each other as possible. The far apart they are the better. The third quantile is very clearly above the first quartile and this gets more and more clearer as we move forward into time. The less overlapping between the graphs the better.

    This is calculated for the 3 different quartiles over the period of time that we traded shown in figure 6.

    Figure 6: Cumulative Returns by Quantile (a)1D, (b)5D, (c)22D
  5. Top Minus Bottom Quantile Mean

    This graph in figure 7 subtracts the top quantile from the bottom quantile and takes a mean of the answer to smoothen out the results for the given trading time period. The more positive the graph plot the more return we get over that period of trade time.

    Figure 7: Top Minus Bottom Quantile Mean 1D, 5D, 22D

5 Algorithms

5.1 Naive Bayes

Naive Bayes classification uses Bayes’ rule. Assuming Ck= a particular event x= feature vector


Bayes’ rule defines the probability of a particular event ck occurring for feature vector x, can be computed from the given formula.

For estimating from a dataset we must first compute . The strategy used to find the distribution of x conditional on ck is specified by the following formula:


In this formula we assume that having a particular value is independent of the occurrence of any other ’ form the x feature vector for a particular event . By plugging in the estimates the equation 2 becomes:

Gaussian Naive Bayes is better for this case as the features are continuous. Whereas Bernoulli Naive Bayes works better when the features are binary.

5.2 Logistic Regression

Regression models are widely used for data driven decision making. In many fields, logistic regression has become the standard method of data analysis in such situations. The key to any such kind of analysis is to find the model that fits best when explaining the relationship between a dependent and one or more independent variables. Unlike linear regression which most people are familiar with, where the outcome variable is usually continuous, a condition for logistic regression is that the outcome variable is binary. Logistic regression also allows us to determine to what degree a chosen independent variable affects the outcome.

Two reasons why logistic regression is so widely used is that it is 1) flexible and can be easily used in many situations, and 2) it allows for meaningful interpretations of the results. For simplicity, the quantity is used to represent Y’s conditional mean, given a value x.

The logit transformation is integral for logistic regression. The transformation is as follows:


The importance of the logit, p(x) lies in the fact that it contains many useful properties of Logistic regression. The logit takes linear vales which might be continuous, either positive or negative and depends on x range.

To summarize, when the outcome variable is dichotomous, in a regression analysis:

  • The logistic regression mean must be scaled to be between 1 and 0.

  • The binomial, as opposed to the normal distribution, describes the distribution of errors.

  • The principles of linear regression can also be applied to logistic regression.

For the model to produce reliable results, we need to have a large number of observations (at least 50).

In order to prevent overfitting of data, L1 (Lasso) and L2 (Ridge) regression is used. The difference between these two methods of regularization lie in the penalty term. L2 uses the “squared magnitude” as the penalty term to the loss function, while L2 uses the “absolute value of magnitude”.


The above equations show the cost function when using L2 and L1 regularization respectively where is weight put on a particular feature x and is coefficient of the penalty term.

5.3 Stochastic Gradient Descent

It is a first order optimizing supervised machine learning algorithm that specializes to fit a straight line over a series of data points with the least amount of error.


is the weight we want to optimize with regards to cost where is the learning rate. It works by first assuming randomly a point of intercept to draw a straight line and for each individual weight of the graph we find the predicted Y value i.e. using which we calculate the lingering or remaining value i.e. the difference between real y and and then square the residual and square its value, and then find the sum of the squared residual for each and every Y value that exists for the X value. If we keep on increasing the intercept and for every intercept we get a sum of squared residual value, which if we plot a graph we will obtain a graph that looks similar to y = . The maximum value of the graph will be the one that has the lowest squared residual. This is a very slow method but gradient descent uses this concept but works in a much faster way by taking big steps when it is far away from the optimal value and gradually decreases the step size when it gets closer. Gradient descent derives the sum of the squared residuals with respect to the intercept giving us the slope of the curve at that state of time. The closer we get to intercept the closer the slope gets to 0. We then calculate the step size we multiply the slope that we got with i.e. learning rate. Using the new intercept we got we then repeat the entire working process until we get a slope close to 0 or we reach our iterative limit, when we stop our algorithm.

Stochastic gradient descent on the other hand works very similarly but is very optimized and efficient when it comes to using it on large data sets.[bottou2010large]

What stochastic gradient descent does is it picks random values from the weight and only uses that value to perform the entire working process and so on. Thus this reduces the calculation factor by F-1, here F is sum of the points. It also performs well when using it on data with a lot of redundancy, as it clusters them and only picks a random value from every single cluster to perform the working steps. Thus if there are 5 clusters stochastic gradient descent will pick 5 points to work with.

Thus stochastic gradient descent works well with stock prediction all whilst trading real time is it reduces the complexity of the algorithm by a whole lot thus not resulting in any TLE unlike gradient descent.

Hinge Loss : It is a loss function that is mainly used to train classifiers using the maximum margin classification.[rosasco2004loss]

Elastic Net Penalty :

Elastic net penalty that is mainly used to overcome the limitations Lasso regression. If there are highly correlated values lasso regression usually tends to pick one variable from the group that are highly correlated and ignore the rest but what elastic net does is it adds a squared penalty to it. Adding this term gives this loss function a unique minimum( strongly convex ).

5.4 Support Vector Machine (SVM)

SVM is a classification and regression based algorithm. It is used to maximize predictive accuracy whilst avoiding the overfitting of data. It is used for applications such as handwriting, face, text and hypertext classification, Bioinformatics etc. SVM is used to achieve maximum separation between data points. Hyperplane is a part of SVM that maximize the separation of data points by increasing the line width with increments. It starts by drawing a line and two equidistant parallel lines. Next the algorithm picks a stopping point so that the algorithm does not run into an infinite loop and also picks an expanding factor close to 1 example 0.99. [jakkula2006tutorial]

5.5 AdaBoost

A boosting algorithm increases the accuracy of weak learners. A weak learner as an algorithm that uses a simple assumption to output a hypothesis that comes from an easily learnable hypothesis class and performs at least slightly better than a random guess. If each weak learner is properly implemented then Boosting aggregates the weak hypotheses to provide a better predictor which will perform well on hard to learn problems.

Adaboost is the short form of Adaptive boosting. The AdaBoost algorithm outputs a “strong” function that is a weighted sum of weak classifiers. The algorithm follows an iterative process where in each iteration the algorithm focuses on the samples where the previous hypothesis gave incorrect answers. The weak learner is returns a weak function whose error is et such that


where is the loss function and is the hypothesis and then a specific classifier is assigned a weight for as follows: . So, the weight given to that weak classifier is inversely proportional to the error of that weak classifier. [shalev2014understanding]

5.6 Random Forest

A random forest uses a set of decision trees. As a class of decision trees of unspecified size has infinite VC dimension (Vapnik–Chervonenkis dimension), we restrict the size of the decision tree in order to prevent overfitting. Creating an ensemble of trees is another way to reduce the probability of overfitting [shalev2014understanding].

An advantage of using Random Forest is that it both a classifier and regressor [liaw2002classification]. For our purposes, we applied Random Forest for classification. The algorithm works as follows:

Create a bootstrapped dataset from the original data (bagging). An important point about bootstrap samples is that the same sample can be chosen more than once, given they are chosen at random. Next, we create decision trees for each sample in our bootstrap dataset. At every tree node, choose a random subset of variables and the best split among those are chosen. We use the aggregate of the predictions of our trees in order to predict the classification of new data. For classification purposes, we use the majority vote. The average is used for regression.

We can then easily estimate the error in our results in the following manner: We take a sample from our original data, which was not used to create our decision trees. This sample is called an “Out of bag” (OOB) sample. We then try to predict the data of the out of bag sample using the tree we grew by applying bootstrapping. We then aggregate the predictions of the out of bag samples and calculate the rate of error. This is called the OOB estimate of the error rate.

Given enough trees have been grown, the OOB estimate of error rate is significantly accurate.

5.7 Decision Tree

Decision Tree Classifiers divides a problem into a collection of subproblems turning a complex problem easier to solve [8]. Using entropy as the criteria of splitting tress is useful when the problem contains numerous classes. The objective used for tree design in our model is to minimize uncertainty in each layer, or put differently increase entropy reduction. Shanon’s entropy, defined as


Pi= a prior likelihood of class i


Entropy is used to find the most gain of a particular factor. And the factor with the most gain is used to make a split. At the terminal nodes a decision is given about the classification.

One advantage of decision tree over other classifiers is that a training sample is tested for subsets of classes and not all classes. This reduces computation and improves performance of the classification task. The decision tree classifier also uses different subsets of the features of the given problem. This makes the classifier perform better than single layer classifiers. Decision tree classifier also overcomes high dimensionality problem as it takes limited factors for each decision tree.

Overlapping is one of the problems of using decision tree classifier. The classifier takes a large amount of time and space when the label class is large. But that is not the case in this system as the classification task is binary. There are a lot of difficulties involved in designing an ideal decision tree. Error may also add up in each level to reduce the accuracy of the model [safavian1991survey].

6 Results

We used all the seven classifiers discussed to start to perform calculation on data from 2011-03-06 to 2011-09-7. We split by 80:20 ratio to form the train set and test set. Table 2 shows that ensemble methods work far better in this case. However, for ensemble methods we only predicted the top and bottom values, as in real-life we do not need to trade all the 1500 stocks. The ensemble 1 showing accuracy of 99.25% included LR, Guassian_NB, Bernoulli_NB and SGDC whereas the ensemble 2 showing accuracy of 74.23% consisted of LR_L1Regress, LR_L2Regress, Guassian_NB and Bernoulli_NB.

Name of the Algorithm Test accuracy
Naive Bayes(NB) 51.21 %
Logistic Regression(LR) 51.77 %
Stochastic Gradient
Descent (SGDC)
50.56 %
Support Vector Machine
54.06 %
Adaboost 53.29 %
Random Forest 52.43 %
Ensemble 1
(predict top and bottom)
99.25 %
Ensemble 2
(predict top and bottom)
74.23 %
Table 2: Accuracy test on data from 1500 US stocks 2011-03-06 to 2011-09-7

All the features that we calculated were later filtered out and grouped out into their specific dates for trading where they perform the best. The three categories are weekly trading, monthly trading and daily trading. We then used specific different algorithms to trade in order to compare their performance.

6.1 Day Trading

  • SVM: SVM has a very high complexity and thus when performing SVM for a trading day of 1 we get a TLE error, because when dealing with real time data time complexity is very important.
    Time complexity O(d)

  • RandomForest: Using Random forest algorithm and daily trading we get a return of 18.08% with a sharpe ratio of 0.77.
    Time complexity O( v * n log(n) )

  • Ensemble 1 Classifiers:

    1. GaussianNB

    2. LogisticRegression

    3. BernoulliNB

    4. Sgdc

    Using the mixed classiferes of all these algorithms together we get a return of 34.99% with a sharpe ratio of 0.67. Time complexity of all these algorithms combined is very less and thus is very feasible for our purpose.

  • Best Classifiers:

    1. GaussianNB

    2. LogisticRegression

    3. DTC

    4. Sgdc

    Figure 8 shows, using Decision tree classifiers in the mix we get a return of 54.63% with a sharpe ratio of 1.16%.
    Time complexity : Somewhere between O(NklogN) and O(N2k) .

    1. GaussianNB

    2. LogisticRegression

    3. AdaBoostClassifier

    4. Sgdc

    Using AdaBoostClassifier in the mix we get a return of 11.69% with a sharpe ratio of 0.49.
    Time complexity : Is somewhere between and .

Figure 8: top result Daily Trade by ensembling GaussianNB, LogisticRegression, DTC and SGDC (best classifier)

6.2 Weekly Trading:

  • AdaBoostClassifier: Using AdaBoost for weekly trading we get a TLE and also a return of -0.75%.

  • Decision Tree:Decision tree for weekly trading we get a total return of 2.58 and a sharp ratio of 0.19

  • Random Forest:Using random forest we get a total return of 1.09% and a sharp ratio of 0.10.

6.3 Monthly Trading:

  • Decision Tree: Using AdaBoost for weekly trading we get a TLE.

  • AdaBoostClassifier : Using AdaBoost for weekly trading we get a return of -13.05% and a sharp ratio of -0.38.

  • SVM : Using AdaBoost for weekly trading we get a return of -4.05% and a sharp ratio of -0.17.

  • Random Forest : Using AdaBoost for weekly trading we get a return of -6.16% and a sharp ratio of -0.2.

6.4 Performance of our best classifier

Total Returns: It is the total amount of returns of an investment over a given period of time.This accounts for two different categories of investment.

  1. Fixed income investment

  2. Distribution and capital appreciation

Common Returns: Common returns are how much of your total returns can be attributed to the common risk factors as modeled by Quantopian exposure to market beta, sectors, momentum, mean reversion, volatility, size, and value. If all your returns are common returns, it means your algorithm isn’t doing anything unique and is therefore of little value. Table 3 shows 2.71% of common returns.

Specific Returns: It is an excess return that we get from an asset which is independent of specific returns of other assets. Table 3 shows 50.60% of common returns.

Sharpe Ratio: It is the measure of performance measure of an investment by risk adjustment. It measures the excess returns for every unit deviation of a trade. Our approach had a 1.16% sharpe ratio which is decent shown in table 3.


Max Drawdown: It is the maximum observed loss from the maximum observed point of the graph to the minimum point. This is used to assess the relative risk of a stock strategy.


Volatility: It is the measure of risk

6.5 Performance Evaluation and Risk Evaluation:

Our best algorithm from all of the above, was the ensemble learning algorithm which incorporated Gaussian Naive Bayes classifier , Logistic Regression , decision tree classifier and Stochastic gradient descent classifier. The training day for each decision was set to be 200 days prior to that day and trading was done daily. Below are the few results that are got by running the algorithm from the date 01/04/2011 to 07/05/2019 with a capital of 10000000 USD.

Total Returns 54.35%
Specific Returns 50.60%
Common Returns 2.71%
Sharp 1.16%
Max Draw Down -8.31%
Volatility 0.05%
Table 3: Performance of the System’s Best Model

The table 3 depicts that returns calculated from the initial investment was 54.35% on the total capital. The average Sharpe ratio is 1.16 and the average volatility is 0.05 and the final max drawdown was -8.31. These values indicate that our model returns a portfolio which has a low level of risk.

Cumulative specific and total returns: Cumulative returns are independent of the time period and us the total amount of profit or loss from a particular investment. The common returns is very low which is a good sign for the model as it means that our algorithm has a low beta and performs well irrespective of whether the stock prices rise or fall. Which made the specific return very high (50.60%) as shown in figure 9.

Figure 9: Cumulative specific and total returns

Returns over time : Returns are gains or losses made by a particular investment. Returns can be expressed as the percentage increase or decrease in a particular investment or it can be quantified in a particular currency. The figure 10 shows that the returns are mostly positive.

Figure 10: Returns over time

Rolling portfolio beta to Equity: This is shown in figure 11.The beta is the risk that can be attributed to the movement of the market. A beta having the value 1 signifies that a portfolio follows the trend of the market precisely. Whereas, a beta having lower value than 1 means that a portfolio is less correlated with the overall market. Low beta value incorporated with high Alpha value will mean that the portfolio will make profit irrespective of the market movement.

Figure 11: Rolling portfolio beta to Equity

Daily weekly and Monthly returns : Figure 12 illustrates, returns over the daily, weekly and monthly periods are indicated in the above figure. Each figure gives how much profit was made on a particular period. The daily, annual and monthly

Figure 12: (a)Daily returns, (b) weekly returns, (c)Monthly returns, (d)daily, (e)weekly and (f) monthly quantiles

Style Exposure : Figure 13 shows, exposure to various investing styles. The values displayed are the rolling 63-day mean. The relevant styles are described below:

Figure 13: Expose to Momentum, Size, Value, Short-term Reversal and Volatility

Ratio of long and short position : We implemented an equal amount of long and short position strategy as shown in figure 14. So at a time we went long on 250 stocks and short on 250 stocks. This made our model to perform well both on bull market and bear market.

Figure 14: Ratio of long and short position

Daily Holdings: From the figure 15, we see the total daily holdings of our portfolio which never exceeds 500. As we set our maximum holding limit in our portfolio to be 500.

Figure 15: Daily Holdings

Gross Leverage : Figure 16 shows, we kept our leverage at max 1.05 and at least 0.96 so that our money would be utilized but avoided the risk of being liquidated.

Figure 16: Gross Leverage

7 Conclusion

Through experimentation, it is clear that ensemble learning produced a better result in case of stock market trading as compared to using a single algorithm. Furthermore, it also became clear that most important part of a stock trading algorithm is the feature extraction part. The 1 day trading algorithm made 54.35% over the course of 8 years profit due to the quality of the features that were used for 1 day trading. Whereas the weekly and the monthly algorithm failed due to its features. Our most significant contribution is that we detected 28 features which can clearly capture the trend of the market over a one day period. Due to our limited resources, we were not able to use Pipeline to train our model and as such, were not able to train our models without exceeding the time limit for some algorithms. It is also difficult to implement high frequency trading such as hourly trading, as Quantopian does not provide features for hourly trading. We were also unable to implement any kind of neural networks as Quantopian would not allow us to import Keras for tensor flow. Furthermore, if we had better access to trading data, we would have been able to run our own neural network over the data for better results, but were unable to do so as Quantopian does not allow downloading of its datasets. With better resources, and better access to trading data, we would have been able to produce better and more accurate results.For future implementation purposes, we intend to design our own reinforcement learning algorithm that will be specifically tailored for this purpose. In order to get better results, we would like to try high-frequency trading, preferably daily and hourly. We would have to calculate our own features in that case.

8 Information Sharing Statement

In order to ensure reproducibility of our research we published our entire work at:
however the data we used for our research can only be used for free within Subsequently, all the codes provided will only work in Quantopian’s research environment.

Appendix A Appendix

The next list describes several symbols & abbreviation that will be later used within the body of the document