Using Pandas to Manage Large Time Series Files

Although Excel is a useful tool for performing time-series analysis and is the primary analysis application in many hedge funds and financial trading operations, it is fundamentally flawed in the size of the datasets it can work with. Even though Excel is limited to 1 million rows, it often becomes unstable on most systems when the row count approaches 500k. This is an issue for time-series analysis since high-frequency data (typically tick data or 1-minute bars) consumes a great deal of file space.


As such, there is often a need to break up large time-series datasets into smaller, more manageable Excel files. Fortunately, Pandas comes with inbuilt tools to aggregate, filter, and generate Excel files. 


Installation


Whilst Python, Pandas, and Jupyter Notebooks can be installed separately the most convenient and efficient way to install all three (and numerous other popular Python libraries) is to install Anaconda - installation guides here https://docs.anaconda.com/anaconda/install/windows/



Getting Started


We will work through the examples using Jupyter Notebooks. Note that once we have a working Python script the most efficient process is to deploy it as a service, however, for learning and script development Jupyter Notebooks is a very useful tool.


Starting in a new notebook we will firstly import Pandas


import pandas as pd 


We can then load the data using Pandas read_csv method (note that this also handles zip files). For the purposes of this demo we will use 1-minute intraday bars for AMZN (Amazon) stock provided by FirstRate Data which provides free high-frequency data samples. 


column_names = ["TimeStamp", "open", "high", "low", "close", "volume"]
maindf  = pd.read_csv("https://frd001.s3-us-east-2.amazonaws.com/AMZN_FirstRateDatacom1.zip",                 
                 names=column_names, 
                 parse_dates=["TimeStamp"],
                 index_col=["TimeStamp"] )


Note the read_excel method could be used if the data was stored as an Excel file, however, most large datasets are saved in csv to reduce filesize.


Also, note a few key points regarding the above snippet. Firstly we need to ascertain if the file contains a header row containing the column names, if so the simply include 


header=0


in the arguments. If not we will need to manually add the column names and pass them to the names argument in read_csv. Finally, we need to add the timestamp column as the index by specifying the passing the timestamp column name to the index_col argument. By default, Pandas will convert timestamps such as 2019-07-23 09:01:00 to a string and not a timestamp object so we also need to pass the timestamp column to the parse_dates parameter to convert the first column to valid dates. 


We can check the dataframe by running 


df.head()


Which will output the first 5 rows of the dataframe.


If the requirement is to simply select a sample of the datafile based on a date range then we can use the .loc() method and pass in the start and end dates for the required selection.


sampled_df = maindf.loc['2019-01-01':'2019-12-31']


However, analysts often require the data to be reframed into a different time interval. For example, the source file may contain 1-minute intraday bars but the analyst requires 2-hour bars. In this circumstance, the 1-minute bars need to be aggregated into 2-hours bars using different rules for each of the open, high, low, close, volume columns. 


This can be achieved by using the resample() and aggs() methods in combination. The resample() method will group rows into a different timeframe based on the parameter passed in, for example resample(“B”) will group the rows into business days (1 row per business day). 

However, the resample() method will not be able to aggregate the columns based on different rules and so the aggs() method needs to be used to provide information on how to aggregate each column:


sampled_df =  maindf.resample("2H").agg({'open': 'first', 'close': 'last', 'high' : 'max', 'low' : 'min', 'volume': 'sum'}) 


In some code samples a 'how' argument is used in the resample() method to provide the aggregation information, eg :


resample("2H", how=’ohlc’)


However, the how parameter has been deprecated in Pandas and is no longer available and as such the agg() method needs to be used.


Next, we will need to filter for trading days as the new dataframe will contain empty bars for the weekends and holidays. This can be done by passing the dataframe a filtering argument which will be true only for trading days


sampled_df = sampled_df[sampled_df.open > 0]


Finally we can save the resampled dataframe to an Excel file using the to_excel() method:


sampled_df.to_excel(r'path\file.xlsx', index = False)


If the error ModuleNotFoundError: No module named ‘openpyxl’ is shown then openpyxl will need to be installed first as Pandas uses this for its Excel file operations.


pip install openpyxl




Comments

There are no comments yet.