Integrating ML models with the Strategy Tester (Conclusion): Implementing a regression model for price prediction


In the previous article, we completed the implementation of a CSV file management class for storing and retrieving data related to financial markets. Having created the infrastructure, we are now ready to use this data to build and train a machine learning model.

Our task in this article is to implement a regression model that can predict the closing price of a financial asset within a week. This forecast will allow us to analyze market behavior and make informed decisions when trading financial assets.

Price forecast is a useful tool for developing trading strategies and making decisions in the financial market. The ability to accurately predict price trends can lead to better investment decisions, maximizing profits and minimizing losses. Additionally, price forecasting can help identify trading opportunities and manage risks.

To implement our regression model, we will perform the following steps:

  1. Collect and prepare data: Using a Python script, we will get historical price data and other required information. We will use this data to train and test our regression model.
  2. Select and train a model: We will select a suitable regression model for our task and train it using the collected data. There are several regression models such as linear regression, polynomial regression and support vector regression (SVR). The model choice depends on its suitability for solving our problem and the performance obtained during the training process.
  3. Evaluate model performance: To make sure that our regression model works correctly, we need to evaluate its performance through a series of tests. This evaluation will help us identify potential problems and adjust the model if necessary.

At the end of this article, we will obtain a regression model that can predict the closing price of a financial asset for a week. This forecast will allow us to develop more effective trading strategies and make informed decisions in the financial market.

Section 1: Selecting a Regression Model

Before applying our regression model to predict the weekly closing price of a financial asset, it is necessary to understand the different types of regression models and their characteristics. This will allow us to choose the most suitable model to solve our problem. In this section, we will discuss some of the most common regression models:

  1. Linear Regression is one of the simplest and most popular regression models. It assumes a linear relationship between the independent variables and the dependent variable. It aims at finding a straight line that best fits the data while minimizing the sum of the squared error. Although easy to understand and implement, linear regression may not be suitable for problems where the relationship between variables is not linear.
  2. Polynomial Regression is an extension of linear regression that takes into account non-linear relationships between variables. Uses polynomials of varying degrees to fit a curve to the data. Polynomial regression may provide a better approximation for more complex problems; however, it is important to avoid overfitting, which occurs when the model fits too closely to the training data, reducing its ability to generalize to unseen data.
  3. Decision Tree Regression is a decision tree-based model that divides the feature space into distinct, non-overlapping regions. In each region, the forecast is made based on the average of the observed values. Decision tree regression is capable of capturing complex nonlinear relationships between variables, but can be subject to overfitting, especially if the tree becomes very large. Pruning and cross-validation techniques can be used to combat overfitting.
  4. Support Vector Regression (SVR) is an extension of the Support Vector Machine (SVM) algorithm for solving regression problems. SVR attempts to find the best function for the data while maintaining maximum margin between the function and the training points. SVR is capable of modeling nonlinear and complex relationships using kernel functions such as the Radial Basis Function (RBF). However, SVR training can be computationally expensive compared to other regression models.

To select the most appropriate regression model for predicting the closing price of a financial asset for a week, we need to consider the complexity of the problem and the relationship between the variables. In addition, we must consider the balance between model performance and computational complexity. In general, I recommend experimenting with different models and adjusting their settings to achieve the best performance.

When choosing a model, it is important to consider several criteria that affect the quality and applicability of the model. In this section we will look at the main criteria that must be taken into account when choosing a regression model:

  1. The performance of a regression model is very important to ensure the accuracy and usefulness of predictions. We can evaluate performance using mean square error (MSE) and mean absolute error (MAE), among other metrics. When comparing different models, it is important to choose the one that performs best on these metrics.
  2. Model Interpretability is the ability to understand the relationships between variables and how they affect the prediction. Simpler models such as linear regression are generally easier to interpret than more complex models such as neural networks. Interpretability is especially important if we want to explain our predictions to others or understand the factors that influence results.
  3. The complexity of a regression model is related to the number of parameters as well as the structure of the model. More complex models can capture more subtle and nonlinear relationships in the data, but they may also be more prone to overfitting. It is important to find a balance between the complexity of the model and the ability to generalize it to unknown data.
  4. Training time is an important point to consider, especially when working with large data sets or when training models iteratively. Simpler models, such as linear and polynomial regression, typically require less training time than other, more complex models, such as neural networks or support vector regression. It is important to find a balance between model performance and training time to ensure that the model is applicable.
  5. Robustness of a regression model is its ability to deal with outliers and noise in the data. Robust models are less sensitive to small changes in data and produce more stable forecasts. It is important to choose a model that can handle outliers and noise in the data.

When choosing the most appropriate regression model for forecasting closing prices, it is important to weigh these criteria and find the right balance between them. It is typically recommended to test different models and fine-tune their parameters to optimize performance. This way you can select the best model for a particular problem.

Based on the above criteria, in this article, I decided to use the Decision Tree Regression model to predict the closing price. The choice of this model is justified for the following reasons:

  1. Performance: Decision trees typically work well for regression problems because they are able to capture nonlinear relationships and interactions between variables. By properly tuning model hyperparameters, such as tree depth and minimum number of samples per leaf, we can achieve a balance between fitness and generalization.
  2. Interpretability: One of the main advantages of decision trees is their interpretability. Decision trees are a series of decisions based on attributes and their values, making them easy to understand. This is useful for justifying forecasts and understanding the factors influencing closing prices.
  3. Complexity: The complexity of decision trees can be controlled by tuning the hyperparameters of the model. With this, we canfind a balance between the ability to model complex relationships and the simplicity of the model, while avoiding overfitting.
  4. Training time: Decision trees typically train relatively quickly compared to more complex models such as neural networks or SVMs. This fact makes the decision tree regression model suitable for cases where training time is an important factor.
  5. Robustness: Decision trees are robust to outliers and noise in the data because each decision is based on a set of samples rather than a single observation, and this contributes to the stability of predictions and the reliability of the model.

Given the criteria discussed and the benefits of decision tree regression, I believe this model is suitable for predicting the weekly closing price. However, it is important to remember that the choice of models may vary depending on the specific context and requirements of each problem. Therefore, to select the most appropriate model for your specific problem, you should test and compare different regression models.

Section 2: Data Preparation

Data preparation and cleaning are important steps in the process of implementing a regression model, since the quality of the input data directly affects the efficiency and performance of the model. These steps are important for the following reasons:

  1. Elimination of outliers and noise: Raw data may contain outliers, noise and errors that can negatively affect the performance of the model. By identifying and correcting these inconsistencies, you can improve the quality of your data and, therefore, the accuracy of your forecasts.
  2. Filling and removing missing values: Incomplete data is common in datasets, and missing values can cause the model to perform poorly. To ensure data integrity and reliability, you might consider imputing missing values, deleting records with missing data, or using special techniques to deal with such data. The choice between imputing and deletion depends on the nature of the data, the number of missing values, and the potential impact of those values on model performance. It is important to carefully analyze each situation and choose the most appropriate approach to solving the problem.
  3. Selecting Variables: Not all variables present in the data set may be important or useful in predicting the closing price. Appropriate variable selection allows the model to focus on the most important features, improving performance and reducing model complexity.
  4. Data Transformation: Sometimes the original data needs to be transformed to match the assumptions of the regression model or to improve the relationship between independent variables and the dependent variable. Examples of transformations include normalization, standardization, and the use of mathematical functions such as logarithm or square root.
  5. Data division: Divide the data set into a training subset and a testing subset to properly evaluate the performance of the regression model. This division allows the model to be trained on a subset of data and tested for its ability to generalize to unseen data, which provides an assessment of the model’s performance in real-world situations.

The stage of data preparation and cleaning ensures that the model is trained and evaluated from quality data, maximizing its effectiveness and usefulness in predicting closing prices.

We’ll look at a basic example of preparing data for a regression model using Python. However, I would like to note that it is important to deepen your knowledge as each specific data set and each problem may require special preparation approaches and methods. Therefore, I strongly recommend that you take the time to learn and understand the various data preparation methods.

To collect data, we will use the get_rates_between function, is intended for collecting financial data for a specific asset and for a specific period. It uses the MetaTrader 5 library to connect to the trading platform and obtain historical price data for various time intervals.

The function has the following parameters:

  • symbol: a string that represents the financial symbol (for example, “PETR3”, “EURUSD”).
  • period: an integer that specifies the time period for which the data will be collected (for example, mt5.TIMEFRAME_W1 for weekly data).
  • ini: a datetime object that represents the start time and date of the time interval for data collection.
  • end: a datetime object that represents the end date and time of the time interval for data collection.

The function starts by checking MetaTrader 5 boot. If boot failed, the function returns an exception and terminates the program.

Then the function uses mt5.copy_rates_range() to get financial data of the specified symbol and period. The data is saved in the DataFrame object from pandas, which is a two-dimensional axis-labeled data structure that is suitable for storing financial data.

After receiving the data, the function checks if the DataFrame is empty. If the function is empty, it will return an exception because this indicates that an error occurred while collecting data.

If all goes well, the function will convert the ‘time’ column of DataFrame to a readable date and time format using the pd.to_datetime() function. The ‘time’ column is defined as the DataFrame index, which facilitates data access and manipulation.

more insights