Machine Learning Models
Machine learning models are trainable structures that learn from the information within a dataset to accomplish a task. These models are widely used for various applications, but in the QuantiX platform, we use them for the following purposes:
Classification: Predicting labels (usually buy, sell, no-action) for candles based on the knowledge learned from the historical price data and a categorical target.
Regression: forecasting numeric values for candles based on the knowledge learned from the historical price data and a numerical target.
The models introduced below are used for classification and regression tasks.
XGBoost
XGBoost (eXtreme Gradient Boosting) is a highly efficient and flexible gradient boosting framework that has gained immense popularity in machine learning competitions and real-world applications. It's particularly effective for handling both classification and regression problems.
How XGBoost Works
Base Learners: XGBoost starts with a set of decision trees.
Predictions and Residuals: The base learners make predictions for the training data. The difference between the actual and predicted values is calculated, known as residuals.
New Base Learner: A new base learner is trained to predict the residuals from the previous round.
Weighted Sum: The predictions from all base learners are combined using a weighted sum. The weights are adjusted based on the performance of each learner.
Iteration: This process is repeated iteratively, with each new base learner focusing on the errors made by the previous ones.
XGBoost for Classification
For classification problems, XGBoost uses a special loss function that measures the accuracy of the predictions. The algorithm iteratively trains base learners to minimize this loss, improving the model's ability to classify instances correctly.
XGBoost for Regression
In regression problems, XGBoost uses a different loss function that measures the difference between predicted and actual values. The algorithm aims to minimize this loss, leading to more accurate predictions.
XGBoost Parameters
Booster: Specifies the type of model used in XGBoost. Common options include "gbtree" (gradient boosted trees) and "dart" (dropout with adaptive rates).
Learning Rate: Controls the step size at each iteration. A smaller learning rate often leads to better generalization but requires more iterations.
Min Split Loss: Sets the minimum loss reduction required for a node to split; a higher value leads to fewer splits and a simpler model.
Subsample: Specifies the fraction of rows to sample at each iteration; this helps prevent overfitting by introducing randomness.
Max Depth: Sets the maximum depth of the trees; a deeper tree can capture more complex patterns but is prone to overfitting.
N Estimator: Determines the number of trees in the ensemble; more trees generally improve performance but can increase training time.
Sampling Method: Controls the sampling method used for training.
AdaBoost
AdaBoost (Adaptive Boosting) is a machine learning algorithm that combines multiple decision trees to create a strong learner. It's a popular ensemble method that iteratively adjusts the weights of training instances to focus on the ones that are misclassified by previous learners.
How AdaBoost Works
Initialize Weights: Each training instance is assigned an equal weight initially.
Train Weak Learner: A weak learner (decision tree) is trained on the weighted dataset.
Calculate Error: The error rate of the weak learner is calculated.
Update Weights: The weights of misclassified instances are increased, while the weights of correctly classified instances are decreased.
Repeat: Steps 2-4 are repeated for a specified number of iterations or until a stopping criterion is met.
Combine Learners: The final prediction is made by combining the predictions of all weak learners, weighted based on their performance.
AdaBoost for Classification
In classification problems, AdaBoost typically uses a weak learner like a decision stump (a decision tree with only one split). Each weak learner is trained on the weighted dataset, and its predictions are combined using a weighted voting scheme.
AdaBoost for Regression
For regression problems, AdaBoost can be used with weak learners like linear regression models. The predictions of the weak learners are combined using a weighted average.
AdaBoost Parameters
N Estimators: The number of weak learners (decision trees) in the ensemble; more learners typically improve performance but can increase training time.
Learning Rate: Controls the step size at each iteration; a smaller learning rate often leads to better generalization but requires more iterations.
Algorithm: Specifies the boosting algorithm used.
Max Depth: Sets the maximum depth of the trees; a deeper tree can capture more complex patterns but is prone to overfitting.
CatBoost
CatBoost is a gradient boosting framework Similar to XGBoost but uses a different method to grow trees.
How CatBoost Works
CatBoost is a gradient boosting framework specifically designed for handling categorical features efficiently. Here's a simplified overview of how it works:
Ordered Boosting: Unlike traditional gradient boosting, CatBoost uses ordered boosting. This means that categorical features are not one-hot encoded or embedded, but instead, their values are sorted based on their importance to the target variable. This helps the model capture the ordinal nature of categorical features.
Greedy Forest Construction: CatBoost builds a greedy forest of decision trees. Each tree is constructed by splitting nodes based on features, including categorical ones. The splitting criterion is chosen to maximize the reduction in the loss function.
Gradient-Based Optimization: CatBoost uses gradient-based optimization to update the model's parameters at each iteration. The gradients are calculated based on the loss function and the current model predictions.
Regularization: CatBoost employs various regularization techniques, such as L1 and L2 regularization, to prevent overfitting and improve generalization.
Categorical Feature Handling: CatBoost automatically handles categorical features without requiring explicit feature engineering. It uses a special technique called "target encoding" to transform categorical values into numerical values based on their statistical properties with respect to the target variable.
Overall, CatBoost's combination of ordered boosting, greedy forest construction, gradient-based optimization, and regularization techniques makes it a powerful learning unit.
CatBoost for Classification
CatBoost is an excellent choice for classification tasks, especially when dealing with datasets containing numerous categorical features. Its unique approach to handling categorical variables, combined with its gradient boosting framework, provides several advantages, including improved generalization, flexibility, speed, and efficiency.
CatBoost for Regression
Similar to classification, CatBoost can be utilized to make ensembles of regressor trees. The quality and robustness of results highly depend on the training parameters.
CatBoost Parameters
N Estimators: The number of weak learners (decision trees) in the ensemble; more learners typically improve performance but can increase training time.
Learning Rate: Controls the step size at each iteration
Max Depth: Sets the maximum depth of the trees; a deeper tree can capture more complex patterns but is prone to overfitting.