User:Whdgur0407/sandbox

General
Decision tree is one of the most common methods used in statistics, machine learning and data mining. When processing a decision tree the machine will operate to evaluate each iteration through feature questions. This allows the computer to formulate a function fm(x) allowing a classification or regression model to be created based on the function.

XGBoost is a gradient boosted decision tree(GBDT). Gradient boosted decision tree uses an algorithm called the ensemble learning algorithm which improves its algorithm through a combination of learning algorithms. The accuracy of the prediction as well as the computer’s understanding of the data can vary based on the processing power of the computer. XGBoost has included features that allow the GPU to be involved in the processing process allowing the GPU to accelerate the overall model.

DMLC, the organization that created XGBoost is currently working with various other teams from other organizations allowing them to create external tools to improve the overall efficiency, scalability, and other features to support the contribution of XGBoost in the Data Science industry.

Example code explanation
The code above will read and utilize a data frame using a python library called Pandas.

Categorical Data with the data type str/object should be preprocessed into a numerical format in order for the data to be acceptable.

Feature variable is the column names that will be used as input variables for the model.(this can be considered the X of the function/algorithm)

The label variable would be considered the output variable of the model.(this can be considered the Y of the function/algorithm)

The train_test_split function imported from the python library sci-kit learn creates 4 different data frames x train, x test, y train, and y test.

Once the data is splitted you can train the model using the model fit function. Within the function parameter you should insert the x train and y train in order.

After the model has been trained you can use the x test variables to create predicted values through the model predict function.

Store the x test variables into a data frame and insert a column called “pred” to further develop the program.

Possible future development suggestions
The classification prediction can be used to evaluate the model, tune the model, and deploy the model.

Some examples of these could be cross validation, feature selection, scorer, and etc. The model can be stored as file types of Pickle or Joblib for further utilization, modification, and implementation.

Further improvements/development can be made from utilizing the model with other python libraries in order to be implemented in software applications.

Fraud Detection
Fraud detection in financial institutions is most often operated through machine learning. XGBoost has been gaining recognition within the fields of fraud detection for its pattern analysis. It has been showing a high accuracy while showing an accurate prevention for overfitting.

Predictive Maintenance
Through various features such as status scanned, equipment performance, time series data of last maintenance, and etc. XGBoost is able to translate each feature and analyze it in order to come up with an optimized schedule for maintenance. This will allow the manufacturer to prevent possible accidents or equipment failures making the solution cost efficient.

Customer Churn
XGBoost will analyze and understand customer behavior and pattern in order to predict if the customer is likely to cancel their subscription or not.

Personalized Advertising
This is one of the industries that has the most profitable and practical use of machine learning. Since XGBoost has been showing a massive growth in its user pattern and behavior analysis it is very commonly used in personalized advertising.

Regression prediction
Because of the characteristics of a decision tree model XGBoost which is a gradient boosted decision tree model it is harder for them to come up with a continuous prediction output. In a decision tree model like XGBoost each leaf node will return the mean of the training data which will result in the result never being able to exceed the range of the lowest training data and the highest training data. It would be hard to consider this a prediction of a continuous variable.

Real Time Data Analysis
XGBoost was designed to perform batch learning, meaning that the model requires all the data to be present in order for the training to occur. This can be modified through various methods but optimally XGBoost would not be the best option to perform such tasks.

Natural Language Processing
XGBoost is not originally designed for Natural Language Processing(NLPs) but still is able to perform some tasks within the field. Although XGBoost might be able to perform some of the given tasks in Natural Language Processing it is questionable whether it is efficient and effective. The recommended algorithm to use for Natural Language Processing would be Recurrent Neural Networks(RNNs) which would be considered the first option for Natural Language Processing.

Computer Vision
XGBoost also struggles with computer visions considering the tendency of computer vision. Computer vision tasks require processing high dimension image data often in real time. XGBoost is not suitable for these tasks especially with the existence of convolutional neural networks (CNNs) that can process such data much more efficiently and effectively.