Using the python API from the documentation of xgboost I am creating the train data by:
dtrain = xgb.DMatrix(file_path)
Here file_path is of libsvm format txt file. As I am doing pairwise ranking I am also inputting the length of the groups in the dtrain data that we just inputed:
dtrain.set_group(group_len_file)
and now I am training the model:
param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'rank:pairwise' }
param['nthread'] = 4
param['eval_metric'] = 'ndcg'
bst = xgb.train(param,dtrain,10)
Now I want to use gridsearch. So my question is how to use gridsearch from sklearn when I am inputing DMatrix?
Gennerally:
from sklearn.grid_search import GridSearchCV
param_test1 = {
'max_depth':list(range(3,10,2)),
'min_child_weiht':list(range(1,6,2))
}
gsearch = GridSearchCV(estimator=XGBClassifier(objective='rank:pairwise'),
param_grid = param_test1)
but gsearch doesnt have train. It only have fit which takes X and y as inputs and not DMatrix. Is there a way to do it?
And if not I can transform the DMatrix to be of format from Dmatrix => X, y but I can't find in documentation how to input set_group in that case? Any idea on it?
To use grid search with the XGBoost library in scikit-learn, you will need to use the XGBClassifier or XGBRegressor class, which provide a scikit-learn compatible interface to train and evaluate XGBoost models. These classes expect the input data to be in the form of NumPy arrays or pandas DataFrames, rather than the DMatrix objects used by the low-level XGBoost library.
To use grid search with XGBoost in scikit-learn, you will need to first convert your input data to a NumPy array or pandas DataFrame. You can do this by calling the .get_label() and .get_data() methods of the DMatrix object, which will return the labels and feature matrix, respectively, in the form of NumPy arrays.
For example:
import pandas as pd # Load the data into a pandas DataFrame df = pd.read_csv(file_path, sep='\s+', header=None) # Split the data into features and labels X = df.iloc[:, 1:] y = df.iloc[:, 0] # Convert the group length file to a NumPy array group_len = np.loadtxt(group_len_file) # Create the grid search object param_test1 = { 'max_depth': list(range(3, 10, 2)), 'min_child_weight': list(range(1, 6, 2)) } gsearch = GridSearchCV(estimator=XGBClassifier(objective='rank:pairwise'), param_grid=param_test1) # Fit the model using the grid search object gsearch.fit(X, y, group=group_len)
Note that you will need to pass the group length array as the group parameter to the fit method, rather than using the set_group method as you did with the DMatrix object.
Alternatively, you can use the DMatrix object as input to the fit method by wrapping it in a DMatrixWrapper object, which is a scikit-learn compatible transformer that allows you to use DMatrix objects as input to scikit-learn estimators. To use this approach, you will need to install the dmatrixwrapper library, which can be done using pip install dmatrixwrapper.
from dmatrixwrapper import DMatrixWrapper # Wrap the DMatrix object in a DMatrixWrapper X = DMatrixWrapper(dtrain) # Create the grid search object param_test1 = { 'max_depth': list(range(3, 10, 2)), 'min_child_weight': list(range(1, 6, 2)) } gsearch = GridSearchCV(estimator=XGBClassifier(objective='rank:pairwise'), param_grid=param_test1) # Fit the model using the grid search object gsearch.fit(X)
This will allow you to use the DMatrix object as input to the grid search object, while still using the scikit-learn interface for training and evaluation.