I have a huge amount of data which I would like to run a kmean classification on. The dataset are so big, that I cannot load the files into memory.
My idea is to run the classifiation on some part of the dataset like a training dataset, and then apply the calssification to the rest of the dataset part by part.
import pandas as pd
import pickle
from sklearn.cluster import KMeans
frames = [pd.read_hdf(fin) for fin in ifiles]
data = pd.concat(frames, ignore_index=True, axis=0)
data.dropna(inplace=True)
k = 12
x = pd.concat(data['A'], data['B'], data['C'], axis=1, keys=['A','B','C'])
model = KMeans(n_clusters=k, random_state=0, n_jobs = -2)
model.fit(x)
pickle.dump(model, open(filename, 'wb'))
x looks like this:
array([[-2.26732099, 0.24895614, 2.34840191],
[-2.26732099, 0.22270912, 1.88942378],
[-1.99246557, 0.04154312, 2.63458941],
...,
[-4.29596287, 1.97036309, -0.22767511],
[-4.26055474, 1.72347591, -0.18185197],
[-4.15980382, 1.73176239, -0.30781225]])
The model look like this:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=12, n_init=10, n_jobs=-2, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)
A plot of two of the model parameters color coded with the model looks like this:

Now I want to load the model and use it for predicting. As a test example I have loaded the same data (not shown here), and trying to predict the new dataset.
modelnew = pickle.load(open('test.pkl', 'rb'))
modelnew.predict(x)
The result:

This data does clearly not cluster. What am I missing? Do I need to fix the model parameters in some way?
I have tried to make an example of a test and a train data set. Here it also goes wrong. There is clearly something I am missing:
## Splitting data in a test and train data set
sample_train, sample_test = train_test_split(x, test_size=0.50)
k = 12 ## Setting number of clusters
model = KMeans(n_clusters=k, random_state=0, n_jobs = -2) ## Kmeans model
train = model.fit(sample_train) ## Fitting the training data
model.predict(sample_test) # Predicting the test data
centroids = model.cluster_centers_
labels = model.labels_
## Figures
cmap_model = np.array(['red', 'lime', 'black', 'green', 'orange', 'blue', 'gray', 'magenta', 'cyan', 'purple', 'pink', 'lightblue', 'brown', 'yellow'])
plt.figure()
plt.scatter(sample_train[:,0], sample_train[:,1], c=cmap_model[train.labels_], s=10, edgecolors='none')
plt.scatter(centroids[:, 0], centroids[:, 1], c=cmap_model, marker = "x", s=150, linewidths = 5, zorder = 10)
plt.figure()
plt.scatter(sample_test[:,0], sample_test[:,1], c=cmap_model[labels], s=10, edgecolors='none')
plt.scatter(centroids[:, 0], centroids[:, 1], c=cmap_model, marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
Train data:

Test data:

To fix this, you can split your data into a training set and a test set, and use the training set to train the model and the test set to evaluate its performance. Here is an example of how to do this using the train_test_split function from sklearn:
from sklearn.model_selection import train_test_split # Split data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train the model on the training set model.fit(X_train, y_train) # Evaluate the model on the test set y_pred = model.predict(X_test)
Make sure to use a different dataset for testing than the one used for training. Once you have the test set, you can compare the predictions of the model with the true labels of the test set and evaluate the performance of the model.