I have a vector called pca
, from which I decompose its values which I call x
y y
in the following way:
from pyspark.ml.linalg import Vectors
var = transformed.select('customer_id','pca')
def extract(row):
return (row.customer_id, ) + tuple(row.pca.toArray().tolist())
var_a = var.rdd.map(extract).toDF(["customer_id"])
var_a = var_a.withColumnRenamed("_2","x")
var_a = var_a.withColumnRenamed("_3","y")
var_a.show()
Giving me as a result:
Then I separate x and y as follows:
x = var_a.select("x")
y = var_a.select("y")
This in order to be able to make one ScatterPlot
of the two variables, my attempt was as shown in the following code. It is worth mentioning that the column prediction
I refer to only brings me values from 0 to 6, therefore the seven assignment colors to differentiate the clusters
.
df = predictions_pca.select('prediction').toPandas()
colores=['red','green','blue','yellow','fuchsia','black','purple']
asignar=[]
for row in df:
asignar.append(colores[int(row)])
plt.scatter(x, y, c=asignar, s=1)
plt.xlabel('Var_1')
plt.ylabel('Var_2')
plt.title('K-Means Clustering')
plt.show()
However, despite this, I get an error in the code. Could someone guide me or tell me what I'm doing wrong.
I attach the trace of the error that marks me:
I was finally able to solve it. What I was missing was a
collect
and cast the variables tolista
as follows:With practically the same code to make the Plot:
I got the expected output:
I hope someone else finds it helpful Greetings!