I am trying to perform an unsupervised ML algorithm, for this I am testing with the algorithm K-Means
in pyspark
. However when running the code from the documentation it is as follows:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)
# Make predictions
predictions = model.transform(dataset)
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
I get an error that the library ClusteringEvaluator
was not found. Does anyone know if this library was deprecated in some version of PySpark
o Why do I get this error when importing the library? First of all, Thanks.
I share an image of the error that appears:
If you want to use it, you
ClusteringEvaluator()
need to have a version equal to or greater than 2.3 , since it is the version in which it came out.pyspark 2.2 documentation you can see it is NOT found.
pyspark 2.3 documentation you can see that it IS found.
You can upgrade to any version by uninstalling and installing the version with the package name and version you want.
And then:
Version 3 came out in June. I would install it if you are starting any project. In the event that you want to make changes to an old project, I would install the version
pyspark 2.4
withpip install pyspark==2.4
, since between version 2 and 3 there are significant changes that can cause problems.