1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
|
+++
title = "🌸 Clustering Iris Species with K-Means and Hierarchical Methods"
description = "A hands-on comparison of K-Means and Agglomerative Clustering on the Iris dataset, with insights into performance, parameter tuning, and practical trade-offs."
date = 2025-11-01
[taxonomies]
tags = ["machine_learning"]
[extra]
styles = ["notebooks.css", ]
+++
## Why Clustering?
Clustering is a foundational technique in unsupervised learning, used to
uncover patterns in data without predefined labels. This project explores how
two popular clustering algorithms—**K-Means** and **Agglomerative Hierarchical
Clustering**—perform on the well-known **Iris flower dataset**, aiming to group
samples by species based solely on their morphological features.
***
## About the Dataset
The Iris dataset contains **150 samples** from three species: *setosa*,
*versicolor*, and *virginica*. Each sample includes four features:
* Sepal length
* Sepal width
* Petal length
* Petal width
While *setosa* is linearly separable, *versicolor* and *virginica* overlap
significantly, making this dataset ideal for testing clustering algorithms.
***
## What Was Explored
The analysis focused on:
* Comparing **K-Means** and **Agglomerative Clustering**
* Evaluating performance using **accuracy**, **silhouette score**, and
**Adjusted Rand Index (ARI)**
* Testing different **linkage methods** and **distance metrics**
* Visualizing clusters and errors using **PCA**
***
## Key Experiments & Findings
### Dimensionality Reduction with PCA
* **95.8%** of the variance was captured using just **2 principal components**, confirming
strong correlations among features—especially between petal length and width.
### Optimal Cluster Count
* Using metrics like **inertia**, **silhouette score**, and **accuracy**, the optimal
number of clusters was found to be **3**, matching the true number of species.
### Parameter Tuning for Agglomerative Clustering
* Tried combinations of:
* Linkage: `ward`, `complete`, `average`, `single`
* Metrics: `euclidean`, `manhattan`, `cosine`
* **Best result**: `average linkage` with `manhattan distance` achieved **88.7%
accuracy**, outperforming default settings.
### Performance Comparison
| Algorithm | Accuracy | Silhouette Score | ARI |
| ----------------------- | :------: | :--------------: | :--: |
| K-Means | 83.3% | 0.46 | 0.62 |
| Agglomerative (default) | 82.7% | 0.45 | 0.61 |
| Agglomerative (best) | 88.7% | 0.45 | 0.72 |
***
## Error Analysis
* **Setosa** was classified almost perfectly across all methods.
* Most errors occurred between **versicolor** and **virginica**, confirming
their overlapping nature.
* Agglomerative Clustering showed **bias** depending on parameters—sometimes
misclassifying one species more than the other.
***
## Final Thoughts
While Agglomerative Clustering achieved the highest accuracy with tuned
parameters, its **sensitivity to configuration** and **instability** in cluster
composition make it less reliable for real-world applications without labeled
data.
**K-Means**, despite slightly lower accuracy, offered **more balanced results**
and **greater stability**, making it a safer choice for practical clustering
tasks.
***
## Future Work
* Extend analysis to other clustering algorithms like DBSCAN or Spectral Clustering
* Apply to more complex datasets
* Explore automated parameter tuning techniques
***
The full notebook with code and visualizations is embedded below 👇
<!-- markdownlint-disable MD033 -->
<iframe title="Spam Email Classification notebook" class="notebook-embed" src="notebook.html"></iframe>
You can also view the notebook in [a separate page](notebook.html), or check it
on [GitHub](https://github.com/Farzat07/Unsupervised-Learning-Final-Project-Iris-Species-Clustering-Analysis).
|