aboutsummaryrefslogtreecommitdiff
path: root/content/blog/csca5632-final/index.md
blob: e2b1b6605d2bfd2a67580b72a6c1b23ad311536f (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
+++
title = "🌸 Clustering Iris Species with K-Means and Hierarchical Methods"
description = "A hands-on comparison of K-Means and Agglomerative Clustering on the Iris dataset, with insights into performance, parameter tuning, and practical trade-offs."
date = 2025-11-01
[taxonomies]
tags = ["machine_learning"]
[extra]
styles = ["notebooks.css", ]
+++

## Why Clustering?

Clustering is a foundational technique in unsupervised learning, used to
uncover patterns in data without predefined labels. This project explores how
two popular clustering algorithms—**K-Means** and **Agglomerative Hierarchical
Clustering**—perform on the well-known **Iris flower dataset**, aiming to group
samples by species based solely on their morphological features.

***

## About the Dataset

The Iris dataset contains **150 samples** from three species: *setosa*,
*versicolor*, and *virginica*. Each sample includes four features:

* Sepal length
* Sepal width
* Petal length
* Petal width

While *setosa* is linearly separable, *versicolor* and *virginica* overlap
significantly, making this dataset ideal for testing clustering algorithms.

***

## What Was Explored

The analysis focused on:

* Comparing **K-Means** and **Agglomerative Clustering**
* Evaluating performance using **accuracy**, **silhouette score**, and
**Adjusted Rand Index (ARI)**
* Testing different **linkage methods** and **distance metrics**
* Visualizing clusters and errors using **PCA**

***

## Key Experiments & Findings

### Dimensionality Reduction with PCA

* **95.8%** of the variance was captured using just **2 principal components**, confirming
strong correlations among features—especially between petal length and width.

### Optimal Cluster Count

* Using metrics like **inertia**, **silhouette score**, and **accuracy**, the optimal
number of clusters was found to be **3**, matching the true number of species.

### Parameter Tuning for Agglomerative Clustering

* Tried combinations of:
  * Linkage: `ward`, `complete`, `average`, `single`
  * Metrics: `euclidean`, `manhattan`, `cosine`
* **Best result**: `average linkage` with `manhattan distance` achieved **88.7%
accuracy**, outperforming default settings.

### Performance Comparison

| Algorithm               | Accuracy | Silhouette Score | ARI  |
| ----------------------- | :------: | :--------------: | :--: |
| K-Means                 | 83.3%    | 0.46             | 0.62 |
| Agglomerative (default) | 82.7%    | 0.45             | 0.61 |
| Agglomerative (best)    | 88.7%    | 0.45             | 0.72 |

***

## Error Analysis

* **Setosa** was classified almost perfectly across all methods.
* Most errors occurred between **versicolor** and **virginica**, confirming
their overlapping nature.
* Agglomerative Clustering showed **bias** depending on parameters—sometimes
misclassifying one species more than the other.

***

## Final Thoughts

While Agglomerative Clustering achieved the highest accuracy with tuned
parameters, its **sensitivity to configuration** and **instability** in cluster
composition make it less reliable for real-world applications without labeled
data.

**K-Means**, despite slightly lower accuracy, offered **more balanced results**
and **greater stability**, making it a safer choice for practical clustering
tasks.

***

## Future Work

* Extend analysis to other clustering algorithms like DBSCAN or Spectral Clustering
* Apply to more complex datasets
* Explore automated parameter tuning techniques

***

The full notebook with code and visualizations is embedded below 👇

<!-- markdownlint-disable MD033 -->
<iframe title="Spam Email Classification notebook" class="notebook-embed" src="notebook.html"></iframe>

You can also view the notebook in [a separate page](notebook.html), or check it
on [GitHub](https://github.com/Farzat07/Unsupervised-Learning-Final-Project-Iris-Species-Clustering-Analysis).