diff options
| author | A Farzat <a@farzat.xyz> | 2025-11-01 10:19:00 +0300 |
|---|---|---|
| committer | A Farzat <a@farzat.xyz> | 2025-11-01 10:19:00 +0300 |
| commit | 576b204aca5d4f8d9f0d3898c1ce54dc08775c53 (patch) | |
| tree | 603a9c7cb422803d658d94c28bffdc9d6c535f7e /content/blog/csca5632-final/index.md | |
| parent | c1cc6eb579d4443e7b02352b6b2a56036637d627 (diff) | |
| download | farzat.xyz-576b204aca5d4f8d9f0d3898c1ce54dc08775c53.tar.gz farzat.xyz-576b204aca5d4f8d9f0d3898c1ce54dc08775c53.zip | |
Add the unsupervised learning final project
Diffstat (limited to 'content/blog/csca5632-final/index.md')
| -rw-r--r-- | content/blog/csca5632-final/index.md | 115 |
1 files changed, 115 insertions, 0 deletions
diff --git a/content/blog/csca5632-final/index.md b/content/blog/csca5632-final/index.md new file mode 100644 index 0000000..e2b1b66 --- /dev/null +++ b/content/blog/csca5632-final/index.md @@ -0,0 +1,115 @@ ++++ +title = "🌸 Clustering Iris Species with K-Means and Hierarchical Methods" +description = "A hands-on comparison of K-Means and Agglomerative Clustering on the Iris dataset, with insights into performance, parameter tuning, and practical trade-offs." +date = 2025-11-01 +[taxonomies] +tags = ["machine_learning"] +[extra] +styles = ["notebooks.css", ] ++++ + +## Why Clustering? + +Clustering is a foundational technique in unsupervised learning, used to +uncover patterns in data without predefined labels. This project explores how +two popular clustering algorithms—**K-Means** and **Agglomerative Hierarchical +Clustering**—perform on the well-known **Iris flower dataset**, aiming to group +samples by species based solely on their morphological features. + +*** + +## About the Dataset + +The Iris dataset contains **150 samples** from three species: *setosa*, +*versicolor*, and *virginica*. Each sample includes four features: + +* Sepal length +* Sepal width +* Petal length +* Petal width + +While *setosa* is linearly separable, *versicolor* and *virginica* overlap +significantly, making this dataset ideal for testing clustering algorithms. + +*** + +## What Was Explored + +The analysis focused on: + +* Comparing **K-Means** and **Agglomerative Clustering** +* Evaluating performance using **accuracy**, **silhouette score**, and +**Adjusted Rand Index (ARI)** +* Testing different **linkage methods** and **distance metrics** +* Visualizing clusters and errors using **PCA** + +*** + +## Key Experiments & Findings + +### Dimensionality Reduction with PCA + +* **95.8%** of the variance was captured using just **2 principal components**, confirming +strong correlations among features—especially between petal length and width. + +### Optimal Cluster Count + +* Using metrics like **inertia**, **silhouette score**, and **accuracy**, the optimal +number of clusters was found to be **3**, matching the true number of species. + +### Parameter Tuning for Agglomerative Clustering + +* Tried combinations of: + * Linkage: `ward`, `complete`, `average`, `single` + * Metrics: `euclidean`, `manhattan`, `cosine` +* **Best result**: `average linkage` with `manhattan distance` achieved **88.7% +accuracy**, outperforming default settings. + +### Performance Comparison + +| Algorithm | Accuracy | Silhouette Score | ARI | +| ----------------------- | :------: | :--------------: | :--: | +| K-Means | 83.3% | 0.46 | 0.62 | +| Agglomerative (default) | 82.7% | 0.45 | 0.61 | +| Agglomerative (best) | 88.7% | 0.45 | 0.72 | + +*** + +## Error Analysis + +* **Setosa** was classified almost perfectly across all methods. +* Most errors occurred between **versicolor** and **virginica**, confirming +their overlapping nature. +* Agglomerative Clustering showed **bias** depending on parameters—sometimes +misclassifying one species more than the other. + +*** + +## Final Thoughts + +While Agglomerative Clustering achieved the highest accuracy with tuned +parameters, its **sensitivity to configuration** and **instability** in cluster +composition make it less reliable for real-world applications without labeled +data. + +**K-Means**, despite slightly lower accuracy, offered **more balanced results** +and **greater stability**, making it a safer choice for practical clustering +tasks. + +*** + +## Future Work + +* Extend analysis to other clustering algorithms like DBSCAN or Spectral Clustering +* Apply to more complex datasets +* Explore automated parameter tuning techniques + +*** + +The full notebook with code and visualizations is embedded below 👇 + +<!-- markdownlint-disable MD033 --> +<iframe title="Spam Email Classification notebook" class="notebook-embed" src="notebook.html"></iframe> + +You can also view the notebook in [a separate page](notebook.html), or check it +on [GitHub](https://github.com/Farzat07/Unsupervised-Learning-Final-Project-Iris-Species-Clustering-Analysis). |
