diff options
| author | A Farzat <a@farzat.xyz> | 2025-10-13 13:12:12 +0300 |
|---|---|---|
| committer | A Farzat <a@farzat.xyz> | 2025-10-13 13:12:12 +0300 |
| commit | c26169e2d0663d21644d78a1da1348b80f9dff4d (patch) | |
| tree | c21e85613cbb7007188b1477e8aeaee07bac8c2d | |
| parent | c02504be3ff744a740b02a4e9372cd2e155ddc68 (diff) | |
| download | csca5028-c26169e2d0663d21644d78a1da1348b80f9dff4d.tar.gz csca5028-c26169e2d0663d21644d78a1da1348b80f9dff4d.zip | |
Add the whiteboard
| -rw-r--r-- | README.md | 69 | ||||
| -rw-r--r-- | images/Whiteboard1.png | bin | 0 -> 273691 bytes | |||
| -rw-r--r-- | images/Whiteboard2.png | bin | 0 -> 183812 bytes |
3 files changed, 69 insertions, 0 deletions
@@ -23,3 +23,72 @@ and [nginx](http://nginx.org/). - Language: JavaScript. - Built using [Vite](https://vite.dev/). - Hosted on [GitHub Pages](https://pages.github.com/). + +## Whiteboard + +The following is the initial version of the whiteboard: + + + +The data collector fetches the RSS feed of YouTube Channels/Playlists for new videos. +The original design had it storing the list of subscriptions to fetch in memory, +not updating it by re-querying the database but instead by getting instructions +from the data analyser. The only interactions with the database were getting the +subscriptions list at startup and saving new videos after each fetch. + +The data analyser is the one that actually controls the subscription list by adding/deleting +them. It too keeps the list of subscriptions in memory, and periodically fetches +additional video details (duration) using [yt-dlp](https://github.com/yt-dlp/yt-dlp), +writing them to the database. + +Finally we have the Flask app, which serves the data using an API. It also has the +functionality of modifying the subscriptions list, but not by itself as it has to +message the Data analyser through the message queue. This is because scraping YouTube +might be needed to convert channel URLs to feed links, for which the data analyser +is more suited. + +However, as the development progressed, the whiteboard structure changed to the following: + + + +There are many changes, which I will list with the reason: + +1. PostgreSQL was replaced with MongoDB as I decided not to wait before learning +SQL (see the SQL vs NoSQL section for more details). +2. Message Queues are replaced by direct access to the database. As I was creating +the MVP, I just allowed each application to directly access/manipulate the database +to save time. As I did so, I realised that doing so made more sense than keeping +the whole list of subscriptions in-memory for both the data analyser and collector. +Instead, both would periodically loop through the subscriptions and update some +as appropriate. +3. The data analyser no longer uses [yt-dlp](https://github.com/yt-dlp/yt-dlp) as +it was overkill. Instead, web scraping done directly, as it has a lower chance of +failing ([yt-dlp](https://github.com/yt-dlp/yt-dlp) fetches many details about a +video increasing the points of failure and the chance of being blocked by YouTube). +To top that off, YouTube blocks all calls to video URLs from non-residential IP +addresses without login, which meant that the API had to be used in production. +4. As [yt-dlp](https://github.com/yt-dlp/yt-dlp) was no longer used, it made no +sense to separate the fetching of the feed URLs from the API server. The channel +URLs were not blocked by YouTube, so only web scraping is used. + +## SQL vs NoSQL + +NoSQL has some advantages - it is easier to scale horizontally, and in our example +you only need one collection(table) for the app (no `JOIN` statements). It also eliminates +the need to create a schema. + +It also has a couple of disadvantages. Whenever I want to deal with a subscription +object, I have to hold all the corresponding videos in memory, which can scale badly +(many channels have millions of videos over the years). I also lose the ability to +perform operations on videos only. For example, if I want to count the number of +videos a subscription holds, I have to count in the python code, and if I want to +update one video, I have to update the whole videos list. + +Furthermore, when I add the users collection (there was initially a plan to add users) +the advantage of needing only one collection will be lost. + +Why did I choose NoSQL then? The deciding factor was familiarity. I had already used +MongoDB before and felt comfortable with its JSON-like syntax. I was set on learning +SQL, but that was going to take some time and I did not want to wait until I learned +it to start the project. Now that I am taking the [Databases](https://www.colorado.edu/program/data-science/databases) +specialisation though, if I were to redo the project I would definitely use SQL. diff --git a/images/Whiteboard1.png b/images/Whiteboard1.png Binary files differnew file mode 100644 index 0000000..77cfd60 --- /dev/null +++ b/images/Whiteboard1.png diff --git a/images/Whiteboard2.png b/images/Whiteboard2.png Binary files differnew file mode 100644 index 0000000..23052ce --- /dev/null +++ b/images/Whiteboard2.png |
