aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 781ec873e2bcd3014bec9bc90b5ad795282fe0d6 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
# YouTube Subscriber

This application runs continuously, optionally on a server, watching YouTube feeds
for new videos. This can be used to eliminate the need to subscribe to the channels
themselves or even create a Google account. It also allows you to follow YouTube
playlists.

## Technology Stack

### Back-end

- Language: Python.
- Web Framework Flask.
- [venv](https://docs.python.org/3/library/venv.html) for the virtual environment.
- Type checking using [mypy](https://www.mypy-lang.org/).
- Testing using [unittest](https://docs.python.org/3/library/unittest.html).
- Production on a Private VPS running Arch Linux.
- Production deployment using [gunicorn](https://flask.palletsprojects.com/en/stable/deploying/gunicorn/)
and [nginx](http://nginx.org/).

### Front-end

- Language: JavaScript.
- Built using [Vite](https://vite.dev/).
- Hosted on [GitHub Pages](https://pages.github.com/).

## Whiteboard

The following is the initial version of the whiteboard:

![Whiteboard 1](images/Whiteboard1.png)

The data collector fetches the RSS feed of YouTube Channels/Playlists for new videos.
The original design had it storing the list of subscriptions to fetch in memory,
not updating it by re-querying the database but instead by getting instructions
from the data analyser. The only interactions with the database were getting the
subscriptions list at startup and saving new videos after each fetch.

The data analyser is the one that actually controls the subscription list by adding/deleting
them. It too keeps the list of subscriptions in memory, and periodically fetches
additional video details (duration) using [yt-dlp](https://github.com/yt-dlp/yt-dlp),
writing them to the database.

Finally we have the Flask app, which serves the data using an API. It also has the
functionality of modifying the subscriptions list, but not by itself as it has to
message the Data analyser through the message queue. This is because scraping YouTube
might be needed to convert channel URLs to feed links, for which the data analyser
is more suited.

However, as the development progressed, the whiteboard structure changed to the following:

![Whiteboard 2](images/Whiteboard2.png)

There are many changes, which I will list with the reason:

1. PostgreSQL was replaced with MongoDB as I decided not to wait before learning
SQL (see the SQL vs NoSQL section for more details).
2. Message Queues are replaced by direct access to the database. As I was creating
the MVP, I just allowed each application to directly access/manipulate the database
to save time. As I did so, I realised that doing so made more sense than keeping
the whole list of subscriptions in-memory for both the data analyser and collector.
Instead, both would periodically loop through the subscriptions and update some
as appropriate.
3. The data analyser no longer uses [yt-dlp](https://github.com/yt-dlp/yt-dlp) as
it was overkill. Instead, web scraping done directly, as it has a lower chance of
failing ([yt-dlp](https://github.com/yt-dlp/yt-dlp) fetches many details about a
video increasing the points of failure and the chance of being blocked by YouTube).
To top that off, YouTube blocks all calls to video URLs from non-residential IP
addresses without login, which meant that the YouTube API had to be used in production.
4. As [yt-dlp](https://github.com/yt-dlp/yt-dlp) was no longer used, it made no
sense to separate the fetching of the feed URLs from the API server. The channel
URLs were not blocked by YouTube, so only web scraping was used.

## SQL vs NoSQL

NoSQL has some advantages - it is easier to scale horizontally, and in our example
you only need one collection(table) for the app (no `JOIN` statements). It also eliminates
the need to create a schema.

It also has a couple of disadvantages. Whenever I want to deal with a subscription
object, I have to hold all the corresponding videos in memory, which can scale badly
(many channels have millions of videos over the years). I also lose the ability to
perform operations on videos only. For example, if I want to count the number of
videos a subscription holds, I have to count in the python code, and if I want to
update one video, I have to update the whole videos list.

Furthermore, when I add the users collection (there was initially a plan to add users)
the advantage of needing only one collection will be lost.

Why did I choose NoSQL then? The deciding factor was familiarity. I had already used
MongoDB before and felt comfortable with its JSON-like syntax. I was set on learning
SQL, but that was going to take some time and I did not want to wait until I learned
it to start the project. Now that I am taking the [Databases](https://www.colorado.edu/program/data-science/databases)
specialisation though, if I were to redo the project I would definitely use SQL.

## Requirements

### User requirements

1. The user should be able to add subscriptions using their YouTube URLs (channels
or playlists).
2. The user should be able to set and modify the duration between fetches for each
subscription (some may upload more frequently than others).
3. The user should be able to delete subscriptions.
4. The user should be able to see the videos of each subscription along with the
duration of each video.
5. New videos (added since last time viewed) should be indicated to the user.

### System requirements

1. The application should be able to verify valid YouTube URLs.
2. The application should be able to identify valid subscription.
3. The application should be able to convert channel/playlist URLs to feed links.
4. The flask application should have CRUD APIs set up.
5. The data collector should be able to fetch each subscription with the appropriate
interval between fetches (set duration + <= 60s).
6. The data collector should be able to identify new/updated videos.
7. The data analyser should update all non-analysed videos in each iteration.
8. The data analyser should get the correct duration (as long as the video is not
private for example).
9. The database (MongoDB in this example) should store the data persistently.
10. The React.js application should be able to correctly communicate with the appropriate
APIs from the flask application.

All of these are easy to test. That being said, some of them, like 5., would take
a considerable time to test, even for an integration test. For the purpose of this
project, only tests taking under a minute were allowed into the integration test.

## A breakdown of the project directory

Let's start with the `components/` directory:

- `videos.py` defines the video datatype and a function which generates an object
from a feed item.
- `database.py` prepares the appropriate database collection for the `Subscription`
class.
- `subscriptions/` contains `typing.py`, which defines how the dictionary form of
the subscriptions (for [mypy](https://www.mypy-lang.org/) typing of the database
collection), while `main.py` contains the actual `Subscription` class, which has
the appropriate functions for fetching the RSS feed and database CRUD operations.
- `extractor/` contains functions designed to extract information about a YouTube
object. It is sometimes done from the URL directly (such as obtaining a Playlist's
feed link), or by web scraping (such as that of a Channel), or even by YouTube's
API (such as the duration of a video). Most of these functions accept html input
to facilitate mocking during tests.

Note that you do not need to setup YouTube API at all for any of the functions
unless you are running it on a non-residential server. YouTube API will only be
used if a key is provided as an environment variable.

### Data collector

It is stored in the `data_collector/` directory.

`utils.py` defines the core function that actually loops through the collection,
identifies which subscriptions are due to be fetched, and calls the `.fetch()` function
on them. `__main__()` imports that function and runs it periodically (every minute).

This design was chosen to make integration testing easier. Instead of having to call
a separate process, one can just import the `collect_data()` function and call it
to test the outcome. In production, the process is run by calling `python -m data_collector`,
which automatically runs the `__main__.py` file.

Most of the heavy-lifting is done by the `.fetch()` function from the `Subscription`
component, which is thoroughly tested in the `feed.py` unit test.

### Data Analyser

It is stored in the `data_analyser/` directory.

This is divided into `utils.py` and `__main__.py` for the same reason as the data
collector. The main difference is that the main function in `utils.py` is factored
into smaller functions. This makes it easy to write very specific unit tests.

### Flask application

It is stored in the `api/` directory.

While some helper functions are stored in `utils.py`, most of the functions are stored
in `__init__.py`, including flask's `app` object. This allows it to be easily found
by wsgi applications (`api:app`) as `import api` will import `__init__.py` automatically.

I did not use the same structure as the data collector or analyser as the flask
endpoints are not easily testable using unit tests as the other two. Instead, the
main testing will be done in integration tests using `from api import app`.

The flask app mostly implements a REST API interface for the front-end.

### React.js app

It is stored in the `front-end/` directory.

It is a basic web application with form functionality for CRUD operations. The main
code can be found in the `src/` subdirectory. It communicates with the flask application
using REST API.

When a subscription is selected, its videos are displayed and `last_viewed` is set
to that point in time. New videos since the last view will have a special indicator
over them.

### Tests

All tests are in the `tests/` directory. They are divided into two types:

#### Unit tests

These are the python scripts right under the `tests/` directory. Each file tests
a different function or area of the code. Unit tests are designed to test a specific
functionality, and are expected to be FIRST: Fast, Isolated, Repeatable, Self-validating,
and Timely. This means that **MOCKS** are heavily used, as external dependencies
would otherwise slow down the execution of the code and the tests would not be isolated,
as failure of the test could be caused by a problem with the external dependency
instead of the code itself. **FAKE DATA**, which is stored in the `data/` subdirectory,
is necessary too to empower the mocks to be real-like.

A few utilities to implement the mocks are stored in the `utils/` subdirectory. Specialised
packages were also installed, such as `mongomock`, which creates very realistic and
fast mocks of MongoDB databases.

Unit tests for this project run in under 1 second.

#### Integration tests

These are stored in the `integration/` subdirectory, and they test how well different
parts of the application work with each other. This means that mocks/fake data would
not be used normally, as these often mask the functionality of the other components
with which we want to test the integration. This means that the real database instance
has to be used (in an isolated testing area). YouTube API calls are also used when
testing on the production server, while local testing uses web scraping of real
URL calls.

In addition, instead of calling internal function, the main functions of each of
the data collector and analyser are called to simulate the real function. Flask also
provides a way to simulate real API calls using `app.test_client()`.

Integration tests for this project run in around half a minute, depending on the
machine and internet connection.

## Workflow and Deployment

### Local hooks

First, before the code leaves the machine, it passes through two testing phases:

1. Pre-commit. This runs the type checker and runs **only** the unit tests. This
is because integration tests take too long to be run at each commit. Moreover, one
commit might change something which affects the integration, while the other part
might be scheduled to match it in the next commit.
2. Pre-push. This runs the type checker and **all** checks. This is because pushed
code is expected to integrate properly and be ready for production.

Here is the code for the pre-commit hook:

```bash
#!/usr/bin/env sh

echo "Running pre-commit checks..."

# Ensure we're at the project root
git_root=$(git rev-parse --show-toplevel 2>/dev/null)
if [ $? -ne 0 ]; then
    echo "This script must be run within a Git repository"
    exit 1
fi
cd "$git_root" || exit 1

# Check if virtual environment exists
if [ ! -d "venv" ]; then
    echo "Error: Virtual environment 'venv' not found."
    echo "Please create it with: python -m venv venv"
    exit 1
fi

# Run mypy type checking
if ! venv/bin/mypy --explicit-package-bases --strict .; then
    exit 1
fi

# Run tests
if ! venv/bin/python -m unittest tests/*.py; then
    exit 1
fi

echo "End of pre-commit hook."
```

Here is the code for the pre-push hook:

```bash
#!/usr/bin/env bash

echo "Running pre-push checks..."

# Ensure we're at the project root
git_root=$(git rev-parse --show-toplevel 2>/dev/null)
if [ $? -ne 0 ]; then
    echo "This script must be run within a Git repository"
    exit 1
fi
cd "$git_root" || exit 1

# Check if virtual environment exists
if [ ! -d "venv" ]; then
    echo "Error: Virtual environment 'venv' not found."
    echo "Please create it with: python -m venv venv"
    exit 1
fi

# Run mypy type checking
if ! venv/bin/mypy --explicit-package-bases --strict .; then
    exit 1
fi

# Run tests
if ! YT_DB=testing venv/bin/python -m unittest tests{,/integration}/*.py; then
    exit 1
fi

echo "End of pre-push hook."
```

### After it reaches the VPS

Once the push is accepted, the server runs a post-receive hook. This hook checks
if the `deploy` branch has been updated, and if so it checks out the branch in the
deployment directory, downloads any new python packages in the virtual environment,
runs type checking, unit tests, and integration tests, and if everything passes it
deploys the new code.

First it reloads the docker containers. To make management easier, MongoDB and RabbitMQ
are run as docker containers, and these might get restarted by the hook if `docker-compose.yml`
was updated.

Then it interrupts the 3 python services by sending a `USR2` signal. This makes the
data collector and data analyser processes quit and start over, while gunicorn starts
new subprocesses with the new code then removes the old ones to eliminate downtime
on the back-end API server.

You can see how the post-receive hook is implemented here:

```bash
#!/usr/bin/env bash

ROOT=/srv/csca5028
VENV="$ROOT/venv"
PIP="$VENV/bin/pip"
PYTHON="$VENV/bin/python"

github_url="git@github.com:Farzat07/youtube-subscriber.git"

[ -d "$ROOT" ] || mkdir -p "$ROOT"
[ -d "$VENV" ] || python -m venv "$VENV"

while read oldrev newrev refname
do
    BRANCH_NAME="${refname#refs/heads/}"
    printf "refname: %s branchname: %s\n" "$refname" "$BRANCH_NAME"
    if [ "$BRANCH_NAME" = deploy ]; then
        GIT_WORK_TREE="$ROOT" git checkout -f "$BRANCH_NAME" || exit
        "$PIP" install -r "$ROOT/requirements.txt"
        # Run tests.
        if ! (cd "$ROOT"; "$VENV"/bin/mypy --explicit-package-bases --strict .) ||
            ! (cd "$ROOT"; YT_DB=testing "$PYTHON" -m unittest "$ROOT"/tests{,/integration}/*.py)
            then
            exit 1
        fi
        # Reload services.
        (cd "$ROOT"; docker compose up -d --remove-orphans)
        pkill -USR2 -u gitolite -f "$PYTHON"
    fi
done

git push --mirror "$github_url"
```

As you can see, the data is also pushed to GitHub, where the front-end will be deployed.

Nginx forwards all traffic sent to the flask API server to the gunicorn processes.
The docker containers and 3 python processes are all kept running even after host
restarts using systemd services. The systemd services are also instructed to restart
the data collector and data analyser services if they were interrupted by the `USR2`
signal.

### GitHub CI/CD

While the above post-receive hook employs **Continuous Integration** by thoroughly
testing the code before deployment, and **Continuous Deployment** by deploying the
code into production on push, it is not necessarily what comes to mind when a CI/CD
pipeline is discussed.

Many components were missing such as test environment isolation, and both testing
and deployment were tightly coupled into the same script. Most importantly, **Continuous
Delivery** was not employed as there was no build stage or output artifacts.

This requirement is achieved by the GitHub workflow stored in `.github/workflows/cicd.yml`,
which properly builds the react application from jsx to browser-readable js, exports
the output as a container (artifact), and then deploys the resulting website to
GitHub pages.

### How to deploy yourself

First of all, make sure to set the environment variables, either directly or by
writing to a `.env` file at the root directory. The expected environment variables
are `MONGO_USER`, `MONGO_PASS`, `RABBIT_USER`, `RABBIT_PASS`, `VITE_API_BASE_URL`,
and `YOUTUBE_API_KEY`. The `VITE_API_BASE_URL` is the URL from which to access the
API flask server, while `YOUTUBE_API_KEY` is only needed if running from a non-residential
server.

Next, make sure to have the docker containers up and running by executing `docker-compose
up` at the root directory.

To install the python dependencies, run `pip install -r requirements.txt`.

The data collector and data analyser scripts are started by running `python -m data_collector`
and `python -m data_analyser` respectively.

For non-production, the API back-end can be started by running `flask --app api:app
run`. For production use, make sure to use gunicorn instead. The command I used is:
`gunicorn --workers 3 --bind unix:/srv/csca5028/flask.sock --umask 007
--access-logfile - --error-logfile - api:app`.

## What about Message Queues

While I did make preparations to use it (set up the docker container), I never did
end up implementing it in the code. This is because there was little need for the
processes to communicate with each other, and there was little time left to come
up with one. One possible usage would be adding the functionality to order the fetch
of a subscription "right now" using the front-end, which would require the flask
application to communicate with the data_collector, but unfortunately I did not have
time to develop that functionality.

One other possible usage is using it to scale the data collector and data analyser.
Each would have a master process which coordinates the fetching, and child processes
would receive fetch orders through message queues. Though this would be an overkill
for the current scale of the project, but as it expands it might become necessary.