📝 Prettifies README

2025-05-21 21:17:52 +02:00
parent ddcaba6d92
commit 02d64fb6b8
1 changed files with 3 additions and 3 deletions
--- a/README.md
+++ b/README.md
@@ -32,7 +32,7 @@ PO - 103
 KE - 113
 ```

-We also needed to adjust for population, since some counties have more inhabitants than others, and therefore are expected to win more. The population data have been extracted from the following source: https://sk.wikipedia.org/wiki/Zoznam_krajov_na_Slovensku (this was accessed in May of 2025 and holds the statistical data from 31.12.2024).
+We also needed to adjust for population, since some counties have more inhabitants than others, and therefore are expected to win more. The population data have been extracted from [this source](https://sk.wikipedia.org/wiki/Zoznam_krajov_na_Slovensku) (this was accessed in May of 2025 and holds the statistical data from 31.12.2024).

 The unit we used is rather cursed, but understandable once you realize that pure wins per capita would yield very small numbers, and therefore we had to use micro-wins per capita - we just multiplied everything by one million. This is the adjusted (and rounded for the purposes of this README) data:

@@ -47,7 +47,7 @@ PO - 127
 KE - 145
 ```

-It is evident that Bratislava county has by far the least micro-wins per capita, already hinting at the possibly existing bias. The expected (average) value for each county was `142` micro-wins per capita.
+It is evident that Bratislava county has by far the least micro-wins per capita, already hinting at the possibly existing bias. The expected (average) value for each county was 142 micro-wins per capita.

 We used Chi Square test to compare the observed and expected values and received the following results:

@@ -83,7 +83,7 @@ We have demonstrated that bias is very much present, especially against the Brat

 ## Data & Dataset

-The raw documents from the Institute are all saved in [data.zip](data.zip) in the following format: `rYEAR-CAT.EXT` (eg. `r2023-06.pdf` for the 6th category of 2023). These aren't of much use, since they're very inconsistent and downright terrible to work with, I don't recommend handling them unless you know what you're doing.
+The raw documents from the Institute are all saved in [data.zip](data.zip) in the following format: `rYEAR-CAT.EXT` (eg. `r2023-06.pdf` for the 6th category of 2023). These aren't of much use, since they're very inconsistent and downright terrible to work with, I don't recommend handling them unless you know what you're doing. They have been scraped from the [official website of the Institute](https://siov.sk/sutaze/stredoskolska-odborna-cinnost).

 The labeled dataset can be found in [dataset.txt](dataset.txt), where each line follows the format of: `YEAR CAT WIN1,WIN2,WIN3,WIN4,WIN5` (eg. `2023 06 ZA,NR,PO,BB,KE`). This is actually what was used for the analysis, since it can easily be parsed. You can find 153 entries there (17 categories * 9 years = 153) and a total of 765 samples (county-placement).