📝 Reworks README and adds statistical tests

2025-05-21 21:12:21 +02:00
parent 5f6656d710
commit ddcaba6d92
1 changed files with 81 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -4,9 +4,88 @@

 In the fall of 2024, I decided to participate in the 2024/2025 round of SOC, organized by an institute ran by the Ministry of Education of the Slovak Republic. My category was Psychology, Sociology, and Pedagogy (number 17), you can find my paper [here](https://gitea.svitan.dev/Streamer272/soc-2024). With this paper, I won first place in the county round and continued on to the state round with a girl who won 2nd place.

-Even before I left, I was warned by my mentor that it's not a fair fight and that it's an open secret that the judges have a massive bias in favor of the students from their own county, especially those from the eastern counties. I didn't think much of it, until we got to the closing ceremony where the results were announced.
+I have been made aware of the bias in the state round, but I wanted to see for myself how it would really turn out. Me and my collegue from Bratislava didn't win anything despite having good papers and the by-far best paper only won 4th place (shoutout to Sofia Petrovičová).

-The day before, we defended our papers, and I would say me and my collegue from Bratislava county did reasonably well. The results were shocking, to say the least, terrible papers won and that one paper from Nitra county, which was, by far, the best in our category, only won 4th place. So I decided to create a dataset of the results of the last couple of years and see if some statistics doesn't shine a light on this situation.
+I was quite outraged by the results, so I decided to see if there is some statistical evidence to back up the claims of bias.
+
+## Statistics
+
+This project is being written in 2025 and I have only collected data from 2017 to 2025 (including), so if you're reading this a couple of years later, this may not be relevant.
+
+We conducted two statistical tests:
+
+1. to see whether some counties won more than others
+2. and if they won, whether some counties won better places
+
+### First test
+
+The following data represents how many times a county won (any place):
+
+```
+BA - 75
+TN - 80
+TT - 76
+NR - 94
+BB - 83
+ZA - 141
+PO - 103
+KE - 113
+```
+
+We also needed to adjust for population, since some counties have more inhabitants than others, and therefore are expected to win more. The population data have been extracted from the following source: https://sk.wikipedia.org/wiki/Zoznam_krajov_na_Slovensku (this was accessed in May of 2025 and holds the statistical data from 31.12.2024).
+
+The unit we used is rather cursed, but understandable once you realize that pure wins per capita would yield very small numbers, and therefore we had to use micro-wins per capita - we just multiplied everything by one million. This is the adjusted (and rounded for the purposes of this README) data:
+
+```
+BA - 102
+TN - 141
+TT - 134
+NR - 141
+BB - 136
+ZA - 206
+PO - 127
+KE - 145
+```
+
+It is evident that Bratislava county has by far the least micro-wins per capita, already hinting at the possibly existing bias. The expected (average) value for each county was `142` micro-wins per capita.
+
+We used Chi Square test to compare the observed and expected values and received the following results:
+
+```
+Chi-Square = 42.1935
+p = 0.00000048
+```
+
+Since `p` is far below the common threshold of 0.05, we reject the null hypothesis and accept the alternative hypothesis, which states that some counties win more often than others. Our claims of bias can therefore be supported by statistical evidence.
+
+### Second test
+
+For the purposes of the second test, we had to reshape the dataset so that rows represent counties and columns their placements. An example (not real data) could look like this:
+
+```
+BA | 5 | 1 | 4 | ...
+TN | 2 | 2 | 3 | ...
+TT ...
+```
+
+Then, we performed a Kruskal-Wallis test (since the data didn't fit a normal distribution) to compare the counties between each other and see whether some counties won higher places than others (once they actually won). The results were as follows:
+
+```
+F = 6.2036
+p = 0.51619147
+```
+
+`p` indicates that there is no statistical evidence to support the alternative hypothesis and we shall thus reject it. The null hypothesis therefore stands, once counties win, there is no difference in placements between the counties.
+
+### Conclusion
+
+We have demonstrated that bias is very much present, especially against the Bratislava county and in favor if Žilina county. More data would be able to shed even more light onto the situation, but unfortunately, the digitized records only start in 2017, and we can't really get our hands on earlier data. This topic should be reexamined every couple of years once new data is available.
+
+## Data & Dataset
+
+The raw documents from the Institute are all saved in [data.zip](data.zip) in the following format: `rYEAR-CAT.EXT` (eg. `r2023-06.pdf` for the 6th category of 2023). These aren't of much use, since they're very inconsistent and downright terrible to work with, I don't recommend handling them unless you know what you're doing.
+
+The labeled dataset can be found in [dataset.txt](dataset.txt), where each line follows the format of: `YEAR CAT WIN1,WIN2,WIN3,WIN4,WIN5` (eg. `2023 06 ZA,NR,PO,BB,KE`). This is actually what was used for the analysis, since it can easily be parsed. You can find 153 entries there (17 categories * 9 years = 153) and a total of 765 samples (county-placement).

 ## Credits