In the fall of 2024, I decided to participate in the 2024/2025 round of SOC, organized by an institute ran by the Ministry of Education of the Slovak Republic. My category was Psychology, Sociology, and Pedagogy (number 17), you can find my paper here. With this paper, I won first place in the county round and continued on to the state round with a girl who won 2nd place.

I have been made aware of the bias in the state round, but I wanted to see for myself how it would really turn out. Me and my collegue from Bratislava didn't win anything despite having good papers and the by-far best paper only won 4th place (shoutout to Sofia Petrovičová).

I was quite outraged by the results, so I decided to see if there is some statistical evidence to back up the claims of bias.

About the SOC

The SOC is an annual competition, it stands for "Stredoškolská Odborná Činnosť" (translates as High school Research Paper), students submit their papers to their school and then go on to the county round. In this round, only 3 places are awarded, 1st, 2nd, and 3rd, and only 1st and 2nd go on to the state round.

This competition is split into separate categories, like history, math, electronics, etc, and there are 17 categories overall. On the D-Day, students defend their papers in front of a commission of three teachers (I still don't understand why it's teachers instead of scientists or researchers) and are given the results the next day.

There are eight counties overall, each sending two representatives (1st and 2nd place) from each category to the state round. There, each category has 16 competitors (8 * 2), all of them having won their county round. The same procedure follows, defense, evaluation, results next day. However, this time, there are five places instead of three.

SOC 2025 took place in Košice over the period of three days. We arrived on Monday, 2025-04-14. On Tuesday, 2025-04-15, we defended our papers, having only one break and afterwards being provided lunch. On Wednesday, 2025-04-16, we received the results and traveled back to our hometowns.

Statistics

This project is being written in 2025 and I have only collected data from 2017 to 2025 (including), so if you're reading this a couple of years later, this may not be relevant.

We conducted two statistical tests:

to see whether some counties won more than others
and if they won, whether some counties won better places

First test

The following data represents how many times a county won (any place - out of five places):

BA - 75
TN - 80
TT - 76
NR - 94
BB - 83
ZA - 141
PO - 103
KE - 113

We also needed to adjust for population, since some counties have more inhabitants than others, and therefore are expected to win more. The population data have been extracted from this source (this was accessed in May of 2025 and holds the statistical data from 31.12.2024).

The unit we used is rather cursed, but understandable once you realize that pure wins per capita would yield very small numbers, and therefore we had to use micro-wins per capita - we just multiplied everything by one million. This is the adjusted (and rounded for the purposes of this README) data:

BA - 102
TN - 141
TT - 134
NR - 141
BB - 136
ZA - 206
PO - 127
KE - 145

It is evident that Bratislava county has by far the least micro-wins per capita, already hinting at the possibly existing bias. The expected (average) value for each county was 142 micro-wins per capita.

We used Chi Square test to compare the observed and expected values and received the following results:

Chi-Square = 42.1935
p = 0.00000048

Since p is far below the common threshold of 0.05, we reject the null hypothesis and accept the alternative hypothesis, which states that some counties win more often than others. Our claims of bias can therefore be supported by statistical evidence.

Second test

For the purposes of the second test, we had to reshape the dataset so that rows represent counties and columns their placements. An example (not real data) could look like this:

BA | 5 | 1 | 4 | ...
TN | 2 | 2 | 3 | ...
TT ...

Then, we performed a Kruskal-Wallis test (since the data didn't fit a normal distribution) to compare the counties between each other and see whether some counties won higher places than others (once they actually won). The results were as follows:

F = 6.2036
p = 0.51619147

p indicates that there is no statistical evidence to support the alternative hypothesis and we shall thus reject it. The null hypothesis therefore stands, once counties win, there is no difference in placements between the counties.

Conclusion

We have demonstrated that bias is very much present, especially against the Bratislava county and in favor of Žilina county. More data would be able to shed even more light onto the situation, but unfortunately, the digitized records only start in 2017, and we can't really get our hands on earlier data. This topic should be reexamined every couple of years once new data is available.

Data & Dataset

The raw documents from the Institute are all saved in data.zip in the following format: rYEAR-CAT.EXT (eg. r2023-06.pdf for the 6th category of 2023). These aren't of much use, since they're very inconsistent and downright terrible to work with, I don't recommend handling them unless you know what you're doing. They have been scraped from the official website of the Institute.

The labeled dataset can be found in dataset.txt, where each line follows the format of: YEAR CAT WIN1,WIN2,WIN3,WIN4,WIN5 (eg. 2023 06 ZA,NR,PO,BB,KE). This is actually what was used for the analysis, since it can easily be parsed. You can find 153 entries there (17 categories * 9 years = 153) and a total of 765 samples (county-placement).

Credits

I would like to thank my mentor, Mgr. Martina Šandor, who has been of great help with my SOC paper, and Adam Kováč, who manually labeled the data and created the dataset.

License

This project is licensed under the GNU GPLv3 license