📝 Updates README.md and adds images

2025-02-22 17:47:15 +01:00
parent 2f3c547b55
commit cd3755f167
4 changed files with 89 additions and 17 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -15,6 +15,8 @@ paper/
 *.jasp
 *.pth
 *.png
+!structure.png
+!example-graph.png
 *.drawio

 *.tar.gz
--- a/README.md
+++ b/README.md
@@ -1,22 +1,92 @@
 # Hello!

-Welcome, you either don't know what the hell this is or know exactly what the hell this is. Either way, here's a quick
-explanation:
-
-I decided to write a special paper at my highschool - why? It's kind of a competition, a bunch of students submit their
-SOC paper and the best one wins. They're actually graded by a whole ahh comittee and it's a big deal and whatever.
-Anyway, I'm here cuz it's fun, not cuz I wanna win (obviously I wanna win but that's not why I decided to do the SOC)
-
-I will eventually (probably Feb 2025) publish the paper, which I will also send to everyone who participated in the
-survey (fr thanks to everyone who did). Here I wanna keep all the scripts I used while writing the paper, it's mostly
-stuff for cleaning, analyzing, and graphing the dataset
-
-I will also probably send out a link to this repo along with the paper, so if you got here from that link, welcome!
-Thanks so much for participating in the survey, you can check out the scripts and the other markdowns (to be added)
-where I explain what I did in a slightly more friendly way
-
-And if you're interested in the dataset, no I'm not publishing it, sorry (mom said no)
+Welcome to the technical repository for my 2024/2025 SOC paper,
+this is where I keep all my scripts, scientific tests, algorithms, and graphing programs,
+let me walk you through how it works

 ### Dataset

-The documentation for the dataset structure can be found [here](https://github.com/Streamer272/soc-2024/blob/main/DATASET.md)
+The documentation for the dataset structure can be found in [DATASET.md](DATASET.md),
+this is only interesting for the nerds
+
+### Distribution
+
+This is probably the easiest part of this whole thing, it's basically just making charts and computing percentages,
+say you have 12 male and 15 female respondents, what is the distribution? It's quite simple, here:
+
+`(number of elements in a group) / (number of elements in the dataset)`
+
+So in this case, the distribution of male respondents would be `(12) / (12 + 15) = ~44%`, and female `(15) / (12 + 15) = ~56%`,
+now that we know this, we can make a pretty pie graph! The script that does all of this is [distribution.py](distribution.py)
+
+### Analysis and scientific tests
+
+This is where stuff gets interesting, the script that does all the heavy lifting is [analyze.py](analyze.py),
+then, you have the specified analysis scripts, like [analyze_sex.py](analyze_sex.py)
+(which, surprisingly, only analyzes sex)
+
+[analyze_sex.py](analyze_sex.py) only picks out its data from the dataset and passes it down to [analyze.py](analyze.py) to do all the analyses, where the following things happen:
+
+1. the received data is put into groups, each receiving an assigned letter (A, B, C, etc), groups of insufficient size are removed
+2. if there are less than 2 groups, analysis aborts
+3. [Kruskal-Wallis test](https://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_test) is performed and `F` and `p` values are received
+4. if `p` is greater than 0.05, the difference between those groups is not statistically significant and analysis aborts
+5. post-hoc [Dunn test](https://www.statology.org/dunns-test/) is performed and `p` values are saved
+6. a result table is created, a comparison between each group is added as well as their [rank-biserial correlation](https://www.statisticshowto.com/rank-biserial-correlation/), difference in medians, difference in means, and post-hoc `p` value
+
+Problem solved!
+If the difference is statistically insignificant,
+or we don't have sufficient data to perform a statistical test, the analysis aborts,
+otherwise, we get our `F` value, `p` value, and the result table, which could look something like this:
+
+| Skupina 1 | Skupina 2 | Veľkosť účinku | Rozdiel priemerov | Rozdiel mediánov | Post-Hoc p-hodnota |
+|-----------|-----------|----------------|-------------------|------------------|--------------------|
+| A         | B         | 0.0440         | 0.4198            | 0.0000           | 0.0497             |
+| A         | C         | 0.0399         | 0.2723            | 0.0000           | 0.5239             |
+| B         | C         | -0.0084        | -0.1475           | 0.0000           | 0.3706             |
+
+### Graphing
+
+Once the analysis is complete, successful or not, we can graph the data, we mostly use violin plots,
+which are quite easy to understand and interpret, it goes like this:
+
+1. the window is split into four subplots, top left for average grade, top right for math grade, bottom left for slovak grade, and bottom right for english grade
+2. all groups get added to each subplot as a violin plot, so for sex, each subplot would contain a violin plot for males and a violin plot for females
+3. the `F` and `p` values get added to the top left corner of each subplot
+4. the legend gets added to the top right corner of each subplot and axes are marked
+5. each violin plot contains five pieces of valuable information:
+   1. the shaded background that shows the distribution of the data
+   2. the gray line that represents the data between the first and third quartile
+   3. the red mean line with the mean value on the left
+   4. the green median line with the median value on the right
+   5. the minimum and maximum bounds
+6. labels get added to each violin plot
+
+Quite complicated, right? A ton of data packed into one small image, which could look something like this:
+
+![example graph](example-graph.png)
+
+It can be overwhelming to look at at first, but once you understand what's going on, it's quite intuitive, anyway,
+the function that does all this is also saved in [analyze.py](analyze.py)
+
+### Neural network
+
+Ah!
+AI stuff!
+Well, it didn't work in the end because of the abysmal amount of data, but the structure and
+the training process is still here and can be looked at
+
+The script that trains the neural network is [train_nn.py](train_nn.py) (yes, I am very creative, I am aware), it uses
+the [pytorch](https://pytorch.org/) library to do all the math stuff that goes on behind the scenes, but the important
+part is the structure of the neural network, right here:
+
+![structure of a neural network](structure.png)
+
+Of course, we have to use the `.npy` file format to load the data into our program, so how do we convert the `.csv`
+data provided by the Google Forms into a `.npy`?
+The answer lies in [clean.py](clean.py), but I'm not going to go
+into how it all works, it's just cleaning the data
+
+The whole training thing is pretty complicated, so if you don't know anything about neural networks, just forget about
+it and attribute it to magic, but if you do, read through [train_nn.py](train_nn.py),
+it's a pretty clean and readable code
--- a/example-graph.png
+++ b/example-graph.png
--- a/structure.png
+++ b/structure.png