📝 Updates README.md and adds images
This commit is contained in:
parent
2f3c547b55
commit
cd3755f167
2
.gitignore
vendored
2
.gitignore
vendored
@ -15,6 +15,8 @@ paper/
|
||||
*.jasp
|
||||
*.pth
|
||||
*.png
|
||||
!structure.png
|
||||
!example-graph.png
|
||||
*.drawio
|
||||
|
||||
*.tar.gz
|
||||
|
104
README.md
104
README.md
@ -1,22 +1,92 @@
|
||||
# Hello!
|
||||
|
||||
Welcome, you either don't know what the hell this is or know exactly what the hell this is. Either way, here's a quick
|
||||
explanation:
|
||||
|
||||
I decided to write a special paper at my highschool - why? It's kind of a competition, a bunch of students submit their
|
||||
SOC paper and the best one wins. They're actually graded by a whole ahh comittee and it's a big deal and whatever.
|
||||
Anyway, I'm here cuz it's fun, not cuz I wanna win (obviously I wanna win but that's not why I decided to do the SOC)
|
||||
|
||||
I will eventually (probably Feb 2025) publish the paper, which I will also send to everyone who participated in the
|
||||
survey (fr thanks to everyone who did). Here I wanna keep all the scripts I used while writing the paper, it's mostly
|
||||
stuff for cleaning, analyzing, and graphing the dataset
|
||||
|
||||
I will also probably send out a link to this repo along with the paper, so if you got here from that link, welcome!
|
||||
Thanks so much for participating in the survey, you can check out the scripts and the other markdowns (to be added)
|
||||
where I explain what I did in a slightly more friendly way
|
||||
|
||||
And if you're interested in the dataset, no I'm not publishing it, sorry (mom said no)
|
||||
Welcome to the technical repository for my 2024/2025 SOC paper,
|
||||
this is where I keep all my scripts, scientific tests, algorithms, and graphing programs,
|
||||
let me walk you through how it works
|
||||
|
||||
### Dataset
|
||||
|
||||
The documentation for the dataset structure can be found [here](https://github.com/Streamer272/soc-2024/blob/main/DATASET.md)
|
||||
The documentation for the dataset structure can be found in [DATASET.md](DATASET.md),
|
||||
this is only interesting for the nerds
|
||||
|
||||
### Distribution
|
||||
|
||||
This is probably the easiest part of this whole thing, it's basically just making charts and computing percentages,
|
||||
say you have 12 male and 15 female respondents, what is the distribution? It's quite simple, here:
|
||||
|
||||
`(number of elements in a group) / (number of elements in the dataset)`
|
||||
|
||||
So in this case, the distribution of male respondents would be `(12) / (12 + 15) = ~44%`, and female `(15) / (12 + 15) = ~56%`,
|
||||
now that we know this, we can make a pretty pie graph! The script that does all of this is [distribution.py](distribution.py)
|
||||
|
||||
### Analysis and scientific tests
|
||||
|
||||
This is where stuff gets interesting, the script that does all the heavy lifting is [analyze.py](analyze.py),
|
||||
then, you have the specified analysis scripts, like [analyze_sex.py](analyze_sex.py)
|
||||
(which, surprisingly, only analyzes sex)
|
||||
|
||||
[analyze_sex.py](analyze_sex.py) only picks out its data from the dataset and passes it down to [analyze.py](analyze.py) to do all the analyses, where the following things happen:
|
||||
|
||||
1. the received data is put into groups, each receiving an assigned letter (A, B, C, etc), groups of insufficient size are removed
|
||||
2. if there are less than 2 groups, analysis aborts
|
||||
3. [Kruskal-Wallis test](https://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_test) is performed and `F` and `p` values are received
|
||||
4. if `p` is greater than 0.05, the difference between those groups is not statistically significant and analysis aborts
|
||||
5. post-hoc [Dunn test](https://www.statology.org/dunns-test/) is performed and `p` values are saved
|
||||
6. a result table is created, a comparison between each group is added as well as their [rank-biserial correlation](https://www.statisticshowto.com/rank-biserial-correlation/), difference in medians, difference in means, and post-hoc `p` value
|
||||
|
||||
Problem solved!
|
||||
If the difference is statistically insignificant,
|
||||
or we don't have sufficient data to perform a statistical test, the analysis aborts,
|
||||
otherwise, we get our `F` value, `p` value, and the result table, which could look something like this:
|
||||
|
||||
| Skupina 1 | Skupina 2 | Veľkosť účinku | Rozdiel priemerov | Rozdiel mediánov | Post-Hoc p-hodnota |
|
||||
|-----------|-----------|----------------|-------------------|------------------|--------------------|
|
||||
| A | B | 0.0440 | 0.4198 | 0.0000 | 0.0497 |
|
||||
| A | C | 0.0399 | 0.2723 | 0.0000 | 0.5239 |
|
||||
| B | C | -0.0084 | -0.1475 | 0.0000 | 0.3706 |
|
||||
|
||||
### Graphing
|
||||
|
||||
Once the analysis is complete, successful or not, we can graph the data, we mostly use violin plots,
|
||||
which are quite easy to understand and interpret, it goes like this:
|
||||
|
||||
1. the window is split into four subplots, top left for average grade, top right for math grade, bottom left for slovak grade, and bottom right for english grade
|
||||
2. all groups get added to each subplot as a violin plot, so for sex, each subplot would contain a violin plot for males and a violin plot for females
|
||||
3. the `F` and `p` values get added to the top left corner of each subplot
|
||||
4. the legend gets added to the top right corner of each subplot and axes are marked
|
||||
5. each violin plot contains five pieces of valuable information:
|
||||
1. the shaded background that shows the distribution of the data
|
||||
2. the gray line that represents the data between the first and third quartile
|
||||
3. the red mean line with the mean value on the left
|
||||
4. the green median line with the median value on the right
|
||||
5. the minimum and maximum bounds
|
||||
6. labels get added to each violin plot
|
||||
|
||||
Quite complicated, right? A ton of data packed into one small image, which could look something like this:
|
||||
|
||||

|
||||
|
||||
It can be overwhelming to look at at first, but once you understand what's going on, it's quite intuitive, anyway,
|
||||
the function that does all this is also saved in [analyze.py](analyze.py)
|
||||
|
||||
### Neural network
|
||||
|
||||
Ah!
|
||||
AI stuff!
|
||||
Well, it didn't work in the end because of the abysmal amount of data, but the structure and
|
||||
the training process is still here and can be looked at
|
||||
|
||||
The script that trains the neural network is [train_nn.py](train_nn.py) (yes, I am very creative, I am aware), it uses
|
||||
the [pytorch](https://pytorch.org/) library to do all the math stuff that goes on behind the scenes, but the important
|
||||
part is the structure of the neural network, right here:
|
||||
|
||||

|
||||
|
||||
Of course, we have to use the `.npy` file format to load the data into our program, so how do we convert the `.csv`
|
||||
data provided by the Google Forms into a `.npy`?
|
||||
The answer lies in [clean.py](clean.py), but I'm not going to go
|
||||
into how it all works, it's just cleaning the data
|
||||
|
||||
The whole training thing is pretty complicated, so if you don't know anything about neural networks, just forget about
|
||||
it and attribute it to magic, but if you do, read through [train_nn.py](train_nn.py),
|
||||
it's a pretty clean and readable code
|
||||
|
BIN
example-graph.png
Normal file
BIN
example-graph.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 128 KiB |
BIN
structure.png
Normal file
BIN
structure.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 37 KiB |
Loading…
x
Reference in New Issue
Block a user