Exercice 1¶
L'idée était de vous faire mettre en forme un jeu de données vous-même, avant d'effectuer une ACP. En effet l'ACP directe n'est pas possible car il y a trop de lignes incomplètes, des colonnes a priori peu pertinentes, et des données de type séries temporelles (que l'on ramènera à une seule valeur).
# Import du jeu de données brut
data_orig <- read.csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")
# Affichage d'un aperçu du jeu de données ; fonctions head() et summary() = bons réflexes :)
data <- data_orig
dim(data)
head(data)
- 344917
- 67
| iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ⋯ | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | population | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | AFG | Asia | Afghanistan | 2020-01-03 | NA | 0 | NA | NA | 0 | NA | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
| 2 | AFG | Asia | Afghanistan | 2020-01-04 | NA | 0 | NA | NA | 0 | NA | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
| 3 | AFG | Asia | Afghanistan | 2020-01-05 | NA | 0 | NA | NA | 0 | NA | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
| 4 | AFG | Asia | Afghanistan | 2020-01-06 | NA | 0 | NA | NA | 0 | NA | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
| 5 | AFG | Asia | Afghanistan | 2020-01-07 | NA | 0 | NA | NA | 0 | NA | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
| 6 | AFG | Asia | Afghanistan | 2020-01-08 | NA | 0 | 0 | NA | 0 | 0 | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
# On se limite pour l'instant à des statistiques globales prises à un jour fixé :
dmax <- '2021-12-31' #max(data$date)
data <- data[data$date == dmax,] #filtre sur les lignes, donc *avant* la virgule
dim(data)
summary(data)
- 254
- 67
iso_code continent location date
Length:254 Length:254 Length:254 Length:254
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
total_cases new_cases new_cases_smoothed total_deaths
Min. : 1 Min. : 0 Min. : 0.0 Min. : 1
1st Qu.: 18326 1st Qu.: 0 1st Qu.: 37.0 1st Qu.: 281
Median : 148839 Median : 295 Median : 296.8 Median : 2468
Mean : 5085552 Mean : 22919 Mean : 19680.6 Mean : 101373
3rd Qu.: 806066 3rd Qu.: 1768 3rd Qu.: 1848.0 3rd Qu.: 16624
Max. :285446097 Max. :1349829 Max. :1122946.0 Max. :5474098
NA's :19 NA's :8 NA's :8 NA's :29
new_deaths new_deaths_smoothed total_cases_per_million
Min. : 0.00 Min. : 0.000 Min. : 8.99
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 8437.41
Median : 0.00 Median : 1.214 Median : 57981.15
Mean : 101.60 Mean : 112.105 Mean : 69265.17
3rd Qu.: 9.75 3rd Qu.: 13.607 3rd Qu.:108410.14
Max. :5965.00 Max. :6481.714 Max. :289593.33
NA's :8 NA's :8 NA's :19
new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million
Min. : 0.00 Min. : 0.000 Min. : 1.086
1st Qu.: 0.00 1st Qu.: 7.484 1st Qu.: 139.656
Median : 48.22 Median : 83.628 Median : 720.977
Mean : 543.64 Mean : 370.260 Mean :1020.674
3rd Qu.: 358.59 3rd Qu.: 426.848 3rd Qu.:1658.001
Max. :6861.15 Max. :3543.851 Max. :5949.676
NA's :8 NA's :8 NA's :29
new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate
Min. : 0.0000 Min. : 0.0000 Min. :0.020
1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.990
Median : 0.0000 Median : 0.1995 Median :1.250
Mean : 1.2384 Mean : 1.3266 Mean :1.307
3rd Qu.: 0.9377 3rd Qu.: 1.4287 3rd Qu.:1.570
Max. :29.6820 Max. :16.5200 Max. :4.220
NA's :8 NA's :8 NA's :69
icu_patients icu_patients_per_million hosp_patients
Min. : 19.0 Min. : 0.579 Min. : 67
1st Qu.: 75.5 1st Qu.:10.996 1st Qu.: 575
Median : 317.0 Median :17.341 Median : 1599
Mean : 1129.5 Mean :25.830 Mean : 6510
3rd Qu.: 774.2 3rd Qu.:37.632 3rd Qu.: 3284
Max. :18382.0 Max. :84.912 Max. :93776
NA's :220 NA's :220 NA's :220
hosp_patients_per_million weekly_icu_admissions
Min. : 50.39 Min. : 5.0
1st Qu.: 96.10 1st Qu.: 162.8
Median :157.06 Median : 523.0
Mean :177.82 Mean : 730.1
3rd Qu.:216.94 3rd Qu.:1130.8
Max. :500.19 Max. :1980.0
NA's :220 NA's :246
weekly_icu_admissions_per_million weekly_hosp_admissions
Min. : 0.529 Min. : 230
1st Qu.: 9.080 1st Qu.: 894
Median :15.690 Median : 2434
Mean :15.395 Mean : 8879
3rd Qu.:22.003 3rd Qu.: 8993
Max. :29.198 Max. :92322
NA's :246 NA's :235
weekly_hosp_admissions_per_million total_tests new_tests
Min. : 30.10 Min. : 55939 Min. : 162
1st Qu.: 74.18 1st Qu.: 2056989 1st Qu.: 5700
Median :121.31 Median : 7072918 Median : 18268
Mean :133.89 Mean : 38588425 Mean : 141706
3rd Qu.:203.27 3rd Qu.: 28473905 3rd Qu.: 71131
Max. :272.91 Max. :726359152 Max. :2017702
NA's :235 NA's :156 NA's :161
total_tests_per_thousand new_tests_per_thousand new_tests_smoothed
Min. : 11.45 Min. : 0.097 Min. : 13
1st Qu.: 273.97 1st Qu.: 0.810 1st Qu.: 3442
Median : 954.34 Median : 2.616 Median : 11916
Mean : 1876.38 Mean : 7.742 Mean : 201231
3rd Qu.: 2092.83 3rd Qu.: 5.987 3rd Qu.: 49380
Max. :21654.04 Max. :189.146 Max. :14769984
NA's :156 NA's :161 NA's :118
new_tests_smoothed_per_thousand positive_rate tests_per_case
Min. : 0.008 Min. :0.0000 Min. : 1.5
1st Qu.: 0.477 1st Qu.:0.0443 1st Qu.: 4.2
Median : 1.749 Median :0.0970 Median : 10.3
Mean : 5.240 Mean :0.1394 Mean : 696.6
3rd Qu.: 4.349 3rd Qu.:0.2386 3rd Qu.: 22.6
Max. :110.367 Max. :0.6703 Max. :65979.5
NA's :118 NA's :127 NA's :127
tests_units total_vaccinations people_vaccinated
Length:254 Min. :1.050e+05 Min. :5.568e+04
Class :character 1st Qu.:4.256e+06 1st Qu.:2.258e+06
Mode :character Median :1.549e+07 Median :8.206e+06
Mean :3.647e+08 Mean :1.795e+08
3rd Qu.:9.113e+07 3rd Qu.:4.509e+07
Max. :9.178e+09 Max. :4.558e+09
NA's :155 NA's :162
people_fully_vaccinated total_boosters new_vaccinations
Min. :4.931e+04 Min. : 5245 Min. : 45
1st Qu.:1.832e+06 1st Qu.: 834874 1st Qu.: 4035
Median :6.659e+06 Median : 2729817 Median : 28059
Mean :1.506e+08 Mean : 28731545 Mean : 1611225
3rd Qu.:3.982e+07 3rd Qu.: 9342736 3rd Qu.: 186534
Max. :3.879e+09 Max. :536089356 Max. :35457928
NA's :161 NA's :179 NA's :169
new_vaccinations_smoothed total_vaccinations_per_hundred
Min. : 0 Min. : 6.79
1st Qu.: 987 1st Qu.: 96.15
Median : 10899 Median :148.62
Mean : 629396 Mean :133.91
3rd Qu.: 69828 3rd Qu.:180.89
Max. :35763420 Max. :275.36
NA's :23 NA's :155
people_vaccinated_per_hundred people_fully_vaccinated_per_hundred
Min. : 4.72 Min. : 2.05
1st Qu.:50.73 1st Qu.:43.51
Median :69.69 Median :63.24
Mean :62.04 Mean :55.63
3rd Qu.:77.56 3rd Qu.:71.47
Max. :93.19 Max. :89.24
NA's :162 NA's :161
total_boosters_per_hundred new_vaccinations_smoothed_per_million
Min. : 0.020 Min. : 0
1st Qu.: 6.915 1st Qu.: 750
Median :19.900 Median : 2253
Mean :22.071 Mean : 2995
3rd Qu.:33.240 3rd Qu.: 4310
Max. :57.400 Max. :27586
NA's :179 NA's :23
new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred
Min. : 0 Min. :0.00000
1st Qu.: 199 1st Qu.:0.01500
Median : 2002 Median :0.03300
Mean : 121963 Mean :0.06284
3rd Qu.: 16102 3rd Qu.:0.06900
Max. :6989791 Max. :0.50500
NA's :23 NA's :23
stringency_index population_density median_age aged_65_older
Min. : 6.48 Min. : 0.137 Min. :15.10 Min. : 1.144
1st Qu.:35.19 1st Qu.: 38.612 1st Qu.:22.27 1st Qu.: 3.526
Median :43.95 Median : 90.672 Median :29.80 Median : 6.378
Mean :44.51 Mean : 444.976 Mean :30.53 Mean : 8.703
3rd Qu.:53.80 3rd Qu.: 225.097 3rd Qu.:38.80 3rd Qu.:13.928
Max. :85.08 Max. :20546.766 Max. :48.20 Max. :27.049
NA's :73 NA's :39 NA's :54 NA's :61
aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate
Min. : 0.526 Min. : 661.2 Min. : 0.100 Min. : 79.37
1st Qu.: 2.099 1st Qu.: 4126.5 1st Qu.: 0.625 1st Qu.:174.59
Median : 3.893 Median : 12595.3 Median : 2.500 Median :245.06
Mean : 5.499 Mean : 19173.6 Mean :13.848 Mean :264.27
3rd Qu.: 8.638 3rd Qu.: 27341.8 3rd Qu.:21.350 3rd Qu.:331.93
Max. :18.493 Max. :116935.6 Max. :77.600 Max. :724.42
NA's :56 NA's :58 NA's :128 NA's :58
diabetes_prevalence female_smokers male_smokers handwashing_facilities
Min. : 0.990 Min. : 0.10 Min. : 7.70 Min. : 1.188
1st Qu.: 5.378 1st Qu.: 1.90 1st Qu.:22.60 1st Qu.: 20.482
Median : 7.205 Median : 6.30 Median :33.10 Median : 49.691
Mean : 8.561 Mean :10.79 Mean :32.91 Mean : 50.789
3rd Qu.:10.770 3rd Qu.:19.20 3rd Qu.:41.30 3rd Qu.: 82.687
Max. :30.530 Max. :44.00 Max. :78.10 Max. :100.000
NA's :48 NA's :107 NA's :109 NA's :158
hospital_beds_per_thousand life_expectancy human_development_index
Min. : 0.100 Min. :53.28 Min. :0.3940
1st Qu.: 1.300 1st Qu.:69.59 1st Qu.:0.6030
Median : 2.500 Median :75.05 Median :0.7400
Mean : 3.097 Mean :73.74 Mean :0.7225
3rd Qu.: 4.200 3rd Qu.:79.46 3rd Qu.:0.8287
Max. :13.800 Max. :86.75 Max. :0.9570
NA's :81 NA's :21 NA's :64
population excess_mortality_cumulative_absolute
Min. :4.700e+01 Min. : -13147.4
1st Qu.:4.677e+05 1st Qu.: 205.4
Median :5.763e+06 Median : 4737.8
Mean :1.275e+08 Mean : 52703.1
3rd Qu.:2.827e+07 3rd Qu.: 22096.6
Max. :7.975e+09 Max. :1076833.1
NA's :189
excess_mortality_cumulative excess_mortality
Min. :-6.47 Min. :-24.71
1st Qu.: 6.30 1st Qu.: 3.85
Median :19.24 Median : 13.91
Mean :17.43 Mean : 18.83
3rd Qu.:25.67 3rd Qu.: 32.22
Max. :50.05 Max. : 76.05
NA's :189 NA's :189
excess_mortality_cumulative_per_million
Min. :-1034.9
1st Qu.: 782.7
Median : 1773.4
Mean : 2213.6
3rd Qu.: 3305.7
Max. : 7421.2
NA's :189
# Le jeu de données comprend quelques "lignes résumé" (continents, catégories de revenus...).
# En observant un peu, on remarque que leur code ISO démarre par "OWID_" :
owid_lines <- startsWith(data$iso_code, "OWID_")
data[owid_lines,]
| iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ⋯ | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | population | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 2093 | OWID_AFR | Africa | 2021-12-31 | 9850360 | 58952 | 44044.714 | 228878 | 295 | 211.571 | ⋯ | NA | NA | NA | NA | NA | 1426736614 | NA | NA | NA | NA | |
| 17104 | OWID_ASI | Asia | 2021-12-31 | 84647174 | 114105 | 85279.143 | 1256618 | 1055 | 1107.429 | ⋯ | NA | NA | NA | NA | NA | 4721383370 | NA | NA | NA | NA | |
| 87984 | OWID_ENG | Europe | England | 2021-12-31 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | 56550000 | NA | NA | NA | NA |
| 96168 | OWID_EUR | Europe | 2021-12-31 | 86422548 | 565941 | 569201.000 | 1566172 | 2308 | 3111.857 | ⋯ | NA | NA | NA | NA | NA | 744807803 | NA | NA | NA | NA | |
| 97538 | OWID_EUN | European Union | 2021-12-31 | 53608595 | 241084 | 351451.000 | 915208 | 1139 | 1654.286 | ⋯ | NA | NA | NA | NA | NA | 450146793 | NA | NA | NA | NA | |
| 131649 | OWID_HIC | High income | 2021-12-31 | 134132754 | 1084099 | 912211.429 | 2042081 | 3297 | 3470.714 | ⋯ | NA | NA | NA | NA | NA | 1250514600 | NA | NA | NA | NA | |
| 158897 | OWID_KOS | Europe | Kosovo | 2021-12-31 | 161399 | 53 | 20.571 | 2980 | 0 | 0.143 | ⋯ | NA | NA | NA | NA | NA | 1782115 | 6396.4 | 32.62 | 13.45 | 3848.565 |
| 173907 | OWID_LIC | Low income | 2021-12-31 | 1805520 | 23632 | 16829.571 | 42106 | 38 | 47.714 | ⋯ | NA | NA | NA | NA | NA | 737604900 | NA | NA | NA | NA | |
| 175271 | OWID_LMC | Lower middle income | 2021-12-31 | 65603392 | 70197 | 55666.714 | 1187053 | 831 | 1072.143 | ⋯ | NA | NA | NA | NA | NA | 3432097300 | NA | NA | NA | NA | |
| 221102 | OWID_NAM | North America | 2021-12-31 | 64102325 | 542800 | 357089.571 | 1224887 | 2012 | 1750.286 | ⋯ | NA | NA | NA | NA | NA | 600323657 | NA | NA | NA | NA | |
| 224823 | OWID_CYN | Asia | Northern Cyprus | 2021-12-31 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | 382836 | NA | NA | NA | NA |
| 225824 | OWID_NIR | Europe | Northern Ireland | 2021-12-31 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | 1896000 | NA | NA | NA | NA |
| 229914 | OWID_OCE | Oceania | 2021-12-31 | 548908 | 670 | 14575.286 | 4989 | 14 | 13.714 | ⋯ | NA | NA | NA | NA | NA | 45038860 | NA | NA | NA | NA | |
| 270779 | OWID_SCT | Europe | Scotland | 2021-12-31 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | 5466000 | NA | NA | NA | NA |
| 287145 | OWID_SAM | South America | 2021-12-31 | 39873056 | 67361 | 52751.429 | 1192550 | 281 | 286.857 | ⋯ | NA | NA | NA | NA | NA | 436816679 | NA | NA | NA | NA | |
| 328061 | OWID_UMC | Upper middle income | 2021-12-31 | 83617794 | 169473 | 136426.714 | 2200159 | 1793 | 1887.857 | ⋯ | NA | NA | NA | NA | NA | 2525921300 | NA | NA | NA | NA | |
| 337532 | OWID_WLS | Europe | Wales | 2021-12-31 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | 3170000 | NA | NA | NA | NA |
| 340184 | OWID_WRL | World | 2021-12-31 | 285446097 | 1349829 | 1122946.000 | 5474098 | 5965 | 6481.714 | ⋯ | 34.635 | 60.13 | 2.705 | 72.58 | 0.737 | 7975105024 | NA | NA | NA | NA |
# Toutes ces lignes spéciales sont a priori à supprimer, soit car elles correspondent à des aggrégations à retrouver plus tard,
# soit car elles concernent des pays avec très peu de données, à l'exception semble-t-il du Kosovo. Gardons seulement cette ligne.
kosovo_index <- data$iso_code == "OWID_KOS"
kosovo_line <- data[kosovo_index,]
kosovo_line[1] <- "KOS"
data <- rbind(data[!owid_lines,], kosovo_line)
# Pour supprimer les lignes avec valeurs manquantes, il y a plusieurs options.
# J'en indique ici 3, de la plus complexe à la plus simple :
data1 <- data[apply(data, 1, function(row) all(!is.na(row))),] #méthode 1
data2 <- data[complete.cases(data),] #méthode 2
data3 <- na.omit(data) #méthode 3
# On remarque alors qu'il n'y a plus aucune ligne ==> il faut restreindre les colonnes
# (ou "deviner" les valeurs manquantes d'une manière ou d'une autre : voir le package missMDA)
# (ici on se contente de la version simple : pas de données manquantes en entrée).
nrow(data3)
# Que représentent les variables ?
colnames(data)
- 'iso_code'
- 'continent'
- 'location'
- 'date'
- 'total_cases'
- 'new_cases'
- 'new_cases_smoothed'
- 'total_deaths'
- 'new_deaths'
- 'new_deaths_smoothed'
- 'total_cases_per_million'
- 'new_cases_per_million'
- 'new_cases_smoothed_per_million'
- 'total_deaths_per_million'
- 'new_deaths_per_million'
- 'new_deaths_smoothed_per_million'
- 'reproduction_rate'
- 'icu_patients'
- 'icu_patients_per_million'
- 'hosp_patients'
- 'hosp_patients_per_million'
- 'weekly_icu_admissions'
- 'weekly_icu_admissions_per_million'
- 'weekly_hosp_admissions'
- 'weekly_hosp_admissions_per_million'
- 'total_tests'
- 'new_tests'
- 'total_tests_per_thousand'
- 'new_tests_per_thousand'
- 'new_tests_smoothed'
- 'new_tests_smoothed_per_thousand'
- 'positive_rate'
- 'tests_per_case'
- 'tests_units'
- 'total_vaccinations'
- 'people_vaccinated'
- 'people_fully_vaccinated'
- 'total_boosters'
- 'new_vaccinations'
- 'new_vaccinations_smoothed'
- 'total_vaccinations_per_hundred'
- 'people_vaccinated_per_hundred'
- 'people_fully_vaccinated_per_hundred'
- 'total_boosters_per_hundred'
- 'new_vaccinations_smoothed_per_million'
- 'new_people_vaccinated_smoothed'
- 'new_people_vaccinated_smoothed_per_hundred'
- 'stringency_index'
- 'population_density'
- 'median_age'
- 'aged_65_older'
- 'aged_70_older'
- 'gdp_per_capita'
- 'extreme_poverty'
- 'cardiovasc_death_rate'
- 'diabetes_prevalence'
- 'female_smokers'
- 'male_smokers'
- 'handwashing_facilities'
- 'hospital_beds_per_thousand'
- 'life_expectancy'
- 'human_development_index'
- 'population'
- 'excess_mortality_cumulative_absolute'
- 'excess_mortality_cumulative'
- 'excess_mortality'
- 'excess_mortality_cumulative_per_million'
# Variables "new_*" : instantané journalier d'un certain indicateur. On ne s'y intéressera pas ici (cf. plus bas).
# De même pour les variables "weekly_*" (indicateurs hebdomadaires, j'imagine). Reste :
selection <- colnames(data)[!startsWith(colnames(data), "new_") & !startsWith(colnames(data), "weekly_")]
# Colonnes avec +50% de valeurs renseignées
selection <- selection[ apply(data[,selection], 2, function(col) sum(!is.na(col)) > nrow(data)/2) ]
selection
- 'iso_code'
- 'continent'
- 'location'
- 'date'
- 'total_cases'
- 'total_deaths'
- 'total_cases_per_million'
- 'total_deaths_per_million'
- 'reproduction_rate'
- 'positive_rate'
- 'tests_per_case'
- 'tests_units'
- 'stringency_index'
- 'population_density'
- 'median_age'
- 'aged_65_older'
- 'aged_70_older'
- 'gdp_per_capita'
- 'extreme_poverty'
- 'cardiovasc_death_rate'
- 'diabetes_prevalence'
- 'female_smokers'
- 'male_smokers'
- 'hospital_beds_per_thousand'
- 'life_expectancy'
- 'human_development_index'
- 'population'
On y voit enfin plus clair :
- iso_code : identifiant d'un pays, sur 3 lettres
- location : nom du pays
- continent, date : heu, continent, et date =)
- total_cases : nombre total de cas enregistrés jusqu'à dmax
- total_deaths : nombre total de décès enregistrés jusqu'à dmax
- total_cases_per_million : nombre relatif de cas totaux (par million)
- total_deaths_per_million : nombre relatif de décès (par million)
- tests_units : "Units used by the location to report its testing data" https://github.com/owid/covid-19-data/blob/master/public/data/README.md
en fait cette colonne ne contient que ". " ==> inutilisable. - population : nombre d'habitants
- population_density : densité de population (par kilomètre carré)
- median_age : âge médian, 50% des gens sont plus jeunes et 50% plus vieux
- aged_65_older : pourcentage de personnes dépassant 65 ans
- aged_70_older : pareil avec 70 ans
- gdp_per_capita : PIB par habitant
- extreme_poverty : pourcentage de la population sous le seuil d'extrême pauvreté
- cardiovasc_death_rate : "Death rate from cardiovascular disease in 2017 (annual number of deaths per 100,000 people)"
- diabetes_prevalence : Diabetes prevalence (% of population aged 20 to 79) in 2017
- female_smokers : pourcentage de fumeuses
- male_smokers : pourcentage de fumeurs
- hospital_beds_per_thousand : nombre de lits d'hôpital par tranche de 1000 habitants
- life_expectancy : espérance de vie
- human_development_index : indice de développement humain
data$tests_units #???
- ''
- 'tests performed'
- ''
- ''
- 'tests performed'
- ''
- ''
- ''
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'units unclear'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- ''
- 'samples tested'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- ''
- 'tests performed'
- ''
- ''
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- ''
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- ''
- ''
- 'people tested'
- 'tests performed'
- 'people tested'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- ''
- ''
- 'samples tested'
- 'people tested'
- ''
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- ''
- 'tests performed'
- 'people tested'
- ''
- 'tests performed'
- 'tests performed'
- 'people tested'
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'samples tested'
- ''
- ''
- ''
- 'tests performed'
- 'people tested'
- ''
- ''
- ''
- 'tests performed'
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'samples tested'
- 'people tested'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'people tested'
- 'tests performed'
- 'samples tested'
- 'people tested'
- ''
- 'tests performed'
- ''
- 'tests performed'
- ''
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- ''
- ''
- ''
- 'samples tested'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'people tested'
- 'samples tested'
- ''
- 'tests performed'
- ''
- ''
- 'tests performed'
- ''
- ''
- 'people tested'
- ''
- 'tests performed'
- ''
- 'samples tested'
- ''
- ''
- 'people tested'
- 'tests performed'
- 'samples tested'
- 'tests performed'
- ''
- 'samples tested'
- 'tests performed'
- ''
- 'tests performed'
- ''
- ''
- 'tests performed'
- ''
- 'samples tested'
- 'tests performed'
- ''
- 'people tested'
- ''
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'people tested'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'samples tested'
- ''
- ''
- 'people tested'
- ''
- ''
- ''
- 'tests performed'
- ''
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'people tested'
- ''
- ''
- ''
- ''
- 'tests performed'
- 'tests performed'
- ''
- ''
- 'people tested'
- 'people tested'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'people tested'
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- ''
- 'people tested'
- 'people tested'
- 'tests performed'
- ''
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'people tested'
- ''
- ''
- ''
- ''
- 'people tested'
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
selection <- selection[selection != "tests_units"]
newData <- na.omit(data[,selection])
nrow(newData)
92 lignes est raisonnable (proche de 50% de la taille du jeu de données initial). Cependant, pour être cohérent il faut en plus choisir un type de variable : absolu, ou relatif ? Je préfère les indicateurs relatifs (*_per_million, *_density) :
selection <- selection[!selection %in% c("total_cases", "total_deaths", "population")]
newData <- na.omit(data[,selection])
nrow(newData) #92 encore
rownames(newData) <- newData$iso_code #pour l'affichage des individus
newData <- newData[,-c(1,4)] #suppression des colonnes "code ISO" et "date", désormais inutiles
Note : toute l'analyse jusqu'ici aurait pu se faire aussi facilement avec un autre langage, Python par exemple.
À partir d'ici cependant, le package R FactoMineR est très pratique (pas d'équivalent Python (?!))
# ...On est enfin prêt pour l'ACP !
library(FactoMineR)
res.pca <- PCA(newData, quali.sup=1:2, ncp=6, graph=FALSE)
options(repr.plot.width=15, repr.plot.height=10)
plotInd <- plot(res.pca, choix="ind", invisible="quali")
plotVar <- plot(res.pca, choix="var")
library(gridExtra)
grid.arrange(plotInd, plotVar, ncol=2)
Warning message: “ggrepel: 5 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Bon, il semble qu'aucun individu ne se détache. En fait certaines lignes ont presque toutes leurs valeurs renseignées, et une fois complétées à l'aide de Google on trouve des individus extrêmes (Monaco, Singapour). EDIT : ici je n'ai pas complété le fichier - vous pouvez le faire.
extremes <- which(data$location %in% c("Monaco", "Singapore"))
data[extremes,selection]
| iso_code | continent | location | date | total_cases_per_million | total_deaths_per_million | reproduction_rate | positive_rate | tests_per_case | stringency_index | ⋯ | aged_70_older | gdp_per_capita | extreme_poverty | cardiovasc_death_rate | diabetes_prevalence | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 197910 | MCO | Europe | Monaco | 2021-12-31 | 138280.67 | 1041.353 | 0.93 | NA | NA | 34.13 | ⋯ | NA | NA | NA | NA | 5.46 | NA | NA | 13.8 | 86.75 | NA |
| 277597 | SGP | Asia | Singapore | 2021-12-31 | 49566.07 | 146.886 | 1.14 | NA | NA | 42.77 | ⋯ | 7.049 | 85535.38 | NA | 92.243 | 10.99 | 5.2 | 28.3 | 2.4 | 83.62 | 0.938 |
Côté variables, aged_65_older et aged_70_older apparaissent très corrélées (en fait même confondues). C'est logique, on gardera donc seulement aged_70_older après vérification numérique :
cor(newData[,c("aged_65_older", "aged_70_older")])
| aged_65_older | aged_70_older | |
|---|---|---|
| aged_65_older | 1.0000000 | 0.9939191 |
| aged_70_older | 0.9939191 | 1.0000000 |
newData <- subset(newData, select=-aged_65_older)
res.pca <- PCA(newData, quali.sup=1:2, ncp=6, graph=FALSE)
plotInd <- plot(res.pca, choix="ind", invisible="quali", habillage=1)
plotVar <- plot(res.pca, choix="var")
library(gridExtra)
grid.arrange(plotInd, plotVar, ncol=2)
Warning message: “ggrepel: 2 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Environ 65% de l'inertie expliquée dans ce premier plan ACP (51 + 14). Le cercle des corrélations oppose logiquement "richesse" à droite (HDI, PIB/hab), avec "pauvreté" à gauche. Sur le nuage des individus cela correspond grossièrement à l'opposition Europe occidentale / Afrique (quelques exceptions : Tunisie, Seychelles, ...).
Il est intéressant de constater que les indicateurs de richesse sont très corrélés positivement au nombre de morts par millions, lui-même anti-corrélé avec extreme_poverty : le COVID frapperait plutôt les riches ? Mais pourquoi donc, puisque le virus est partout ? Et bien un élément de réponse se trouve dans ce même plan ACP : aged_70_older => on y vit plus vieux, et dans une moindre mesure diabetes_prevalence => plus de cas de diabète (à vérifier numériquement).
cor(newData[,c("aged_70_older", "total_deaths_per_million", "human_development_index",
"diabetes_prevalence", "extreme_poverty")])
| aged_70_older | total_deaths_per_million | human_development_index | diabetes_prevalence | extreme_poverty | |
|---|---|---|---|---|---|
| aged_70_older | 1.00000000 | 0.6116174 | 0.8203092 | -0.02528401 | -0.5541654 |
| total_deaths_per_million | 0.61161735 | 1.0000000 | 0.5002665 | 0.14617300 | -0.4433506 |
| human_development_index | 0.82030920 | 0.5002665 | 1.0000000 | 0.17607414 | -0.7396914 |
| diabetes_prevalence | -0.02528401 | 0.1461730 | 0.1760741 | 1.00000000 | -0.3462567 |
| extreme_poverty | -0.55416536 | -0.4433506 | -0.7396914 | -0.34625674 | 1.0000000 |
newData
| continent | location | total_cases_per_million | total_deaths_per_million | reproduction_rate | positive_rate | tests_per_case | stringency_index | population_density | median_age | aged_70_older | gdp_per_capita | extreme_poverty | cardiovasc_death_rate | diabetes_prevalence | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| ALB | Europe | Albania | 73495.999 | 1130.064 | 1.48 | 0.1110 | 9.0 | 46.30 | 104.871 | 38.0 | 8.643 | 11803.431 | 1.1 | 304.195 | 10.08 | 7.1 | 51.2 | 2.890 | 78.57 | 0.795 |
| ARG | South America | Argentina | 127015.620 | 2596.686 | 2.11 | 0.3190 | 3.1 | 36.52 | 16.177 | 31.9 | 7.441 | 18933.907 | 0.6 | 191.032 | 5.50 | 16.2 | 27.7 | 5.000 | 76.67 | 0.845 |
| AUS | Oceania | Australia | 14004.709 | 93.325 | 2.20 | 0.0745 | 13.4 | 47.16 | 3.202 | 37.9 | 10.129 | 44648.710 | 0.5 | 107.791 | 5.07 | 13.0 | 16.5 | 3.840 | 83.44 | 0.944 |
| AUT | Europe | Austria | 141452.257 | 1867.753 | 1.18 | 0.0042 | 239.8 | 44.16 | 106.749 | 44.4 | 13.748 | 45436.686 | 0.7 | 145.183 | 6.35 | 28.4 | 30.9 | 7.370 | 81.54 | 0.922 |
| BGD | Asia | Bangladesh | 9262.063 | 163.985 | 1.43 | 0.0219 | 45.7 | 29.63 | 1265.036 | 27.5 | 3.262 | 3523.984 | 14.8 | 298.003 | 8.38 | 1.0 | 44.7 | 0.800 | 72.59 | 0.632 |
| BEL | Europe | Belgium | 179883.824 | 2429.409 | 1.25 | 0.1750 | 5.7 | 33.89 | 375.564 | 41.8 | 12.849 | 42658.576 | 0.2 | 114.898 | 4.29 | 25.1 | 31.4 | 5.640 | 81.63 | 0.931 |
| BIH | Europe | Bosnia and Herzegovina | 89830.928 | 4152.737 | 1.36 | 0.1867 | 5.4 | 35.19 | 68.496 | 42.5 | 10.711 | 11713.895 | 0.2 | 329.635 | 10.08 | 30.2 | 47.7 | 3.500 | 77.40 | 0.780 |
| BGR | Europe | Bulgaria | 108210.980 | 4501.357 | 1.42 | 0.0838 | 11.9 | 45.82 | 65.180 | 44.7 | 13.272 | 18563.307 | 1.5 | 424.688 | 5.81 | 30.1 | 44.4 | 7.454 | 75.05 | 0.816 |
| CAN | North America | Canada | 54674.470 | 779.054 | 1.49 | 0.2292 | 4.4 | 68.16 | 4.037 | 41.4 | 10.797 | 44017.591 | 0.5 | 105.599 | 7.37 | 12.0 | 16.6 | 2.500 | 82.43 | 0.929 |
| CHL | South America | Chile | 92058.065 | 1994.314 | 1.19 | 0.0258 | 38.8 | 31.42 | 24.282 | 35.4 | 6.938 | 22767.037 | 1.3 | 127.993 | 8.46 | 34.2 | 41.5 | 2.110 | 80.18 | 0.851 |
| CHN | Asia | China | 92.624 | 3.997 | 1.31 | 0.0000 | 65979.5 | 79.17 | 147.674 | 38.7 | 5.929 | 15308.712 | 0.7 | 261.899 | 9.74 | 1.9 | 48.4 | 4.340 | 76.91 | 0.761 |
| COL | South America | Colombia | 99059.263 | 2503.488 | 2.01 | 0.1350 | 7.4 | 49.40 | 44.223 | 32.2 | 4.312 | 13254.949 | 4.5 | 124.240 | 7.44 | 4.7 | 13.5 | 1.710 | 77.29 | 0.767 |
| CRI | North America | Costa Rica | 110206.152 | 1419.462 | 1.05 | 0.0910 | 11.0 | 45.31 | 96.079 | 33.6 | 5.694 | 15524.995 | 1.3 | 137.973 | 8.78 | 6.4 | 17.4 | 1.130 | 80.28 | 0.810 |
| HRV | Europe | Croatia | 176082.986 | 3099.722 | 1.31 | 0.3788 | 2.6 | 38.05 | 73.726 | 44.0 | 13.053 | 22669.797 | 0.7 | 253.782 | 5.59 | 34.3 | 39.9 | 5.540 | 78.49 | 0.851 |
| DNK | Europe | Denmark | 133231.468 | 553.529 | 1.34 | 0.1133 | 8.8 | 31.52 | 136.520 | 42.3 | 12.325 | 46682.515 | 0.2 | 114.767 | 6.41 | 19.3 | 18.8 | 2.500 | 80.90 | 0.940 |
| DOM | North America | Dominican Republic | 37160.446 | 378.134 | 1.62 | 0.0994 | 10.1 | 47.08 | 222.873 | 27.6 | 4.419 | 14600.861 | 1.6 | 266.653 | 8.20 | 8.5 | 19.1 | 1.600 | 74.08 | 0.756 |
| ECU | South America | Ecuador | 30320.534 | 1870.396 | 1.26 | 0.2760 | 3.6 | 53.30 | 66.939 | 28.1 | 4.458 | 10581.936 | 3.6 | 140.448 | 5.55 | 2.0 | 12.3 | 1.500 | 77.01 | 0.759 |
| SLV | North America | El Salvador | 19212.981 | 603.340 | 0.65 | 0.0130 | 76.9 | 31.48 | 307.811 | 27.6 | 5.417 | 7292.458 | 2.2 | 167.295 | 8.87 | 2.5 | 18.8 | 1.300 | 73.32 | 0.673 |
| EST | Europe | Estonia | 171039.256 | 1378.516 | 1.27 | 0.1241 | 8.1 | 37.29 | 31.033 | 42.7 | 13.491 | 29481.252 | 0.5 | 255.569 | 4.02 | 24.5 | 39.3 | 4.690 | 78.74 | 0.892 |
| ETH | Africa | Ethiopia | 3367.185 | 56.136 | 1.42 | 0.4357 | 2.3 | 40.74 | 104.957 | 19.8 | 2.063 | 1729.927 | 26.7 | 182.634 | 7.47 | 0.4 | 8.5 | 0.300 | 66.60 | 0.485 |
| GMB | Africa | Gambia | 3758.322 | 126.756 | 0.53 | 0.0534 | 18.7 | 13.89 | 207.566 | 17.5 | 1.417 | 1561.767 | 10.1 | 331.430 | 1.91 | 0.7 | 31.2 | 1.100 | 62.05 | 0.496 |
| GEO | Asia | Georgia | 249638.058 | 3685.518 | 1.02 | 0.0601 | 16.6 | 50.00 | 65.032 | 38.7 | 10.244 | 9745.079 | 4.2 | 496.218 | 7.11 | 5.3 | 55.5 | 2.600 | 73.77 | 0.812 |
| GHA | Africa | Ghana | 4364.905 | 39.013 | 1.17 | 0.2351 | 4.3 | 51.27 | 126.719 | 21.1 | 1.948 | 4227.630 | 12.0 | 298.245 | 4.97 | 0.3 | 7.7 | 0.900 | 64.07 | 0.611 |
| GRC | Europe | Greece | 104473.849 | 1974.488 | 1.75 | 0.0581 | 17.2 | 74.28 | 83.479 | 45.3 | 14.524 | 24574.382 | 1.5 | 175.695 | 4.55 | 35.3 | 52.0 | 4.210 | 82.24 | 0.888 |
| HUN | Europe | Hungary | 126053.645 | 3931.454 | 1.00 | 0.1975 | 5.1 | 27.96 | 108.043 | 43.4 | 11.976 | 26777.561 | 0.5 | 278.296 | 7.55 | 26.8 | 34.8 | 7.020 | 76.88 | 0.854 |
| IND | Asia | India | 24583.308 | 339.465 | 2.10 | 0.0101 | 98.7 | 68.64 | 450.419 | 28.2 | 3.414 | 6426.674 | 21.2 | 282.280 | 10.39 | 1.9 | 20.6 | 0.530 | 69.66 | 0.645 |
| IDN | Asia | Indonesia | 15472.592 | 523.025 | 1.16 | 0.0012 | 830.5 | 66.69 | 145.725 | 29.3 | 3.053 | 11188.744 | 5.7 | 342.864 | 6.32 | 2.8 | 76.1 | 1.040 | 71.72 | 0.718 |
| IRN | Asia | Iran | 69934.029 | 1485.840 | 0.86 | 0.0181 | 55.2 | 54.63 | 49.831 | 32.4 | 3.182 | 19082.620 | 0.2 | 270.308 | 9.59 | 0.8 | 21.1 | 1.500 | 76.68 | 0.783 |
| IRL | Europe | Ireland | 149793.912 | 1211.202 | 1.53 | 0.4300 | 2.3 | 42.84 | 69.874 | 38.7 | 8.678 | 67335.293 | 0.2 | 126.459 | 3.28 | 23.0 | 25.7 | 2.960 | 82.30 | 0.955 |
| ISR | Asia | Israel | 146252.196 | 874.061 | 2.10 | 0.0280 | 35.7 | 48.91 | 402.606 | 30.6 | 7.359 | 33132.320 | 0.5 | 93.320 | 6.74 | 15.4 | 35.4 | 2.990 | 82.97 | 0.919 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| MAR | Africa | Morocco | 25656.966 | 396.284 | 2.25 | 0.0777 | 12.9 | 65.27 | 80.080 | 29.6 | 4.209 | 7485.013 | 1.0 | 419.146 | 7.14 | 0.8 | 47.1 | 1.100 | 76.68 | 0.686 |
| MOZ | Africa | Mozambique | 5587.555 | 60.541 | 1.48 | 0.6374 | 1.6 | 37.04 | 37.728 | 17.7 | 1.870 | 1136.103 | 62.9 | 329.942 | 3.30 | 5.1 | 29.1 | 0.700 | 60.85 | 0.456 |
| MMR | Asia | Myanmar | 9797.725 | 355.634 | 0.87 | 0.0116 | 86.3 | 80.56 | 81.721 | 29.1 | 3.120 | 5591.597 | 6.4 | 202.104 | 4.61 | 6.3 | 35.2 | 0.900 | 67.13 | 0.583 |
| NPL | Asia | Nepal | 27119.361 | 379.539 | 1.22 | 0.0250 | 40.0 | 33.33 | 204.430 | 25.0 | 3.212 | 2442.804 | 15.0 | 260.797 | 7.26 | 9.5 | 37.8 | 0.300 | 70.78 | 0.602 |
| NOR | Europe | Norway | 72669.388 | 256.518 | 1.10 | 0.2700 | 3.7 | 51.85 | 14.462 | 39.7 | 10.813 | 64800.057 | 0.2 | 114.316 | 5.31 | 19.6 | 20.7 | 3.600 | 82.40 | 0.957 |
| PAK | Asia | Pakistan | 5490.774 | 122.638 | 1.44 | 0.0088 | 113.5 | 47.52 | 255.573 | 23.5 | 2.780 | 5034.708 | 4.0 | 423.031 | 8.35 | 2.8 | 36.7 | 0.600 | 67.27 | 0.557 |
| PAN | North America | Panama | 111383.433 | 1684.215 | 1.71 | 0.1076 | 9.3 | 42.34 | 55.133 | 29.7 | 5.030 | 22267.037 | 2.2 | 128.346 | 8.33 | 2.4 | 9.9 | 2.300 | 78.51 | 0.815 |
| PRY | South America | Paraguay | 68738.907 | 2451.648 | 1.19 | 0.0445 | 22.5 | 40.74 | 17.144 | 26.5 | 3.833 | 8827.010 | 1.7 | 199.128 | 8.27 | 5.0 | 21.6 | 1.300 | 74.25 | 0.728 |
| PRT | Europe | Portugal | 124888.605 | 1838.990 | 1.55 | 0.0731 | 13.7 | 48.15 | 112.371 | 46.2 | 14.924 | 27936.896 | 0.5 | 127.842 | 9.85 | 16.3 | 30.0 | 3.390 | 82.05 | 0.864 |
| ROU | Europe | Romania | 91927.269 | 2986.581 | 1.62 | 0.0275 | 36.4 | 56.48 | 85.129 | 43.0 | 11.690 | 23313.199 | 5.7 | 370.946 | 9.74 | 22.9 | 37.1 | 6.892 | 76.05 | 0.828 |
| RUS | Europe | Russia | 72557.126 | 2134.289 | 0.82 | 0.0517 | 19.3 | 54.17 | 8.823 | 39.6 | 9.393 | 24765.954 | 0.1 | 431.297 | 6.18 | 23.4 | 58.3 | 8.050 | 72.58 | 0.824 |
| SVK | Europe | Slovakia | 149152.071 | 2947.662 | 0.83 | 0.1070 | 9.3 | 56.34 | 113.128 | 41.2 | 9.167 | 30155.152 | 0.7 | 287.959 | 7.29 | 23.1 | 37.7 | 5.820 | 77.54 | 0.860 |
| ZAF | Africa | South Africa | 57543.972 | 1520.372 | 0.78 | 0.2607 | 3.8 | 44.44 | 46.754 | 27.3 | 3.053 | 12294.876 | 18.9 | 200.380 | 5.52 | 8.1 | 33.2 | 2.320 | 64.13 | 0.709 |
| KOR | Asia | South Korea | 12259.772 | 108.558 | 0.79 | 0.0236 | 42.3 | 46.39 | 527.967 | 43.4 | 8.622 | 35938.374 | 0.2 | 85.998 | 6.80 | 6.2 | 40.9 | 12.270 | 83.03 | 0.916 |
| ESP | Europe | Spain | 128265.632 | 1919.210 | 1.74 | 0.3080 | 3.2 | 43.44 | 93.105 | 45.5 | 13.799 | 34272.360 | 1.0 | 99.403 | 7.17 | 27.4 | 31.4 | 2.970 | 83.56 | 0.904 |
| LKA | Asia | Sri Lanka | 26898.175 | 686.098 | 0.94 | 0.0702 | 14.2 | 53.80 | 341.955 | 34.1 | 5.331 | 11669.077 | 0.7 | 197.093 | 10.68 | 0.3 | 27.0 | 3.600 | 76.98 | 0.782 |
| SWE | Europe | Sweden | 125711.264 | 1454.592 | 1.57 | 0.1220 | 8.2 | 41.43 | 24.718 | 41.0 | 13.433 | 46949.283 | 0.5 | 133.982 | 4.79 | 18.8 | 18.9 | 2.220 | 82.80 | 0.945 |
| THA | Asia | Thailand | 31011.538 | 302.635 | 1.09 | 0.0744 | 13.4 | 43.06 | 135.132 | 40.1 | 6.890 | 16277.671 | 0.1 | 109.861 | 7.04 | 1.9 | 38.8 | 2.100 | 77.15 | 0.777 |
| TLS | Asia | Timor | 14789.405 | 90.957 | 0.70 | 0.0023 | 429.1 | 40.58 | 87.176 | 18.0 | 1.897 | 6570.102 | 30.3 | 335.346 | 6.86 | 6.3 | 78.1 | 5.900 | 69.50 | 0.606 |
| TGO | Africa | Togo | 3408.749 | 28.027 | 1.64 | 0.2567 | 3.9 | 21.76 | 143.366 | 19.4 | 1.525 | 1429.813 | 49.2 | 280.033 | 6.15 | 0.9 | 14.2 | 0.700 | 61.04 | 0.515 |
| TUN | Africa | Tunisia | 58743.540 | 2068.935 | 1.65 | 0.0627 | 15.9 | 35.20 | 74.228 | 32.7 | 5.075 | 10849.297 | 2.0 | 318.991 | 8.52 | 1.1 | 65.8 | 2.300 | 76.70 | 0.740 |
| TUR | Asia | Turkey | 110635.410 | 962.067 | 1.39 | 0.0847 | 11.8 | 35.55 | 104.914 | 31.6 | 5.061 | 25129.341 | 0.2 | 171.285 | 12.13 | 14.1 | 41.1 | 2.810 | 77.69 | 0.820 |
| UGA | Africa | Uganda | 3019.518 | 69.778 | 1.69 | 0.1726 | 5.8 | 73.15 | 213.759 | 16.4 | 1.308 | 1697.707 | 41.6 | 213.333 | 2.50 | 3.4 | 16.7 | 0.500 | 63.37 | 0.544 |
| UKR | Europe | Ukraine | 90829.360 | 2330.704 | 0.97 | 0.1988 | 5.0 | 39.49 | 77.390 | 41.4 | 11.133 | 7894.393 | 0.1 | 539.849 | 7.11 | 13.5 | 47.4 | 8.800 | 72.06 | 0.779 |
| GBR | Europe | United Kingdom | 199109.996 | 2619.105 | 1.24 | 0.1001 | 10.0 | 44.06 | 272.898 | 40.8 | 12.527 | 39753.244 | 0.2 | 122.137 | 4.28 | 20.0 | 24.7 | 2.540 | 81.32 | 0.932 |
| USA | North America | United States | 158249.753 | 2421.163 | 1.64 | 0.2600 | 3.8 | 47.65 | 35.608 | 38.3 | 9.732 | 54225.446 | 1.2 | 151.089 | 10.79 | 19.1 | 24.6 | 2.770 | 78.86 | 0.926 |
| URY | South America | Uruguay | 119875.973 | 1802.036 | 1.57 | 0.0773 | 12.9 | 28.70 | 19.751 | 35.6 | 10.361 | 20551.409 | 0.1 | 160.708 | 6.93 | 14.0 | 19.9 | 2.800 | 77.91 | 0.817 |
| VNM | Asia | Vietnam | 17632.268 | 329.922 | 1.03 | 0.0926 | 10.8 | 69.44 | 308.127 | 32.6 | 4.718 | 6171.884 | 2.0 | 245.465 | 6.00 | 1.0 | 45.9 | 2.600 | 75.40 | 0.704 |
| ZMB | Africa | Zambia | 12448.652 | 186.335 | 1.55 | 0.3133 | 3.2 | 37.96 | 22.995 | 17.7 | 1.542 | 3689.251 | 57.5 | 234.499 | 3.94 | 3.1 | 24.7 | 2.000 | 63.89 | 0.584 |
| ZWE | Africa | Zimbabwe | 12973.101 | 306.179 | 0.81 | 0.2858 | 3.5 | 47.22 | 42.729 | 19.6 | 1.882 | 1899.775 | 21.4 | 307.846 | 1.82 | 1.6 | 30.7 | 1.700 | 61.49 | 0.571 |
La corrélation (resp. anti-corrélation) aged_70_older avec HDI (resp. extreme_poverty) et total_deaths est vérifiée. De même, on observe une légère corrélation positive (resp. négative) entre diabetes_prevalence et HDI (resp. extreme_poverty).
Ensuite, le taux de fumeuses semble très corrélé à l'âge médian : les femmes auraient plus tendance à fumer dans les pays où l'on vit plus vieux (donc en général plus riches). Ce n'est pas le cas de male_smokers : le taux de fumeurs n'indique quant à lui pas grand chose. De même, et plus étonnament, le taux de mortalité par maladies cardiovasculaires (infarctus j'imagine) ne paraît corrélé à rien - si ce n'est justement et assez logiquement, la proportion de fumeurs : "According to the American Heart Association, cardiovascular disease accounts for about 800,000 U.S. deaths every year,5 making it the leading cause of all deaths in the United States. Of those, nearly 20 percent are due to cigarette smoking." [https://www.fda.gov/tobacco-products/health-effects-tobacco-use/how-smoking-affects-heart-health#]
La coloration par continents montre une opposition haut/bas entre Europe de l'est et Europe de l'ouest + USA/Canada/Israel/Corée/Australie. Il semble y avoir relativement plus de fumeurs en Géorgie/Ukraine/Russie. les pays d'Amérique centrale et du sud sont plus bas, donc a priori moins touchés par les décès par infarctus et comportant moins de fumeurs. Il n'y a pas assez de pays d'Océanie pour en dire grand chose, et l'Asie est répartie un peu partout, montrant une grande inhomogénéité en comparaison aux autres continents.
Vérifions notre analyse en regardant de plus près quelques individus :
indivs_indices <- rownames(newData) %in% c("LUX", "UKR", "NER", "ECU")
newData[indivs_indices,c("location", "total_deaths_per_million", "aged_70_older", "male_smokers",
"cardiovasc_death_rate", "human_development_index")]
| location | total_deaths_per_million | aged_70_older | male_smokers | cardiovasc_death_rate | human_development_index | |
|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| ECU | Ecuador | 1870.396 | 4.458 | 12.3 | 140.448 | 0.759 |
| LUX | Luxembourg | 980.542 | 9.842 | 26.0 | 128.275 | 0.916 |
| UKR | Ukraine | 2330.704 | 11.133 | 47.4 | 539.849 | 0.779 |
Luxembourg : population âgée, HDI élevé, 2x moins de fumeurs qu'en Ukraine mais 2x + qu'en Equateur.
Niger : population jeune, HDI bas, peu de fumeurs, très peu de morts du COVID.
Bref, passons au second plan ACP :
plotInd <- plot(res.pca, choix="ind", invisible="quali", habillage=1, axes=3:4)
plotVar <- plot(res.pca, choix="var", axes=3:4)
library(gridExtra)
grid.arrange(plotInd, plotVar, ncol=2)
Warning message: “ggrepel: 12 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Peu d'inertie expliquée dans ce plan (à peine plus de 17%), mais une observation intéressante : anti-corrélation population_density et total_deaths_per_million ? À vérifier numériquement bien sûr car cette dernière flèche est loin du bord. Ce serait cependant cohérent : densément peuplé => contaminations plus faciles => plus de cas => plus de personnes très fragiles touchées => plus de morts.
On note aussi l'anti-corrélation entre diabetes_prevalence et extreme_poverty, déjà un peu observée dans le premier plan. Vérification numérique :
indivs_indices <- rownames(newData) %in% c("MLT", "EGY", "MNE", "MWI")
newData[indivs_indices,c("location", "total_deaths_per_million", "population_density",
"diabetes_prevalence", "extreme_poverty")]
| location | total_deaths_per_million | population_density | diabetes_prevalence | extreme_poverty | |
|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | |
| MWI | Malawi | 115.411 | 197.519 | 3.94 | 71.4 |
| MLT | Malta | 924.445 | 1454.037 | 8.83 | 0.2 |
Opposition Égypte / Malawi vérifiée sur l'axe diabète/pauvreté, ainsi que Malte/Montenegro sur l'axe morts_par_million/densité.