Exercice 1¶
L'idée était de vous faire mettre en forme un jeu de données vous-même, avant d'effectuer une ACP. En effet l'ACP directe n'est pas possible car il y a trop de lignes incomplètes, des colonnes a priori peu pertinentes, et des données de type séries temporelles (que l'on ramènera à une seule valeur).
# Import du jeu de données brut
data_orig <- read.csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")
# Affichage d'un aperçu du jeu de données ; fonctions head() et summary() = bons réflexes :)
data <- data_orig
dim(data)
head(data)
- 344917
- 67
iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ⋯ | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | population | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | AFG | Asia | Afghanistan | 2020-01-03 | NA | 0 | NA | NA | 0 | NA | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
2 | AFG | Asia | Afghanistan | 2020-01-04 | NA | 0 | NA | NA | 0 | NA | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
3 | AFG | Asia | Afghanistan | 2020-01-05 | NA | 0 | NA | NA | 0 | NA | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
4 | AFG | Asia | Afghanistan | 2020-01-06 | NA | 0 | NA | NA | 0 | NA | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
5 | AFG | Asia | Afghanistan | 2020-01-07 | NA | 0 | NA | NA | 0 | NA | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
6 | AFG | Asia | Afghanistan | 2020-01-08 | NA | 0 | 0 | NA | 0 | 0 | ⋯ | NA | 37.746 | 0.5 | 64.83 | 0.511 | 41128772 | NA | NA | NA | NA |
# On se limite pour l'instant à des statistiques globales prises à un jour fixé :
dmax <- '2021-12-31' #max(data$date)
data <- data[data$date == dmax,] #filtre sur les lignes, donc *avant* la virgule
dim(data)
summary(data)
- 254
- 67
iso_code continent location date Length:254 Length:254 Length:254 Length:254 Class :character Class :character Class :character Class :character Mode :character Mode :character Mode :character Mode :character total_cases new_cases new_cases_smoothed total_deaths Min. : 1 Min. : 0 Min. : 0.0 Min. : 1 1st Qu.: 18326 1st Qu.: 0 1st Qu.: 37.0 1st Qu.: 281 Median : 148839 Median : 295 Median : 296.8 Median : 2468 Mean : 5085552 Mean : 22919 Mean : 19680.6 Mean : 101373 3rd Qu.: 806066 3rd Qu.: 1768 3rd Qu.: 1848.0 3rd Qu.: 16624 Max. :285446097 Max. :1349829 Max. :1122946.0 Max. :5474098 NA's :19 NA's :8 NA's :8 NA's :29 new_deaths new_deaths_smoothed total_cases_per_million Min. : 0.00 Min. : 0.000 Min. : 8.99 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 8437.41 Median : 0.00 Median : 1.214 Median : 57981.15 Mean : 101.60 Mean : 112.105 Mean : 69265.17 3rd Qu.: 9.75 3rd Qu.: 13.607 3rd Qu.:108410.14 Max. :5965.00 Max. :6481.714 Max. :289593.33 NA's :8 NA's :8 NA's :19 new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million Min. : 0.00 Min. : 0.000 Min. : 1.086 1st Qu.: 0.00 1st Qu.: 7.484 1st Qu.: 139.656 Median : 48.22 Median : 83.628 Median : 720.977 Mean : 543.64 Mean : 370.260 Mean :1020.674 3rd Qu.: 358.59 3rd Qu.: 426.848 3rd Qu.:1658.001 Max. :6861.15 Max. :3543.851 Max. :5949.676 NA's :8 NA's :8 NA's :29 new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate Min. : 0.0000 Min. : 0.0000 Min. :0.020 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.990 Median : 0.0000 Median : 0.1995 Median :1.250 Mean : 1.2384 Mean : 1.3266 Mean :1.307 3rd Qu.: 0.9377 3rd Qu.: 1.4287 3rd Qu.:1.570 Max. :29.6820 Max. :16.5200 Max. :4.220 NA's :8 NA's :8 NA's :69 icu_patients icu_patients_per_million hosp_patients Min. : 19.0 Min. : 0.579 Min. : 67 1st Qu.: 75.5 1st Qu.:10.996 1st Qu.: 575 Median : 317.0 Median :17.341 Median : 1599 Mean : 1129.5 Mean :25.830 Mean : 6510 3rd Qu.: 774.2 3rd Qu.:37.632 3rd Qu.: 3284 Max. :18382.0 Max. :84.912 Max. :93776 NA's :220 NA's :220 NA's :220 hosp_patients_per_million weekly_icu_admissions Min. : 50.39 Min. : 5.0 1st Qu.: 96.10 1st Qu.: 162.8 Median :157.06 Median : 523.0 Mean :177.82 Mean : 730.1 3rd Qu.:216.94 3rd Qu.:1130.8 Max. :500.19 Max. :1980.0 NA's :220 NA's :246 weekly_icu_admissions_per_million weekly_hosp_admissions Min. : 0.529 Min. : 230 1st Qu.: 9.080 1st Qu.: 894 Median :15.690 Median : 2434 Mean :15.395 Mean : 8879 3rd Qu.:22.003 3rd Qu.: 8993 Max. :29.198 Max. :92322 NA's :246 NA's :235 weekly_hosp_admissions_per_million total_tests new_tests Min. : 30.10 Min. : 55939 Min. : 162 1st Qu.: 74.18 1st Qu.: 2056989 1st Qu.: 5700 Median :121.31 Median : 7072918 Median : 18268 Mean :133.89 Mean : 38588425 Mean : 141706 3rd Qu.:203.27 3rd Qu.: 28473905 3rd Qu.: 71131 Max. :272.91 Max. :726359152 Max. :2017702 NA's :235 NA's :156 NA's :161 total_tests_per_thousand new_tests_per_thousand new_tests_smoothed Min. : 11.45 Min. : 0.097 Min. : 13 1st Qu.: 273.97 1st Qu.: 0.810 1st Qu.: 3442 Median : 954.34 Median : 2.616 Median : 11916 Mean : 1876.38 Mean : 7.742 Mean : 201231 3rd Qu.: 2092.83 3rd Qu.: 5.987 3rd Qu.: 49380 Max. :21654.04 Max. :189.146 Max. :14769984 NA's :156 NA's :161 NA's :118 new_tests_smoothed_per_thousand positive_rate tests_per_case Min. : 0.008 Min. :0.0000 Min. : 1.5 1st Qu.: 0.477 1st Qu.:0.0443 1st Qu.: 4.2 Median : 1.749 Median :0.0970 Median : 10.3 Mean : 5.240 Mean :0.1394 Mean : 696.6 3rd Qu.: 4.349 3rd Qu.:0.2386 3rd Qu.: 22.6 Max. :110.367 Max. :0.6703 Max. :65979.5 NA's :118 NA's :127 NA's :127 tests_units total_vaccinations people_vaccinated Length:254 Min. :1.050e+05 Min. :5.568e+04 Class :character 1st Qu.:4.256e+06 1st Qu.:2.258e+06 Mode :character Median :1.549e+07 Median :8.206e+06 Mean :3.647e+08 Mean :1.795e+08 3rd Qu.:9.113e+07 3rd Qu.:4.509e+07 Max. :9.178e+09 Max. :4.558e+09 NA's :155 NA's :162 people_fully_vaccinated total_boosters new_vaccinations Min. :4.931e+04 Min. : 5245 Min. : 45 1st Qu.:1.832e+06 1st Qu.: 834874 1st Qu.: 4035 Median :6.659e+06 Median : 2729817 Median : 28059 Mean :1.506e+08 Mean : 28731545 Mean : 1611225 3rd Qu.:3.982e+07 3rd Qu.: 9342736 3rd Qu.: 186534 Max. :3.879e+09 Max. :536089356 Max. :35457928 NA's :161 NA's :179 NA's :169 new_vaccinations_smoothed total_vaccinations_per_hundred Min. : 0 Min. : 6.79 1st Qu.: 987 1st Qu.: 96.15 Median : 10899 Median :148.62 Mean : 629396 Mean :133.91 3rd Qu.: 69828 3rd Qu.:180.89 Max. :35763420 Max. :275.36 NA's :23 NA's :155 people_vaccinated_per_hundred people_fully_vaccinated_per_hundred Min. : 4.72 Min. : 2.05 1st Qu.:50.73 1st Qu.:43.51 Median :69.69 Median :63.24 Mean :62.04 Mean :55.63 3rd Qu.:77.56 3rd Qu.:71.47 Max. :93.19 Max. :89.24 NA's :162 NA's :161 total_boosters_per_hundred new_vaccinations_smoothed_per_million Min. : 0.020 Min. : 0 1st Qu.: 6.915 1st Qu.: 750 Median :19.900 Median : 2253 Mean :22.071 Mean : 2995 3rd Qu.:33.240 3rd Qu.: 4310 Max. :57.400 Max. :27586 NA's :179 NA's :23 new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred Min. : 0 Min. :0.00000 1st Qu.: 199 1st Qu.:0.01500 Median : 2002 Median :0.03300 Mean : 121963 Mean :0.06284 3rd Qu.: 16102 3rd Qu.:0.06900 Max. :6989791 Max. :0.50500 NA's :23 NA's :23 stringency_index population_density median_age aged_65_older Min. : 6.48 Min. : 0.137 Min. :15.10 Min. : 1.144 1st Qu.:35.19 1st Qu.: 38.612 1st Qu.:22.27 1st Qu.: 3.526 Median :43.95 Median : 90.672 Median :29.80 Median : 6.378 Mean :44.51 Mean : 444.976 Mean :30.53 Mean : 8.703 3rd Qu.:53.80 3rd Qu.: 225.097 3rd Qu.:38.80 3rd Qu.:13.928 Max. :85.08 Max. :20546.766 Max. :48.20 Max. :27.049 NA's :73 NA's :39 NA's :54 NA's :61 aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate Min. : 0.526 Min. : 661.2 Min. : 0.100 Min. : 79.37 1st Qu.: 2.099 1st Qu.: 4126.5 1st Qu.: 0.625 1st Qu.:174.59 Median : 3.893 Median : 12595.3 Median : 2.500 Median :245.06 Mean : 5.499 Mean : 19173.6 Mean :13.848 Mean :264.27 3rd Qu.: 8.638 3rd Qu.: 27341.8 3rd Qu.:21.350 3rd Qu.:331.93 Max. :18.493 Max. :116935.6 Max. :77.600 Max. :724.42 NA's :56 NA's :58 NA's :128 NA's :58 diabetes_prevalence female_smokers male_smokers handwashing_facilities Min. : 0.990 Min. : 0.10 Min. : 7.70 Min. : 1.188 1st Qu.: 5.378 1st Qu.: 1.90 1st Qu.:22.60 1st Qu.: 20.482 Median : 7.205 Median : 6.30 Median :33.10 Median : 49.691 Mean : 8.561 Mean :10.79 Mean :32.91 Mean : 50.789 3rd Qu.:10.770 3rd Qu.:19.20 3rd Qu.:41.30 3rd Qu.: 82.687 Max. :30.530 Max. :44.00 Max. :78.10 Max. :100.000 NA's :48 NA's :107 NA's :109 NA's :158 hospital_beds_per_thousand life_expectancy human_development_index Min. : 0.100 Min. :53.28 Min. :0.3940 1st Qu.: 1.300 1st Qu.:69.59 1st Qu.:0.6030 Median : 2.500 Median :75.05 Median :0.7400 Mean : 3.097 Mean :73.74 Mean :0.7225 3rd Qu.: 4.200 3rd Qu.:79.46 3rd Qu.:0.8287 Max. :13.800 Max. :86.75 Max. :0.9570 NA's :81 NA's :21 NA's :64 population excess_mortality_cumulative_absolute Min. :4.700e+01 Min. : -13147.4 1st Qu.:4.677e+05 1st Qu.: 205.4 Median :5.763e+06 Median : 4737.8 Mean :1.275e+08 Mean : 52703.1 3rd Qu.:2.827e+07 3rd Qu.: 22096.6 Max. :7.975e+09 Max. :1076833.1 NA's :189 excess_mortality_cumulative excess_mortality Min. :-6.47 Min. :-24.71 1st Qu.: 6.30 1st Qu.: 3.85 Median :19.24 Median : 13.91 Mean :17.43 Mean : 18.83 3rd Qu.:25.67 3rd Qu.: 32.22 Max. :50.05 Max. : 76.05 NA's :189 NA's :189 excess_mortality_cumulative_per_million Min. :-1034.9 1st Qu.: 782.7 Median : 1773.4 Mean : 2213.6 3rd Qu.: 3305.7 Max. : 7421.2 NA's :189
# Le jeu de données comprend quelques "lignes résumé" (continents, catégories de revenus...).
# En observant un peu, on remarque que leur code ISO démarre par "OWID_" :
owid_lines <- startsWith(data$iso_code, "OWID_")
data[owid_lines,]
iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ⋯ | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | population | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
2093 | OWID_AFR | Africa | 2021-12-31 | 9850360 | 58952 | 44044.714 | 228878 | 295 | 211.571 | ⋯ | NA | NA | NA | NA | NA | 1426736614 | NA | NA | NA | NA | |
17104 | OWID_ASI | Asia | 2021-12-31 | 84647174 | 114105 | 85279.143 | 1256618 | 1055 | 1107.429 | ⋯ | NA | NA | NA | NA | NA | 4721383370 | NA | NA | NA | NA | |
87984 | OWID_ENG | Europe | England | 2021-12-31 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | 56550000 | NA | NA | NA | NA |
96168 | OWID_EUR | Europe | 2021-12-31 | 86422548 | 565941 | 569201.000 | 1566172 | 2308 | 3111.857 | ⋯ | NA | NA | NA | NA | NA | 744807803 | NA | NA | NA | NA | |
97538 | OWID_EUN | European Union | 2021-12-31 | 53608595 | 241084 | 351451.000 | 915208 | 1139 | 1654.286 | ⋯ | NA | NA | NA | NA | NA | 450146793 | NA | NA | NA | NA | |
131649 | OWID_HIC | High income | 2021-12-31 | 134132754 | 1084099 | 912211.429 | 2042081 | 3297 | 3470.714 | ⋯ | NA | NA | NA | NA | NA | 1250514600 | NA | NA | NA | NA | |
158897 | OWID_KOS | Europe | Kosovo | 2021-12-31 | 161399 | 53 | 20.571 | 2980 | 0 | 0.143 | ⋯ | NA | NA | NA | NA | NA | 1782115 | 6396.4 | 32.62 | 13.45 | 3848.565 |
173907 | OWID_LIC | Low income | 2021-12-31 | 1805520 | 23632 | 16829.571 | 42106 | 38 | 47.714 | ⋯ | NA | NA | NA | NA | NA | 737604900 | NA | NA | NA | NA | |
175271 | OWID_LMC | Lower middle income | 2021-12-31 | 65603392 | 70197 | 55666.714 | 1187053 | 831 | 1072.143 | ⋯ | NA | NA | NA | NA | NA | 3432097300 | NA | NA | NA | NA | |
221102 | OWID_NAM | North America | 2021-12-31 | 64102325 | 542800 | 357089.571 | 1224887 | 2012 | 1750.286 | ⋯ | NA | NA | NA | NA | NA | 600323657 | NA | NA | NA | NA | |
224823 | OWID_CYN | Asia | Northern Cyprus | 2021-12-31 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | 382836 | NA | NA | NA | NA |
225824 | OWID_NIR | Europe | Northern Ireland | 2021-12-31 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | 1896000 | NA | NA | NA | NA |
229914 | OWID_OCE | Oceania | 2021-12-31 | 548908 | 670 | 14575.286 | 4989 | 14 | 13.714 | ⋯ | NA | NA | NA | NA | NA | 45038860 | NA | NA | NA | NA | |
270779 | OWID_SCT | Europe | Scotland | 2021-12-31 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | 5466000 | NA | NA | NA | NA |
287145 | OWID_SAM | South America | 2021-12-31 | 39873056 | 67361 | 52751.429 | 1192550 | 281 | 286.857 | ⋯ | NA | NA | NA | NA | NA | 436816679 | NA | NA | NA | NA | |
328061 | OWID_UMC | Upper middle income | 2021-12-31 | 83617794 | 169473 | 136426.714 | 2200159 | 1793 | 1887.857 | ⋯ | NA | NA | NA | NA | NA | 2525921300 | NA | NA | NA | NA | |
337532 | OWID_WLS | Europe | Wales | 2021-12-31 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | 3170000 | NA | NA | NA | NA |
340184 | OWID_WRL | World | 2021-12-31 | 285446097 | 1349829 | 1122946.000 | 5474098 | 5965 | 6481.714 | ⋯ | 34.635 | 60.13 | 2.705 | 72.58 | 0.737 | 7975105024 | NA | NA | NA | NA |
# Toutes ces lignes spéciales sont a priori à supprimer, soit car elles correspondent à des aggrégations à retrouver plus tard,
# soit car elles concernent des pays avec très peu de données, à l'exception semble-t-il du Kosovo. Gardons seulement cette ligne.
kosovo_index <- data$iso_code == "OWID_KOS"
kosovo_line <- data[kosovo_index,]
kosovo_line[1] <- "KOS"
data <- rbind(data[!owid_lines,], kosovo_line)
# Pour supprimer les lignes avec valeurs manquantes, il y a plusieurs options.
# J'en indique ici 3, de la plus complexe à la plus simple :
data1 <- data[apply(data, 1, function(row) all(!is.na(row))),] #méthode 1
data2 <- data[complete.cases(data),] #méthode 2
data3 <- na.omit(data) #méthode 3
# On remarque alors qu'il n'y a plus aucune ligne ==> il faut restreindre les colonnes
# (ou "deviner" les valeurs manquantes d'une manière ou d'une autre : voir le package missMDA)
# (ici on se contente de la version simple : pas de données manquantes en entrée).
nrow(data3)
# Que représentent les variables ?
colnames(data)
- 'iso_code'
- 'continent'
- 'location'
- 'date'
- 'total_cases'
- 'new_cases'
- 'new_cases_smoothed'
- 'total_deaths'
- 'new_deaths'
- 'new_deaths_smoothed'
- 'total_cases_per_million'
- 'new_cases_per_million'
- 'new_cases_smoothed_per_million'
- 'total_deaths_per_million'
- 'new_deaths_per_million'
- 'new_deaths_smoothed_per_million'
- 'reproduction_rate'
- 'icu_patients'
- 'icu_patients_per_million'
- 'hosp_patients'
- 'hosp_patients_per_million'
- 'weekly_icu_admissions'
- 'weekly_icu_admissions_per_million'
- 'weekly_hosp_admissions'
- 'weekly_hosp_admissions_per_million'
- 'total_tests'
- 'new_tests'
- 'total_tests_per_thousand'
- 'new_tests_per_thousand'
- 'new_tests_smoothed'
- 'new_tests_smoothed_per_thousand'
- 'positive_rate'
- 'tests_per_case'
- 'tests_units'
- 'total_vaccinations'
- 'people_vaccinated'
- 'people_fully_vaccinated'
- 'total_boosters'
- 'new_vaccinations'
- 'new_vaccinations_smoothed'
- 'total_vaccinations_per_hundred'
- 'people_vaccinated_per_hundred'
- 'people_fully_vaccinated_per_hundred'
- 'total_boosters_per_hundred'
- 'new_vaccinations_smoothed_per_million'
- 'new_people_vaccinated_smoothed'
- 'new_people_vaccinated_smoothed_per_hundred'
- 'stringency_index'
- 'population_density'
- 'median_age'
- 'aged_65_older'
- 'aged_70_older'
- 'gdp_per_capita'
- 'extreme_poverty'
- 'cardiovasc_death_rate'
- 'diabetes_prevalence'
- 'female_smokers'
- 'male_smokers'
- 'handwashing_facilities'
- 'hospital_beds_per_thousand'
- 'life_expectancy'
- 'human_development_index'
- 'population'
- 'excess_mortality_cumulative_absolute'
- 'excess_mortality_cumulative'
- 'excess_mortality'
- 'excess_mortality_cumulative_per_million'
# Variables "new_*" : instantané journalier d'un certain indicateur. On ne s'y intéressera pas ici (cf. plus bas).
# De même pour les variables "weekly_*" (indicateurs hebdomadaires, j'imagine). Reste :
selection <- colnames(data)[!startsWith(colnames(data), "new_") & !startsWith(colnames(data), "weekly_")]
# Colonnes avec +50% de valeurs renseignées
selection <- selection[ apply(data[,selection], 2, function(col) sum(!is.na(col)) > nrow(data)/2) ]
selection
- 'iso_code'
- 'continent'
- 'location'
- 'date'
- 'total_cases'
- 'total_deaths'
- 'total_cases_per_million'
- 'total_deaths_per_million'
- 'reproduction_rate'
- 'positive_rate'
- 'tests_per_case'
- 'tests_units'
- 'stringency_index'
- 'population_density'
- 'median_age'
- 'aged_65_older'
- 'aged_70_older'
- 'gdp_per_capita'
- 'extreme_poverty'
- 'cardiovasc_death_rate'
- 'diabetes_prevalence'
- 'female_smokers'
- 'male_smokers'
- 'hospital_beds_per_thousand'
- 'life_expectancy'
- 'human_development_index'
- 'population'
On y voit enfin plus clair :
- iso_code : identifiant d'un pays, sur 3 lettres
- location : nom du pays
- continent, date : heu, continent, et date =)
- total_cases : nombre total de cas enregistrés jusqu'à dmax
- total_deaths : nombre total de décès enregistrés jusqu'à dmax
- total_cases_per_million : nombre relatif de cas totaux (par million)
- total_deaths_per_million : nombre relatif de décès (par million)
- tests_units : "Units used by the location to report its testing data" https://github.com/owid/covid-19-data/blob/master/public/data/README.md
en fait cette colonne ne contient que ". " ==> inutilisable. - population : nombre d'habitants
- population_density : densité de population (par kilomètre carré)
- median_age : âge médian, 50% des gens sont plus jeunes et 50% plus vieux
- aged_65_older : pourcentage de personnes dépassant 65 ans
- aged_70_older : pareil avec 70 ans
- gdp_per_capita : PIB par habitant
- extreme_poverty : pourcentage de la population sous le seuil d'extrême pauvreté
- cardiovasc_death_rate : "Death rate from cardiovascular disease in 2017 (annual number of deaths per 100,000 people)"
- diabetes_prevalence : Diabetes prevalence (% of population aged 20 to 79) in 2017
- female_smokers : pourcentage de fumeuses
- male_smokers : pourcentage de fumeurs
- hospital_beds_per_thousand : nombre de lits d'hôpital par tranche de 1000 habitants
- life_expectancy : espérance de vie
- human_development_index : indice de développement humain
data$tests_units #???
- ''
- 'tests performed'
- ''
- ''
- 'tests performed'
- ''
- ''
- ''
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'units unclear'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- ''
- 'samples tested'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- ''
- 'tests performed'
- ''
- ''
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- ''
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- ''
- ''
- 'people tested'
- 'tests performed'
- 'people tested'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- ''
- ''
- 'samples tested'
- 'people tested'
- ''
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- ''
- 'tests performed'
- 'people tested'
- ''
- 'tests performed'
- 'tests performed'
- 'people tested'
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'samples tested'
- ''
- ''
- ''
- 'tests performed'
- 'people tested'
- ''
- ''
- ''
- 'tests performed'
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'samples tested'
- 'people tested'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'people tested'
- 'tests performed'
- 'samples tested'
- 'people tested'
- ''
- 'tests performed'
- ''
- 'tests performed'
- ''
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- ''
- ''
- ''
- 'samples tested'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'people tested'
- 'samples tested'
- ''
- 'tests performed'
- ''
- ''
- 'tests performed'
- ''
- ''
- 'people tested'
- ''
- 'tests performed'
- ''
- 'samples tested'
- ''
- ''
- 'people tested'
- 'tests performed'
- 'samples tested'
- 'tests performed'
- ''
- 'samples tested'
- 'tests performed'
- ''
- 'tests performed'
- ''
- ''
- 'tests performed'
- ''
- 'samples tested'
- 'tests performed'
- ''
- 'people tested'
- ''
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'people tested'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'samples tested'
- ''
- ''
- 'people tested'
- ''
- ''
- ''
- 'tests performed'
- ''
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'people tested'
- ''
- ''
- ''
- ''
- 'tests performed'
- 'tests performed'
- ''
- ''
- 'people tested'
- 'people tested'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- 'people tested'
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- ''
- ''
- 'people tested'
- 'people tested'
- 'tests performed'
- ''
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'tests performed'
- 'people tested'
- ''
- ''
- ''
- ''
- 'people tested'
- ''
- ''
- 'tests performed'
- 'tests performed'
- 'tests performed'
selection <- selection[selection != "tests_units"]
newData <- na.omit(data[,selection])
nrow(newData)
92 lignes est raisonnable (proche de 50% de la taille du jeu de données initial). Cependant, pour être cohérent il faut en plus choisir un type de variable : absolu, ou relatif ? Je préfère les indicateurs relatifs (*_per_million, *_density) :
selection <- selection[!selection %in% c("total_cases", "total_deaths", "population")]
newData <- na.omit(data[,selection])
nrow(newData) #92 encore
rownames(newData) <- newData$iso_code #pour l'affichage des individus
newData <- newData[,-c(1,4)] #suppression des colonnes "code ISO" et "date", désormais inutiles
Note : toute l'analyse jusqu'ici aurait pu se faire aussi facilement avec un autre langage, Python par exemple.
À partir d'ici cependant, le package R FactoMineR est très pratique (pas d'équivalent Python (?!))
# ...On est enfin prêt pour l'ACP !
library(FactoMineR)
res.pca <- PCA(newData, quali.sup=1:2, ncp=6, graph=FALSE)
options(repr.plot.width=15, repr.plot.height=10)
plotInd <- plot(res.pca, choix="ind", invisible="quali")
plotVar <- plot(res.pca, choix="var")
library(gridExtra)
grid.arrange(plotInd, plotVar, ncol=2)
Warning message: “ggrepel: 5 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Bon, il semble qu'aucun individu ne se détache. En fait certaines lignes ont presque toutes leurs valeurs renseignées, et une fois complétées à l'aide de Google on trouve des individus extrêmes (Monaco, Singapour). EDIT : ici je n'ai pas complété le fichier - vous pouvez le faire.
extremes <- which(data$location %in% c("Monaco", "Singapore"))
data[extremes,selection]
iso_code | continent | location | date | total_cases_per_million | total_deaths_per_million | reproduction_rate | positive_rate | tests_per_case | stringency_index | ⋯ | aged_70_older | gdp_per_capita | extreme_poverty | cardiovasc_death_rate | diabetes_prevalence | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
197910 | MCO | Europe | Monaco | 2021-12-31 | 138280.67 | 1041.353 | 0.93 | NA | NA | 34.13 | ⋯ | NA | NA | NA | NA | 5.46 | NA | NA | 13.8 | 86.75 | NA |
277597 | SGP | Asia | Singapore | 2021-12-31 | 49566.07 | 146.886 | 1.14 | NA | NA | 42.77 | ⋯ | 7.049 | 85535.38 | NA | 92.243 | 10.99 | 5.2 | 28.3 | 2.4 | 83.62 | 0.938 |
Côté variables, aged_65_older et aged_70_older apparaissent très corrélées (en fait même confondues). C'est logique, on gardera donc seulement aged_70_older après vérification numérique :
cor(newData[,c("aged_65_older", "aged_70_older")])
aged_65_older | aged_70_older | |
---|---|---|
aged_65_older | 1.0000000 | 0.9939191 |
aged_70_older | 0.9939191 | 1.0000000 |
newData <- subset(newData, select=-aged_65_older)
res.pca <- PCA(newData, quali.sup=1:2, ncp=6, graph=FALSE)
plotInd <- plot(res.pca, choix="ind", invisible="quali", habillage=1)
plotVar <- plot(res.pca, choix="var")
library(gridExtra)
grid.arrange(plotInd, plotVar, ncol=2)
Warning message: “ggrepel: 2 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Environ 65% de l'inertie expliquée dans ce premier plan ACP (51 + 14). Le cercle des corrélations oppose logiquement "richesse" à droite (HDI, PIB/hab), avec "pauvreté" à gauche. Sur le nuage des individus cela correspond grossièrement à l'opposition Europe occidentale / Afrique (quelques exceptions : Tunisie, Seychelles, ...).
Il est intéressant de constater que les indicateurs de richesse sont très corrélés positivement au nombre de morts par millions, lui-même anti-corrélé avec extreme_poverty : le COVID frapperait plutôt les riches ? Mais pourquoi donc, puisque le virus est partout ? Et bien un élément de réponse se trouve dans ce même plan ACP : aged_70_older => on y vit plus vieux, et dans une moindre mesure diabetes_prevalence => plus de cas de diabète (à vérifier numériquement).
cor(newData[,c("aged_70_older", "total_deaths_per_million", "human_development_index",
"diabetes_prevalence", "extreme_poverty")])
aged_70_older | total_deaths_per_million | human_development_index | diabetes_prevalence | extreme_poverty | |
---|---|---|---|---|---|
aged_70_older | 1.00000000 | 0.6116174 | 0.8203092 | -0.02528401 | -0.5541654 |
total_deaths_per_million | 0.61161735 | 1.0000000 | 0.5002665 | 0.14617300 | -0.4433506 |
human_development_index | 0.82030920 | 0.5002665 | 1.0000000 | 0.17607414 | -0.7396914 |
diabetes_prevalence | -0.02528401 | 0.1461730 | 0.1760741 | 1.00000000 | -0.3462567 |
extreme_poverty | -0.55416536 | -0.4433506 | -0.7396914 | -0.34625674 | 1.0000000 |
newData
continent | location | total_cases_per_million | total_deaths_per_million | reproduction_rate | positive_rate | tests_per_case | stringency_index | population_density | median_age | aged_70_older | gdp_per_capita | extreme_poverty | cardiovasc_death_rate | diabetes_prevalence | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
ALB | Europe | Albania | 73495.999 | 1130.064 | 1.48 | 0.1110 | 9.0 | 46.30 | 104.871 | 38.0 | 8.643 | 11803.431 | 1.1 | 304.195 | 10.08 | 7.1 | 51.2 | 2.890 | 78.57 | 0.795 |
ARG | South America | Argentina | 127015.620 | 2596.686 | 2.11 | 0.3190 | 3.1 | 36.52 | 16.177 | 31.9 | 7.441 | 18933.907 | 0.6 | 191.032 | 5.50 | 16.2 | 27.7 | 5.000 | 76.67 | 0.845 |
AUS | Oceania | Australia | 14004.709 | 93.325 | 2.20 | 0.0745 | 13.4 | 47.16 | 3.202 | 37.9 | 10.129 | 44648.710 | 0.5 | 107.791 | 5.07 | 13.0 | 16.5 | 3.840 | 83.44 | 0.944 |
AUT | Europe | Austria | 141452.257 | 1867.753 | 1.18 | 0.0042 | 239.8 | 44.16 | 106.749 | 44.4 | 13.748 | 45436.686 | 0.7 | 145.183 | 6.35 | 28.4 | 30.9 | 7.370 | 81.54 | 0.922 |
BGD | Asia | Bangladesh | 9262.063 | 163.985 | 1.43 | 0.0219 | 45.7 | 29.63 | 1265.036 | 27.5 | 3.262 | 3523.984 | 14.8 | 298.003 | 8.38 | 1.0 | 44.7 | 0.800 | 72.59 | 0.632 |
BEL | Europe | Belgium | 179883.824 | 2429.409 | 1.25 | 0.1750 | 5.7 | 33.89 | 375.564 | 41.8 | 12.849 | 42658.576 | 0.2 | 114.898 | 4.29 | 25.1 | 31.4 | 5.640 | 81.63 | 0.931 |
BIH | Europe | Bosnia and Herzegovina | 89830.928 | 4152.737 | 1.36 | 0.1867 | 5.4 | 35.19 | 68.496 | 42.5 | 10.711 | 11713.895 | 0.2 | 329.635 | 10.08 | 30.2 | 47.7 | 3.500 | 77.40 | 0.780 |
BGR | Europe | Bulgaria | 108210.980 | 4501.357 | 1.42 | 0.0838 | 11.9 | 45.82 | 65.180 | 44.7 | 13.272 | 18563.307 | 1.5 | 424.688 | 5.81 | 30.1 | 44.4 | 7.454 | 75.05 | 0.816 |
CAN | North America | Canada | 54674.470 | 779.054 | 1.49 | 0.2292 | 4.4 | 68.16 | 4.037 | 41.4 | 10.797 | 44017.591 | 0.5 | 105.599 | 7.37 | 12.0 | 16.6 | 2.500 | 82.43 | 0.929 |
CHL | South America | Chile | 92058.065 | 1994.314 | 1.19 | 0.0258 | 38.8 | 31.42 | 24.282 | 35.4 | 6.938 | 22767.037 | 1.3 | 127.993 | 8.46 | 34.2 | 41.5 | 2.110 | 80.18 | 0.851 |
CHN | Asia | China | 92.624 | 3.997 | 1.31 | 0.0000 | 65979.5 | 79.17 | 147.674 | 38.7 | 5.929 | 15308.712 | 0.7 | 261.899 | 9.74 | 1.9 | 48.4 | 4.340 | 76.91 | 0.761 |
COL | South America | Colombia | 99059.263 | 2503.488 | 2.01 | 0.1350 | 7.4 | 49.40 | 44.223 | 32.2 | 4.312 | 13254.949 | 4.5 | 124.240 | 7.44 | 4.7 | 13.5 | 1.710 | 77.29 | 0.767 |
CRI | North America | Costa Rica | 110206.152 | 1419.462 | 1.05 | 0.0910 | 11.0 | 45.31 | 96.079 | 33.6 | 5.694 | 15524.995 | 1.3 | 137.973 | 8.78 | 6.4 | 17.4 | 1.130 | 80.28 | 0.810 |
HRV | Europe | Croatia | 176082.986 | 3099.722 | 1.31 | 0.3788 | 2.6 | 38.05 | 73.726 | 44.0 | 13.053 | 22669.797 | 0.7 | 253.782 | 5.59 | 34.3 | 39.9 | 5.540 | 78.49 | 0.851 |
DNK | Europe | Denmark | 133231.468 | 553.529 | 1.34 | 0.1133 | 8.8 | 31.52 | 136.520 | 42.3 | 12.325 | 46682.515 | 0.2 | 114.767 | 6.41 | 19.3 | 18.8 | 2.500 | 80.90 | 0.940 |
DOM | North America | Dominican Republic | 37160.446 | 378.134 | 1.62 | 0.0994 | 10.1 | 47.08 | 222.873 | 27.6 | 4.419 | 14600.861 | 1.6 | 266.653 | 8.20 | 8.5 | 19.1 | 1.600 | 74.08 | 0.756 |
ECU | South America | Ecuador | 30320.534 | 1870.396 | 1.26 | 0.2760 | 3.6 | 53.30 | 66.939 | 28.1 | 4.458 | 10581.936 | 3.6 | 140.448 | 5.55 | 2.0 | 12.3 | 1.500 | 77.01 | 0.759 |
SLV | North America | El Salvador | 19212.981 | 603.340 | 0.65 | 0.0130 | 76.9 | 31.48 | 307.811 | 27.6 | 5.417 | 7292.458 | 2.2 | 167.295 | 8.87 | 2.5 | 18.8 | 1.300 | 73.32 | 0.673 |
EST | Europe | Estonia | 171039.256 | 1378.516 | 1.27 | 0.1241 | 8.1 | 37.29 | 31.033 | 42.7 | 13.491 | 29481.252 | 0.5 | 255.569 | 4.02 | 24.5 | 39.3 | 4.690 | 78.74 | 0.892 |
ETH | Africa | Ethiopia | 3367.185 | 56.136 | 1.42 | 0.4357 | 2.3 | 40.74 | 104.957 | 19.8 | 2.063 | 1729.927 | 26.7 | 182.634 | 7.47 | 0.4 | 8.5 | 0.300 | 66.60 | 0.485 |
GMB | Africa | Gambia | 3758.322 | 126.756 | 0.53 | 0.0534 | 18.7 | 13.89 | 207.566 | 17.5 | 1.417 | 1561.767 | 10.1 | 331.430 | 1.91 | 0.7 | 31.2 | 1.100 | 62.05 | 0.496 |
GEO | Asia | Georgia | 249638.058 | 3685.518 | 1.02 | 0.0601 | 16.6 | 50.00 | 65.032 | 38.7 | 10.244 | 9745.079 | 4.2 | 496.218 | 7.11 | 5.3 | 55.5 | 2.600 | 73.77 | 0.812 |
GHA | Africa | Ghana | 4364.905 | 39.013 | 1.17 | 0.2351 | 4.3 | 51.27 | 126.719 | 21.1 | 1.948 | 4227.630 | 12.0 | 298.245 | 4.97 | 0.3 | 7.7 | 0.900 | 64.07 | 0.611 |
GRC | Europe | Greece | 104473.849 | 1974.488 | 1.75 | 0.0581 | 17.2 | 74.28 | 83.479 | 45.3 | 14.524 | 24574.382 | 1.5 | 175.695 | 4.55 | 35.3 | 52.0 | 4.210 | 82.24 | 0.888 |
HUN | Europe | Hungary | 126053.645 | 3931.454 | 1.00 | 0.1975 | 5.1 | 27.96 | 108.043 | 43.4 | 11.976 | 26777.561 | 0.5 | 278.296 | 7.55 | 26.8 | 34.8 | 7.020 | 76.88 | 0.854 |
IND | Asia | India | 24583.308 | 339.465 | 2.10 | 0.0101 | 98.7 | 68.64 | 450.419 | 28.2 | 3.414 | 6426.674 | 21.2 | 282.280 | 10.39 | 1.9 | 20.6 | 0.530 | 69.66 | 0.645 |
IDN | Asia | Indonesia | 15472.592 | 523.025 | 1.16 | 0.0012 | 830.5 | 66.69 | 145.725 | 29.3 | 3.053 | 11188.744 | 5.7 | 342.864 | 6.32 | 2.8 | 76.1 | 1.040 | 71.72 | 0.718 |
IRN | Asia | Iran | 69934.029 | 1485.840 | 0.86 | 0.0181 | 55.2 | 54.63 | 49.831 | 32.4 | 3.182 | 19082.620 | 0.2 | 270.308 | 9.59 | 0.8 | 21.1 | 1.500 | 76.68 | 0.783 |
IRL | Europe | Ireland | 149793.912 | 1211.202 | 1.53 | 0.4300 | 2.3 | 42.84 | 69.874 | 38.7 | 8.678 | 67335.293 | 0.2 | 126.459 | 3.28 | 23.0 | 25.7 | 2.960 | 82.30 | 0.955 |
ISR | Asia | Israel | 146252.196 | 874.061 | 2.10 | 0.0280 | 35.7 | 48.91 | 402.606 | 30.6 | 7.359 | 33132.320 | 0.5 | 93.320 | 6.74 | 15.4 | 35.4 | 2.990 | 82.97 | 0.919 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
MAR | Africa | Morocco | 25656.966 | 396.284 | 2.25 | 0.0777 | 12.9 | 65.27 | 80.080 | 29.6 | 4.209 | 7485.013 | 1.0 | 419.146 | 7.14 | 0.8 | 47.1 | 1.100 | 76.68 | 0.686 |
MOZ | Africa | Mozambique | 5587.555 | 60.541 | 1.48 | 0.6374 | 1.6 | 37.04 | 37.728 | 17.7 | 1.870 | 1136.103 | 62.9 | 329.942 | 3.30 | 5.1 | 29.1 | 0.700 | 60.85 | 0.456 |
MMR | Asia | Myanmar | 9797.725 | 355.634 | 0.87 | 0.0116 | 86.3 | 80.56 | 81.721 | 29.1 | 3.120 | 5591.597 | 6.4 | 202.104 | 4.61 | 6.3 | 35.2 | 0.900 | 67.13 | 0.583 |
NPL | Asia | Nepal | 27119.361 | 379.539 | 1.22 | 0.0250 | 40.0 | 33.33 | 204.430 | 25.0 | 3.212 | 2442.804 | 15.0 | 260.797 | 7.26 | 9.5 | 37.8 | 0.300 | 70.78 | 0.602 |
NOR | Europe | Norway | 72669.388 | 256.518 | 1.10 | 0.2700 | 3.7 | 51.85 | 14.462 | 39.7 | 10.813 | 64800.057 | 0.2 | 114.316 | 5.31 | 19.6 | 20.7 | 3.600 | 82.40 | 0.957 |
PAK | Asia | Pakistan | 5490.774 | 122.638 | 1.44 | 0.0088 | 113.5 | 47.52 | 255.573 | 23.5 | 2.780 | 5034.708 | 4.0 | 423.031 | 8.35 | 2.8 | 36.7 | 0.600 | 67.27 | 0.557 |
PAN | North America | Panama | 111383.433 | 1684.215 | 1.71 | 0.1076 | 9.3 | 42.34 | 55.133 | 29.7 | 5.030 | 22267.037 | 2.2 | 128.346 | 8.33 | 2.4 | 9.9 | 2.300 | 78.51 | 0.815 |
PRY | South America | Paraguay | 68738.907 | 2451.648 | 1.19 | 0.0445 | 22.5 | 40.74 | 17.144 | 26.5 | 3.833 | 8827.010 | 1.7 | 199.128 | 8.27 | 5.0 | 21.6 | 1.300 | 74.25 | 0.728 |
PRT | Europe | Portugal | 124888.605 | 1838.990 | 1.55 | 0.0731 | 13.7 | 48.15 | 112.371 | 46.2 | 14.924 | 27936.896 | 0.5 | 127.842 | 9.85 | 16.3 | 30.0 | 3.390 | 82.05 | 0.864 |
ROU | Europe | Romania | 91927.269 | 2986.581 | 1.62 | 0.0275 | 36.4 | 56.48 | 85.129 | 43.0 | 11.690 | 23313.199 | 5.7 | 370.946 | 9.74 | 22.9 | 37.1 | 6.892 | 76.05 | 0.828 |
RUS | Europe | Russia | 72557.126 | 2134.289 | 0.82 | 0.0517 | 19.3 | 54.17 | 8.823 | 39.6 | 9.393 | 24765.954 | 0.1 | 431.297 | 6.18 | 23.4 | 58.3 | 8.050 | 72.58 | 0.824 |
SVK | Europe | Slovakia | 149152.071 | 2947.662 | 0.83 | 0.1070 | 9.3 | 56.34 | 113.128 | 41.2 | 9.167 | 30155.152 | 0.7 | 287.959 | 7.29 | 23.1 | 37.7 | 5.820 | 77.54 | 0.860 |
ZAF | Africa | South Africa | 57543.972 | 1520.372 | 0.78 | 0.2607 | 3.8 | 44.44 | 46.754 | 27.3 | 3.053 | 12294.876 | 18.9 | 200.380 | 5.52 | 8.1 | 33.2 | 2.320 | 64.13 | 0.709 |
KOR | Asia | South Korea | 12259.772 | 108.558 | 0.79 | 0.0236 | 42.3 | 46.39 | 527.967 | 43.4 | 8.622 | 35938.374 | 0.2 | 85.998 | 6.80 | 6.2 | 40.9 | 12.270 | 83.03 | 0.916 |
ESP | Europe | Spain | 128265.632 | 1919.210 | 1.74 | 0.3080 | 3.2 | 43.44 | 93.105 | 45.5 | 13.799 | 34272.360 | 1.0 | 99.403 | 7.17 | 27.4 | 31.4 | 2.970 | 83.56 | 0.904 |
LKA | Asia | Sri Lanka | 26898.175 | 686.098 | 0.94 | 0.0702 | 14.2 | 53.80 | 341.955 | 34.1 | 5.331 | 11669.077 | 0.7 | 197.093 | 10.68 | 0.3 | 27.0 | 3.600 | 76.98 | 0.782 |
SWE | Europe | Sweden | 125711.264 | 1454.592 | 1.57 | 0.1220 | 8.2 | 41.43 | 24.718 | 41.0 | 13.433 | 46949.283 | 0.5 | 133.982 | 4.79 | 18.8 | 18.9 | 2.220 | 82.80 | 0.945 |
THA | Asia | Thailand | 31011.538 | 302.635 | 1.09 | 0.0744 | 13.4 | 43.06 | 135.132 | 40.1 | 6.890 | 16277.671 | 0.1 | 109.861 | 7.04 | 1.9 | 38.8 | 2.100 | 77.15 | 0.777 |
TLS | Asia | Timor | 14789.405 | 90.957 | 0.70 | 0.0023 | 429.1 | 40.58 | 87.176 | 18.0 | 1.897 | 6570.102 | 30.3 | 335.346 | 6.86 | 6.3 | 78.1 | 5.900 | 69.50 | 0.606 |
TGO | Africa | Togo | 3408.749 | 28.027 | 1.64 | 0.2567 | 3.9 | 21.76 | 143.366 | 19.4 | 1.525 | 1429.813 | 49.2 | 280.033 | 6.15 | 0.9 | 14.2 | 0.700 | 61.04 | 0.515 |
TUN | Africa | Tunisia | 58743.540 | 2068.935 | 1.65 | 0.0627 | 15.9 | 35.20 | 74.228 | 32.7 | 5.075 | 10849.297 | 2.0 | 318.991 | 8.52 | 1.1 | 65.8 | 2.300 | 76.70 | 0.740 |
TUR | Asia | Turkey | 110635.410 | 962.067 | 1.39 | 0.0847 | 11.8 | 35.55 | 104.914 | 31.6 | 5.061 | 25129.341 | 0.2 | 171.285 | 12.13 | 14.1 | 41.1 | 2.810 | 77.69 | 0.820 |
UGA | Africa | Uganda | 3019.518 | 69.778 | 1.69 | 0.1726 | 5.8 | 73.15 | 213.759 | 16.4 | 1.308 | 1697.707 | 41.6 | 213.333 | 2.50 | 3.4 | 16.7 | 0.500 | 63.37 | 0.544 |
UKR | Europe | Ukraine | 90829.360 | 2330.704 | 0.97 | 0.1988 | 5.0 | 39.49 | 77.390 | 41.4 | 11.133 | 7894.393 | 0.1 | 539.849 | 7.11 | 13.5 | 47.4 | 8.800 | 72.06 | 0.779 |
GBR | Europe | United Kingdom | 199109.996 | 2619.105 | 1.24 | 0.1001 | 10.0 | 44.06 | 272.898 | 40.8 | 12.527 | 39753.244 | 0.2 | 122.137 | 4.28 | 20.0 | 24.7 | 2.540 | 81.32 | 0.932 |
USA | North America | United States | 158249.753 | 2421.163 | 1.64 | 0.2600 | 3.8 | 47.65 | 35.608 | 38.3 | 9.732 | 54225.446 | 1.2 | 151.089 | 10.79 | 19.1 | 24.6 | 2.770 | 78.86 | 0.926 |
URY | South America | Uruguay | 119875.973 | 1802.036 | 1.57 | 0.0773 | 12.9 | 28.70 | 19.751 | 35.6 | 10.361 | 20551.409 | 0.1 | 160.708 | 6.93 | 14.0 | 19.9 | 2.800 | 77.91 | 0.817 |
VNM | Asia | Vietnam | 17632.268 | 329.922 | 1.03 | 0.0926 | 10.8 | 69.44 | 308.127 | 32.6 | 4.718 | 6171.884 | 2.0 | 245.465 | 6.00 | 1.0 | 45.9 | 2.600 | 75.40 | 0.704 |
ZMB | Africa | Zambia | 12448.652 | 186.335 | 1.55 | 0.3133 | 3.2 | 37.96 | 22.995 | 17.7 | 1.542 | 3689.251 | 57.5 | 234.499 | 3.94 | 3.1 | 24.7 | 2.000 | 63.89 | 0.584 |
ZWE | Africa | Zimbabwe | 12973.101 | 306.179 | 0.81 | 0.2858 | 3.5 | 47.22 | 42.729 | 19.6 | 1.882 | 1899.775 | 21.4 | 307.846 | 1.82 | 1.6 | 30.7 | 1.700 | 61.49 | 0.571 |
La corrélation (resp. anti-corrélation) aged_70_older avec HDI (resp. extreme_poverty) et total_deaths est vérifiée. De même, on observe une légère corrélation positive (resp. négative) entre diabetes_prevalence et HDI (resp. extreme_poverty).
Ensuite, le taux de fumeuses semble très corrélé à l'âge médian : les femmes auraient plus tendance à fumer dans les pays où l'on vit plus vieux (donc en général plus riches). Ce n'est pas le cas de male_smokers : le taux de fumeurs n'indique quant à lui pas grand chose. De même, et plus étonnament, le taux de mortalité par maladies cardiovasculaires (infarctus j'imagine) ne paraît corrélé à rien - si ce n'est justement et assez logiquement, la proportion de fumeurs : "According to the American Heart Association, cardiovascular disease accounts for about 800,000 U.S. deaths every year,5 making it the leading cause of all deaths in the United States. Of those, nearly 20 percent are due to cigarette smoking." [https://www.fda.gov/tobacco-products/health-effects-tobacco-use/how-smoking-affects-heart-health#]
La coloration par continents montre une opposition haut/bas entre Europe de l'est et Europe de l'ouest + USA/Canada/Israel/Corée/Australie. Il semble y avoir relativement plus de fumeurs en Géorgie/Ukraine/Russie. les pays d'Amérique centrale et du sud sont plus bas, donc a priori moins touchés par les décès par infarctus et comportant moins de fumeurs. Il n'y a pas assez de pays d'Océanie pour en dire grand chose, et l'Asie est répartie un peu partout, montrant une grande inhomogénéité en comparaison aux autres continents.
Vérifions notre analyse en regardant de plus près quelques individus :
indivs_indices <- rownames(newData) %in% c("LUX", "UKR", "NER", "ECU")
newData[indivs_indices,c("location", "total_deaths_per_million", "aged_70_older", "male_smokers",
"cardiovasc_death_rate", "human_development_index")]
location | total_deaths_per_million | aged_70_older | male_smokers | cardiovasc_death_rate | human_development_index | |
---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
ECU | Ecuador | 1870.396 | 4.458 | 12.3 | 140.448 | 0.759 |
LUX | Luxembourg | 980.542 | 9.842 | 26.0 | 128.275 | 0.916 |
UKR | Ukraine | 2330.704 | 11.133 | 47.4 | 539.849 | 0.779 |
Luxembourg : population âgée, HDI élevé, 2x moins de fumeurs qu'en Ukraine mais 2x + qu'en Equateur.
Niger : population jeune, HDI bas, peu de fumeurs, très peu de morts du COVID.
Bref, passons au second plan ACP :
plotInd <- plot(res.pca, choix="ind", invisible="quali", habillage=1, axes=3:4)
plotVar <- plot(res.pca, choix="var", axes=3:4)
library(gridExtra)
grid.arrange(plotInd, plotVar, ncol=2)
Warning message: “ggrepel: 12 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
Peu d'inertie expliquée dans ce plan (à peine plus de 17%), mais une observation intéressante : anti-corrélation population_density et total_deaths_per_million ? À vérifier numériquement bien sûr car cette dernière flèche est loin du bord. Ce serait cependant cohérent : densément peuplé => contaminations plus faciles => plus de cas => plus de personnes très fragiles touchées => plus de morts.
On note aussi l'anti-corrélation entre diabetes_prevalence et extreme_poverty, déjà un peu observée dans le premier plan. Vérification numérique :
indivs_indices <- rownames(newData) %in% c("MLT", "EGY", "MNE", "MWI")
newData[indivs_indices,c("location", "total_deaths_per_million", "population_density",
"diabetes_prevalence", "extreme_poverty")]
location | total_deaths_per_million | population_density | diabetes_prevalence | extreme_poverty | |
---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | |
MWI | Malawi | 115.411 | 197.519 | 3.94 | 71.4 |
MLT | Malta | 924.445 | 1454.037 | 8.83 | 0.2 |
Opposition Égypte / Malawi vérifiée sur l'axe diabète/pauvreté, ainsi que Malte/Montenegro sur l'axe morts_par_million/densité.