Research Paper |
Corresponding author: Vito E. Cambria ( vitoemanuele.cambria@phd.unipd.it ) Academic editor: Florian Jansen
© 2020 Fabio Attorre, Vito E. Cambria, Emiliano Agrillo, Nicola Alessi, Marco Alfò, Michele De Sanctis, Luca Malatesta, Tommaso Sitzia, Riccardo Guarino, Corrado Marcenò, Marco Massimi, Francesco Spada, Giuliano Fanelli.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Attorre F, Cambria VE, Agrillo E, Alessi N, Alfò M, De Sanctis M, Malatesta L, Sitzia T, Guarino R, Marcenò C, Massimi M, Spada F, Fanelli G (2020) Finite Mixture Model-based classification of a complex vegetation system. Vegetation Classification and Survey 1: 77-86. https://doi.org/10.3897/VCS/2020/48518
|
Aim: To propose a Finite Mixture Model (FMM) as an additional approach for classifying large datasets of georeferenced vegetation plots from complex vegetation systems. Study area: The Italian peninsula including the two main islands (Sicily and Sardinia), but excluding the Alps and the Po plain. Methods: We used a database of 5,593 georeferenced plots and 1,586 vascular species of forest vegetation, created in TURBOVEG by storing published and unpublished phytosociological plots collected over the last 30 years. The plots were classified according to species composition and environmental variables using a FMM. Classification results were compared with those obtained by TWINSPAN algorithm. Groups were characterized in terms of ecological parameters, dominant and diagnostic species using the fidelity coefficient. Interpretation of resulting forest vegetation types was supported by a predictive map, produced using discriminant functions on environmental predictors, and by a non-metric multidimensional scaling ordination. Results: FMM clustering obtained 24 groups that were compared with those from TWINSPAN, and similarities were found only at a higher classification level corresponding to the main orders of the Italian broadleaf forest vegetation: Fagetalia sylvaticae, Carpinetalia betuli, Quercetalia pubescenti-petraeae and Quercetalia ilicis. At lower syntaxonomic level, these 24 groups were referred to alliances and sub-alliances. Conclusions: Despite a greater computational complexity, FMM appears to be an effective alternative to the traditional classification methods through the incorporation of modelling in the classificatory process. This allows classification of both the co-occurrence of species and environmental factors so that groups are identified not only on their species composition, as in the case of TWINSPAN, but also on their specific environmental niche.
Taxonomic reference:
Abbreviations: CLM = Community-level models; FMM = Finite Mixture Model; NMDS = non-metric multidimensional scaling.
cluster analysis, finite mixture model, forest vegetation, Italian peninsula, vegetation plots
The analysis of the spatial distribution of assemblages of communities is receiving increasing attention by ecologists (
Approaches to CLM clustering can be either based on minimizing a given loss function (for instance, the sum of within-group deviance), or can be based on associating each group to a specific joint density, which is parametrically specified. In this last case, CLM based clustering arises. While in standard (either hard or fuzzy) partitioning groups are summarized or represented by prototypes, in CLM clustering groups are represented by specific shapes of the corresponding probability density. Using such an approach, vegetation plots can be classified using the posterior probability that each belongs to a given component of the mixture, each component describing a group. Moreover, when the dataset is large, hierarchical approaches, based on the calculation of the pairwise (between plots) distances, rapidly become unfeasible. In this case, partitioning around prototypes, either means, medians or other, in a hard or a fuzzy perspective are usually adopted. However, much of these are based on simple Euclidean distances between each plot and the group prototypes that do not consider the dependence, the association and the covariance between the variables (plant species abundance values) characterizing the plots. In this respect, finite mixtures of multivariate Gaussian densities provide a simple, model-based, extension to the K-means method, allowing for overlapping clusters oriented according to the group-specific covariances and providing, a posteriori, for the classification of each plot to one of the groups. For this reason, among CLMs, Finite Mixture Modelling (FMM) is an emerging method and has already been used to identify marine bioregions on the Western Australian continental margin (
Within this framework, this paper aims to verify the applicability of FMM as classification method of vegetation plots using a complex case study and a large dataset, comparing the classification results with (1) those obtained by the TWINSPAN algorithm and (2) with current syntaxonomic classification schemes.
Observation data include 5,593 georeferenced vegetation plots of between 100 and 300 m2 and 1,586 vascular species of forests in the Italian peninsula and major islands (
Environmental covariates to be used in the statistical model were derived from a database with a spatial resolution of 1×1 km (
We used a FMM to cluster vegetation plots, based on the assumption that data originate from one of K potential groups, also referred to as components. Each group is identified by a component, and each component is completely characterized by a distribution with known parametric form and component-specific parameters. When a (multivariate) Gaussian density is used to describe the component-specific distribution of observed plant species cover, the component is identified by a specific center, defined by the mean vector (as the observed values are on abundance scale, we may hypothesize that similar plots will be characterized by similar values of abundance of the same species), and a specific shape, summarized by the covariance matrix, which allows for varying dependence between cover values corresponding to different plant species for plots in that component. The groups (components) are defined as homogeneous in the sense that they include plots that show similar vegetation as described by the plant species cover. Therefore, the observed plots can be allocated to one of the groups by using a criterion associated with the proximity between plots and group centers. This criterion is based on the posterior probability that a plot comes from that group (component of the finite mixture). The sum of the posterior probabilities over the components for a given plot is equal to 1, meaning that the plot has a varying degree of membership to all clusters in the population. We usually allocate a plot to a given cluster by finding that for which the posterior probability is maximum. At the end of the grouping step, each group will be characterized by a weight defined as the mean of posterior probabilities and refers to the (relative) frequency of plots allocated to that group. These terms can be interpreted as (prior) probabilities that a generic plot is randomly drawn from a “population of plots” belonging to that group (component of the finite mixture). We propose to model these (prior) probabilities as a function of so-called auxiliary variables (see e.g.
After estimating the parameter vectors for the component-specific densities describing observed abundance, and the prior probability models, we derived the updated posterior probabilities as the (normalized) product of the prior information (based on covariates) and the density for that specific component.
These two steps can be jointly performed within the same estimation algorithm (e.g. using Latent Gold software, see
In this paper, we adapted the FMM to account for a large data matrix, formed by 5,593 vegetation plots and 1,586 species whose percentage cover is recorded. In this case the direct application of a FMM would be difficult, since it would require the computation and inversion of a 1,586 * 1,586 covariance matrix, with a very sparse structure. Looking at the distribution of the number of species observed in each plot, we see that the corresponding median value is equal to 81; if we look at the distribution of the number of plots each species is present in, the median value is equal to 7. The outcome of this is that of 10,402,980 values in the abundance data matrix, we have 10,241,820 (i.e. 98.45%) null values. Thus, rather than applying a FMM to the observed matrix of percentage covers, we fitted this model to a derived matrix, defined by projecting the original data matrix onto the space spanned by the first 20 principal components of the original data matrix using an approximate method (see
The optimal number of forest groups (components) was obtained according to penalized likelihood criteria (AIC –
FMM classification was compared with that obtained by TWINSPAN (
The obtained groups were characterized according to environmental parameters and diagnostic species, which were determined using the fidelity coefficient (phi) of
Interpretation of groups was supported by Kruskal’s non-metric multidimensional scaling (NMDS) ordination (function isoMDS in the MASS R package,
FMM R code and R libraries used for the statistical analyses are included in Suppl. material
FMM identified 24 groups, which were considered optimal according to all penalized likelihood criteria. However, four of these were discarded because they were characterized by few plots (less than 50), and they were quite heterogeneous. Descriptions of their environmental parameters, spatial distribution and syntaxonomic correspondences is presented in Suppl. material
Cluster A includes groups 8, 2, and 23. The first three can be found in temperate areas at an average altitude greater than 1000 m and are characterized by the dominance of Fagus sylvatica in groups 8 and 2, and by the codominance of this species with Abies alba in group 23 (Suppl. material
TWINSPAN classification identified three main clusters, dominated by temperate broadleaved deciduous forests generally dominated by Fagus sylvatica (Groups 1–13), evergreen Mediterranean forests dominated by Quercus suber and Quercus ilex (Groups 14–18) and sub-Mediterranean deciduous forests dominated by Quercus cerris (Groups 19–24). The first TWINSPAN cluster corresponds to the four groups of the FMM classification (FMM groups 2, 8, 18 and 23, Table
Comparative matrix between the 24 groups obtained by Finite Mixture Model classification (rows) and the 24 groups by the modified version of TWINSPAN (columns). Colors of the margins (groups) indicate membership to the clusters. Within the matrix, the red color indicates no correspondence among the groups. An increasing correspondence is highlighted by a color gradient from yellow to dark green.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | Tot | |
2 | 1 | 5 | 50 | 15 | 12 | 32 | 9 | 21 | 79 | 44 | 4 | 10 | 80 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 363 |
8 | 4 | 10 | 19 | 142 | 34 | 18 | 106 | 21 | 47 | 95 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 497 |
23 | 0 | 12 | 47 | 28 | 4 | 12 | 3 | 0 | 24 | 35 | 28 | 51 | 132 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 5 | 384 |
18 | 0 | 3 | 11 | 0 | 6 | 4 | 0 | 0 | 5 | 0 | 30 | 21 | 79 | 0 | 0 | 0 | 2 | 0 | 10 | 0 | 0 | 3 | 1 | 12 | 187 |
3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 79 | 42 | 13 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 8 | 0 | 50 | 198 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17 | 0 | 1 | 1 | 0 | 0 | 0 | 3 | 11 | 0 | 1 | 10 | 3 | 50 | 97 |
5 | 0 | 0 | 5 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 27 | 9 | 25 | 0 | 0 | 0 | 0 | 10 | 35 | 0 | 5 | 1 | 1 | 61 | 181 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 0 | 7 | 6 | 1 | 160 | 197 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 1 | 15 | 7 | 13 | 1 | 8 | 4 | 9 | 98 | 163 |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 3 | 0 | 0 | 0 | 0 | 18 | 59 | 3 | 0 | 0 | 9 | 0 | 205 | 304 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 39 | 0 | 0 | 0 | 14 | 3 | 323 | 397 |
16 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 84 | 23 | 81 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 3 | 2 | 0 | 38 | 238 |
17 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 1 | 2 | 7 | 0 | 3 | 11 | 43 | 43 | 0 | 13 | 7 | 3 | 55 | 195 |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 3 | 0 | 0 | 0 | 21 | 31 | 5 | 0 | 22 | 14 | 1 | 11 | 113 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 87 | 165 | 4 | 1 | 0 | 26 | 0 | 46 | 333 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 237 | 38 | 2 | 0 | 1 | 16 | 66 | 13 | 381 |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 55 | 390 | 0 | 1 | 0 | 0 | 0 | 2 | 449 |
21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 29 | 0 | 28 | 36 | 246 | 0 | 0 | 0 | 1 | 1 | 16 | 357 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 15 | 76 | 19 | 33 | 0 | 1 | 0 | 0 | 2 | 2 | 166 |
22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 96 | 44 | 46 | 2 | 56 | 0 | 0 | 0 | 0 | 0 | 0 | 244 |
1 | 1 | 9 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 4 | 0 | 0 | 0 | 4 | 16 | 0 | 7 | 0 | 0 | 5 | 56 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 9 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 9 | 31 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 3 | 0 | 0 | 0 | 4 | 0 | 6 | 0 | 1 | 3 | 2 | 10 | 33 |
24 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 2 | 0 | 0 | 7 | 0 | 2 | 1 | 0 | 12 | 29 |
Tot | 8 | 39 | 133 | 185 | 58 | 68 | 118 | 42 | 157 | 174 | 308 | 179 | 429 | 157 | 59 | 164 | 526 | 1124 | 185 | 5 | 71 | 127 | 93 | 1184 | 5593 |
The confusion matrix built to compare classified versus predicted plots highlighted that, with only some exceptions, environmental factors alone are insufficient to clearly discriminate among the groups identified by the FMM classification (Suppl. material
The choice of an algorithm for the classification of vegetation plots depends on the objective of the classification and each algorithm has advantages and drawbacks (
Modified TWINSPAN classification with 24 groups. Light blue color indicates groups belonging to Fagetalia sylvaticae (Groups 1–10), purple to Carpinetalia betuli (Groups 11–13), orange and red to Quercetalia ilicis (Groups 14–18), and green to Quercetalia pubescenti-petraea (Groups 19–24).
Consequently, FMM appears an effective alternative to traditional classification methods, such as TWINSPAN, to support the analysis of complex vegetation systems due to the ability to integrate both species composition and environmental factors into the modelled classificatory process. Moreover, since FMM identifies groups according to their ecological space, a predictive distribution map can also be produced (Figure
When compared with current syntaxonomic knowledge, the groups obtained by the FMM classification largely corresponded to several alliances and suballiances recognized for Italy according to
Correspondence between the FMM group and the syntaxonomy in
FMM Group |
Alliance in |
2 | New alliance? |
8 | FAG-02B Fagion sylvaticae Luquet 1926 |
23 | FAG-02C Geranio striati-Fagion |
18 | FAG-03 Carpinetalia betuli P. Fukarek 1968 |
3 | PUB-01F Fraxino orni-Ostryion Tomazic 1940 |
4 | FAG-03C Erythronio-Carpinion (Horvat 1958) Marincek in Wallnofer et al. 1993 |
5 | PUB-01L Crataego laevigatae-Quercion cerridis Arrigoni 1997 |
6 | PUB-01L Crataego laevigatae-Quercion cerridis Arrigoni 1997 |
7 | PUB-01L Crataego laevigatae-Quercion cerridis Arrigoni 1997 |
11 | PUB-01G Carpinion orientalis Horvat 1958 |
14 | PUB-01G Carpinion orientalis Horvat 1958 |
16 | FAG-03C Erythronio-Carpinion (Horvat 1958) Marincek in Wallnofer et al. 1993 |
17 | PUB-01L Crataego laevigatae-Quercion cerridis Arrigoni 1997 |
9 | PUB-01M Pino calabricae-Quercion congestae S. Brullo et al. 1999 |
10 | QUI-01D Fraxino orni-Quercion ilicis Biondi, Casavecchia et Gigante in Biondi et al. 2013 |
12 | PUB-01M Pino calabricae-Quercion congestae S. Brullo et al. 1999 |
13 | QUI-01A Quercion ilicis Br.-Bl. ex Molinier 1934 |
20 | QUI-01E Erico-Quercion ilicis S. Brullo et al. 1977 |
21 | QUI-01E Erico-Quercion ilicis S. Brullo et al. 1977 |
22 | QUI-01E Erico-Quercion ilicis S. Brullo et al. 1977 |
In our analysis, a more complex pattern emerged: the gradient of different bioclimates, from temperate to sub-Mediterranean, with decreasing water availability and increasing temperature, follows not only the phytogeographical sector but also an altitudinal gradient. For instance, temperate beech forests of the upper altitude are potentially distributed all along the peninsula including the Etna volcano in Sicily (Group 8), while lower altitude beech forests (Groups 2 and 23) are distributed respectively in the south and in the central north (Figure
Cluster B includes only group 18 and can be referred to the Carpinetalia betuli order (
Sub-Mediterranean deciduous oak forests of cluster C are characterized by a complex geographic pattern along the Apennines, which cannot be explained only by the combination of geo-climatic factors, as is highlighted by the very high omission errors of the confusion matrix (Suppl. material
The geographic pattern also characterizes the evergreen Mediterranean forests, which are difficult to classify due to the low number of characteristic species, especially in the herbaceous layer. FMM (and also TWINSPAN, see Table
The 20 groups can be aggregated in four clusters corresponding to the main syntaxonomic orders recognized for the Italian peninsula: Carpinetalia betuli, Fagetalia sylvaticae, Quercetalia ilicis Quercetalia pubescenti-petraeae (Figure
Despite a greater computational complexity, Finite Mixture Model seems to be a promising classificatory approach when dealing with the analysis of complex vegetation systems and using a large dataset. This relied on the possibility of modelling in the classification process both the co-occurrence of species and environmental variables so that groups are identified not only based on their species composition, such as in the case of TWINSPAN, but also on their specific environmental niche. These features can effectively highlight geographical patterns as depicted by predictive maps and support the interpretation of classification results.
Primary data are stored in the European Vegetation Archive (
F.A., V.E.C., E.A. and G.F. conceived the study, M.A. and L.M. run the statistical analyses, and N.A., M.D.S., T.S., R.G., C.M., M.M. and F.S. contributed to the interpretation of results.
We would like to thank Laura Clarke for revising the text and all those who collected vegetation-plot data in the field and integrated these data in the Sapienza database (https://www.givd.info/ID/EU-IT-011).
Fabio Attorre (fabio.attorre@uniroma1.it), ORCID: http://orcid.org/0000-0002-7744-2195
Vito E. Cambria (Corresponding author, vitoemanuele.cambria@phd.unipd.it), ORCID: http://orcid.org/0000-0003-0481-6368
Emiliano Agrillo (emiliano.agrillo@isprambiente.it), ORCID: http://orcid.org/0000-0003-2346-8346
Nicola Alessi (nicola.alessi@natec.unibz.it)
Marco Alfò (marco.alfo@uniroma1.it), ORCID: http://orcid.org/0000-0001-7651-6052
Michele De Sanctis (michele.desanctis@uniroma1.it), ORCID: http://orcid.org/0000-0002-7280-6199
Luca Malatesta (luca.malatesta@uniroma1.it), ORCID: http://orcid.org/0000-0003-1887-4163
Tommaso Sitzia (tommaso.sitzia@unipd.it), ORCID: http://orcid.org/0000-0001-6221-4256
Riccardo Guarino (riccardo.guarino@unipa.it), ORCID: http://orcid.org/0000-0003-0106-9416
Corrado Marcenò (marceno.corrado@ehu.eus), ORCID: http://orcid.org/0000-0003-4361-5200
Marco Massimi (marco.massimi@hotmail.com)
Francesco Spada (francesco.spada@uniroma1.it)
Giuliano Fanelli (giuliano.fanelli@gmail.com), ORCID: http://orcid.org/0000-0002-3143-1212
MM R code and R libraries used for the statistical analyses (.R)
Ecological, physiognomic and distributional features, floristic composition and syntaxonomy of groups (.DOCX)
Ecological parameters, dominant and diagnostic species of the groups (.XLSX)
Maps of the distribution of the classified plots of each group (.JPG)
Confusion matrix generated for the accuracy assessment of the potential distribution map of groups (.DOCX)