VCS Methods |
Corresponding author: Tiago Monteiro-Henriques ( tmh@isa.ulisboa.pt ) Academic editor: David W. Roberts
© 2025 Tiago Monteiro-Henriques.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Monteiro-Henriques T (2025) TDV-optimization: A novel numerical method for phytosociological tabulation. Vegetation Classification and Survey 6: 99-127. https://doi.org/10.3897/VCS.140466
|
I present the Total Differential Value (TDV), an index designed for vegetation analysis based on the operational concept of differential species, as classically illustrated by Heinz Ellenberg and Dieter Mueller-Dombois. Given a phytosociological table and a grouping of its relevés, TDV is obtained by averaging the Differential Value (DiffVal) for each species in the table. DiffVal, grounded in combinatorial-discrete mathematics, quantifies the differential power of a species. The novelty of this approach lies in its distinction between two types of species absences: (i) absences from some relevés within a group and (ii) absences from all relevés representing a group. By leveraging the distribution of species absences among groups, this method effectively quantifies the differential power and distinguishes differential from non-differential species. I illustrate the computation of DiffVal and TDV and show that, because only differential species contribute to TDV, it reflects the strength of the differential species patterns in a classified table. TDV can be optimized (TDV-optimization), providing partitions of relevés. I demonstrate TDV-optimization using both an artificial and a well-known real-world data set. Key features of this method include its ability to identify patterns very closely resembling manual phytosociological tabulation and to detect reticulate patterns. TDV-optimization may lead to partitions where outlier or extreme relevés are isolated in groups; however, enforcing a minimum group size can highlight partitions with more balanced group sizes. An R package is now available, implementing DiffVal and TDV calculation as well as TDV-optimization.
Abbreviations: DiffVal = Differential Value; EllPar = Ellenberg’s partition into three groups; NDR = number of discrepant relevés; NSS = number of significant species; TDV = Total Differential Value.
biclustering, block structure, differential species, optimization, patterns, phytosociological tabulation, tabular classification, TDV-optimization, vegetation
Vegetation is a complex phenomenon driven by many deeply interacting factors. As bottom-up mechanistic approaches to vegetation are often hindered by this complexity, physiognomic-floristic patterns are sought as a means to simplify and handle this phenomenon (
Vegetation classification is the process of identifying, describing and interrelating vegetation units using vegetation relevés (see, e.g.,
Early in its history, vegetation classification was conducted solely by manipulating matrices of vegetation samples (relevés), as later described in detail by
As described in
The concept of differential species dates back over 100 years (
The concept of differential species is closely linked to two other key concepts from the Braun-Blanquet school: (i) characteristic species and (ii) fidelity. These concepts have also endured changes over the years. For a detailed discussion, see Appendix 1.
During the first half of the 20th century, researchers discussed the feasibility of vegetation classification based on physiognomic-floristic patterns, emphasizing the need for objective sampling and classification methods (
The results of numerical methods are often unsatisfactory for vegetation scientists (
Clustering/partitioning approaches commonly used in vegetation analysis. Each approach is characterized by the models on which the grouping is based, the space in which the groups are sought, and other relevant characteristics that may hinder consistency with theoretical models, reality or expert knowledge.
Name | Reference | Model: Grouping is based on… | Space: Groups are searched in… | Relevant characteristics | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
…clusters centre (centroid, medoid, etc.) | …an agglomerative strategy | …a divisive strategy | …density | …biclusters (subsets) | …graph theory | …heuristics1 | …metaheuristcs1 | …the data space (as a Euclidean space) | …pairwise distance/dissimilarity space | …an ordination space, plane or axis | …the data space (the original table) | …a graph | …the solution/partition space | Returns a hierarchical structure? | Search is geometric? | Species are considered equally informative? | Ensures exclusiveness of informative species? | ||
Agglomerative hierarchical clustering |
|
■ | ■ | Y | Y | Y | --- | ||||||||||||
k-means |
|
■ | ■ | ■ | N | Y | Y | --- | |||||||||||
--- |
|
■ | ■ | ■ | N | N | N | O4 | |||||||||||
TABORD |
|
■ | ■ | ■ | □ | □ | N | Y | Y | --- | |||||||||
TWINSPAN |
|
■ | ■ | ■ | ■ | Y | Y | P5 | N | ||||||||||
COMPCLUS |
|
■ | ■ | ■ | N | N | Y | --- | |||||||||||
ALOC |
|
■ | ■ | ■ | □ | □ | N | Y | Y | --- | |||||||||
PAM |
|
■ | ■ | ■ | N | Y | Y | --- | |||||||||||
DIANA |
|
■ | ■ | Y | Y | Y | --- | ||||||||||||
--- |
|
■ | ■ | ■ | N | N | Y | --- | |||||||||||
ESPRESSO |
|
■ | ■ | ■ | N | N | N | N | |||||||||||
COCKTAIL |
|
■ | ■ | ■ | N | N | N | N | |||||||||||
ISOPAM |
|
■ | □ | ■ | ■ 6 | O | P7 | Y8 | N | ||||||||||
--- |
|
■ | ■ 9 | N | N | P10 | N | ||||||||||||
--- |
|
■ | ■ 11 | N | N | P12 | N | ||||||||||||
OPTPART |
|
■ | ■ | N | N | Y | --- | ||||||||||||
OPTSIL |
|
■ | ■ | N | N | Y | --- | ||||||||||||
REMOS |
|
■ | ■ | N | N | Y | --- | ||||||||||||
TDV-optimization | present work | ■ | ■ | N | N | N | Y |
Algorithms that produce hierarchical structures are constrained by nested fusion/division (
In vegetation classification, species occurrences are not all equally informative (
For those methods that recognize the relevance of subsets of species in the emergence of relevé groups, it is crucial to ensure the exclusivity of these species to one or more emerging groups, as this is a fundamental aspect of the tabulation technique. Note that several modern approaches to quantifying fidelity or indicator value use a concentration-based approach, which does not guarantee exclusivity to specific groups (see Appendix 1 for a detailed discussion of this issue). Relying on a concentration-based approach when identifying subsets of informative species may lead to different subsets from those obtained through phytosociological tabulation.
When the clustering/partitioning approach does not meet these four requirements of the phytosociological tabulation (italicized above), it risks misplacing relevés compared to the tabulation technique.
Diagrams of bicluster structures (species in rows, relevés in columns). Dark and light grey blocks enclose the occurrences of species, while white areas contain only the absences of species. A) An exclusive-row and -column, non-exhaustive-in-the-rows structure. B) A relaxation of the previous structure, where only exclusive-rows, non-exhaustive-in-the-rows biclusters are depicted.
In manual tabulation, mutually exclusive groups of all columns (i.e., a partition of the relevés) are adjusted to the identified bicluster blocks to reveal exclusive species patterns among the final relevé groups. Figure
Diagrams of bicluster structures and partitions of the columns/relevés as sought in the phytosociological tabulation. A) A partition of the columns/relevés (groups I, II and III) superimposed on the biclustering structure from Figure
The bicluster structure in Figure
Recall that the bottom light grey area contains species left outside the biclusters. In phytosociological tabulation, some species absent from one or more relevé groups may remain in this area, though this can be considered subjective. In an automatic classification, however, each species is preferentially assigned to a unique block, ensuring that each block contains only species present in all its associated groups. Any later adjustments can be justified by the expert. Thus, in Figure
Looking carefully at Figure
The Total Differential Value (TDV), proposed in this work, seeks to answer the following question: “Can a numerical index be created to globally reflect the strength of differential species patterns in a differentiated table?” To answer this question, TDV leverages the distinction between stochastic and differentiating absences.
An ideal block of differential species is characterized by a high density of presences, coupled with a total absence from at least some of the remaining groups, preferably from all. An ideal tabulation would display large blocks of exclusive species, differentiating each group. However, in practice, the number of species between blocks varies, as does the density of occurrences of each species within the same block.
The usefulness of a single species in differentiating a specific group of relevés increases with (i) the frequency of the species’ presences in the group, and (ii) its exclusivity to the group, meaning it is absent from the other groups as much as possible. This feature is commonly sought in species indicator power measures (e.g.,
Given a phytosociological table T and a k-partition P of its relevés (i.e., a classification of all relevés into k groups, with each relevé assigned to only one group), the DiffVal expresses the ability of a species, say s, to differentiate the groups in which it occurs from the other groups. The DiffVal of species s, given partition P, is calculated by the following formula:
(1)
where ag is the total number of presences of species s within group g; bg is the total number of relevés in group g; cg’ is the total number of differentiating absences of species s in groups other than g; dg’’ is the total number of relevés in all groups excluding group g; and e is the total number of groups in which species s occurs at least once.
DiffVal is, therefore, a summation of k summands, each one representing the partial differential value of the species s to each one of the k groups. Note that ag/bg is the frequency of the presences (of species s inside group g), relative to the group g size (i.e., the number of relevés in g). This is the constancy of species s to group g (
The TDV is the mean of the DiffVal outcomes for species in table T. It is a global measure (ranging from 0 to 1) of how well the k groups can be distinguished from one another using differential species. Thus, the TDV of a partition P over the phytosociological table T is given by:
(2)
where n is the total number of species in table T and DiffVals,P is the differential value of species s, given partition P.
Table
Note that an interesting property emerges from the way the distinction between stochastic and differentiating absences is incorporated into DiffVal. Specifically, the distribution of differentiating absences among the groups under comparison allows for the establishment of exclusiveness degrees (i.e., species exclusive to one group, two groups, three groups, etc.), quantifying the differential power of the species and ultimately distinguishing differential from non-differential species. The latter species, which occur in all vegetation groups of the phytosociological table, do not contribute to TDV, as their respective DiffVal is zero.
relevé no. | 12|345|6789 | DiffVal calculation | DiffVal summands | DiffVal |
---|---|---|---|---|
group no. | 11|222|3333 | |||
species 1 | 11|000|0000 | 1/1 × [(2/2) × (7/7) + (0/3) × (4/6) + (0/4) × (3/5)] = | 1/1 × [1.00 + 0.00 + 0.00] = | 1.00 |
species 2 | 00|000|1011 | 1/1 × [(0/2) × (3/7) + (0/3) × (2/6) + (3/4) × (5/5)] = | 1/1 × [0.00 + 0.00 + 0.75] = | 0.75 |
species 3 | 00|101|0000 | 1/1 × [(0/2) × (4/7) + (2/3) × (6/6) + (0/4) × (2/5)] = | 1/1 × [0.00 + 0.67 + 0.00] = | 0.67 |
species 4 | 11|000|1111 | 1/2 × [(2/2) × (3/7) + (0/3) × (0/6) + (4/4) × (3/5)] = | 1/2 × [0.43 + 0.00 + 0.60] = | 0.51 |
species 5 | 01|100|0000 | 1/2 × [(1/2) × (4/7) + (1/3) × (4/6) + (0/4) × (0/5)] = | 1/2 × [0.29 + 0.22 + 0.00] = | 0.25 |
species 6 | 11|111|1101 | 1/3 × [(2/2) × (0/7) + (3/3) × (0/6) + (3/4) × (0/5)) = | 1/3 × [0.00 + 0.00 + 0.00] = | 0.00 |
species 7 | 10|101|0100 | 1/3 × [(1/2) × (0/7) + (2/3) × (0/6) + (1/4) × (0/5)] = | 1/3 × [0.00 + 0.00 + 0.00] = | 0.00 |
TDV = 3.18/7 = 0.45 |
The number of possible partitions of a set with n elements is well-studied in combinatorics and corresponds to the nth Bell number (
The TDV metric may provide a way to assist tabulation. Specifically, if we accept that DiffVal is a good measure of the differential value of a species, and if we accept that TDV is a good measure of how well a partition is characterized in terms of the strength of the differential species patterns it contains, then optimizing TDV would allow us to find tabulations close to the ones sought in Braun-Blanquet’s school.
As Bell numbers and Stirling numbers of the second kind increase rapidly, calculating TDV for all possible partitions of the relevés in a phytosociological table is often impracticable. In other words, complete enumeration is not a viable option, as is the case with other criteria (see, e.g.,
R package diffval (
In the diffval package, users can also find the “tabulation” function, which, given a phytosociological table and a partition of its relevés, rearranges the table’s rows and columns to display exclusive species at the top. The reordering follows these steps: (i) Species are first ordered by the increasing number of groups in which they occur (i.e., by exclusiveness level/degree). (ii) Within each exclusiveness level, species are further ordered lexicographically based on the groups to which they belong. (iii) Finally, within each level defined by the previous two steps (also known as shortlex order), species are reordered by the decreasing sum of their relative frequencies across the groups in which they occur.
The columns are also reordered according to the increasing order of the assigned group membership numbers. Optionally, the rearranged table can be visualized graphically (an example is provided below).
It is important to note that the tabular rearrangement performed by the “tabulation” function is based on simple combinatorial rules (primarily the shortlex order). Its effectiveness in generating a meaningful phytosociological tabulation depends entirely on the data and the given partition.
I illustrate TDV-optimization using an artificial data set (Example 1), and a real-world data set (Example 2). I contrast TDV-optimization with ten clustering/partitioning methods (see Table
Summary of clustering/partitioning methods compared to TDV-optimization, along with their abbreviated names used hereafter.
Method | Abbreviated name (in bold) and software package | Parameters |
---|---|---|
Agglomerative hierarchical clustering | Ward, cluster1 | BCD, Ward’s method, dendrogram cut into three groups |
Flexible β, cluster1 | BCD, Flexible β method, β = -0.25, dendrogram cut into three groups | |
Centroid partitioning (k-means) | k-means, stats2 | number of groups = 3, no. of runs/starts = 10 |
Modified-TWINSPAN ( |
Mod-TWINSPAN, twinspan3 | cut levels = 0, max. no. of indicators for division = 7, min. group size for division = 5, max. depth of levels of divisions = 6, dendrogram cut into three groups, using class heterogeneity ( |
Partition around the medoids | PAM, cluster1 | BCD, no. of groups = 3 |
Divisive analysis clustering | DIANA, cluster1 | BCD, dendrogram cut into three groups |
Isometric feature mapping and partitioning around medoids | ISOPAM, isopam4 | BCD, no. of groups = 3 |
Partitioning by optimizing PARTANA ratio | OPTPART, optpart5 | BCD, desired no. of groups = 3 (optimization parameters given in text) |
Partitioning by optimizing mean silhouette width | OPTSIL, optpart5 | BCD, desired no. of groups = 3 (optimization parameters given in text) |
Reallocation of misclassified objects using the silhouette width criterion |
OPTSIL+REMOS, R code supplied in |
REMOS1 application to all OPTSIL local optima. BCD, threshold of silhouette width for misclassified objects = -0.001 |
Partitioning by optimizing TDV | TDV-optimization, diffval6 | no. of groups = 3 (optimization parameters given in text) |
IndVal p-values are calculated using permutations. To obtain stable estimates, ssIndval and nsIndic are averaged over 100 runs of the “indval” function from the labdsv package (
For all methods that required dissimilarity matrices, the Bray-Curtis index (
OptimClass 1 was calculated using Fisher’s exact test for the right-tailed hypothesis (
Reticulated patterns are misaligned subsets of species that produce multiple, sound groupings of relevés (see also
To test this, I created an artificial data set (107 species × 64 relevés) containing two distinct patterns of differential species distributed across three groups. The first pattern included three blocks of 16, 18, and 30 relevés, with 6, 8 and 12 differential species, respectively. The second pattern comprised three additional blocks of 14, 22, and 28 relevés, with 5, 11, and 9 differential species, respectively. These two patterns were randomly interwoven, ensuring that the blocks composing each pattern remained misaligned. The remaining 56 species were considered common-to-rare, without any associated pattern. Species presences within each block were randomly generated with varying theoretical relative frequencies, decreasing exponentially from 0.9 to around 0.1. A similar process was applied to the 56 common-to-rare species, but their presences were distributed across all 64 relevés, with frequencies decreasing exponentially from 1.0 to around 0.05.
The first pattern of differential species defined three groups of relevés, labelled 1, 2, and 3 (hereafter pattern ‘123’), while the second pattern defined three groups labelled a, b, and c (pattern ‘abc’). Each relevé was assigned both a number and a letter based on the differential species it contained (Figure
I submitted this data set, with relevés in random order, to the classification procedures listed in Table
Optimization procedures (OPTPART, OPTSIL, and TDV-optimization) were run multiple times with random initial partitions, retaining local optima for further analysis. OPTPART was run 50,000 times with default settings: a maximum of 100 iterations per run and a minimum PARTANA ratio increment of 0.001 to continue iterating. OPTSIL was run 20,000 times with default values for the maximum of 100 iterations per run. TDV-optimization was run 5,000 times using the “optim_tdv_hill_climb” function (
For each optimization procedure, I analysed the partition with the highest observed value of the optimized criterion and searched the local optima for partitions closest to patterns ‘123’ and ‘abc’. For the OPTSIL+REMOS method, I also analysed the solution with the highest ASW and searched the local optima for partitions as close as possible to these patterns. Partition discrepancy (or distance) was measured as the minimum number of relevés that must be reassigned to obtain one partition from another (see MINDMT in
Table
Significance for (1) to (5) was determined using Fisher’s exact test, exactly as in OptimClass 1, with a significance level (α) of 0.05. The sum of NSS-c-to-r, NSS’123’, and NSS’abc’ equals OptimClass 1 (α = 0.05).
NSS-c-to-r counts how many of these 56 species showed a statistically significant concentration in at least one group. At the commonly accepted significance level of α = 0.05, an average of 2.8 species are expected to be falsely detected as concentrated in at least one group. Table
Despite the high false positive rates among the common-to-rare species, k-means and OPTPART detected 32 of the 51 differential species as significant (NSSExc’123’ + NSSExc’abc’). ISOPAM found 31. These are notably high values, given that the corresponding partitions deviate substantially from patterns ‘123’ and ‘abc’ and represent a mixture of some of the artificially created groups.
For partitions J and K, which are closer to partitions ‘123’ and ‘abc’, the figures are necessarily different, showing lower values of NSS-c-t-r. Unsurprisingly, partitions ‘123’ and ‘abc’ (found by OPTSIL, OPTSIL+REMOS and TDV-optimization) have OptimClass 1 values (α = 0.05) that are closer to the actual number of differential species composing each pattern. These numbers reflect high statistical power, ranging from 88 to 96% (see the NSS of the respective pattern), and acceptable false positive rates, ranging from 4 to 5% (see NSS-c-to-r and the NSS of the other pattern).
It is worth mentioning that: (i) Mod-TWINSPAN and DIANA had relatively low NSS-c-to-r values; (ii) PAM and ISOPAM showed the highest values of ssIndVal and nsIndic; (iii) Flexible β and DIANA had relatively high values of avISAMIC; and (iv) k-means showed a high value for the PARTANA ratio.
The number of significant indicators (nsIndic) was consistently between OptimClass 1 values for α = 0.05 and those for α = 0.01.
For the optimization procedures, Table
OPTSIL found partition ‘123’ (among the set of local optima) and got extremely close to partition ‘abc’, with just two discrepant relevés (partition K, which also has the highest ASW). It is not possible for OPTISIL to find partition ‘abc’ exactly, as its ASW is lower than the ASW of partition K, and applying OPTSIL directly to partition ‘abc’ leads to convergence with partition K. The increase in ASW can only be explained by the noise that other (randomly distributed) species introduce to the ASW.
In contrast, OPTISIL+REMOS improved the MSW of partition K, converging to partition ‘abc’ (which is a local optimum of OPTISIL+REMOS). However, since the MSW is negative for partition ‘123’, it becomes impossible for OPTSIL+REMOS to find it. Applying the REMOS algorithm directly to partition ‘123’ changes 13 relevés from their original groups, leading to partition L, which has the highest ASW found by the REMOS procedure and is the closest to pattern ‘123’.
SillyPutty (
TDV-optimization was the only method capable of converging to local optima that identified both patterns (‘123’ and ‘abc’). However, the highest TDV corresponds to a partition (M) where two groups consist of single relevés. Partition M also exhibits the highest avISAMIC. Nevertheless, when TDV-optimization is applied with the parameter min_g_size = 4, partitions ‘123’ and ‘abc’ are ranked first and second among the returned solutions (see Discussion).
Artificial data set displaying its two reticulated patterns of differential species. Species presences are represented as small rectangles (coloured or grey). A) Relevés are sorted by the numbers in their labels, revealing pattern ‘123’ with 26 differential species. B) Relevés are sorted by the letters in their labels, revealing pattern ‘abc’ with 25 differential species.
Statistics for the partitions returned by the tested methods for the Example 1 data set. The best scores in each row are in bold, and the maximum (or minimum) values are in bold italics.
Agglomerative clustering | k-means | Mod-TWINSPAN | PAM | DIANA | ISOPAM | OPTPART selected partitions: | OPTSIL selected partitions: | OPTSIL+ REMOS selected partitions: | TDV-optimization selected partitions: | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Ward | Flexible β | maximum found | local optimum closest to ‘123’ | local optimum closest to ‘abc’ | maximum found and closest to ‘abc’ | ‘123’ is among local optima | maximum found and closest to ‘123’ | ‘abc’ is among local optima | maximum found | ‘123’ is among local optima | ‘abc’ is among local optima | ||||||
Partition id: | A | B | C | D | E | F | G | H | I | J | K | ‘123’ | L | ‘abc’ | M | ‘123’ | ‘abc’ |
PARTANA | 1.207 | 1.183 | 1.258 | 1.189 | 1.187 | 1.173 | 1.258 | 1.265 | 1.240 | 1.255 | 1.249 | 1.228 | 1.256 | 1.243 | 1.246 | 1.228 | 1.243 |
ASW | 0.0814 | 0.0714 | 0.1018 | 0.0729 | 0.0723 | 0.0496 | 0.1024 | 0.0989 | 0.0946 | 0.1010 | 0.1042 | 0.1036 | 0.1039 | 0.1014 | 0.0491 | 0.1036 | 0.1014 |
MSW (×10) | -0.965 | -1.545 | -0.087 | -0.758 | -0.925 | -1.379 | 0.012 | -0.336 | -0.51 | -0.461 | -0.318 | -0.229 | 0.002 | 0.022 | -2.461 | -0.229 | 0.022 |
OptimClass 1 (α = 0.05)1 | 43 (25) | 34 (24) | 42 (33) | 33 (27) | 43 (24) | 29 (22) | 43 (32) | 45 (35) | 41 (24) | 31 (23) | 29 (25) | 26 (24) | 39 (24) | 28 (24) | 5 (5) | 26 (24) | 28 (24) |
OptimClass 1 (α = 0.01) 1 | 25 (17) | 16 (15) | 18 (17) | 19 (16) | 23 (15) | 16 (15) | 22 (19) | 20 (19) | 23 (17) | 17 (15) | 16 (16) | 18 (18) | 22 (18) | 17 (17) | 0 (0) | 18 (18) | 17 (17) |
OptimClass 1 (α = 10-3) 1 | 10 (6) | 12 (11) | 11 (10) | 11 (10) | 11 (6) | 10 (10) | 11 (10) | 9 (9) | 14 (11) | 12 (11) | 12 (12) | 14 (14) | 16 (13) | 14 (14) | 0 (0) | 14 (14) | 14 (14) |
OptimClass 1 (α = 10-6) 1 | 3 (3) | 3 (3) | 5 (4) | 3 (3) | 2 (1) | 4 (4) | 6 (5) | 4 (4) | 4 (3) | 7 (6) | 8 (8) | 9 (9) | 5 (4) | 9 (9) | 0 (0) | 9 (9) | 9 (9) |
ssIndval (α = 0.05) | 10.66 | 10.41 | 10.64 | 9.82 | 12.08 | 7.94 | 12.00 | 9.99 | 10.88 | 9.76 | 9.74 | 9.61 | 11.89 | 10.01 | 1.87 | 9.61 | 10.02 |
nsIndic (α = 0.05) | 31.93 | 25.00 | 30.53 | 29.39 | 36.53 | 20.06 | 36.37 | 27.90 | 30.87 | 22.32 | 24.22 | 21.00 | 34.27 | 23.81 | 2.00 | 21.00 | 23.87 |
avISAMIC | 0.6654 | 0.6846 | 0.6695 | 0.6717 | 0.6709 | 0.6794 | 0.6710 | 0.6636 | 0.6705 | 0.6657 | 0.6640 | 0.6771 | 0.6731 | 0.6670 | 0.8797 | 0.6771 | 0.6670 |
NSS-c-to-r (α = 0.05) | 11 | 10 | 8 | 4 | 14 | 5 | 9 | 10 | 11 | 3 | 5 | 2 | 7 | 3 | 4 | 2 | 3 |
(20%) | (18%) | (14%) | (7%) | (25%) | (9%) | (16%) | (18%) | (20%) | (5%) | (9%) | (4%) | (12%) | (5%) | (7%) | (4%) | (5%) | |
NSS’123’ (α = 0.05) | 17 | 21 | 17 | 16 | 18 | 19 | 18 | 15 | 20 | 6 | 0 | 23 | 19 | 1 | 0 | 23 | 1 |
(65%) | (81%) | (65%) | (62%) | (69%) | (73%) | (69%) | (58%) | (77%) | (23%) | (0%) | (88%) | (73%) | (4%) | (0%) | (88%) | (4%) | |
NSSExc’123’ (α = 0.05) | 10 | 20 | 17 | 13 | 12 | 19 | 18 | 14 | 19 | 3 | 0 | 23 | 19 | 0 | 0 | 23 | 0 |
(38%) | (77%) | (65%) | (50%) | (46%) | (73%) | (69%) | (54%) | (73%) | (12%) | (0%) | (88%) | (73%) | (0%) | (0%) | (88%) | (0%) | |
NSS’abc’ (α = 0.05) | 15 | 3 | 17 | 13 | 11 | 5 | 16 | 20 | 10 | 22 | 24 | 1 | 13 | 24 | 1 | 1 | 24 |
(60%) | (12%) | (68%) | (52%) | (44%) | (20%) | (64%) | (80%) | (40%) | (88%) | (96%) | (4%) | (52%) | (96%) | (4%) | (4%) | (96%) | |
NSSExc’abc’ (α = 0.05) | 12 | 1 | 15 | 12 | 8 | 2 | 13 | 18 | 4 | 20 | 24 | 1 | 5 | 24 | 1 | 1 | 24 |
(48%) | (4%) | (60%) | (48%) | (32%) | (8%) | (52%) | (72%) | (16%) | (80%) | (96%) | (4%) | (20%) | (96%) | (4%) | (4%) | (96%) | |
NDR’123’ | 23 | 13 | 18 | 18 | 21 | 16 | 17 | 21 | 12 | 34 | 36 | 0 | 13 | 36 | 33 | 0 | 36 |
NDR’abc’ | 22 | 39 | 22 | 26 | 25 | 37 | 22 | 20 | 29 | 7 | 2 | 36 | 25 | 0 | 34 | 36 | 0 |
TDV | 0.0515 | 0.0503 | 0.0615 | 0.0601 | 0.0433 | 0.0678 | 0.0586 | 0.0700 | 0.0591 | 0.0700 | 0.0904 | 0.1010 | 0.0511 | 0.1024 | 0.1622 | 0.1010 | 0.1024 |
Research works aiming to classify the Arrhenatheretum data set and their performance compared to the three-group partition proposed by
Reference | Method | No. of discrepant relevés compared to Ellenberg’s partition |
---|---|---|
|
TABORD | 7 |
|
Agglomerative clustering (Euclidean distance, Ward’s method) | 3 |
|
TWINSPAN | 7 |
|
COMPCLUS (non-hierarchical composite clustering) | 2 |
|
VEGTAB, using the minimum spanning tree1 | 6 |
|
FLEXCLUS, modified single linkage clustering + centroid clustering | 8 |
|
Agglomerative clustering (complete linkage) | 3 |
For the Arrhenatheretum data set, OPTPART was run 5,000 times with default values for the maximum number of iterations per run (100) and for the minimum increment in the PARTANA ratio required to continue iterating (0.001). OPTSIL was run 5,000 times with default values for the maximum number of iterations per run (100). TDV-optimization was run 500 times using the function “optim_tdv_hill_climb”. Each run began with 300 iterations of stochastic hill climbing, followed by up to 30 iterations of greedy hill climbing. These iteration numbers were chosen based on initial tests, which confirmed that TDV plateaued in individual runs. The 100 highest-TDV partitions were retained for subsequent analysis. The minimum group size (min_g_size) was set to the default value of 1.
OPTPART consistently converged to a two-group solution with a PARTANA ratio of 2.183. In the OPTPART algorithm, “no minimum cluster size is enforced” (
Table
Suppl. material
OPTSIL found the solution with the highest ASW, followed by OPTSIL+REMOS. TDV-optimization yielded the highest values of TDV and avISAMIC. The second highest value of TDV was obtained by Mod-TWINSPAN. TDV-optimization also found the solution with the highest PARTANA ratio, again followed by Mod-TWINSPAN (but recall that OPTPART was not included in the comparison).
In general, all methods identified a relatively high proportion of the differential species defined by Ellenberg (nEll-diff-found), with Ward, k-means and OPTSIL performing relatively worse.
The local optimum closest to EllPar found by the OPTSIL algorithm is the same partition produced by k-means (partition c). Partition c has the highest number of species exclusive to one or two groups (nEll-not-diff-but-excl + nEll-diff-found), but the respective TDV is relatively low. Recall that DiffVal weighs the constancy in groups and the degree of exclusiveness. Partition c has a relatively low number of species exclusive to just one group.
The local optimum closest to EllPar found by the OPTSIL+REMOS procedure is the same partition produced by Flexible β (partition b), differing from EllPar by only three discrepant relevés.
ISOPAM produced an interesting partition that, despite having eight discrepant relevés (NDR’EllPar’), closely resembled EllPar in terms of differential species. The Bromus erectus group remains the same as in EllPar, with the difference lying in how the remaining relevés are split. The tabulation shown in Suppl. material
The statistic nEll-not-diff-but-excl counts the number of species not considered differential by Ellenberg but found to be exclusive to one or two groups (and thus are potential differentials if the partition is to be accepted). High values of this measure indicate that species considered pervasive in Ellenberg’s tabulation contribute to differentiate the groups of the partition under analysis. The baseline for this measure is not 0, as Ellenberg did not highlight as differential some species that were exclusive to two groups. As a result, nEll-not-diff-but-excl for Ellenberg’s own partition is 5. While the decision that was made not to consider these species as differential may have been subjective (or based on expert knowledge), all had a low number of presences in the matrix: four for Lotus corniculatus and Galium boreale, and three for Silene inflata, Silaus pratensis and Pastinaca sativa. The automatic tabulation performed by the “tabulation” function, which is primarily based on the shortlex order, assign these species to the respective differential block (see Suppl. material
Figure
Suppl. material
Several of the top solutions from TDV-optimization (see Suppl. material
Statistics for the partitions returned by the tested methods for the Arrhenatheretum data set. The best scores in each row are in bold, and the maximum (or minimum) values are in bold italics.
Agglomerative clustering | k-means | Mod-TWINSPAN | PAM | DIANA | ISOPAM | OPTSIL selected partitions: | OPTSIL+ REMOS selected partitions: | TDV-optimization selected partitions: | Ellenberg’s partition | |||||
Ward | Flexible β | maximum found | local optimum closest to EllPar | maximum found | local optimum closest to EllPar | maximum found | local optimum closest to EllPar | |||||||
Partition id: | a | b | c | d | e | f | g | h | c | i | b | j | k | EllPar |
PARTANA | 1.693 | 1.978 | 1.750 | 2.015 | 1.724 | 1.819 | 1.709 | 1.978 | 1.750 | 1.936 | 1.978 | 2.094 | 1.854 | 1.803 |
ASW | 0.1556 | 0.1695 | 0.1824 | 0.1489 | 0.1679 | 0.1883 | 0.1644 | 0.2280 | 0.1824 | 0.2079 | 0.1695 | 0.1622 | 0.1697 | 0.1627 |
MSW (×10) | -1.302 | 0.332 | -0.346 | -0.637 | -0.641 | 0.157 | -0.658 | -0.728 | -0.346 | 0.020 | 0.332 | -1.435 | -1.299 | -1.354 |
OptimClass 1 (α = 0.05) 1 | 21 (18) | 20 (19) | 23 (19) | 21 (19) | 22 (18) | 25 (19) | 25 (21) | 15 (14) | 23 (19) | 22 (20) | 20 (19) | 21 (21) | 22 (21) | 23 (21) |
OptimClass 1 (α = 0.01) 1 | 12 (10) | 10 (10) | 13 (11) | 12 (12) | 14 (12) | 17 (14) | 14 (13) | 9 (9) | 13 (11) | 12 (12) | 10 (10) | 11 (11) | 11 (11) | 12 (12) |
OptimClass 1 (α = 10-3) 1 | 2 (2) | 3 (3) | 2 (2) | 3 (3) | 2 (2) | 3 (3) | 2 (2) | 5 (5) | 2 (2) | 4 (4) | 3 (3) | 4 (4) | 3 (3) | 4 (4) |
ssIndval (α = 0.05) | 10.55 | 5.67 | 10.16 | 6.04 | 9.68 | 11.61 | 10.81 | 6.97 | 10.16 | 8.59 | 5.69 | 5.12 | 8.07 | 8.64 |
nsIndic (α = 0.05) | 20.82 | 8.00 | 19.00 | 9.00 | 17.83 | 21.83 | 20.34 | 9.00 | 19.00 | 13.43 | 8.04 | 7.11 | 13.00 | 15.00 |
avISAMIC | 0.6006 | 0.5997 | 0.6047 | 0.5832 | 0.5940 | 0.6150 | 0.5939 | 0.6429 | 0.6047 | 0.6090 | 0.5997 | 0.7071 | 0.5808 | 0.5824 |
nEll-not-diff-but-excl | 9 | 8 | 12 | 9 | 6 | 9 | 5 | 11 | 12 | 8 | 8 | 15 | 6 | 5 |
(41%) | (36%) | (55%) | (41%) | (27%) | (41%) | (23%) | (50%) | (55%) | (36%) | (36%) | (68%) | (27%) | (23%) | |
nEll-diff-found | 23 | 25 | 23 | 24 | 24 | 24 | 26 | 22 | 23 | 24 | 25 | 26 | 26 | 26 |
(88%) | (96%) | (88%) | (92%) | (92%) | (92%) | (100%) | (85%) | (88%) | (92%) | (96%) | (100%) | (100%) | (100%) | |
NDR’EllPar’ | 8 | 3 | 7 | 4 | 8 | 5 | 8 | 9 | 7 | 7 | 3 | 5 | 1 | 0 |
TDV | 0.1821 | 0.2409 | 0.1909 | 0.2471 | 0.1838 | 0.2051 | 0.1854 | 0.2380 | 0.1909 | 0.2330 | 0.2409 | 0.2511 | 0.2285 | 0.2127 |
Visualization of the exclusive species for each group (or combination of two groups) in partition k, as generated by the “tabulation” function. Relevé and species names have been added, with Ellenberg’s differential species highlighted in orange. See
Reticulate patterns are not uncommon in vegetation data, and ordination techniques are typically used to explore them (
Example 1 demonstrated that reticulate patterns pose challenges to most clustering and partitioning methods. This is methodologically understandable: when some relevés contain only one or a few differential species per pattern, the signal may be masked by other species combinations and remain undetected by conventional dissimilarity measures.
A key sentence in
“The idea of diagnostic-floristic similarity applies to the number of differential species per group that are present in a relevé. It thus differs from the idea of total floristic similarity, as evaluated for standard floristic similarity relationships.”
Clustering and partitioning strategies that treat all species as equally informative (see Table
TDV-optimization was run with the minimum group size parameter set to 1, allowing the formation of groups with isolated relevés. The treatment of isolated objects, however, varies between methods. Some methods, like Ward and k-means, tend to produce relatively balanced groups and avoid isolating individual objects. In contrast, the β parameter in Flexible β offers a continuous control that balances the chaining effect of single linkage (which tends to isolate objects when cutting the dendrogram) and the more compact, balanced groups produced by complete linkage. Some metrics, such as the silhouette width, are not defined for isolated objects. The solution with the highest TDV (0.1622, Table
Two solutions among the local optima, in positions 196 and 197 (Suppl. material
The ability of TDV-optimization to find patterns is explained by its design. As DiffVal equals zero for species that pervade all groups of a given partition, TDV-optimization ignores such species in each iteration, focusing solely on the subset of the matrix where differential species may occur. The idea that it could be useful to focus on subsets of a phytosociological data set was also expressed by
The use of indices such as OptimClass (
One of the top partitions obtained using TDV-optimization (Figure
TDV-optimization of the Arrhenatheretum data set showed that presence-absence data alone was sufficient to generate a relevé partition that closely mirrors the original classification given by
DiffVal and TDV were designed to satisfy traditional tabulation requirements. For this reason, TDV-optimization is conceptually distinct from previously published clustering/partitioning approaches (recall Table
Most of the approaches listed in Table
Conceptually, TDV-optimization is particularly similar to biclustering approaches (like
The algorithm proposed by
Using the parameter Y = 0% in Češka and Roemer’s approach would ensure the exclusivity of species to the forming biclusters, conforming to an exclusive-rows, non-exhaustive-in-the-rows structure. However, this approach does not seem to have been explored by the authors. This parametrization is the closest known approach to TDV-optimization (see Table
The ESPRESSO algorithm, proposed by
TDV-optimization, based on DiffVal, searches for patterns of differential species in a manner similar to
Internal indices are probably the most commonly used (
Relative evaluators are used less often.
Considering external validity indices, a recent example is
A major finding of
Under purely exploratory analysis, multiple clustering/partitioning methods can be applied to gain potential insights. The performance of these methods can be evaluated in relation to the vegetation phenomenon (e.g., some methods prioritize dominant species, while others highlight rare species). External validation is often compelling, particularly when groups accord with environmental gradients or reveal biogeographical patterns. Expert validation is also valuable in assessing whether the resulting groups agree with existing knowledge, relate to established vegetation classifications, or serve a practical purpose (see, e.g.,
I assume that vegetation scientists may also be interested in specific methods for partitioning relevés based on differential species (see
Under such an applied approach and optimization framework, the optimized criterion also serves as an internal evaluator (this is self-evident, as there would otherwise be no justification for its optimization). Other internal indices may highlight characteristics that are not necessarily aligned with the concept of differential species. Species-based evaluators (other than the one being optimized) may provide insight but are unlikely to outperform the optimized criterion, especially if their conceptual basis diverges from Ellenberg’s concept. As expected, indirect external validation (using proxies for ground truth and expert knowledge) functions in the same way as in exploratory analyses and should play a crucial role in validating solutions.
In heterogeneous data sets, the highest-TDV partitions may include small groups of relevés or even isolated relevés containing some exclusive species, which some authors may consider outliers. Such isolated relevés may arise from stochastic processes, as demonstrated in the analysis of the artificial data set in this study. However, they may also represent under-sampled vegetation units with a small set of differential species.
Given that DiffVal and TDV were specifically designed to identify patterns of exclusive species, the isolation of such relevés is not only expected but also desirable. The decision to (i) impose a minimum group size (thus forcing these extreme relevés to cluster with others), (ii) exclude them from the analysis (e.g., if they contain errors or exhibit distinct physiognomies), or (iii) accept the isolation as meaningful (e.g., as distinct vegetation types), possibly increasing the total number of groups to accommodate them, must always be guided by expert knowledge, which is not embedded in the data table. Consequently, TDV-optimization cannot be expected to automatically resolve cases involving isolated relevés, if present.
TDV-optimization provides a means of identifying partitions that are likely to be relevant for vegetation classification. However, the researcher must still examine the solutions, possibly assessing the distinctiveness/separability of groups in the geographical space or environmental hyperspace (
By design, TDV-optimization should be robust to spatial autocorrelation, as the species pools associated with a given biogeographical or environmental unit are not expected to change with the inclusion of geographically close relevés. However, a substantial number of geographically proximal relevés sharing a significant number of exclusive species may ultimately form a group. This, however, is a desirable outcome, as it reflects a genuine floristic pattern that should be detected and analysed.
It is also worth noting that, when applied to two relevés, the DiffVal reduces to the Jaccard distance, i.e., the complement of the Jaccard index (
Since the inception of vegetation science, different schools of thought have emerged across regions, closely tied to their respective regional ecological contexts. By identifying patterns of differential species in vegetation data, TDV-optimization is expected to be particularly useful in floristically rich regions with high geographical turnover and/or strong environmental filtering, such as the Mediterranean region. In areas where floristic variation is primarily reflected in differences in species cover-abundance or fluctuations in species relative frequency, TDV-optimization may have limited utility, unless broad environmental gradients are analysed.
TDV-optimization is not an instantaneous method yielding a single answer; rather, it is a tool for exploring vegetation data in search of meaningful and useful classifications, potentially revealing different patterns within the same data (see
The partitioning method presented in this article, along with all its specific details, emerged from within a vegetation science conundrum, generally referred to as vegetation classification (see
I dedicate this work to three professors who were instrumental in shaping this study. First, Dieter Mueller-Dombois (posthumously), whom I was fortunate to meet in 2005 during the annual meeting of the International Association for Vegetation Science. Dieter left a lasting impression on me, and his seminal work, alongside that of Heinz Ellenberg, inspired this research many years ago. Second, José Carlos Costa, whose profound knowledge of flora and chorology guided me in the practice of the tabulation technique and the intricate description of vegetation communities. He patiently answered the myriad questions I raised throughout my PhD, enabling me to gain a deep understanding of the underlying principles of the method. Finally, Jorge Orestes Cerdeira, a brilliant and sharp mathematical mind with an exceptional commitment to interdisciplinary dialogue. His great patience and deeply insightful teachings on discrete mathematics – particularly graph theory, combinatorics, logic, optimization, and computer science – equipped me with the tools needed to develop DiffVal and TDV, addressing the complexity of the tabulation method in what I hope is a useful and mathematically elegant manner.
I acknowledge Professor David W. Roberts for his thorough review of an earlier version of this article, his clarifications on the index notation, and his cordial exchange of long letters, which guided me toward a clearer, better-grounded manuscript. I also thank three anonymous reviewers, whose critiques, comments, and generous suggestions have enhanced the manuscript considerably.
T.M.H. was partially funded by the European Social Fund (POCH and NORTE 2020) and by National Funds (MCTES), through a FCT – Fundação para a Ciência e a Tecnologia postdoctoral fellowship (SFRH/BPD/115057/2016), as well as by National Funds, through the same foundation, under the project UIDB/04033/2020.
Initial remarks
Most of the issues discussed in this appendix stem from correspondence with Professor David W. Roberts, who kindly exchanged ideas with me after the submission of the first version of this article. Anticipating that these issues would likely be raised by other researchers, Professor Roberts recommended their discussion. I summarize some of the viewpoints that I have defended, highlighting subtleties that I believe merit consideration and that underpin the development of this method.
DiffVal, TDV, and TDV-optimization have the potential to reignite long-standing debates on the nature of vegetation communities. The use of the word ‘differential’ in their names may also resurrect old terminological discussions. Precisely because such debates have persisted for over a century, without reaching a clear consensus among vegetation scientists, I intentionally avoided delving into these epistemological or terminological matters in the body of the article. The sole objective of the article is to present a method for vegetation classification, using a block-based approach. As with any other method, the suitability of the approach for the intended analysis must be assessed by the user. Furthermore, the article makes no claim to resolve long-standing issues about the nature of vegetation communities.
Going back to the seminal concepts of vegetation science
Characteristic species, fidelity, and differential species: The notion of exclusiveness
At the beginning of the 20th century, characteristic species were those that indicated a particular environment or site. To refine this concept,
The concept of fidelity relies on how the species are distributed across different plant communities. Braun-Blanquet refers to this as the “sociological distribution of species” (
The first three degrees of fidelity imply that (i) all, or almost all, of the presences are confined to a single plant community; (ii) a higher frequency is confined to a single plant community; or (iii) the higher abundances or vitalities are confined to a single plant community. In short, when any of these three conditions hold, the species is said to be centred in that community (see
Since the early stages of proposition of the phytosociological method and before the notions of fidelity and differential species emerged, Braun-Blanquet and Furrer were well aware that characteristic species were not enough to differentiate all plant communities (
Differential species are those appearing “only in one of two or more related societies” (
The loss of the characteristic species at the association level
Braun-Blanquet presents fidelity as a pivotal concept in the study of plant communities at the association level. For him, fidelity is a synthetic quality of plant communities, determined after the associations are established by tabulating each association side by side (
The initial community-unit hypothesis put forward by
From the 1940s to the 1960s, the narrowing of the association concept, as described in
At this stage, it no longer made sense to refer to characteristic species at the association level. In fact,
Differential species, as initially defined, became crucial in the delimitation of associations. Eventually, the floristic distinction between associations was based entirely on differential species. In practice, phytosociologists find combinations of species, with these species potentially serving as differential species in other associations (
Even though the narrowing of the association concept compromised the identification of characteristic species at the association level, there was considerable reluctance to abandon the doctrine in the following decades (
Changes in postulates over the years
Nuances still under the exclusiveness notion
Ellenberg introduces a slightly modified concept of differential species. While Braun-Blanquet and Walo Koch define differential species as species appearing “only in one of two or more related societies”, Ellenberg’s approach allows a differential species to occur in more than one vegetation unit, provided it is absent from at least one of the units under comparison. In other words, the species remains exclusive to one or some of the emerging groups. This adjustment is particularly relevant to the present work, as it is incorporated into the DiffVal formulation.
A shift from the exclusiveness notion to the concentration notion
“If there are no character-species for a grouping; but certain species are present in samples of this grouping and absent or clearly less important in those of another, closely related grouping, these are differential-species for a unit usually of lower rank than the association.” [my own bold emphasis]
Compared to Braun-Blanquet’s initial definition, there are two modifications: (i) Whittaker’s retains the possibility that a species may be entirely absent from one grouping but relaxes this criterion by adding: “or clearly less important”. In contrast, Braun-Blanquet strictly required complete absence from one (or more) of the compared groups. (ii) Differential species are now considered useful for distinguishing units below the association level, whereas Braun-Blanquet also applied the concept at the association level (note that, at Braun-Blanquet’s time, the association concept was even broader).
The concentration notion was also employed by
Barkman extends the concept of fidelity to that of differential species, asserting that this extension is implicit in a work by Koch in 1926, and stating that “there is no reason not to use the fidelity classes for differential taxa as well”. At that time, and still today, differential species form the foundation of vegetation classification, as evidenced by the works of
“Fidelity is one of the most important concepts of the Braun-Blanquet (Zürich-Montpellier) approach. Generally speaking, fidelity is the degree to which a species is concentrated in a given vegetation unit. The fidelity of a species determines whether it can be considered a differential or character species or just a companion or accidental species. A character species can be interpreted as a special case of a differential species: a differential species shows a distinct accumulation of occurrences in one or more vegetation units; whereas a character species should accumulate in only one single vegetation unit (
Bruelheide’s definition of fidelity likely draws support from the previously mentioned works, such as the review by
Considering Bruelheide’s work, I highlight three issues:
These issues, particularly (i) and (ii), are deeply rooted in the fact that Bruelheide built on Barkman’s idea of extending the fidelity concept to the differential species concept.
As we can see, the exclusiveness notion found in the works of Braun-Blanquet, Whittaker, and Mueller-Dombois and Ellenberg progressively mutated into the idea of concentration, which seems to have become well-established in vegetation science. Chytrý et al. (2002) described fidelity (also in general terms) as “a measure of species concentration in vegetation units” and, building upon
Later,
Tables
Exclusiveness versus concentration: What are the consequences?
In the previous subchapters, I have examined the emergence of the definitions of fidelity and differential species, as well as their changes over the years, highlighting the existence of two paradigms: one based on exclusiveness and the other on concentration.
Concentration-based fidelity is now used in the examination of syntaxa after a classification has been obtained, rather than during the classification process itself. This is indisputable, as previous attempts to use fidelity directly in classification have led researchers into paradoxes, circular definitions, and begging-the-question misconceptions. Fidelity is derived after the different syntaxa are established (
I will continue exploring the fidelity concept here, as some readers might find it interesting to highlight the differences between DiffVal and other indices currently used in the study of concentration-based fidelity. However, I caution the reader that DiffVal, TDV and TDV-optimization are grounded in the exclusiveness-based differential species concept, as implemented in
If we closely examine the first three degrees of the original fidelity concept (
The first two degrees of the original fidelity concept can also be described as a concentration or accumulation of presences in one (or a few) groups, with exclusiveness to that group (or those groups) guaranteed. This is where we begin to spot some disparities: IndVal, the phi coefficient, and similar indices do not ensure exclusiveness to the groups and can return intermediate to high values for species with presences in all the groups being compared.
Table
Species 5 and 8 demonstrate how the phi-coefficient returns reasonably high values, detecting the concentration of the presences in one of the groups, even though exclusiveness to that group is not ensured. When considering the best group combination, species 4, 7 and 9 show cases where the phi-coefficient behave quite differently from DiffVal. Similarly, the high values of IndVal and the point-biserial correlation coefficient for species 8, 9 and 10 illustrate how these indices detect the concentration of cover in one group, again irrespective of exclusiveness. Note that DiffVal is zero for all three species. When considering the best group combination, species 4, 7, 9 and 10 provide examples where IndVal and the point-biserial correlation coefficient can diverge significantly from DiffVal.
The third degree of the original fidelity concept is the closest to a general notion of concentration, as it is postulated almost independently of exclusiveness (i.e., it assumes the species occurs in “several” associations/communities – though not necessarily in all of them). The third degree is also the closest to the quantity A used in IndVal. However, it is important to note that a species may occur in all the groups under analysis and still show a high IndVal, while its practical diagnostic value may remain questionable (see, e.g., species 8 in Table
While the concentration notion seems to generalize the exclusiveness notion (i.e., a species considered exclusive to one vegetation unit has its presences fully concentrated in it), this is not entirely the case. We can easily think of a species with concentrated presences (or concentrated cover) in one vegetation unit, yet not exclusive to any of the vegetation units under consideration (see, e.g., species 5 and 8 in Table
From Table
The concepts of fidelity (and characteristic species) and their nuances over the years (relevant subtleties in italics). When authors provide numerical approaches to these concepts, the characteristics of such approaches are also considered in the description of the concepts’ intensions presented in the table.
Year | Author(s) | Number of levels of characteristic species | Fidelity and character species intensions1 | Obtained… | Used… | |||||
---|---|---|---|---|---|---|---|---|---|---|
Species are required to occur exclusively in one to some communities. | Species may occur in several communities, but the higher abundance/cover or vitalities are found in one… | The exclusiveness requirement is fully relaxed. Species presences/cover are concentrated or accumulated in one or more than one… | Geographical context | Floristic-vegetational context | ||||||
Species presences exclusive or almost exclusive of one… | A higher frequency of the species is found in one… | |||||||||
1913 | Braun-Blanquet and Furrer | 1 | …association. | --- | --- | --- | ≈ Regional (not clearly stated). | As defined by the regional context. (not clearly stated). | …simultaneously to classification. 2 | …to demarcate the association3 level. |
1918 / 1932 | Braun-Blanquet | 3 | …syntaxon. | …syntaxon. | …syntaxon. | --- | Regional. | As defined by the regional context. | …after classification. | …during the ranking of the vegetation units (syntaxonomic positioning). |
1956 | Ellenberg | 3 | …syntaxon. | …syntaxon. | …syntaxon. | --- | Local, regional, and absolute. | As defined by the regional context. | …after classification. | …in ranking (syntaxonomic positioning). |
1962 | Whittaker | Not relevant | …syntaxon4 | …syntaxon4 | …syntaxon4 | --- | As defined by the set of samples/relevés being analysed. | Set of samples/relevés. | Not clear.5 | …to determine the complete characteristic species-combination, i.e. classification; but also used to characterize the different syntaxonomic ranks. 5 |
1974 | Mueller-Dombois and Ellenberg | 3 | …syntaxon. | …syntaxon. | …syntaxon. | --- | Local, regional, and absolute. | As defined by the regional context. | …after classification. | …during the ranking of the vegetation units (syntaxonomic positioning). |
1978 | Westhoff and van der Maarel | 3 | …syntaxon. | …syntaxon. | …syntaxon. | ---6 | Local, regional, and general. | As defined by the regional context. | …after classification. | …at the end of the synthetical phase, preparing the syntaxonomic phase. |
1989 | Barkman | ---7 | ---7 | ---7 | ---7 | …syntaxon.8 | Regional. | As defined by the regional context. 9 | …after classification. | …to express the ideas of optima and role of species in plant communities (or other elementary coenological units). |
1997 | Dufrêne and Legendre | Not relevant | --- | --- | --- | … groups of a certain classification. | As defined by the set of samples/relevés being analysed. | Not relevant | …after classification. | …to explore the bond between species and the groups of a certain classification. |
2000 | Bruelheide | Not relevant | --- | --- | --- | …vegetation unit | As defined by the set of samples/relevés being analysed. | Set of samples/relevés. | …during classification.10 | …to demarcate vegetation units.10 |
The concept of differential species and its nuances over the years (relevant subtleties in italics). When authors provide numerical approaches to this concept, the characteristics of such approaches are also considered in the description of the concept’s intensions presented in the table.
Year | Author(s) | Differential species intensions1 | Obtained… | Used… | ||
---|---|---|---|---|---|---|
Floristic-vegetational context | Considering the given floristic/vegetational context… | |||||
…differential species presences are exclusive to… | …differential species presences/cover are concentrated but not necessarily exclusive to… | |||||
1925 | Braun-Blanquet and Walo Koch | Two or more similar vegetation units. | …one of the vegetation units under comparison. (N.b., cannot occur in all of them). | --- | …during classification. | …to define associations and lower ranks, i.e., used in classification.2 |
1956 | Ellenberg | The set of relevés under analysis. (Usually, a set of similar relevés, potentially containing two or more vegetation units). | …one or more than one of the emerging vegetation units. (N.b., cannot occur in all of them). | --- | …during classification (more concretely during tabulation). | …to define unranked vegetation units. (I.e., used in classification, specifically through tabulation.) |
1962 | Whittaker | Two closely related vegetation units, among the sampled plots (pairwise comparisons among the emerging groups). | --- | …one vegetation unit. (N.b., possibly also occurring in the second unit under comparison). | …during classification. | …to define lower ranks than association, i.e., used in classification. |
1974 | Mueller-Dombois and Ellenberg | The set of relevés under analysis. (Usually, a set of similar relevés, potentially containing two or more vegetation units). | …one or more than one of the emerging vegetation units. (N.b., cannot occur in all of them.) | --- | …during classification (more concretely during tabulation). | …to define unranked vegetation units. (I.e., used in classification, specifically through tabulation.) |
1978 | Westhoff and van der Maarel (These authors distinguish differential and differentiating species) | For differentiating species: The set of relevés under analysis. (As defined by Ellenberg, see above in this table.) |
…one or more than one of the emerging vegetation units. (As defined by Ellenberg) | --- | …during classification. (As defined by Ellenberg) | …to define unranked vegetation units. (As defined by Ellenberg) |
For differential species: Two closely related syntaxa. | --- | …one vegetation unit. (N.b., possibly also occurring in the second unit under comparison). | …while comparing two syntaxa. | …to define lower ranks than association, i.e. used in classification. | ||
1989 | Barkman | The next higher syntaxon, thus two or more vegetation units. | --- | …one or more than one vegetation unit(s) (N.b., possibly also occurring in all the other units under comparison). | Not clear.3 | Not clear.3 |
1997 | Dufrêne and Legendre | Sites/relevés under analysis. | --- | …one or more than one vegetation unit(s) (N.b., possibly also occurring in all the other units under comparison). | …after classification.4 | …to explore the bond between species and the groups of a certain classification. 4 |
2000 | Bruelheide | The next higher syntaxon. | --- | …one or more than one vegetation unit(s) (N.b., possibly also occurring in all the other units under comparison). | …by optimization of indices.5 | …to assist the extraction of differential species groups, i.e. to assist classification.6 |
2025 | (the current work) | The set of relevés under analysis. (Usually, a set of similar relevés, potentially containing two or more vegetation units). | …one or more than one of the emerging vegetation units. (N.b., cannot occur in all of them). | --- | …by optimization of TDV. | …to define unranked vegetation units. (i.e., used in classification, specifically through tabulation). |
Exemplifying how IndVal, the phi coefficient, the point-biserial correlation coefficient, and DiffVal behave depending on exclusiveness to groups, concentration of presences, and concentration of cover.
Relevé no. | 0000000001|1111111112222|2222223 | Short description of the example: | IndValgind for single groups 1|2|3 | IndValgind for the best group combination | rgϕ for single groups 1|2|3 | rgϕ for the best group combination | rgpb for single groups 1|2|3 | rgpb for the best group combination | DiffVal |
---|---|---|---|---|---|---|---|---|---|
1234567890|1234567890123|4567890 | |||||||||
Group/ community | 1111111111|2222222222222|3333333 | ||||||||
species 1 | 1111111111|0000000000000|0000000 | Presences are concentrated in group 1 and exclusive to that group | 1.00*|0.00|0.00 | 1.00* (group 1) | 1.00*|-0.50|-0.50 | 1.00* (group 1) | 1.00*|-0.50|-0.50 | 1.00* (group 1) | 1.00 |
species 2 | 0000000000|0000000000000|1111111 | Presences are concentrated in group 3 and exclusive to that group | 0.00|0.00|1.00* | 1.00* (group 3) | -0.50|-0.50|1.00* | 1.00* (group 3) | -0.50|-0.50|1.00* | 1.00* (group 3) | 1.00 |
species 3 | 1110111010|0000000000000|0000000 | Presences are concentrated in group 1 and exclusive to that group | 0.70*|0.00|0.00 | 0.70* (group 1) | 0.78*|-0.39|-0.39 | 0.78* (group 1) | 0.78*|-0.39|-0.39 | 0.78* (group 1) | 0.70 |
species 4 | 1110111010|0000000000000|1011110 | Presences are concentrated in groups 1 and 3 and exclusive to those groups | 0.35|0.00|0.36 | 0.71* (groups 1 and 3) | 0.32|-0.67|0.34 | 0.67* (groups 1 and 3) | 0.32|-0.67|0.34 | 0.67* (groups 1 and 3) | 0.43 |
species 5 | 1110111010|0100010000100|0000100 | Presences are concentrated in group 1 yet not exclusive to any group | 0.46*|0.05|0.02 | 0.46 (group 1) | 0.50*|-0.19|-0.32 | 0.50* (group 1) | 0.50*|-0.19|-0.32 | 0.50* (group 1) | 0.00 |
species 6 | 243035+534|0000000000000|0000000 | Presences (and cover) are concentrated in group 1 and exclusive to that group | 0.90*|0.00|0.00 | 0.90* (group 1) | 0.93*|-0.46|-0.46 | 0.93* (group 1) | 0.76*|-0.38|-0.38 | 0.76* (group 1) | 0.90 |
species 7 | 243035+534|0000000000000|120211+ | Presences (and cover) are concentrated in group 1 and 3 and exclusive to those groups | 0.79*|0.00|0.10 | 0.88* (groups 1 and 3) | 0.45*|-0.84|0.39 | 0.84* (groups 1 and 3) | 0.72*|-0.44|-0.28 | 0.72* (group 1) | 0.53 |
species 8 | 243035+534|000201000+00+|00020+0 | Presences (and cover) are concentrated in group 1 yet not exclusive to any group | 0.83*|0.01|0.01 | 0.83* (group 1) | 0.57*|-0.27|-0.30 | 0.57* (group 1) | 0.73*|-0.38|-0.36 | 0.73* (group 1) | 0.00 |
species 9 | 3435355534|2232312121232|120211+ | Cover is concentrated in group 1 (and group 2) yet not exclusive to any group | 0.73*|0.20|0.06 | 0.971 (groups 1, 2, and 3) | 0.16|0.16|-0.32 | 0.32 (groups 1 and 2) | 0.84*|-0.27|-0.56 | 0.84* (group 1) | 0.00 |
species 10 | 3435355534|2232312121232|12+211+ | Cover is concentrated in group 1 (and group 2) yet not exclusive to any group. Presences in all relevés. | 0.73*|0.20|0.07 | 1.001 (groups 1, 2, and 3) | NA|NA|NA | NA | 0.84*|-0.28|-0.56 | 0.84* (group 1) | 0.00 |
On the significance testing for the association between a species and a group of relevés
“Finally, practitioners should keep in mind that if the classification of the sites has been obtained from the species composition itself (for example by K-means partitioning), the site groups would not be completely independent of the species data. In such a case of circularity, we can expect more significantly associated species than expected by chance only.”
We must bear in mind that, in vegetation science, the classification of relevés is expected to emerge from species composition itself (i.e., be floristically based). Therefore, using such inference in this context is not advisable, as it would violate the independence assumption of significance testing (see
“More indicator species will be found than expected by chance when the classification of sites has been obtained from the species composition itself (De Cáceres and Legendre 2009). In this case, p-values must be taken with caution: they do not result from a genuine test of significance since the classification of sites is not independent from the species data used in the indicator species analysis.”
This is also why p-values are interpreted as a direct measure of fidelity rather than as probabilities in hypothesis testing (see, e.g.,
For those interested in p-values, a time-consuming approach can be used to assess whether a given TDV (and the associated differential species pattern) could arise by chance – i.e., under the null hypothesis that the pattern does not reflect an underlying structure in species distribution. This can be tested with a Monte Carlo permutation test, optimizing TDV multiple times on randomized tables by permuting each species’ presences among the relevés. A p-value is then obtained by computing the permutational probability, following
Concluding remarks: Different contexts might call for different paradigms
The subtle changes found in
The recognized loss of the exclusiveness notion of characteristic species at the association level should not have been extended to the concept of differential species. Floristics play a key role in the phytosociological approach. As it is currently postulated, concentration-based measures are ineffective in the vegetation classification process at the association level.
The DiffVal includes a measure of how exclusive a species is to one (or more) groups, i.e., a measure of the degree of exclusiveness. It is intended as a single measure for each species in a table, determined solely by the partition of the samples (note that, once a specific partition is given, it is straightforward to identify which species occur exclusively in each group or groups). In the case of IndVal, the phi coefficient, or the point-biserial correlation coefficient, comparisons are always pairwise (target group vs. all other groups). For these indices, it is irrelevant how the species’ presences/absences are distributed among the remaining groups – i.e., outside the target group or combination of groups. This is not true for the DiffVal summands, marking a fundamental difference between DiffVal and the other indices.
I hope I have demonstrated why approaches based on the concentration notion are paradigmatically different from those based on the exclusiveness notion. This is the primary reason why comparisons between them are difficult, or even inconsistent. I believe that the block-based approaches are the only ones truly comparable to TDV-optimization.
In some regional contexts, particularly those with generally low diversity and low environmental roughness, an approach based on concentration without ensuring exclusiveness might be the only useful way of identifying vegetation types, as there may not be sufficient richness and turnover to find species exhibiting some degree of exclusiveness. In regions with high diversity, rich in endemism, and with significant environmental variation and roughness, an approach that ensures exclusiveness can be more effective for understanding the vegetation phenomenon, helping to reduce noise caused by numerous species with very high frequencies (e.g., dominant and companion species).
It is not my intention to raise any objections to approaches based on concentration. First, when analysing (or tabulating) highly dissimilar vegetation (e.g., higher syntaxonomic ranks), we have observed that the concentration-based measures give similar results to those based on exclusiveness. Second, in regions with low species turnover and low environmental roughness, such approaches have proven useful.
The idea of a universal method to classify vegetation stands is appealing, but not necessarily feasible. Nearly a century ago,
“The apparent simplicity of the analysis of the concept of vegetation is directly in contrast with the difficulty of making any universal rules. Sometimes it is entirely impossible to submit different vegetation types to similar methods of treatment.”
The two different paradigms might have to coexist.
R code for auxiliary functions (OptimClass and partition discrepancy) (*.pdf)
Annotated R code for constructing the Example 1 data set and partitioning via TDV-optimization (*.pdf)
Graphical output of the “tabulation” function for each partition obtained in Example 1, alongside ordination biplots (*.pdf)
Statistical power analysis of Fisher’s exact test, adjusted for Example 1 (includes R code) (*.pdf)
Variation in OptimClass 1 with α for each clustering/partitioning method in Example 1 (*.pdf)
Analysis of local optima in the optimization procedures used in Example 1 (*.pdf)
Graphical output of the “tabulation” function for each partition obtained in Example 2, alongside ordination biplots (*.pdf)
Graphical output of the “tabulation” function for the first ten solutions of TDV-optimization, searching for three groups, in Example 2 (*.pdf)
Graphical output of the “tabulation” function for the first ten solutions of TDV-optimization, searching for two groups, in Example 2 (*.pdf)
A tentative bibliographic list of works containing numerical approaches for comparing or classifying vegetation plot data (*.pdf)