Compositional Data Analysis
January 5th, 2021
Overviewπ
Compositional Data Analysis is very useful for measuring the relative values of components within a larger whole. For example, measuring the proportions of each mineral in a rock, the proportions of each color/flavor of Skittle in a bag, or the topics in a given document. Compositional data analysis not complicated, but I have not found a quickstart guide for doing so. This post is intended to be that quickstart guide.
Ratios and Logratiosπ
Subcompositional coherence says that subcomposition results should be same as those from the composition. But parts of a composition are not subcompositionally coherent. In other words, parts parts in a subcomposition differ from those in the original composition. For example, when analyzing time spent on daily activities, excluding sleep will result in different results than if all activities were included.
Sleeping | Eating | Exercising | Reading | Chores | Working |
---|---|---|---|---|---|
8 hrs | 1 hr | 2 hrs | 1 hr | 2 hrs | 10 hrs |
Based on a 24 hour day, the parts of the compositional compositional data would then be
Sleeping | Eating | Exercising | Reading | Chores | Working |
---|---|---|---|---|---|
0.333 | 0.04166 | 0.0833 | 0.04166 | 0.0833 | 0.4166 |
But if we were were to exclude sleep from this composition, the subcomposition would be based on a 16 hour day, which would cause the relationships between the parts to be different:
Sleeping | Eating | Exercising | Reading | Chores | Working |
---|---|---|---|---|---|
Β | 0.0625 | 0.125 | 0.0625 | 0.125 | 0.625 |
Which makes the values from the subcomposition incomparable to those of the original composition. To resolve this discrepancy, we will use ratios between proportions to compare compositional data. Ratios respect the principle of subcompositional coherence, which is why they are fundamental to compositional data analysis. Respecting subcompositional coherence also allows summary statistics to be taken of compositional data, as the summary statistics will be the same for subcompositions and compositions of the same data.
Eating / Working | Exercising / Working | Reading / Working | Chores / Working | Sleeping / Working | Eating / Sleeping | Exercising / Sleeping | Reading / Sleeping | Chores / Sleeping | Working / Sleeping |
---|---|---|---|---|---|---|---|---|---|
0.1 | 0.2 | 0.1 | 0.2 | 0.8 | 0.125 | 0.25 | 0.125 | 0.25 | 0.125 |
Eating / Working | Exercising / Working | Reading / Working | Chores / Working | Sleeping / Working |
---|---|---|---|---|
0.1 | 0.2 | 0.1 | 0.2 | 0.8 |
The ratio clearly respects the subcomposition coherence. Ratios are strictly positive values, but can range widely based on the parts they are constructed from which can result in statistical distributions where two standard deviations from the mean are well below zero. The common approach to solving this problem is converting the ratios to logratios using a log transformation to the ratios, which addresses this issue and provides a few other benefits as well:
- Converts the strictly positive values into real number than can be positive or negative, which addresses the issue of standard deviations.
- Makes the statistical distribution symmetric
- Reduces the effect of outliers
- Converts the ratios into interval scale, which is key statistical computations such as means, variances, and regression models.
Code & Usageπ
We have covered a couple approaches for compositional data analysis. These calculations are all self-contained, which makes them low-hanging fruit for modularing these calculations. I have done so which resulted in the following Composition
class for build compositional data from raw values and for efficiently computing proportions, ratios and log-ratios from raw_values.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
import math
import itertools
class Composition(object):
"""
A class for efficiently managing/updating values for compositional data
analysis in Python
"""
__slots__ = ('_values', '_proportions', '_ratios', '_logratios', '_base')
def __init__(self, raw_values, base=math.e):
self._values = raw_values
self._proportions = self._proportions_from_raw_values(raw_values)
self._ratios = self._ratios_from_proportions(self._proportions)
self._logratios = self._logratios_from_proportions(self._proportions, base=base)
self._base = base
def __repr__(self):
values = []
for key in self._values:
values.append('%r=%r' % (key, self._values[key]))
values = ', '.join(values)
return '<%s %s %s>' % (self.__class__.__name__, id(self), values)
def __setitem__(self, key, value):
"""
Set/update the value for a key
"""
if not isinstance(value, (int, float)):
raise ValueError('value (%r) must be an integer or a float' % value)
if value <= 0:
raise ValueError('value (%r) must be > 0' % value)
self._values[key] = value
updated_keys = set([key])
self._proportions = self._proportions_from_raw_values(self._values, self._proportions, updated_keys)
self._ratios = self._ratios_from_proportions(self._proportions, self._ratios, updated_keys)
self._logratios = self._logratios_from_proportions(self._proportions, self._ratios, updated_keys)
def __getitem__(self, key):
return self._values[key]
def __delitem__(self, key):
self.pop(key)
def __iter__(self):
for key in self._values:
yield key
def __contains__(self, key):
return key in self._values
def pop(self, key):
self.proportions.pop(key)
self._ratios.pop(key)
self._logratios.pop(key)
for other_key in self._ratios:
self._ratios[other_key].pop(key)
self._logratios[other_key].pop(key)
def keys(self):
return self._values.keys()
def values(self):
return self._values.values()
@property
def base(self):
return self._base
@base.setter
def base(self, value):
self._base = value
self._logratios = self._logratios_from_proportions(self._proportions, base=value)
@property
def proportions(self):
return dict(**self._proportions)
@property
def ratios(self):
output = dict()
for key in self._ratios:
output = dict(**self._ratios[key])
return output
@property
def logratios(self):
output = dict()
for key in self._logratios:
output = dict(**self._logratios[key])
return output
def euclidean_distance(self, other, normalize=False, attributes=None):
self_keys = set(self.keys())
other_keys = set(other.keys())
shared_keys = self_keys & other_keys
if not shared_keys:
raise ValueError('These compositions share no attributes')
if self.base != other.base:
raise ValueError('Compositions must use the same base in the logratios(%r != %r)' % (self.base, other.base))
if not attributes:
attributes = list(itertools.combinations(shared_keys, 2))
total = 0.0
for numerator, denominator in attributes:
y1 = self._logratios[numerator][denominator]
x1 = other.logratios[numerator][denominator]
value = math.pow(y1 - x1, 2)
total += value
output = math.sqrt(total)
if normalize:
part_count = len(self._values)
output = output / part_count
return output
def alr(self, denominator):
"""
Additive Logratio (ALR) transformation
"""
if denominator not in self:
raise ValueError('Denominator must be a part of the composition')
keys = list(self.keys())
keys.remove(denominator)
keys.sort()
output = dict()
for numerator in keys:
value = self._logratios[numerator][denominator]
output[numerator] = value
return tuple(output)
def clr(self):
"""
centred logratio (CLR) transformation
"""
geometric_mean = self.geometric_mean()
output = dict()
for key in self.keys():
output[key] = math.log(self._values[key], self._base) - geometric_mean
return output
def geometric_mean(self):
parts = len(self._values)
total = 0.0
for key in self.keys():
total += math.log(self._values[key], self._base)
return total / parts
@staticmethod
def _proportions_from_raw_values(raw_values, proportions=None, updated_keys=None):
"""
Calculates ratios from proportions
Parameters:
raw_values (dict):
proportions (dict|None):
updated_keys (list|tuple|set):
Returns:
dict<object, dict>: proportions between types
"""
keys = tuple(raw_values)
if updated_keys:
other_keys = updated_keys
else:
other_keys = keys
total = sum(list(raw_values.values()))
total = float(total)
if not proportions:
proportions = dict()
for key in other_keys:
proportions[key] = raw_values[key] / total
return proportions
@staticmethod
def _ratios_from_proportions(proportions, ratios=None, updated_keys=None):
"""
Calculates ratios from proportions
Parameters:
proportions (dict):
ratios (dict|None):
updated_keys (list|tuple|set):
Returns:
dict<object, dict>: proportions between types
"""
keys = tuple(proportions)
if ratios:
for key in keys:
ratios.setdefault(key, dict())
else:
ratios = dict((key, dict()) for key in keys)
if updated_keys:
other_keys = updated_keys
else:
other_keys = keys
for key in keys:
key_value = float(proportions[key])
for other_key in other_keys:
if key == other_key:
continue
other_key_value = proportions[other_key]
ratios[key][other_key] = key_value / other_key_value
ratios[other_key][key] = other_key_value / key_value
return ratios
@staticmethod
def _logratios_from_proportions(proportions, logratios=None, updated_keys=None, base=10):
"""
Calculates logratios from proportions
Parameters:
proportions (dict):
ratios (dict|None):
updated_keys (list|tuple|set):
Returns:
dict<object, dict>: proportions between types
"""
keys = tuple(proportions)
if updated_keys:
other_keys = updated_keys
else:
other_keys = keys
if logratios:
for key in keys:
logratios.setdefault(key, dict())
else:
logratios = dict((key, dict()) for key in keys)
for key in keys:
key_value = float(proportions[key])
for other_key in other_keys:
if key == other_key:
continue
other_key_value = proportions[other_key]
logratios[key][other_key] = math.log(key_value / other_key_value, base)
return logratios
if __name__ == '__main__':
asparagus = dict(carbohydrate=61.07, fat=3.27, protein=35.66)
compositional_asparagus = Composition(asparagus)
beans = dict(carbohydrate=35.88, fat=22.07, protein=42.05)
compositional_beans = Composition(beans)
euclidean_distance = compositional_asparagus.euclidean_distance(compositional_beans)
print(euclidean_distance)
This implementation also allows computation of additive logratios (ALRs), centered logratios (CLRs), and logratio euclidean distance between compositions with the same parts. It also allows compositions to be updated over time by updating the raw_value of a part, and adding/removing parts to create sub-compositions as needed.
The Composition class has a number of helper method implemented to make interacting with the Composition instance easier.
This application contains a test of the functionality, in the form of food macronutrient composition. In the example, a dictionary of the macronutrients are passed in as arguments to the Composition class in their raw units (grams). The composition class takes these raw values and calculates the proportions, rations and logratios. The developer can also specify the base for the logratios, but it defaults to ( e ) as it is the most commonly used base for logratios. To update the logratios of an existing Composition instance, update base
attribute:
which will automatically update the logratios.
The developer can optionally add/update any part of the composition using the __setitem__ (self, key)
method. For example, if we wanted to update the amount of protein in asparagus:
One thing to note, is that raw values should be non-zero values, as zero values can result in DivideByZero
errors when creating the ratios and logratios. If a zero-value is necessary, I would recommend using an extremely small positive value instead, like 1e-28
.
Parts can be removed using the __delitem__(self, key)
or pop(self, key)
methods to create subcompositions.
Any changes made using __setitem__
, pop
, or __delitem__
will automatically be reflected in the ratios and logratios.
Update: Aitchison Distance & Log-Ratio Transformationπ
In real-life applications we may want to be able to integrate transformations into data science applications. One of the key challenges in analyzing compositional data is the constant sum constraint: the components of a composition always sum to a constant (usually 1 or 100%). This constraint violates many assumptions of standard statistical methods. John Aitchison developed several techniques to address this challenge. The Aithison distance is a distance metric specifically designed for compositional data. It takes into account the relative nature of compositions and is invariant to perturbation and powering operations.
- Scale invariance: Multiplying a composition by a positive constant does not change the distance between compositions. This property is important because compositional data is often subject to a constant sum constraint, and the absolute values of the components may not be as meaningful as their relative proportions.
- Perturbation invariance: Adding or subtracting a constant value to each component of a composition does not affect the distance. This property is useful because compositional data is often analyzed in terms of ratios or log-ratios of components.
- Subcompositional coherence: The distance between two compositions is not affected by the presence or absence of other components in the composition. This property is important because compositional data often involves subcompositions, where some components are analyzed separately from others.
- Geometric interpretation: The Aitchison distance has a geometric interpretation in the simplex space, which is the natural space for compositional data. The simplex is a subset of the real space where all components are non-negative and sum up to a constant. The Aitchison distance can be visualized as the Euclidean distance between the centered log-ratio (CLR) transformations of the compositions in the simplex space.
The Aitchison distance can be used to compare compositions based on the relative proportions of itβs components rather than absolute values, and can be used in clustering algorithms such as DBSCAN or KMeans without pre-processing the compositional data.
Isometric Log-Ratioπ
The Isometric Log-Ratio (ilr) transformation is a technique used to transform compositional data from the simplex space to the real space while preserving the Aitchison geometry (and the properties it provides). The ILR transformation allows us to use the Euclidean distance to compare transformed compositions on the original relative proportions of components. Here is a breakdown of the
- Simplex to real space: Compositional data lies in a simplex space, which is a subset of the real space where all components are non-negative and sum up to a constant (usually 1 or 100). The ILR transformation maps the compositions from the simplex space to the real space, allowing the use of standard statistical techniques that assume Euclidean geometry.
- Isometry: The ILR transformation is an isometry, which means that it preserves the Aitchison geometry of the simplex. The Aitchison distance between compositions in the simplex space is equal to the Euclidean distance between their ilr-transformed counterparts in the real space. This property ensures that the relative distances and relationships between compositions are maintained after the transformation.
- Orthonormal basis: The ILR transformation relies on the construction of an orthonormal basis for the simplex space. An orthonormal basis is a set of vectors that are orthogonal (perpendicular) to each other and have unit length. The choice of the orthonormal basis is not unique, and different bases can be used depending on the specific problem or interpretation desired.
The choice of the orthonormal basis for the ILR transformation can impact the interpretation of the results. Different bases may highlight different balances or contrasts between components, so the selection of the basis should be guided by the specific research question or domain knowledge.
Centered Log-Ratioπ
The centered log ratio (CLR) transformation is another powerful tool introduced by John Aitchison to address these challenges. It allows us to transform compositional data into a form that can be analyzed using standard multivariate statistical techniques.
- It removes the constant sum constraint, allowing the use of standard multivariate statistical methods.
- It preserves the relative magnitudes and relationships between components.
- It facilitates the interpretation of compositional variability and covariance structure.
While the CLR transformation is useful, itβs important to be aware of its limitations.
- The CLR transformation cannot handle zero values in the composition. Data preprocessing may be needed to address this. he inability to handle zero values is a significant practical concern, especially in fields like ecology or microbiome research where zero counts are common. Various strategies exist to address this, such as multiplicative replacement or Bayesian approaches, but each has its own implications for the analysis
- The resulting CLR-transformed data has a singular covariance matrix, which can cause issues in some multivariate analyses like principal component analysis or discriminant analysis.
- Interpreting results in the CLR space can be challenging and may require back-transformation.
The ILR and CLR are both transformations differ in their approach and properties. While the ILR transformation creates orthonormal coordinates that preserve distances and angles, the CLR transformation centers the data on the geometric mean of the composition. The ILR results in a set of coordinates equal to the number of parts minus one, maintaining full rank, whereas the CLR produces a set of coordinates equal to the number of parts, resulting in a singular covariance matrix. ILR coordinates are interpretable in terms of log-ratios between groups of parts, while CLR coordinates represent the logarithm of each part relative to the geometric mean of all parts.
When choosing between ILR and CLR, consider the specific requirements of your analysis. Use ILR when you need a full-rank transformation for statistical methods that assume non-singularity, such as regression or principal component analysis. ILR is also preferable when interpreting relationships between groups of components is important. On the other hand, CLR is more suitable when you want to preserve the number of original components and when the interpretation of individual parts relative to the whole composition is crucial. CLR can be particularly useful for exploratory data analysis and visualizations, as it maintains a one-to-one correspondence with the original parts. However, be cautious when applying methods that assume non-singularity to CLR-transformed data.
Conclusion and Further Readingπ
We have gone over the basic components of why you should use ratios and logratios over proportions, and presented a simple wrapper for managing this conversion. While this class does not provide analytical tools such as regression, it provides an interface by which to feed the data into regression models. I believe that model could refined or combined with other tools to create a more comprehensive interface for compositional data analysis, but I hope it provides a good start.
For more information on compositional data analysis, see: