1 Introduction
Under the name “Fair Machine Learning” researchers have attempted to tackle problems of injustice, fairness, discrimination arising in the context of machine learning applications embedded in society [5]. Despite the variety of definitions of fairness and proposed “fair algorithms,” we still lack a conceptual understanding of fairness in machine learning [75, 84]. What does it mean for predictions to be fair? How does the statistical frame influence fairness? Is there fair data and what would it look like? For instance more concretely, how does a population of individuals and their corresponding predictions look like if a provided definition of fairness is fulfilled?
We focus on a collection of widely used fairness notions which are based on statistical independence e.g., [16, 46, 22], but examine them from a new perspective. Surprisingly, debates concerning these notions have not questioned the role and meaning of statistical independence upon which they are based. As we shall argue, statistical independence is far from being a mathematical concept linked to one unique interpretation (see ğ4.2). This paper, in contrast to much of the literature on fairness in machine learning, e.g., [16, 33, 46, 22], investigates what many definitions of fairness take for granted: a well-defined and meaningful notion of statistical independence.
Another, less popular, strand of research investigates the role of randomness in machine learning [94]. The standard randomness notion, independently and often identically distributed data points, suffers the longstanding critique of being inadequate (cf. [89]). Again, statistical independence lies at the foundation of this, hitherto unrelated to fairness, concept. (We will justify below our use of “randomness” as used here.)
At the core of both our observations is the unreflective use of a convenient mathematical theory of probability. Kolmogorov’s axiomatization of probability theory, developed 1933 in his book [55] (translated in [56]), dominates most research in machine learning. As Kolmogorov explicitly stated, his theory was designed as a purely axiomatic, mathematical theory detached from meaning and interpretation. In particular, Kolmogorov’s statistical independence lacks such reference. However, the modeling nature of machine learning and the arising ethical complications within machine learning applied in society ask for semantics of probabilistic notions.
In this work, we focus on statistical independence. We leverage a theory of probability axiomatized by Von Mises [107] in order to obtain meaningful access to probabilistic notions. (In leaning on Von Mises we are directly following the explicit advice of Kolmogorov [56, page 3, footnote 4].) This theory construes probability theory as “scientific”2 (as opposed to purely mathematical) with the aim to describe the world and provide interpretations and verifiability [107, pages 1 and 14]. Von Mises’ theory of probability provides a mathematical definition of statistical independence which describes statistical phenomena observable in the world. In particular, Von Mises’ statistical independence is mathematically, but not conceptually, related to Kolmogorov’s definition.
In this paper, we, to the best of our knowledge, are the first to apply Von Mises’ randomness to machine learning and to interpret randomness in machine learning in a Von Mises’ way. The paper is structured as follows:
In Section 2, we outline our statistical perspective on machine learning. We present the “independent and identically distributed”-assumption (i.i.d.-assumption) as one commonly used choice for modeling randomness. The further occurrence of statistical independence as fundamental ingredient of fairness notions in machine learning (ğ3) pushes us to the question: “How to interpret statistical independence in (fair) machine learning?” Remarkably, “Independence” governs many discussions around fairness in machine learning without getting to a concrete meaning of this term. Its deeper semantics remain untouched even in the considerably exhaustive book by Barocas et al. [5, Chapter 3, p. 13].
We first dissect Kolmogorov’s widely used definition of statistical independence in Section 4 before we propose another mathematical notion following Von Mises. Von Mises uses his notion of independence in order to define randomness. We contrast his definition and the i.i.d.-assumption in Section 5. This reveals a general typification of mathematical definitions of randomness which most importantly differ in the absoluteness respectively relativity to the problem under consideration (ğ5.2).
Finally, we leverage Von Mises’ definition of statistical independence to redefine three fairness notions from machine learning (ğ6). Against the background of Von Mises’ probability theory, we then link randomness and fairness both expressed as statistical independence (ğ7). Thereby, we reveal an unexpected hypothesis: randomness and fairness can be considered equivalent concepts in machine learning. Randomness becomes a relative, even an ethical choice. Fairness, however, turns out to be a modeling assumption about the data used in the machine learning system.
Due to the frequent use of the word “independence” with different meanings in this paper, we differentiate. By “independence” we mean an abstract concept of unrelatedness and non-influence [92]. We use it interchangeably with “statistical independence” which emphasizes the probabilistic and statistical context. When referring to the later introduced, formal definitions of statistical independence following Kolmogorov or Von Mises we explicitly state this. Finally, we assign “Independence” (capital “I”) to one of the fairness criteria in machine learning which demands for statistical independence of predictions and sensitive attribute [5, 83]. Despite the abstract appearance of this work, we consider it as part of the project to make the very abstract notion of “independence” or “statistical independence” at least a bit more concrete — specifically, we describe independence in terms of samples, not abstractions such as countably additive probability measures. We ask (and try to assist) the reader to become aware of the implicit assumptions taken about the concept of “independence”.
2 A Statistical Perspective on Machine Learning
Machine Learning ingests data and provides decisions or inferences. In this sense, at its core, it is statistics.3 Adopting this perspective, we understand machine learning as “modeling data generative processes”. Statistics, respectively machine learning, asks for properties of data generating processes given a collection of data [109, p. ix], [4, p. 1].
We assume that a data generative process occurs somehow in the world. We are confronted with a collection of numbers, the data, produced by the process and acquired by measurement. A “model” is a mathematical description of such a data generating process. This description should allow us to make predictions in an algorithmic fashion. Thus, we require, inter alia, a mathematical description of data — “a model for data collection” [18, p. 207]. What is arguably the standard model of data is stated in [30, p. 11]:
Similar definitions can be found in many machine learning or statistics textbooks (e.g., [18, Def. 5.1.1]). Data (measurements from the world) is conceived of (mathematically) as a collection of random variables which share the same distribution and which are (statistically) independent to each other. Implicit in this definition is that the data indeed has a stable distribution. The assumed independence can be interpreted as presumption of randomness of the data. Each data point was “drawn independently”4 from all others, with the obvious interpretation that each data point does not give a hint about the value of any other.
We shall assume in this book that $({X_{1}},{Y_{1}}),\dots ,({X_{n}},{Y_{n}})$, the data, is a sequence of independent identically distributed (i.i.d.) random pairs with the same distribution[…].
The i.i.d. assumption is two-fold: 1) the assumption of identical distributions of a sample, and 2) the mutual independence of points in the sample. The second assumption alone captures and pertains to randomness [50, Section 3]. However, since the use of i.i.d. is more common, we refer to this more specific assumption.
Even though many results in statistics and machine learning rely on the i.i.d. assumption (e.g., law of large numbers and central limit theorem in statistics [18], generalization bounds in statistical learning theory [91] and computationally feasible expressions in probabilistic machine learning [30]), it has always been subject of fundamental critique.5 Other randomness definitions are rarely applied, but exceptions exist [108, 96].
In summary, statistical conclusions often rely on the i.i.d.-description of data. This description embraces a model of randomness making randomness a substantial assumption about the data in statistics and machine learning. Interestingly, statistical independence lies at the foundation of another, hitherto unrelated, concept: many fairness criteria in fair machine learning are expressed in terms of statistical independence.
3 Fair Machine Learning Relies on Statistical Independence
With the broad use of machine learning algorithms in many socially relevant domains, e.g., recidivism risk prediction or algorithmic hiring, machine learning algorithms turned out to be part of discriminatory practices [22, 80]. These revelations were accompanied by the rise of an entire research field, called “fair machine learning” (cf. [16, 22, 5]). We do not attempt to summarize this large literature here. Instead, we simply take a snapshot of the most widely known fairness criteria in machine learning [5, p. 45].
3.1 Three Fairness Criteria in Machine Learning
The three so called observational fairness criteria, which are expressed in terms of statistical independence, encompass a large part of fair machine learning literature:6
For the sake of distinguishing between the fairness criteria “Independence” and statistical independence, we henceforth mark all fairness criteria by a leading capital letter. Each of the notions appear in a variety of ways and under different names [5, p. 45ff]. From the perspective of ethics, the fairness criteria have been substantiated via loss egalitarianism [11, 111], absence of discrimination [11], affirmative action [9, 83] or equality of opportunity [46].
Certainly, statistical independence is not equivalent to fairness in general (a constellation of concepts sharing a common name, and perhaps little else agreed by all) [33, 46, 83]. The nature of fairness has been discussed for decades, e.g., in political philosophy [82], moral philosophy [61] and actuarial science [1]. The “essentially contested” nature of fairness suggests that no universal, statistical criterion of fairness exists [39]. How fairness should be defined is a context-specific decision [84, 47].
Nevertheless, in order to incorporate fairness notions into algorithmic tools we require mathematical formalisations of fairness definitions. The three named criteria dominate most of the practical fair machine learning tools [5, p. 45], presumably because their simple definitions make it easy to incorporate them in learning procedures in a pre-, in- or post-processing way [5, Chapter 3, p. 20]. Regarding both the reductionist definition of fairness and the pragmatic justification, we emphasize that our argumentation is solely with respect to the fairness criteria named above.
The three fairness criteria are described as group fairness notions since each of the definitions is intrinsically relativized with respect to sensitive groups. The definition of sensitive groups substantially influences the notion of fairness. For instance, via custom categorization one can provide fairness by group-design (see [67, Section H.3] for a detailed discussion of the question of choice of groups). In addition, the meaning of groups in a societal context influences the choice of groups as elaborated in [49], and as explored in a long line of work in social psychology [17, 97, 48, 66, 19]. We contribute to the debate by drawing a connection between the choice of groups and the choice of randomness in ğ7.1.1.
3.2 Independence in Mathematics and the World
Behind the formalization of fairness as statistical independence, there is an apparently rigid, mathematical definition of statistical independence. The fairness criteria presume that we have the machinery of probability theory at our disposal and the relationship of mathematics and the world is clear and unambiguous. However, as we elaborate further in the following, there is no single notion of “the” mathematical theory of probability [36]. Furthermore, it is not clear what it means to be statistically independent when talking of measurements in the world. Respectively, it is not obvious that the standard formulation of statistical independence is the right one to use.
The current definitions of fairness in machine learning fail and even hurt in practice, because debates on fairness notions take for granted a commonly agreed meaning of statistical independence. This common ground fatally does not exist given just the standard probability theory. Hence, given a specified scenario, e.g., hiring algorithm for public service in Kenya, fair machine learning research should enable founded debates on the meaning and purposefulness of deploying a specific notion of fairness, in particular those which require independence statements because their are algorithmically attractive. If the debate in Kenya would rely on everybody’s – who are involved in the deployment process – intuitive understanding of statistical independence, it would result in a fatal misalignment of reasonings.
We perceive this work as part of the project to emphasize the substantive character formal notions of fairness should have. In other words, fairness is a societal and ethical concept that does not allow for the separation of machine learning as technical tool on one side and the idea of fairness in society on the other side. Debates on fairness have to consider machine learning in society.
Hence, if we desire statistical independence to capture a fairness notion applicable to the world, we ought to understand what the mathematical formulae signify in the world. Thus, in addition to the debate about the fairness criteria, a debate on the interpretation of statistical concepts in ethical context is required.
In this work, we contribute to the understanding by scrutinizing the standard definition of statistical independence. Motivated by the occurrence of statistical independence as fundamental ingredient in randomness as well as fairness in machine learning, we first detail the standard account due to Kolmogorov. What is statistical independence? How does statistical independence relate to an independence in the world?
4 Statistical Independence Revisited
To make any sense of phenomena in our world, we need to ignore large parts of it in order to avoid being overwhelmed. Hence, we usually assume or presume the phenomena of interest depends only on a few factors and to be independent of everything else [72]. Thus the concept of independence is inherent in a variety of subjects ranging from causal reasoning [93], to logic [42], accounting [24], public law [69] and many more. Independence, as we understand it, grasps the concept of incapability of an entity to be expressed, derived or deduced by something else [92].
4.1 From Independence to Statistical Independence
Of special interest to us is the concept of independence in probability theories and statistics [60][36, Section IIF, IIIG and VH]. Independence in a probabilistic context should somehow capture the unrelatedness between the occurrence of events, as has been understood for centuries:
Modern probability theory loosely follows this intuition as we see in the following.
Two Events are independent, when they have no connexion [sic] one with the other, and that the happening of one neither forwards nor obstructs the happening of the other. [28, Introduction, p. 6].
4.2 Statistical Independence As We Know It Lacks Semantics
Since the axiomatization of probability theory developed by [55] (translated in [56]), it displaced many other approaches. Mathematically, Kolmogorov’s measure-theoretic axiomatization dominates all other mathematical formalizations to probability and related concepts.7 In particular, his definition of statistical independence developed a well-accepted, ubiquitous notion. In a simple form it is given by:8
Independence plays a central role in Kolmogorov’s probability theory: “measure theory ends and probability begins with the definition of independence” [32, p. 37] (quoted in [98]). However, Kolmogorov’s definition of independence is subtle and requires closer investigation.
We employ a small toy example in order to convey the semantic emptiness of Kolmogorov’s definition: consider the experiment of throwing a die. The events under observation are $A=\{1,2\}$, seeing one or two pips, respectively $B=\{2,3\}$, seeing two or three pips. If the die were fair, so each face has equal probability $\frac{1}{6}$ to show up, the events A and B would turn out to not be independent. In contrast, if the die were loaded in a very special way ${p_{2}}={p_{3}}=\frac{1}{2}$, ${p_{1}}={p_{4}}={p_{5}}={p_{6}}=0$, where ${p_{i}}$ refers to the probability of seeing i pips, the events A and B would be independent following Kolmogorov’s definition.
Thus, statistical independence, even though defined over events, manifests in the correspondence of how probabilities are mapped to events and why. The definition focuses on events. But, the crucial ingredient is the probability. Thus, there is no unique interpretation of statistical independence. A more detailed interpretation and meaning heavily depends on the interpretation of probability in the first place.
Given this observation, we may ask for a notion of independence in the world, which is somehow captured by Kolmogorov’s definition. Kolmogorov himself underlined the avowedly mathematical, axiomatic nature of probability. His theory is in principle detached from any meaning of probabilistic concepts such as statistical independence in the world [56, p. 1]. He even questioned the validity of his axioms as reasonable descriptions of the world [56, p. 17].
However, one can possibly construct a notion of independence in world which is captured by Kolmogorov’s definition. If one assumes one’s calculations about one’s beliefs on the happening of events are governed by the mathematical rules laid out by Kolmogorov, then Kolmogorov’s definition of statistical independence captures one’s (in-the-world) understanding of an independence of beliefs on the happening of events. This sketch of a purely subjectivist account to statistical independence neglects a justification for the choice of mathematical formulation and skips over any reference to an objective world. In conclusion, Kolmogorov’s independence might capture a worldly concept. But, this independence in the world is not uniquely attached to Kolmogorov’s definition.
There is a third major irritation arising from Kolmogorov’s definition. As observed already above, Kolmogorov treats events as the entities of independence. Though, against the background of De Moivre [28]’s intuition on statistical independence, we wonder about this focus. Statistical independence, as De Moivre [28] already emphasized, refers to altering “the happening of the event,” but not the event itself. It is not the independence between the shown numbers of the die (whatever this means), but the independence of the processes how the number showed up (loading the die, throwing the die, etc.) which are captured by statistical independence.
This critique is not new. Von Mises already criticized the measure theoretical definition by Kolmogorov in his book [107, pp. 36–39]. In summary, he argued that there is no interpretation of statistical independence of “single events.” The unrelatedness, which the probabilistic notion of statistical independence is trying to capture, locates in the process of reoccuring events, but not single events themselves. More recently, Von Collani [104] argued in a similar way. It is probability which brings the definition of statistical independence to life and it is the question what probability means and why we use it which links the mathematical definition to a concept in the real world.
In machine learning and statistics it is often presumed that the mathematical definition of independence captures a worldly concept. As we argued in this section, this link is far from being well-defined. However, if we consider machine learning as worldly data modeling, then the natural question arises: what do we model when we leverage Kolmogorov’s statistical independence? What do we mean by independence of events? In order to circumvent these questions, we propose to look into another mathematical theory of probability. This theory was led by the idea of modeling statistical phenomena in the world.
4.3 Statistical Independence and a Probability Theory with Inherent Semantics
Around 15 years before Kolmogorov’s Grundbegriffe der Wahrscheinlichkeitsrechnung [55] (translated as [56]), Von Mises proposed an earlier axiomatization of probability theory [105]. His less known theory approached the problem of a mathematical theory of probability through the lens of physics. Von Mises aimed for a “mathematical theory of repetitive events.”
This aim included the emphasis on the link between real-world and mathematical idealization. In particular, he offers interpretability and verifiability of his theory [107, p. 1 and 14].9 For interpretation he defined probabilities in a frequency-based way (see Definition 2). This inherently reflects the repetitive nature of the phenomena under description. By verifiability he referred to the ability to approximately verify the probabilistic statements made about the world [107, p. 45].
In summary, Von Mises’ theory, in our conception of machine learning, starts the “modeling of data generating processes” on an even more fundamental level then it is currently done via the use of Kolmogorov’s axiomatization. His aim for a mathematical description of data-generating processes (the sequence of repetitive events) aligns to our perspective on machine learning as laid out earlier (cf. Section 2). With Von Mises we obtain access to meaningful foundations for statistical concepts in machine learning. In particular, we redefine and reinterpret statistical independence in a Von Misesean way. This suggests new perspectives on the problem of fair machine learning and the concepts of fairness and randomness themselves.
For the further discussion, we summarize the major ingredients of Von Mises’ theory. Fortunately, it turns out that Von Mises’ notion of statistical independence, central to our discussion, is mathematically analogous to the well-known Kolmogorovian definition. Thus, one’s intuition on statistical independence is refined but its mathematical applicability remains.
4.4 Von Mises’ Theory of Probability and Randomness in a Nutshell
Von Mises’ axiomatization of probability theory is based on random sequences of events and the interpretation of probability as the limiting frequency that an event occurs in such a sequence [105]. These random sequences, called collectives, are the main ingredients of his theory. For the sake of simplicity, we stick to binary collectives. Thus, collectives are 0-1-sequences with certain randomness properties which define probabilities for the labels 0 and 1. Nevertheless, it is possible to define collectives respectively probabilities on richer label sets. Collectives can be extended up to a continuum [107, II.B].
For notational economy, we note that a sequence taking values in $\{0,1\}$, ${({s_{i}})_{i\in \mathbb{N}}}$ can be identified with a function $s:\mathbb{N}\to \{0,1\}$.
Definition 2 (Collective [107, p. 12]).
Let $\mathcal{S}$ be a set of sequences $s:\mathbb{N}\to \{0,1\}$ with $s(j)=1$ for infinitely many $j\in \mathbb{N}$. In mathematical terms, collectives with respect to $\mathcal{S}$ are sequences $x:\mathbb{N}\to \{0,1\}$ for which the following two conditions hold.
-
1. The limit of relative frequencies of 1s, exists.10 If it exists, then the limit of relative frequencies of 0s exists, too. We define ${p_{1}}$, respectively ${p_{0}}=1-{p_{1}}$, to be its value.
-
2. For all $s\in \mathcal{S}$,\[\begin{aligned}{}& \underset{n\to \infty }{\lim }\frac{|\{i\in \mathbb{N}:x(i)=1\hspace{2.5pt}\text{and}\hspace{2.5pt}s(i)=1,1\le i\le n\}|}{|\{j\in \mathbb{N}:s(j)=1,1\le j\le n\}|}\\ {} & \hspace{1em}=\underset{n\to \infty }{\lim }\frac{|\{i\in \mathbb{N}:x(i)=1,1\le i\le n\}|}{n}={p_{1}}.\end{aligned}\]
We call ${p_{0}}$ the probability of label 0. Conversely, ${p_{1}}$ is the probability of label 1.11 The existence of the limit (Condition 1) is a non-vacuous condition. One can easily construct sequences whose frequencies do not converge [37].
The sequences $s\in \mathcal{S}$ are called selection rules. A selection rule selects the jth element of x whenever $s(j)=1$.12 Informally, a collective (w.r.t $\mathcal{S}$) is a sequence which has invariant frequency limits with respect to all selection rules in $\mathcal{S}$. We call any selection rule which does not change the frequency limit of a collective admissible. This invariance property of collectives is often called “law of excluded gambling strategy” [107]. When thinking of a sequence of coin tosses, a gambler is not able to gain an advantage by just considering specific selected coin tosses. The probability of seeing “heads” or “tails” remains unchanged.
Von Mises introduced the “law of excluded gambling strategy” with the goal to define randomness of a collective [107, p. 8]. A collective is called random with respect to $\mathcal{S}$. Consequently, Von Mises integrated randomness and probability into one theory. In fact, admissibility of selection rules is equivalent to statistical independence in the sense of Von Mises. But it is defined with respect to collectives instead of selection rules.
Definition 3 (Von Mises’ Definition of Statistical Independence of Collectives [107, p. 30, Def. 2]).
A collective x with respect to ${\mathcal{S}_{x}}$ is called statistically independent to the collective y with respect to ${\mathcal{S}_{y}}$ iff the following limits exist and
\[\begin{aligned}{}& \underset{n\to \infty }{\lim }\frac{|\{i\in \mathbb{N}:x(i)=1\hspace{2.5pt}\text{and}\hspace{2.5pt}y(i)=1,1\le i\le n\}|}{|\{j\in \mathbb{N}:y(j)=1,1\le j\le n\}|}\\ {} & \hspace{1em}=\underset{n\to \infty }{\lim }\frac{|\{i\in \mathbb{N}:x(i)=1,1\le i\le n\}|}{n}={p_{1}},\end{aligned}\]
where $\frac{0}{0}:=0$. When two collectives are independent of each other we write
In comparison to admissibility, the collective y adopts the role of a selection rule. It is in fact an admissible selection rule with the difference that a potentially finite number of elements in x are selected (cf. [41, p. 120f]). Conversely, Von Mises’ randomness is statistical independence with respect to sequences with infinitely many ones and potentially no frequency limit. (For a general comparison between Kolmogorov’s and Von Mises’ theory of probability see Appendix A and Table 2.)
4.5 Kolmogorov’s Independence versus Von Mises’ Independence
What is the relationship between Kolmogorov’s and Von Mises’ definition of statistical independence? On a conceptual level, the critique posed earlier, which questioned the meaning of statistical independence between events following Kolmogorov, gets resolved.
Von Mises adopted a strong frequential perspective on probabilities which clarifies the mapping from real world to mathematical definition. He idealized repetitive observations by infinite sequences and defined probabilities as limiting frequencies.13 Von Mises’ independence states that there is no difference in counting the frequency of occurrences of an event in the entirety of the sequences or in a subselected sequence. His independence forbids any statistical interference between processes described as sequences. No statistical patterns can be derived from one sequence by leveraging the other. Von Mises’ definition formalizes the concept of statistical independence between processes of reoccuring events.
In contrast to Kolmogorov, Von Mises’ definition does not evoke the conceptual obscurity. His focus on idealized sequences of repetitive events restricts his definition of statistical independence to specific applications with the gain of clarity in the goal of the mathematical description. Von Mises’ definition of independence makes statistical independence more concrete then Kolmogorov’s definition does.
On a more formal level, Kolmogorov defined statistical independence via the factorization of measure (cf. Definition 5), whereas Von Mises defined statistical independence via conditionalization of measures. The invariance of the frequency limit of a collective with regard to the subselection via another collective can be interpreted as the invariance of a probability of an event with regard to the conditioning on another event, i.e., “selecting with respect to” is “conditioning on” (cf. Theorem 1 and Theorem 2).
Mathematically, it turns out that Kolmogorov’s definition and Von Mises’ definition are both special cases (modulo the measure zero problem in conditioning) of a more general form of measure-theoretic statistical independence. A selection rule with converging frequency limit is admissible (respectively, statistically independent), to a collective if and only if the two are statistically independent of each other in the sense of Kolmogorov, when generalized to finitely additive probability spaces (see Appendix A for a formal statement of this claim). Thus, we can replace the known definition of statistical independence by Kolmogorov with the definition by Von Mises. Thereby, we give a specific meaning to statistical independence.
We have been motivated to dissect the notion of statistical independence for its central role in fair machine learning. Von Mises’ definition drew us closer to a more transparent mathematical formalization of statistical independence for fairness notions in machine learning. However, our discussion of Von Mises’ theory skipped over a substantial part of his work so far. Von Mises included a definition of randomness in his theory of probability. Much in contrast to Kolmogorov: There is no definition of “randomness” in Kolmogorov’s Grundbegriffe der Wahrscheinlichkeitsrechnung [55] (translated as [56]). Even more interestingly, Von Mises’ definition of randomness is stated in terms of statistical independence. The reader might notice that in Section 2 we already stumbled upon a heavily used notion of randomness in machine learning, which is expressed as statistical independence (i.i.d.). How do i.i.d. and Von Mises’ randomness relate to each other? How does the close connection between statistical independence and randomness complement our picture of the three fairness criteria from machine learning?
5 Randomness as Statistical Independence
The nature and definition of randomness seems as “random” as the term itself [35, 73, 68, 103, 8]. Usually, a very broad distinction between two approaches to randomness is made: process randomness versus outcome randomness [35]. In this work, we focus on outcome randomness and more specifically the role of randomness in statistics and machine learning.
Randomness is a modeling assumption in statistics (cf. Section 2). Upon looking into statistics and machine learning textbooks one often finds the assumption of independent and identically distributed (i.i.d.) data points as the expression of randomness [18, p. 207], [30, p. 4].
We adopt Von Mises’ differing account of randomness. The expression of randomness relative to the problem at hand, particularly in settings with data models such as statistics, turns out to be substantial.
5.1 Orthogonal Perspectives on Randomness as Independence in Machine Learning and Statistics
Von Mises defined a random sequence as a sequence which is statistically independent to a (pre-specified) set of selection rules respectively other sequences. In contrast, an i.i.d.-sequence consists of elements each statistically independent to all others.
Both definitions are stated in terms of statistical independence. But, the relationship of independence and randomness in terms of i.i.d. and in Von Mises’ theory differ substantially. Von Mises’ randomness is stated relative with respect to a set of selection rules. Furthermore, it is stated between sequences, respectively collectives. Whereas, in an i.i.d. sequence randomness is expressed between random variables. The randomness definitions are in an abstract sense “orthogonal.” We consider a concrete example for better understanding.
1.
Horizontal Randomness. Let $\Omega =\mathbb{N}$ be a penguin colony. Let $s,f$ be two attributes of a penguin, namely sex and whether a penguin has the penguin flu or not. Mathematically: $s:\Omega \to \{0,1\}$, $f:\Omega \to \{0,1\}$. So, penguins are individuals $n\in \Omega $ which we do not know individually, but we know some attributes of them. Suppose we are given a sequence $f(1),f(2),f(3),\dots $ of flu values with existing frequency limit. This allows us to state randomness of f with respect to the corresponding sequence of sex values s, containing infinitely many ones and having a frequency limit, by: the sequence of sex values s is admissible on f. Respectively, s and f are statistically independent of each other. In the context of colony Ω a penguin having flu is random with respect to the sex of the penguin.
2.
Vertical Randomness. This is different to the i.i.d.-setting in which each penguin $i\in \mathbb{N}$ obtains its own random variable ${F_{i}}:\Omega \to \{0,1\}$ on some probability space $(\Omega ,\mathcal{F},P)$. Here, ${F_{i}}$ encodes whether penguin i has the penguin flu or not. The sequence ${F_{1}},{F_{2}},{F_{3}},\dots $ somehow represents the colony. The included random variables share their distribution and are statistically independent to each other. The attribute flu is not random with respect to the attribute sex here, but the penguins are random with respect to each other. The random variables are (often implicitly) defined on a standard probability space on Ω. The set Ω here does not model the colony. It shrivels to an abstract source of randomness and probability.
The choice of perspective, horizontal or vertical, on randomness expressed as statistical independence is a question of the data model. The two types of randomness definitions are distinct in a number of ways. For a summary see Table 1. Most importantly, horizontal randomness is inherently expressed with respect tosome mathematical object. Vertical randomness lacks this explicit relativization. This typification of horizontal and vertical, mathematical definitions of randomness is actually more broadly applicable.
Table 1
Typification of horizontal and vertical randomness (“RV” = “random variable”).
Horizontal Randomness | Vertical Randomness | |
Data points are modelled as: | Evaluations of RVs | RVs |
Mathematical definition of randomness of: | Sequences | Sequences of RVs |
Explicit relativization: | Yes | No |
To the set of vertical randomness notions one can add: exchangeability [27], α-mixing, β-mixing [94] and possibly many more. The set of horizontal randomness notions is spanned up by an entire branch of computer science and mathematics: algorithmic randomness.
Algorithmic randomness poses the question whether a sequence is random or not. This question arose in [105] within the attempt to axiomatize probability theory [10, p. 3]. In algorithmic randomness further definitions of random sequences have been proposed. For the sake of simplicity the considered sequences consist only of zeros and ones.
Four intuitions for random sequences crystallized [77, p. 280ff]: typicality, incompressibility, unpredictability and independence (see Appendix C). For our purposes, the key point to note is that a random sequence is typical, incompressible, unpredictable or independent with respect to “something” (they are all relativised in some way). Each of these intuitions has been expressed in various mathematical terms. In particular, formalizations of the same intuitions are not necessarily equivalent, and formalizations of different intuitions sometimes coincide or are logically related (see Appendix D).14 We mainly stick to the intuition of independence in this paper. A random sequence is independent of “some” other sequences [105, 23].
5.2 Relative Randomness Instead of Absolute, Universal Randomness
The definition of randomness for sequences is inherently relative. Even though, the notion is relative with respect to “something,” most of the effort has been spent on finding the set of statistically independent sequences defining randomness [23, 68].15
Naively, one could attempt to define a random sequence as: a sequence is random if and only if it is independent with respect to all sequences. However, this approach is doomed to fail. There is no sequence fulfilling this condition except for trivial ones such as endless repetitions of zeros or ones (see Kamke’s critique of von Mises’ notion of randomness [100]).
So instead, research focused on computability expressed in various ways (because it was felt by those investigating these matters that computability was somehow given, or more primitive, and thus a natural way to resolve the relativity of the notion of randomness). Intuitively, randomness is considered the antithesis of computability [77, p. 288]: something which is computable is not random. Something which is random is not computable. If we then informally update the definition above we obtain: a sequence is random if and only if it is independent with respect to all computable sequences [23].16 Analogous to the definition of computability [77, p. 165], this is taken as an argument for the existence of the definition of randomness [77, p. 287].
In our work, we argue towards a relativized conception of randomness in line with work by [77], [50] and [107].17 A relative definition of randomness is a definition of randomness which is relative with respect to the problem under consideration.18 In contrast, an absolute and universal definition of randomness would preserve its validity in all problems. It presupposes the existence of the randomness.
Relative randomness with respect to the problem which we want to describe aligns to Von Mises’ theory of probability and randomness. Von Mises emphasized the ex ante choice of randomness [106, p. 89] relative to the problem at hand [107, p. 12]. First, one formalizes randomness with respect to the underlying problem, then one can consider a sequence to be random or not. Otherwise, if we are given a sequence, it is easy to construct a set of selection rules, such that the sequence is random with respect to this set.19 This, however, undermines the concept of randomness, which should capture the pre-existing typicality, incompressibility, unpredictability or independence of a sequence (cf. [108, p. 321]). Von Mises’ randomness intrinsically possesses a modeling character, similar to our needs in machine learning and statistics.
Given its role as modeling assumption in statistics, randomness lacks substantial justification to be expressed in any absolute and universal manner in this context. Neither are there reasons why computability20 is the only mathematical, expressive way to encode one of the four intuitions of randomness. The i.i.d. assumption, an absolute and universal definition of randomness, does not fit this purpose. To appropriately model data we require adjustable notions of randomness. Otherwise, we restrict our modeling choice without reason or gain.21
Equipped with the interpretation of statistical independence as randomness we now return to our motivation for investigating statistical independence. ML-friendly fairness criteria are built upon statistical independence. In contrast to Kolmogorov, Von Mises’ statistical independence transparently refers to a concept of independence in the real world. To clarify the meaning of fairness expressed as statistical independence, we directly apply Von Mises’ independence to the fairness criteria listed in Section 3 in the following.
6 Von Mises’ Fairness
With Von Mises’ definition of statistical independence we have a notion at our disposal which is conceptually focused on a more “scientific” perspective (i.e., making claims about the world) of statistical concepts. Since it is mathematically related to Kolmogorov’s standard account of statistical independence, Kolmogorov’s definition can, at many places, be easily replaced by Von Mises’ definition.
Let us denote the three presented fairness criteria in a Von Mises’ way (cf. Section 4.5).
The 0-1-sequences ${s^{j}}$ determine for each individual i whether it is part of the group or not (according to whether ${s^{j}}(i)=1$ or ${s^{j}}(i)=0$). We call these groups “sensitive,” as these are the groups which are of moral and ethical concern. In philosophical literature these groups are often called “socially salient groups” [2, 64].22 We see that the connection between Von Misean independence and fairness arises from the observation that the set of sensitive groups $\mathcal{G}$ is a family of selection rules, so that if $\mathcal{G}\subseteq \mathcal{S}$, then indeed the collective x will be fair for $\mathcal{G}$.
Following Von Mises’ interpretation of independence, the given definition reads as follows: we assume we are in the idealized setting of infinitely many individuals with values ${x_{i}}$, e.g., binary predictions. The predictions are fair if and only if there is no difference in counting the frequency of 1-predictions in the entirety or in the sensitive group. (For an illustration see Appendix 6.) A proper conceptualization of fairness requires such immediate semantics, but a purely mathematical theory of probability cannot offer these (see Section 4.2).
Each of the three fairness criteria is captured in Definition 4; the choice of fairness criterion manifests in the collective under consideration:
To enable intuitive access to the Von Mises’ notions of fairness we provide a toy example in Appendix B. The three fairness criteria Independence, Separation and Sufficiency encompass a large part of fair machine learning [5, p. 45]. Von Mises’ statistical independence gives a consistent interpretation to all of them. In fact, Von Mises’ independence opens the door to further investigations. To this end, we recapitulate the strong linkage between statistical independence and randomness in Von Mises’ theory.
Independence
The collective $x:\mathbb{N}\to \{0,1\}$ consists of predictions; i.e., $\{0,1\}$ is the set of predictions.
Separation
The collective $x:\mathbb{N}\to \{0,1\}$ is obtained via the subselection of predictions based on the sequence of true labels corresponding to the predictions.23
Sufficiency
The true labels are subselected by predictions.
7 The Ethical Implications of Modeling Assumptions
Machine learning methods try to model data in complex ways. Derived statements, such as predictions, then potentially get applied in society. In these cases one is obliged to ask which ought-state the machine learning model should reflect [75, 83]. To enable a justified choice, statistical concepts in machine learning require relations to the real world. Furthermore, modeling even requires understanding of the entanglement of societal and statistical concepts.
We proposed one specific meaningful definition of statistical independence which can be directly applied to the three observational fairness criteria from fair machine learning. In addition, this Von Mises’ independence is key to a relativized notion of randomness. Pulling these threads together, we are now able to establish the following link: Randomness is fairness. Fairness is randomness.
7.1 Randomness is Fairness. Fairness is Randomness
The concepts fairness and randomness frequently appear jointly: [14] argues that a random allocation of goods is fair under certain conditions. Literature on sortition argues for just representation of society by random selection of people [74, 95].24 Bennett [7, p. 633] even states that randomness encompasses fairness.
With Von Mises’ axiom 2 and Definition 4 we can now tighten the conceptual relationship of fairness and randomness. The proposition directly follows from the definition of randomness respectively fairness in the sense of Von Mises.
Proposition 1 (Randomness is fairness. Fairness is randomness.).
Let x be a collective with respect to ∅ (the empty set). It is fair with respect to a set of sensitive groups (0-1-sequences) ${\{{s^{j}}\}_{j\in J}}$, if and only if it is random with respect to ${\{{s^{j}}\}_{j\in J}}$.
The given proposition establishes a helpful link. It gives insights into both of the concepts. In particular, it substantiates the relativized conception of randomness in machine learning as it presents randomness as an ethical choice.
7.1.1 Randomness as Ethical Choice
Randomness in machine learning is a modeling assumption (Section 2). Fairness is an ethical choice.25 In light of Proposition 1 randomness gets an ethical choice and fairness a modeling assumption. We now further detail this perspective.
We assume that we are given a fixed set of selection rules, which defines “the” randomness. As far-fetched as this may sound, if we, for example, accept the so called Martin-Löf randomness as absolute and universal definition, then we exactly do this and fix the set of selection rules to the partial computable ones (see Appendix D.1). A sequence which is random with respect to this specified set of selection rules is fair with respect to the groups defined by the selection rules. Rephrased in terms of Martin-Löf randomness: a Martin-Löf random sequence is fair with respect to all partial computable groups. Only non-partial-computable groups (respectively sequences) can be discriminated against in this setting. If we interpret statistical independence as fairness (Section 3), then fairness is as absolute and universal as randomness here. Where did the “essentially contested” nature of fairness [39] leave the picture?
The set of admissible selection rules specifies the choice of sensitive groups, which indeed is a fraught and contestable choice [67, Section H.3]. Thus each selection rule gets ethically loaded. Furthermore, the choice of collective, which we consider as random, fixes the fairness criterion. In summary, the determination of randomness is analogous to the determination of fairness.
However one defines randomness, it is an ethical choice. For symmetry reasons one can equivalently state in machine learning: fairness is a modeling assumption. The randomness assumption has an ethical, moral and potentially legal implication. We need non-mathematical, contextual arguments to each problem at hand which justify the adjustable and explicit randomness assumptions.
Given that randomness is an ethical choice, an absolute, universal conception of randomness counteracts any ethical debate in machine learning. Discussions about sexism, racism and other kinds of discrimination and injustice persist over time without ever arrogating the discovery of “the” fairness [39]. But if “the” randomness as statistical independence would exist, then “the” fairness as statistical independence would be an accessible notion. For illustration, we reconsider Martin-Löf randomness. A Martin-Löf random sequence is independent, respectively fair, to the set of all partial computable selection rules. But, it is completely unclear what the ethical meaning of partial computable groups is. And, it remains unsolved whether the groups given by gender are partial computable, when we desire to be fair with respect to them. We conceive Proposition 1 as further counterargument to an absolute, universal definition of randomness. Randomness is, like fairness, better interpreted as a relative notion.
Further concluding, the equivalence of randomness and fairness highlights the deficiency of fairness notions in machine learning. The equivalence only holds due to the very reductionist perspective on fairness in fair machine learning. Despite their regular co-occurrence [14, 74], [7, p. 633], fairness and randomness are more multi-facetted and non-overlapping concepts as illustrated in Proposition 1.
7.1.2 Fairy Tales of Fairness: “Perfectly Fair” Data
With the relationship between fairness and randomness in mind, we now turn towards random data as primitive. Discussions in fair machine learning sometimes seemingly presume the existence of “perfectly fair” data (e.g., as highlighted in [83, p. 134]), as if fair machine learning merely tackles the cases where “perfectly fair” data is not available.
We interpret “perfectly fair” data as a collective with respect to all possible selection rules. The data does not depend on any (sensitive) group at all. In other words, “perfectly fair” data is “totally random” data. As we saw in Section 5.2 this is self-contradicting except of the trivial constant case. “Perfectly fair” data does not exist or is statistically useless.
7.2 Demanding Fairness Is Randomization: Fair Predictors Are Randomizers
In practice, it is often unreasonable to assume random or fair data as in Proposition 1. Instead one demands for fairness respectively randomness of predictions. In these settings, fair machine learning techniques are deployed to exhibit ex post fulfillment of fairness criteria.
We assume for the following discussion that the collective x consists of predictions, as in the fairness criteria Independence or Separation. Fair machine learning techniques enforce statistical independence of predictions and sensitive attributes. Rephrased, fair machine learning techniques actually introduce randomness post-hoc into the predictions. Thus, fair machine learning techniques can potentially be interpreted as randomization techniques.
7.2.1 Fairness-Accuracy Trade-Off — Another Perspective
We noticed that fair predictions are random predictions with respect to the sensitive attribute. In contrast, accurate predictions exploit all dependencies between given attributes and predictive goal, including the sensitive attributes. Thus, in fair machine learning morally wrongful discriminative potential of sensitive attributes is thrown away by purpose. On these grounds, it is not surprising that an increase in fairness respectively randomness (usually) goes hand in hand with a decrease in accuracy [110]. Randomization of predictions leads to the so called fairness-accuracy trade-off.
Concluding, via Von Mises’ axiomatization we established: Randomness is fairness. Fairness is randomness. Exploiting this new perspective, we unlock another perspective on fair predictors as randomizers, demonstrate the nonexistence of “perfectly fair” data and treat randomness as an ethical choice, which can be neither universal nor total. In particular, the “essentially contested” nature of fairness is tied to the “essentially relative” nature of randomness.
8 Conclusion
Fair machine learning attained an increasing interest in the last years. However, its conceptual maturity lags behind. In particular, the interplay between data, its mathematical representation and their relation to fairness is encompassed by a veil of nescience. In this paper, we contribute towards a better understanding of randomness and fairness in machine learning.
We started from the most commonly used definition of statistical independence and questioned its representation due to a lack of semantics. Generally, we observe that in machine learning, as in statistics, probability and its related concepts should be interpreted as modeling assumptions about the world (of data). Von Mises aimed for exactly this “scientific” perspective on probability theory. We lean on his statistical independence, which clarifies the relation to the real world, and his definition of randomness, which is relative and orthogonal to the i.i.d. assumption, but similarly expressed as statistical independence. Then by the three fairness criteria in machine learning we obtain a further interpretation of independence, which we finally exploit to argue for a relative conception of randomness, randomness as an ethical choice in machine learning and fair predictors as randomizers.
8.1 Future Work: Approximate Randomness and Fairness, Randomness as Fairness via Calibration
Despite future conclusions in-between the topics fairness and randomness in other research subjects as machine learning, we claim that a significant dimension is missing in the present discussion. Practitioners usually deal with approximate versions of randomness, statistical independence or fairness. Yet, approximation spans another dimension of choice beset with pitfalls [79, 63]. Several questions ranging from the choice of approximation to the interference of concepts arise. Future work should detail the implications of this choice.
Second, we conjecture that “Randomness is Fairness. Fairness is randomness.” can be substantiated via the intuition of unpredictability. Starting from [90] definition of unpredictability randomness, which is closely related to the calibration idea presented in [25], we can bridge to fairness as calibration as given in [22]. A recent work by Cynthia Dwork and collaborators in fact show a formal link between pseudo-randomness and fairness as calibration [34]. This work, however, still misses a more thorough discussion of the concepts of individual versus group fairness in machine learning [12]. As a subproblem, which is contained therein, the categorization into (sensitive) groups in fair machine learning deserves its own work.
Third, regarding a more thorough definition of statistical independence within the fairness criteria, we are convinced that a subjectivist interpretation of probability might reveal yet another perspective on the problem. We assume that the interplay between different interpretations of probability and ethical concepts such as fairness still leaves room for many important investigations.
Fourth, there are certainly more frameworks to give an interpretation and concretization to current notions in (fair) machine learning (cf. [41, 36]).
Last but not least, we already referred to sortition literature and random allocation. The somewhat different relation between fairness and randomness in this literature leads us to speculate that further fruitful discussions between the two concepts may develop.
In the jungle of statistical concepts such as probability, uncertainty, randomness, independence etc. further relations to social and ethical concepts wait to be brought to light. And machine learning research should care:
The arguments that justify inference from a sample to a population should explicitly refer to the variety of non-mathematical considerations involved. [6, p. 11]