The Conditions of Analyzability
In Part 1, I argued that not everything can be turned into data by means of analysis. Some experiences resist detection, some collapse under measurement, and others vanish once we try to replicate them. But when analysis is possible, it depends on certain hidden preconditions, which I’ll call the five conditions of analyzability. We will look at 3 here today, and 2 in the next article. Some, if not all, of these points might come off as truisms at first. But the point of this part of the series is to take a deeper look at the hidden assumptions that lie at the heart of analysis. These conditions serve as thresholds: they are not guarantees of insight, but rather the minimum requirements for analysis to even begin. Without them, statistics and machine learning alike produce empty formalisms, outputs untethered from reality.
The first condition we will discuss is detectability. In short, the signal in question must exist, but not merely exist, it must exist in a context that gives off a detectable trace for us to observe. If there is no signal to detect, then there is no starting point for our analysis to take place. A perfect example from statistics for this condition is the concept of hypothesis testing in statistics. Hypothesis testing depends on being able to distinguish between random noise and a real effect. If the effect is completely buried in noise, inference collapses. And suppose we apply this to machine learning. In that case, we can see that with this viewpoint on this condition, the ability of a system to learn about a signal is dependent on the signal’s appearance in the distribution of data you’re analyzing. Back to the fraud example from part 1 of the series, we can see that fraud can exist, but by its nature, it tends to escape detection by deliberately avoiding creating a detectable signal.
Beyond the detection of the signal, we need to be able to quantify the signal. In general, we can think of this condition as measurability, in terms analogous to the magnitude or strength of the signal. We must be able to structure what we detect into a form that can be compared, ordered, or aggregated. This is manifest in statistics with your choices of “measurement scale”. Not all numbers mean the same thing; the way something is measured determines what kinds of comparisons are valid. Going back to the customer satisfaction example from part 1 of this series, we can see how something like “on a scale of 1 to 5, how satisfied with your visit were you?” converts a complex feeling into a narrow range of integers. But when analysis is performed on the satisfaction data, you’ll often see results like an average of 3.5 out of 5. When you apply quantitative analysis to qualitative measurements, you risk producing results that look precise but aren’t truly quantitative. Treating qualitative judgments as if they exist on an interval scale creates an illusion of structure where there may be none. When it comes to machine learning, this is mirrored by means of embeddings, normalizations, and encodings of the data. All of these are attempts at taking your real-world messy data and organizing it all into signals of measurable and consistent quantities, which we can then learn from. The translations from these real-world data sets to the cleaner and learnable sets will always be a reduction, be it quantitative or qualitative data. But every translation is also a reduction: nuance becomes structure, texture becomes number. Whether in statistics or machine learning, measurability always involves a trade-off between richness and regularity, between what is real and what is processable.
Next, let’s talk about the sufficiency of data to create generalizable results as a condition of analysis. Because even with measurable data, scarcity limits inference. Again, if we look at our two main frameworks of application for this philosophical analysis, machine learning and statistics, we can see great examples of this condition represented. In statistics, samples produce unstable estimates, wide confidence intervals, and high variance. The law of large numbers only works when “large” actually applies. The law of large numbers is a crucial and pivotal result of statistics that says more or less that the more data you have, the closer to the “actual” results you’ll observe in the data. A concrete example is flipping a coin, flipping it twice, and you might get heads twice, but that doesn’t mean that the coin will flip heads 100% of the time. You need to flip the coin more and more times to be more confident about the odds of a specific outcome. In machine learning, sparse data makes models overfit, which is when models come up with relationships that work for the present data but don’t generalize to a wider variety of data. Sparse data can trick the model into learning “quirks” about the specific data set, rather than general rules.
These first three conditions, detectability, measurability, and sufficiency, set the stage for everything that follows. They define the threshold where analysis becomes possible at all. Without something to detect, a way to measure it, and enough variation to generalize from, analysis risks turning into form without substance. In the next part, I’ll look at the final two conditions of analyzability, learnability, and replicability, and explore what happens when even well-structured data begins to fail those tests.

