Tag: Data

  • The Limits of Analysis Pt. 2

    The Limits of Analysis Pt. 2

    The Conditions of Analyzability

    In Part 1, I argued that not everything can be turned into data by means of analysis. Some experiences resist detection, some collapse under measurement, and others vanish once we try to replicate them. But when analysis is possible, it depends on certain hidden preconditions, which I’ll call the five conditions of analyzability. We will look at 3 here today, and 2 in the next article. Some, if not all, of these points might come off as truisms at first. But the point of this part of the series is to take a deeper look at the hidden assumptions that lie at the heart of analysis. These conditions serve as thresholds: they are not guarantees of insight, but rather the minimum requirements for analysis to even begin. Without them, statistics and machine learning alike produce empty formalisms, outputs untethered from reality.

    The first condition we will discuss is detectability. In short, the signal in question must exist, but not merely exist, it must exist in a context that gives off a detectable trace for us to observe. If there is no signal to detect, then there is no starting point for our analysis to take place. A perfect example from statistics for this condition is the concept of hypothesis testing in statistics. Hypothesis testing depends on being able to distinguish between random noise and a real effect. If the effect is completely buried in noise, inference collapses. And suppose we apply this to machine learning. In that case, we can see that with this viewpoint on this condition, the ability of a system to learn about a signal is dependent on the signal’s appearance in the distribution of data you’re analyzing. Back to the fraud example from part 1 of the series, we can see that fraud can exist, but by its nature, it tends to escape detection by deliberately avoiding creating a detectable signal.

    Beyond the detection of the signal, we need to be able to quantify the signal. In general, we can think of this condition as measurability, in terms analogous to the magnitude or strength of the signal. We must be able to structure what we detect into a form that can be compared, ordered, or aggregated. This is manifest in statistics with your choices of “measurement scale”. Not all numbers mean the same thing; the way something is measured determines what kinds of comparisons are valid. Going back to the customer satisfaction example from part 1 of this series, we can see how something like “on a scale of 1 to 5, how satisfied with your visit were you?” converts a complex feeling into a narrow range of integers. But when analysis is performed on the satisfaction data, you’ll often see results like an average of 3.5 out of 5. When you apply quantitative analysis to qualitative measurements, you risk producing results that look precise but aren’t truly quantitative. Treating qualitative judgments as if they exist on an interval scale creates an illusion of structure where there may be none. When it comes to machine learning, this is mirrored by means of embeddings, normalizations, and encodings of the data. All of these are attempts at taking your real-world messy data and organizing it all into signals of measurable and consistent quantities, which we can then learn from. The translations from these real-world data sets to the cleaner and learnable sets will always be a reduction, be it quantitative or qualitative data. But every translation is also a reduction: nuance becomes structure, texture becomes number. Whether in statistics or machine learning, measurability always involves a trade-off between richness and regularity, between what is real and what is processable.

    Next, let’s talk about the sufficiency of data to create generalizable results as a condition of analysis. Because even with measurable data, scarcity limits inference. Again, if we look at our two main frameworks of application for this philosophical analysis, machine learning and statistics, we can see great examples of this condition represented. In statistics, samples produce unstable estimates, wide confidence intervals, and high variance. The law of large numbers only works when “large” actually applies. The law of large numbers is a crucial and pivotal result of statistics that says more or less that the more data you have, the closer to the “actual” results you’ll observe in the data. A concrete example is flipping a coin, flipping it twice, and you might get heads twice, but that doesn’t mean that the coin will flip heads 100% of the time. You need to flip the coin more and more times to be more confident about the odds of a specific outcome. In machine learning, sparse data makes models overfit, which is when models come up with relationships that work for the present data but don’t generalize to a wider variety of data. Sparse data can trick the model into learning “quirks” about the specific data set, rather than general rules.

    These first three conditions, detectability, measurability, and sufficiency, set the stage for everything that follows. They define the threshold where analysis becomes possible at all. Without something to detect, a way to measure it, and enough variation to generalize from, analysis risks turning into form without substance. In the next part, I’ll look at the final two conditions of analyzability, learnability, and replicability, and explore what happens when even well-structured data begins to fail those tests.

  • The Limits of Analysis Pt 1

    The Limits of Analysis Pt 1

    The Preconditions of Knowing

    This is Part 1 of a 3-part series on what makes a quantity analyzable. In this opening piece, I set the stage by exploring the limits of analysis itself, why some things can’t be turned into data at all. Part 2 will unpack the five conditions that make analysis possible, and Part 3 will reflect on what happens when those conditions fail.

    When it comes to machine learning and statistics, the impulse to model, predict, or explain is often very compelling. But, before any of that is possible, there is a more fundamental assumption we need to analyze, is the thing we want to analyze even analyzable? Not every phenomenon can be subsumed into the realm of data. Some things resist measurements, while others produce too little evidence to generalize from, and still others dissolve entirely when scrutinized for reproducibility. My main claim is that analysis isn’t automatic, it only works within the right conditions.

    I think it would help to give some precise examples of types of data in the real world we might be familiar with that demonstrate this claim. Customer satisfaction is a good example, it is easy to have a conversation about customer satisfaction, but very difficult to measure directly. A single review does not capture the entire situation when it comes to customer satisfaction. Bias also exists in that, people are more likely to leave a review if they have had a negative experience, rather than a positive experience. Fraud detection is another example, fraud does exist, but it only leaves subtle traces behind. If those traces are not detectable, the problem is unsolved by the data. Happiness is another example that empiricists and positivists have struggled to account for. Happiness is a deep human experience, but how do we adequately detect, quantify, and replicate it across individuals?

    As we look across both statistics and machine learning, we see a meta-pattern about analysis. For a quantity to be analyzable, it must meet certain conditions. In part 2 of this series, I will look into these conditions in much more detail, but we’ll start with a preview of the conditions for now.

    • The data must first be detectable, meaning it has to have some signal, however faint, for us to find.

    • The data must be measurable, meaning it should be able to be constructed into a structured form. This seems at the surface nearly identical to the first condition, but the requirement of structure on the data means we need to be able to do basic things like determine if one signal exceeds another signal, for example, or if there is a hierarchy of the signals we detect, something analogous to the strength of the signal.

    • Next, we need to have sufficient data, enough variation to meaningfully capture a representation of the pattern we wish to analyze. This is one reason why sample sizes in statistics are of such importance when making claims with a specific confidence level.

    • The data must also represent something learnable in the first place, about the data structure and the amount of data you have collected. Meaning, the data must be structured so that algorithms or models can generalize from them.

    • Finally, the data has to be replicable. This is another subtle assumption that will take time to analyze in deeper detail. But in general, we expect the data to be stable across samples and systems.

    In the next article, I’ll dive into each of these conditions, showing how they shape both the limits and the possibilities of data science. If Part 1 is about asking whether something can be analyzed, Part 2 is about learning how to test those conditions in practice.