The Preconditions of Knowing
This is Part 1 of a 3-part series on what makes a quantity analyzable. In this opening piece, I set the stage by exploring the limits of analysis itself, why some things can’t be turned into data at all. Part 2 will unpack the five conditions that make analysis possible, and Part 3 will reflect on what happens when those conditions fail.
When it comes to machine learning and statistics, the impulse to model, predict, or explain is often very compelling. But, before any of that is possible, there is a more fundamental assumption we need to analyze, is the thing we want to analyze even analyzable? Not every phenomenon can be subsumed into the realm of data. Some things resist measurements, while others produce too little evidence to generalize from, and still others dissolve entirely when scrutinized for reproducibility. My main claim is that analysis isn’t automatic, it only works within the right conditions.
I think it would help to give some precise examples of types of data in the real world we might be familiar with that demonstrate this claim. Customer satisfaction is a good example, it is easy to have a conversation about customer satisfaction, but very difficult to measure directly. A single review does not capture the entire situation when it comes to customer satisfaction. Bias also exists in that, people are more likely to leave a review if they have had a negative experience, rather than a positive experience. Fraud detection is another example, fraud does exist, but it only leaves subtle traces behind. If those traces are not detectable, the problem is unsolved by the data. Happiness is another example that empiricists and positivists have struggled to account for. Happiness is a deep human experience, but how do we adequately detect, quantify, and replicate it across individuals?
As we look across both statistics and machine learning, we see a meta-pattern about analysis. For a quantity to be analyzable, it must meet certain conditions. In part 2 of this series, I will look into these conditions in much more detail, but we’ll start with a preview of the conditions for now.
• The data must first be detectable, meaning it has to have some signal, however faint, for us to find.
• The data must be measurable, meaning it should be able to be constructed into a structured form. This seems at the surface nearly identical to the first condition, but the requirement of structure on the data means we need to be able to do basic things like determine if one signal exceeds another signal, for example, or if there is a hierarchy of the signals we detect, something analogous to the strength of the signal.
• Next, we need to have sufficient data, enough variation to meaningfully capture a representation of the pattern we wish to analyze. This is one reason why sample sizes in statistics are of such importance when making claims with a specific confidence level.
• The data must also represent something learnable in the first place, about the data structure and the amount of data you have collected. Meaning, the data must be structured so that algorithms or models can generalize from them.
• Finally, the data has to be replicable. This is another subtle assumption that will take time to analyze in deeper detail. But in general, we expect the data to be stable across samples and systems.
In the next article, I’ll dive into each of these conditions, showing how they shape both the limits and the possibilities of data science. If Part 1 is about asking whether something can be analyzed, Part 2 is about learning how to test those conditions in practice.

Leave a comment