Mastering Evidence: Data Analytics for Lawyers

“Data is the sword of the 21st century, those who wield it well, the Samurai.”

SUMMARY

◆ There is a long-standing belief in the judicial ecosystem that data is defi- nite and represents the truth;

◆ One can go a very long way by learning the art of asking the right ques- tions using statistical analysis;

◆ Averages can be misleading, and dangerously so;

◆ Sampling is a tricky thing. And nowhere is this more applicable than in the case of DNA testing as evidence;

◆ An estimate arrived at by studying a sample is never the same as studying the population.

Introduction

This quote may surprise the legal community since the law curriculum completely ignores data analysis. It is almost as if data analysis is mathematics and therefore unnecessary for lawyers. Some would observe that lawyers have been able to get by without any understanding of mathematics and statistics and therefore do not need it. We disagree. We maintain that the lack of a basic understanding of data analysis causes injustice. The law claims that it deals with facts, and is not con- cerned with complicated theoretical equations. But that gives rise to an interesting question? When is something a fact? When is a fact said to be proved?

The Indian Evidence Act, 1872 tells us that a fact is “said to be proved when, after considering the matters before it, the court either believes it to exist or con- siders its existence so probable that a prudent man ought, under the circumstances of the particular case, to act upon the supposition that it exists.” 2 Over a 100 years ago, Nathan Isaac said, “... at some point the law must draw a line and say that some sort of ascertainment shall pass for the truth. Thus, even in connection with the visible and tangible facts of a particular case, it is a constructive truth – the verdict of a jury or the findings of some other tribunal – subject to more or less arbitrary rules of evidence, that must pass as the unchallengeable truth. The difference is only in degree, and the only criticism that can be offered against any particular mode of deciding any question of fact is that the arbitrary line is drawn back too far from the realm of realities.”

We believe that data analytics using rudimentary statistics can help us under- stand where the aforementioned line is and how far away it is from the realm of reality. There is a long-standing belief in the judicial ecosystem that data is definite and represents the truth. The idea that uncertainty pervades every aspect of our existence and that we ought to think probabilistically is absent from the legal system even though the Evidence Act commands us to think about it carefully. But the very process of marshalling facts involves choosing relevant data from observed and relevant reality and interpreting that data into facts that are relevant to the case. This involves an understanding of the principles of statistics and there- fore being a student of statistics is crucial for a law student.

Statistics makes this ‘construction’ of the truth an objective and the ‘line’ is consistent when it comes to the application of statistics. Wikipedia defines sta- tistics as “...the discipline that concerns the collection, organisation, analysis, interpretation, and presentation of data.” 3 Statisticians often ask if their field is art, science or perhaps both. 4 In our opinion, it is more of the former than the latter, but that being said, it is possible to systematise one’s approach to working with data. Such systematisation doesn’t lead us to a world with no errors, but a world where errors can be understood and minimised 5 .

The principles and key ideas of statistics can be taught to and learnt by every- body. While the more arcane calculations are perhaps best left to qualified experts, the principles of statistics require nothing more than common sense. One can go a long way by asking the right questions using statistical analysis when presented with data or analysis based on data. Asking these questions and interpreting the answers can help in minimising the chances of error when arriving at a judgement based on data.

Our goal in this chapter is twofold. First, we outline key principles that under- pin statistical theory that can be used for data analysis. We do so using only words, with the occasional sprinkling of equations or diagrams. These principles are easy to explain, understand, and apply. We build an outline of the principle, explain its importance, and provide examples where the application (or misapplication) of the principle was key to a sound (or otherwise) judgment. These examples are drawn from the field of law and beyond.

Next, we use these principles to recommend a syllabus for the study of data analytics using statistics at law schools across the country. This isn’t a radical new idea - “For the rational study of the law the blackletter man may be the man of the present, but the man of the future is the man of statistics and the master of eco- nomics.’’ 6 Such a syllabus is easily designed and implemented. More importantly, we hope to be able to prove in the course of this chapter that studying statistics is inevitable as a tool for data analysis for members of the legal profession. We hope that lawyers and judges will be able to use these tools to translate sets of numbers into words and decide if the data presented to them is relevant, proves a fact, or otherwise.

We then outline the 10 principles and explain why they are important with the help of legal cases where these principles were either successfully applied, or unfortunately ignored. Further, we also illustrate examples from outside the legal profession for better comprehension and recollection.

The Hacks

In this section, we build an outline of a syllabus on statistics that can be taught at every law school across the country. The syllabus will require no knowledge of mathematics beyond what has been taught at the high school level. We also explain the urgent need to introduce such a syllabus at the earliest possible opportunity in law school curricula given that we are, in the opinion of Justice Holmes, already late by about a century. The final section concludes with some suggestions of how one may proceed, along with some foreseeable limitations and problems.

The onset of winter in Delhi is a guarantee of three things 1) A gradual but dis- tinct drop in temperatures, 2) The advent of the famed wedding season, 3) Smog. Visibility is non-existent during this season and the smog affects the health of the residents of Delhi. Consequently, the government and courts 7 have taken a keen interest in the matter, though the resolution is not easy.

It is difficult to accurately determine the underlying cause of smog. Reasons put forth include firecrackers during Diwali, air pollution from factories and vehicles, harsher winters, and the burning of crop stubble in Punjab. Stubble burning is a process of setting fire to the straw stubble left after the harvesting of grains, like paddy, wheat, etc. It is usually required in areas that use the combined harvesting method, which leaves crop residue behind. Consequently, acres of farmland are set on fire at the onset of winters in northern India, with entirely predictable results – heavy smog. Apart from announcing a few schemes, the Punjab government is not enthusiastic about footing the bill for whatever new equipment may be required to get farmers to give up on stubble burning. The Delhi government is naturally chary of footing the bill and so the problem continues to burn.

Eventually, the issue reached the Indian courts and senior Supreme Court counsel and Solicitor General Tushar Mehta informed the court that stubble burning accounted for only 10 per cent of the capital’s Particulate Matter (PM)

2.5 pollution. 8 How do we verify whether the Solicitor General’s statement is true? How do we ascertain the relevance of this statement to the problem? And finally, how do we determine if the statement can, and should be, modified to make it relevant?

Let’s use another example for the sake of clarity. If someone were to tell you that the road outside your house has an average traffic flow of 100 vehicles per hour, would it strike you as a reasonable estimate? Would you therefore assume that 100 vehicles per hour is a good estimate, no matter the time of day? This is in fact, an important exercise, and the data from it forms the basis for drawing up tenders for roads. And then there is electricity demand. Can the electricity demand be the same throughout the day or is it likely to rise and fall depending upon human traffic? Is there a difference between urban and rural areas? The data generated from these answers help compute load factors for power plants and fix electricity tariffs. In fact, there is a separate tribunal for these matters.

Even if one were to assume that the statement about stubble burning con- tributing to only 10 per cent of PM 2.5 pollution is true, is it relevant? Analysis suggests otherwise. 9 Averages can be misleading and dangerously so. As Nassim Nicholas Taleb is fond of saying, ‘Never wade across a river that is on average four feet deep’.

Finally, the report says: “It is like presenting the annual concentration of methyl isocyanate gas in Bhopal’s air and then concluding that the leak on the night of 2-3 December 1984 that killed thousands was insignificant and thus, Union Carbide should not be blamed.” 10 The simple principle at play when it comes to data analysis is this: Asking about the validity of a statement is only half the battle. Asking about its relevance, and asking if the submitted statement is the best possible description of the underlying data, is the more important question. Ask the right questions, and don’t hesitate to keep on asking them.

As any statistician will tell you, there isn’t ever a one-size-fits-all correct way of representing data. A person, for example, can at the same time be a mother, a daughter, a friend, a neighbour, a professional at the workplace, a student and a commuter. Which role will she use to describe herself is very much a function of the social setting she finds herself in. It is not sufficient to introduce her- self as a commuter when attending a professional conference on behalf of her employers. It is a fact that she commutes daily to work, but is that fact relevant? And so it goes with data. Generating credible data is relatively easy; interpreting their relevance to help understand the problem at hand, is more difficult to ascertain.

Take the example of the Wholesale Price Index (WPI) and the Consumer Price Index (CPI). We often see that the WPI and the CPI move in opposite directions. To begin with, index numbers are a way to aggregate information about certain phenomena, for instance, price indices aggregate information about the prices of commodities. A lot depends on these factors in index numbers: What information is being aggregated; what is the rule of aggregation; and how is information being collapsed into a single number? Two indices can show very different and distorted information based on these factors. In the case of CPI and WPI, WPI aggregates information about wholesale prices of commodities. The upward movement in CPI was guided by an increase in food costs because these items have a much higher weight in CPI (48% as against 24%). Services are another component of CPI, which is absent in WPI.

Often the price increase indicated by the CPI does not represent the actual increase in all places across India. The food CPI may have increased overall, but it is still possible that milk in Mumbai became cheaper, as compared to Chennai, at the same time. As a result, a national minimum wage linked to national CPI will hurt employees who find that actual inflation is far greater than the CPI. It is for this reason that the minimum wage is set for regions within a state. Since the actual cost of living and the impact of inflation is not uniform across India, or even within a state, linking minimum wage to CPI does not adequately compensate for inflation. This has material implications on the hundreds of cases about the fixing of minimum wage and its applicability to different classes of employees.

All aggregation is based on principles. When we aggregate data, we produce a single value but we also miss out on critical details. We must understand the principles underlying this aggregation in sufficient detail before we judge index numbers.

The Statistical Research Group is a name unlikely to arouse much excitement in the minds of most people. However, this team at Columbia University had a crucial role to play during World War II. It was their job to analyse the damage done to allied warplanes in dogfights and suggest suitable reinforcements to increase the chances of survival. The team of analysts had access to planes that had returned from dogfights. These planes had damage patterns as shown in the sample below.

What this image shows us is the damage suffered, on average, by all planes in the sample. Unless you have seen the image before on social media, or have heard of the story elsewhere, the answer by the Statistical Research Group may surprise you. The group recommended that additional armour plating or reinforcements be added to those parts of the plane that had seen the least damage.

Abraham Wald, one of the most famous statisticians of the 20 th century, and an important member of the Statistical Research Group, had a convincing, if counter-intuitive reason. The core argument was this: Planes that had sustained damage in these parts had not made it back to the base for analysis. They had gone down in the fight. The sample consisted only of planes that had made it back. This means that the planes were able to return despite being hit. If you think carefully, the Statistical Research Group’s answer is not surprising. 12

Another example to consider is what treatment works best in the case of severely ill COVID-19 patients. The question should be answered by taking into account not just survivors, but also those who died. The name given to this phenomenon is called survivorship bias.

Sampling is a tricky thing. And nowhere is this more applicable than in the case of DNA testing as evidence. There is enough research and more, to show that DNA testing is not infallible 13 and cases abound where poor statistical understanding has led to false convictions. 14 These problems could be related to contamination of DNA evidence or a failure to utilise correct statistical reasoning. Incorrect statistical reasoning is not to be taken lightly as it can have horrifying consequences.

In November 1999, for example, an English solicitor named Sally Clark was sentenced to life in prison for two charges of murder. 15 We reproduce the case description: “The cause of death in both cases was first attributed to Sudden Infant Death Syndrome (SIDS), also known as ‘cot death’ in the UK. We still do not know about the specific causes of SIDS. But suspicion against the mother rose on account of the unspecified causes for the death of two babies from the same family. Shortly after the demise of her second child, Clark was held, tried and incarcerated.

So what exactly was deemed relevant to the case built against Clark? One important factor was expert testimony provided by the paediatrician. He put the odds of two children from the same family dying of SIDS at 1 in 73 million. How did the doctor arrive at this number? For the level of affluence that Clark’s family possessed, it was reasoned that the chance of one infant dying of SIDS was 1 in 8,543. The fact was that 1 out of every 8,500 infants would die of SIDS. What then, were the chances that two children from the same family would die of SIDS?

According to statisticians, the answer is whether the two deaths are indepen- dent of each other. If one assumes that they are, then the probability of two deaths in the same family is simply the multiplicative product of the two probabilities. 16 The figure we arrive at is 1 in 8,543 multiplied by itself, which is 1 in 73 million. This figure would be enough to convince any ‘reasonable man’ that the deaths could not have been a coincidence. On the other hand, if the two events are not independent of each other — for genetic or environmental reasons that we are not aware of just yet — then it is possible that several children from the same family may die of SIDS. In fact, doctors say that the likelihood of another SIDS death increases when the family has already seen one such death.

Clark’s conviction was overturned on her second appeal, and she was released from prison. She died four years later”. Another example worthy of discussion is the State v. Soto. 17 The defendants, in this case, moved to suppress evidence from traffic stops resulting from discriminatory enforcement of traffic laws. The motions were granted.

What is the average height of the Indian female? You would soon realise that a definitive answer simply is not possible as it would mean measuring the heights of all women in India, an impossible exercise. Impossible for two major reasons: (i) Even with an army of assistants it would take far too much effort, time, manpower and money, (ii) ‘All’ women would never be a static list as births and deaths are an ongoing process.

So, statisticians construct a sample. A sample by definition is smaller in num- ber than the entire population, and therefore more easily assembled, measured and verified. What we gain in convenience and tractability, we lose out in terms of thoroughness and surety. An estimate arrived at by studying a sample, no matter how carefully constructed, is never the same as studying the population. And when conclusions from the sample lead us astray about the features of the popu- lation, we get what is known as sampling error. No matter how perfect a sample you construct, you will always have sampling errors. The question then is, how big is an error and in which direction? What are the errors implicit in sampling? The answer lies in two different, but related topics: Specificity and sensitivity. Given the experiences of the last two years or so, almost everybody is familiar with a Rapid Antigen Test (RAT) that can be self-administered at home to find out if one is infected with the COVID-19 virus. If the result of the RAT turns positive does it mean you are infected with the virus? Or is it a false positive? If the result is negative, does it mean you are not infected with the virus? Or again, is it a false negative? Specificity and sensitivity help us in answering these questions. Sensitivity is sometimes referred to as the ‘True Positive Rate’. Simply put, it is the probability that the test result turns out to be positive, given that you are infected with the virus.

Specificity, on the other hand, is the probability that the test result turns out to be negative, given that you are not infected with the virus. Consider Figure 2 18 :

◆ A False Negative is when an infected individual returns a negative result.

◆ A True Positive is when an infected individual returns a positive result.

◆ A False Positive is when an uninfected individual returns a positive result.

◆ A True Negative is when an uninfected individual returns a negative result.

One way to make sure that all infected individuals are detected correctly is to ensure that all tests come back positive. Unfortunately, this will also mean that uninfected individuals will show up positive as well (false positives). Specificity therefore implies, how specific is the test in terms of detecting as positive only those who actually have the virus. The True Positive Rate then becomes the following:

For law students, the concept ought to be very familiar indeed. In Selvi & Ors v. State Of Karnataka, 20 it was stated that errors associated with polygraph tests are either ‘false positives’ or ‘false negatives’. In the first case, the results show that a person has been “deceitful” even if the answers to the test are true. A ‘false negative’ occurs when misleading responses are held as true. “On account of such inherent complexities, the qualifications and competence of the polygraph examiner are of the utmost importance.”

From the point of view of statistical theory, it is worth asking some ques- tions in this case. Would repeated sampling have made a material difference? Therefore, should the person have been subjected to multiple lie detector tests and not just one? If so, how should we factor in fatigue on the part of both the subject as well as the person administering the test? If different people are administering the test, to help avoid fatigue, are the results comparable? What about the questions in the test itself? Are they appropriate and suitable and on what basis? Would different questions yield different results, and if so, how should these results be interpreted?

There are no easy answers to these questions, and indeed, some of them may even prove to be indeterminate. Our point is that if a judgement is to be made based on these facts, then one must ask if these are indeed facts ‘beyond all rea- sonable doubt’. In the case of a lie detector test, we would argue that the questions become, if anything, even more pressing than usual. Back to our charts. Now consider the same picture, but with labels that are more familiar in a legal context, in Figure 3:

So, a sampling error occurs when an analysis to reach a conclusion about a population is undertaken based on a sample that is not representative of the popu- lation. Trying to arrive at the average height of all Indians by measuring the height of basketball players in all Indian cities, for example, will almost certainly result in sampling error. While there exist ways and means to objectively measure sampling error, the purpose here is to help the reader understand what sampling error is, and how to develop a framework which assesses if a sampling error exists in the dataset. What follows are simple questions about the data.

◆ Ask for a description of the population for which the data is being collected.

◆ How has the data been collected? Has it been collected over the telephone, 22 over email, or in person?

◆ Why was the method used deemed to be the best? Were other methods considered? Have potential biases that may emerge as a consequence of the collection methodology been considered?

◆ Was the data collected at a particular time of day, a particular day of the week, or during a particular time of the year? Will this lead to a bias? (For example, measuring road traffic density on the weekends is likely to give biased results).

◆ Always ask to see a copy of the questionnaire, and make sure that the text of the questionnaire does not have a bias. 23

The feel for whether the data looks right or otherwise develops as you work on it. This may be an imperfect approach, and errors are possible, especially by ‘experts’ who fail to notice them. 24 A series of checks about the quality of data is always advisable.

Correlation is not Causation

Causation has been the subject of many philosophical treatises, right from the time of Aristotle, 25 if not earlier. David Hume and Emmanuel Kant, 26 among others, have written at length on the subject, but it remains poorly understood among statisticians, philosophers, and others. Consider these charts taken from Tyler Vigen’s excellent website dedicated to this topic in Figures 4 and 5:

As you can see there is near perfect correlation between the two series shown in each chart, and it may be trite to point it out, but one could not possibly cause the other. This is what one means by the title of this section that correlation is not causation. There exist several websites, textbooks, videos and podcasts that explain the many fallacies that can arise as a consequence of poor understanding of causation, and we list some of them in an appendix at the end of this chapter. To reiterate – correlation is not causation.

Consider the legal controversy over Benedictin, the market leader for morning sickness in the 1980s. It was around this time that a substantial number of cases against the manufacturer Merrell Dow Pharmaceuticals surfaced, alleging that the drug caused birth effects. Several studies in the late 1970s and early 1980s affirmed these lawsuits, associating the use of Bendectin with certain birth defects. This is a classic example of correlation not being causation. We borrow from an existing case 29 and reproduce it here: “Bendectin (Doxylamine/Dicyclomine/ Pyridoxine) was widely used for the treatment of nausea and vomiting during pregnancy until 1983. A meta-analysis of the 16 cohort and 11 case-control stud- ies gives us an idea of the relative risk of malformation at birth associated with Bendectin exposure. The pooled estimate of the relative risk of any malformation in the first trimester was 0.95 (95 per cent Cl 0.88 to 1.04). For cardiac defects, central nervous system defects, neural tube defects, limb reductions, oral clefts and genital tract malformations, the pooled estimates of relative risk were in the range of 0.81 for oral clefts to 1.11 for limb reductions. Barring two categories, tests for heterogeneity of association showed that all studies were estimating the same odds ratio. In other words, the results show no difference in the risk of birth defects between those babies whose mothers had consumed Bendectin during the first trimester and those babies whose mothers had not”.

This kind of analysis is used in drug and vaccine evaluation in clinical trials. How many patients with what kind of diversity would be a representative sample for the trial? What is a significant result and what are the conclusions drawn from the trial data? If a drug is to be approved for use, what are the risks and how does one mitigate them?

Given the events of the last two years, the statistics associated with clinical trials take on even more urgency than usual. In a recently conducted study, 30 it was found that at least 50 per cent out of 47 randomised clinical trials would have been statistically insignificant if only four events were to be reported differently. In other words, the result is not nearly as significant as one would have thought.

Thinking about Probability

Thinking probabilistically is a tremendously difficult thing to do, but also cru- cially important. For example, in Syad Akbar v. State of Karnataka , 31 “the Supreme Court dealt with, in detail, the distinction between negligence in civil law and in criminal law. It has been held that there is a marked difference as to the effect of evidence, namely, the proof, in civil and criminal proceedings. In civil pro- ceedings, a mere preponderance of probability is sufficient, and the defendant is not necessarily entitled to the benefit of every reasonable doubt. But in criminal proceedings, the persuasion of guilt must amount to such a moral certainty as convinces the mind of the court, as a reasonable man, beyond all reasonable doubt.”

The statement has intrinsic clarity and is generally accepted by lawyers to be true. But if this statement were to be expressed in statistical terms, what percent- age of probability would be ‘beyond reasonable doubt’? Also, does the probability need to be more than 51 per cent for it to be a ‘preponderance of probability’? Remember, judgments issued based on these calculations have a chance (by defi- nition, since we are talking about probability here) of being wrong. Should we, as students of the legal system, be satisfied with a 50 per cent chance of being wrong? The correct answer is that all facts and human experiences cannot be repre- sented by data and cannot be subject to statistical analysis. In addition, all facts and human experiences may not accurately be represented by data. However, the comparison should not be based on some theoretical, Utopian description, which is uncertain. Such a world does not exist, and likely never will. In the world in which we live, risk and uncertainty will always exist. Statistics help reduce this risk and uncertainty, quantify it, and bring an element of objectivity. Relative to the status quo, we argue that it is indeed worth our while to subject the available data

Linda is 31. She is single, outspoken and bright. She majored in Philosophy. As a student of philosophy, she was moved by social discrimination and injustice and joined anti-nuclear protests.

Which is more probable? Linda is a bank teller.

Linda is a bank teller and is active in the feminist movement. 32

This example, perhaps the most famous of its kind, was originally devised by Kahneman and Tversky, and if you chose the second option, you chose the incor- rect one. The error is, in some sense, understandable. The description of Linda almost forces one to choose the second option, 33 but a little thinking helps us realise that the second option is always going to be less likely than the first one, for the second one is an intersection event. Recall your high school lessons in drawing Venn diagrams, and try answering the question again.

Consider another very famous example: You are a participant in a game show. There are three doors in front of you and behind one of the doors lies a brand- new sports car, which will be yours, if you choose the correct door. Behind the other two doors are goats. The game show host, who just so happens to be named Monty Hall, asks you to pick a door. Keeping your fingers crossed, you pick (say) Door 1. Monty Hall then proceeds to open one of the other two doors and shows that behind it is a goat. There now remain two unopened doors: the one that you chose, and one other. Monty Hall now asks, ‘Do you wish to change your choice, or do you wish to stick with your original choice’? One might think that with two doors remaining, the chance is evenly split between both doors, but you would be wrong. There is only a 1 in 3 chance that your original choice is correct, while there is a 2 in 3 chance that the car is behind the door that you did not originally choose. How is this so?

There exist many possible ways to help resolve the apparent paradox. Here’s just one: What if there were a million doors instead of three? The chance that you picked the correct door in the first instance is literally one in a million. That is, it is all but a guarantee that you picked the wrong door. Now, what if Monty Hall opened all of the other doors except one and revealed goats behind all of them? Would you still stick to your original choice, or would you prefer to switch? And if you prefer to switch in this case, well, the original problem with three doors is simply a milder version of the same principle at work.

Both these examples demonstrate that all of us struggle to think systematically about issues related to probability. If it is any consolation, the first time the Monty Hall problem was discussed in a public forum, PhDs in statistics got it wrong and refused to accept the irrefutable logic behind the correct solution. 34 It is only a matter of time before law students either hear or use the phrase ‘beyond reason- able doubt’, or the other phrase, ‘the preponderance of probabilities’.

What exactly do these terms mean? Are they context-dependent, and if so, on what basis do they change? Is our understanding of these phrases the same as yours in a quantitative sense? An understanding of the principles and laws of probability is indispensable.

Keeping a list of simple questions handy is of great help in sidestepping poten- tial pitfalls in issues related to probability. The questions are:

◆ Are the basic laws of probability 35 being violated?

◆ Can I visualise the problem I’m trying to solve as a Venn diagram?

◆ Can I think of a simpler version of the problem?

Probability is not an intuitive subject, and the best among us are prone to the occasional slip-up when dealing with it. But a little reflection does go a very long way in getting answers.

Another point about probability is that a student of statistics needs to under- stand the intuition behind a seemingly innocuous phrase – probability distribu- tions. Rather than get into the technicalities of what probability distribution is, answer this question: When is the traffic on the street outside your house at its busiest?

Your answer, more likely than not, will be that it is at its peak when peo- ple leave for work in the morning and when they return home in the evening. Saturday evenings may also be a little busy. On the other hand, Sunday mornings are likely to have light traffic.

If we had to visualise what we just discussed it would go something like this: Imagine a horizontal axis that starts at midnight and has 24 notches with each notch representing an hour. The vertical axis will represent traffic density. The graph will be fairly flat and low until around 6 am or 7 am, after which it may begin to rise up. The graph may peak at around 11 am and then fall until around 5 pm, at which point it might inch upwards again. After a quiet four to five hours the graphs will fall to their lowest levels around midnight. You may have seen this on Google Maps.

A mathematical description of this phenomenon is referred to as a distribution. These distributions can take many forms, and some of the more typically occurring forms often have entire chapters devoted to them in statistical textbooks. If you have heard of terms such as the bell curve, the Gaussian distribution, the normal distribution, or indeed the chi-square, the t-distribution and the f-distribution, this is what they mean in practice. Of course, statisticians don’t stop there. There is a bewildering variety of distributions lying in wait in more advanced courses. But for the moment, it suffices to understand the idea behind what a distribution really is. Distribution is a mathematical description of all possible events and the probability of their occurrence. But what does this mean in practice? Consider the graph in figure 6:

What you’re looking at on the left is a dartboard. Imagine two students who step out for a couple of drinks on the weekend – let’s call them Rahul and Girish. Rahul, unfortunately, isn’t that good at playing darts; Girish is better. Rahul’s darts are shown in black, while Girish’s are in green.

As you can see, Rahul’s throws are all over the dartboard, rather than being clustered around the bull’s eye. In the language of statistics, we say that Rahul’s throws have high variance, or high standard deviation. 37 Girish’s throws are more tightly clustered or have low variance and low standard deviation.

Now take a look at the figures on the right-hand side in that picture, espe- cially the one in green. These depict Girish’s throws, and the bar in the middle is the number of times Girish manages to hit the bullseye. A little away on the left-hand side of the first concentric circle outside of the bullseye is a bar on the left of the bar that represents the bullseye. On the right-hand side of the first concentric circle outside of the bullseye is the bar on the right of the bar that represents the bullseye. What this distribution 38 represents is that Girish is more likely to hit the bullseye and even when he misses, he comes close enough. His wild misses are relatively rare. In the language of statistics, his throws are normally distributed.

Rahul, on the other hand, is likely to hit any part of the board. There is none of the pleasing bell-shaped symmetry in the case of his chart, and we would therefore say that his throws are not normally distributed. 39

Think intervals, not points

What follows is a little drama enacted in most households in India, and certainly was in mine. The walk back home after the mathematics exam would always be one that involved deep trepidation and marked reluctance on my part, for it was only a matter of time before ‘The Conversation’ took place:

My father would ask with studied casualness, “How did the exam go?” “Fine,” would be my cautiously non-committal response.

“Hmm,” he would say before the heavy artillery was wheeled out. “And how much do you think you will score?”

How does one answer a question like this if one’s aptitude for the subject is not particularly good?

“I think I’ll pass with good marks,” would be my weak attempt to let the matter lie, but it has not worked in all of recorded history.

The correct answer – between 0 and 100 – was wholly unsatisfactory. What my father was seeking was a narrower range, and preferably a single number. This, in a nutshell, is the difference between a point estimate and an interval estimate, and the good news is that in statistics, an interval estimate is preferred. An interval estimate is preferred because of the uncertainty involved and a given interval is more likely to have the answer than a single-point estimate.

An oft-repeated adage in statistics is that one must think in terms of intervals rather than point estimates. Readers of detective novels are also likely to be famil- iar with this approach, for the time of death is usually given in terms of an interval (‘no later than 5 pm and no earlier than 1 pm’) rather than a point estimate.

Where statistics is concerned, there is a payoff function between accuracy and surety. And this is a point that every single child understands intuitively when it comes to guessing the outcome of an examination. The narrower (more precise) one wants the guess to be, the less certain one will be of the accuracy. And the more tolerant one is of the interval (less accurate), and the surer one is of the accuracy. ‘I am 100 per cent sure that I will score between 0 and 100’, isn’t just a joke, but also happens to be an intuitive way to understand the importance of thinking in terms of intervals.

This point has been made in the context of a case. “On the other hand, although catch-phrases such as “beyond the shadow of a doubt”, “innocent until proven guilty”, etc., abound in the legal folklore, lawyers are well aware that the legal process is necessarily imperfect in the presence of the confusion and uncertainty which is characteristic of the real world, and that, if the process is to operate at all, small numbers of errors of the first type and the second type must be permitted to occur.” It goes on to add that “statistical methods of making decisions and of evaluation of evidence” under uncertainties “should be most appealing and convincing to legal professionals”. 40

Here’s a simple checklist in the case of interval estimates vis-a-vis point estimates:

The Central Limit Theorem

A very large part of statistics involves the idea that conclusions that can be drawn based on a sample can be extended, within reasonable limits, to the population. That is, studying only a small subset of the population can help us understand the characteristics of the entire population. For instance, exit polls arrive at a result of how an entire population will vote by asking a very limited number of people. 42

The question that presents itself is how can we possibly justify concluding mil- lions by interviewing only a few hundred? The answer involves one of the most remarkable concepts to be found in statistics: The Central Limit Theorem.

There are many technical definitions of the Central Limit Theorem, but here is one restatement that should suffice.

While this may be easy to understand, it is difficult to explain why this matters so much.

So, let’s say you construct a sample of lawyers in Delhi who happen to play basketball on the weekends. Let’s say the average height of this sample turns out to be 5’6”. How can we be sure that this is representative of the population and therefore can be the basis for us to conclude?

The normal distribution has an extremely useful property. If a given dataset is normally distributed, then 68 per cent of the data lies within one standard deviation 46 of the mean. In other words, if there are 100 lawyers with an average height of 5’6”, then 68 of these lawyers will be within one standard deviation of this number. For example, if the standard deviation is 3”, then of these 100 engineers, 68 will lie in the interval 5’3” on the one side, and 5’9” on the other. Of these 68, half will lie on the left-hand side of the mean, and half on the right, as can be seen below.

This idea is extendable: 95 per cent of the data will lie within two standard deviations of the mean, while 99.7 per cent of the data will lie within three standard deviations of the mean. 48 How many lawyers out of 100 will lie within two standard deviations of 5’6”? How many lawyers out of 100 will lie within three standard deviations of 5’6”? We are fully aware that this concept is rather elaborate, so take a look at the accompanying diagram, and if necessary, read the preceding paragraphs once again.

And now for the coup de grace: Remember those 50 samples we collected of lawyers who played basketball? There is a statistical property that guarantees that the average of these 50 samples will be equal to the population mean. That is if you take 50 samples of lawyers in Delhi who play basketball, compute their average weight, and take the average of these 50 averages, then that number will be equal to the average of all lawyers who play basketball in Delhi. 4 9

To be precise, the mean of the sampling distribution – rigorously proven – is equal to the population mean. If one combines this property with the central limit theorem, it quickly becomes clear that whatever sample has been generated will lie within one standard deviation of the actual mean. This will have a probability of 68 per cent within two standard deviations of the actual mean with a probability of 95 per cent and within three standard deviations of the mean with a probability of 99.7 per cent.

Will the sample mean be exactly equal to the population mean? The answer generally is no. On the other hand, will the sample mean be acceptably close to the population mean? Almost inevitably, the answer is yes. That is how a limited sample making claims about a population works.

And that is the point of studying the central limit theorem. If a dataset does not fit the CLT, one must ask why it does not.

Visualisations help since we are a species that relates much more to that which can be seen, rather than thought of in the abstract. But we are also susceptible to being fooled with visuals and in our haste to arrive at an understanding we are prone to simple errors of judgement. Consider a simple visual that shows the relationship between how much wealth a person on average has in a given country, vis-a-vis the average life expectancy for that country, for all countries on earth in Figure 9:

As it turns out, the countries towards the top right of this chart are not close to each other, although they appear to be. A simple way to be sanguine is by looking at the horizontal axis. Each unit movement to the right represents a doubling. And so while these countries seem to be fairly close to each other, they are not close. The last unit movement (USD 64,000 to USD 1,28,000) is actually the same width as the origin to USD 64,000. And here is the truly mind-boggling bit: One can understand this, agree with it, and make a note of it, and still think of the countries towards the top right as being fairly close to each other. Visualisations, as it turns out, are full of potential surprises and slip-ups.

Consider in Figure 10 one of the most famous examples in statistics: The Anscombe dataset.

As the graphs show, each dataset throws up almost the same regression line, 53 and the same summary statistics, 54 but the datasets themselves look very different.

Is the visualisation descriptive of the data? Is the data as visualised doing a good job describing the facts? What else can we look at? These are questions worth asking every time one is presented with data, the statistical analysis of that data, and the visualisation of any dataset.

A Syllabus for Law Students

Adjudicating a case requires some definitive basis and an understanding of facts that all concerned parties can agree upon. These facts are constructed from data made available for all to analyse, and the data itself is selectively picked up from the reality that surrounds us. For the concerned parties to agree upon these datasets, their method of analysis, and the basis on which these data points (and no other) have been selected, requires an understanding of statistics. Lawyers have always been handy with words, but as we have argued throughout this chapter, an understanding of statistics is as important so that facts can be deduced from data.

Statistics, much like mathematics, comes with a lot of baggage. Most of us, even as adults, carry battle scars from dealing with these two subjects in school and are only too glad to be done with them as adults. Statistics and math conjure up images of impenetrable equations, mystifying symbols, impossibly lengthy derivations and very little in terms of meaningful insight at the end of it.

We contend that it need not be. It is possible to teach statistics using nothing more than English. The only accessories we need are open, curious and willing minds and a liberal sprinkling of common sense. Armed with just these readily available tools, it is possible to draw up a syllabus for an introduction to statistics for law students. We outline herein a syllabus for learning the principles of statis- tics and suggest resources that may be useful for driving home these principles.

The aim of such a course is not to make statisticians out of lawyers but to shape lawyers with the ability to understand data and interrogate the findings using basic, but robust statistical tools and principles, with the same acumen and acuity they bring to bear upon other non-quantitative fields.

We assume here that such a course may stretch to 30 hours. More time is welcome, but not less. No prerequisites would be necessary beyond a grasp of high-school mathematics. We have, in the previous section, outlined 10 ‘hacks’ or statistical principles that one should know. However, a complete syllabus of statis- tics should contain more material beyond this preliminary outline and should be fleshed out further. Here is what such an outline could look like:

We maintain that it is possible to speak about each of the concepts outlined above without having to delve into the world of formulae and equations. The idea behind such a course would be to familiarise students with the underlying con- cepts, ideas and principles of statistics. Interested students could be made familiar with more advanced treatment, but we believe that no student of law should leave university without an introduction to these ideas. They should be equipped to read and interpret data, draw conclusions from it, and at the very least, not be scared of, or indifferent to, data.

Such an introduction may not make it possible for the student to be able to do statistical analysis, but it should certainly be possible for such a student to ask meaningful questions when presented with data and analysis done by others. In much the same way that lawyers are not expected to be experts about forensic DNA analysis, or be able to do DNA analysis themselves, it is unfair to expect lawyers to be statisticians.

That being said, lawyers certainly are expected to know enough about DNA testing, for example, to be able to ask meaningful questions about the opinions of experts. And so it is with statistics. One must know enough about the subject to be able to ask questions that matter. We are sure that statistical experts may have different lists or may go into apoplectic fits about topics we have missed. But if you will allow us to mix our metaphors, may we say, “Let a thousand syllabi bloom!” Our argument isn’t so much about the content of the syllabi as it is about how it is taught.

Conclusion

A non-quantitative, principles-based approach to teaching statistics as a tool for data analytics isn’t just necessary, we believe that it is the need of the hour. Our intent in this chapter has been to outline three things:

“It is to be hoped that in the future attorneys and judges will become more knowledgeable about chance, uncertainty, probability, statistical procedures, and statistical inference in the presence of uncertainty, so that the instructional phase of the statistician’s testimony may be shortened. It is reasonable to expect that more universities, with or without attached law schools, will seize on the need for the understanding of these matters and provide training for future lawyers involving enough statistics and probability to enable them to be better informed and, therefore, more knowledgeable and appreciative consumers of statistical evaluations.” 55

	I		II		III		IV
x	y	x	y	x	y	x	y
10	8.04	10	9.14	10	7.46	8	6.58
8	6.95	8	8.14	8	16.77	8	5.76
113	7.58	13	8.74	13	112.74	8	7.71
9.	8.81	9	8.77	9	7.11	8	8.84
11	8.33	11	9.26	11	7.81	8	8.47
14	9.96	14	8.1	14	8.84	8	7.04
6	7.24	6	6.13	6	6.08	8	5.25
4	4.26	4	3.11	4	5.39	19	12.5
12	10.84	12	9.13	12	8.15	8	5.56
7	4,.82	7	7.26	7	6.42	8	7.91
5	5.68	5	4.74	5	5.73	8	6.89

Mastering the Evidence: The Lawyer’s Essential Guide to Data Analytics