Errors, Polls, and US Elections

Parigyan - The Data Science Society of GIM

7 min readJan 16, 2021

Data is the new oil, no matter whatever the field may be. So much so that important events such as the US Presidential Elections are also not devoid of this element for a very long time. Predicting the next Prez of the world’s biggest economy is no small thing and the role of data has been increasing in this affair every year. Data analytics helps the election campaign to understand the voters better and hence adapt to their sentiments.

But have you ever thought that what these opinion polls and statistical surveys actually try to tell us? It is possible that the authenticity of such surveys are questioned owing to the fact that a considerably smaller sample is chosen to predict the actual possible outcome. It is true that no survey, not even the most optimally designed one, can predict the true state of nature with complete confidence, and there is always a degree of uncertainty involved. But let us understand the science behind them.

The first step in opinion polls is the estimation of some unknown parameter. In the election context, the unknown parameter can be the proportion of voters who intend to vote for a particular party (or alliance). And mind you, this is not a very simple task. In an election, the main interest of the populace (and hence the media) is not in the percentage of votes for a given party (or alliance), but rather in its seat share. This makes the problem complicated, and we need further analysis than the simple statistical estimation of a population proportion.

We now discuss one by one the two stages of estimating the seat share of a particular party; these are (i) estimating the proportion of votes for a given party in a given region and (ii) estimating the corresponding number of seats based on a suitable model.

A statistician will not have complete information about the population and so it is implausible that the sample derived will be equal to the desired target. One must therefore strive to minimize this error using such a sample that is representative of the pattern of the true population. The simplest way of doing this is a simple random sampling (equal probability sampling) which can be further done with or without replacement. This randomness is a property of the sample collection method and not of the actual sample.

The sample size is very small (almost negligible population fraction) when it comes to estimating the percentage of voters favoring a particular party. It is commonly believed that the sample size must reach a certain proportion of the true population size for a given degree of accuracy in estimation to be attained. This is incorrect. The actual value of the true proportion also determines the accuracy of the estimation. If there are two constituencies that have roughly the same proportion of voters supporting a candidate of a party, one will require the same sample size to estimate these proportions with a given degree of accuracy — even if the total population sizes are quite different.

The second issue to deal with is the estimating number of seats for each party which needs the help of a suitable mathematical model. There is one model that assumes that the change in the voting intentions of people from previous to current elections is alike across states, its parts, and other such regions. Post that, sampling techniques can be used to gauge this change in preferences. Once the estimation is done, we can assign a probability of winning a given seat for each party.

The probability assignment would be as below-

[2]

Types of Sampling:

Probability Samples

• Random-Digit Dialing (RDD)

Samples of the phone area codes and exchanges are taken, and then random digits are appended at the end to create 10-digit phone numbers. The first step ensures phone numbers are distributed well by geographical location. The next step, adding the random numbers, makes sure that even the numbers not listed are included. This is the standard practiced by almost all public pollsters. A pro side of RDD is the coverage of the population: Anyone with a phone is eligible to be sampled. A major con is that it is expensive since many of the telephone numbers generated are non-working numbers. Within the Household Sample Selection, places that have more than one eligible one registered voter — further sampling among the members of the household should be done to produce a random sample of voters. Journalists should ask how respondents were selected. Just taking the person who answers the phone would not necessarily result in a representative sample.

• Registration-Based Sampling (RBS)

To start with, the sample of individuals are drawn from lists of registered voters, to which phone numbers are then matched (which may be available from the voter list). This is less expensive and more efficient, as almost all the calls result in reaching an active phone number, which is not the case with an RDD sample. The major demerit of an RBS sample is that voter lists often do not contain unlisted telephone numbers and may have voters who have moved or otherwise might not be really eligible to vote in their current precinct. [1]

2. Non-probability Samples

• Self-Selected Samples (SSS)

In self-selected samples, respondents can choose themselves, so this means their answers might not be representative of the larger population. Types of self-selected samples include dial-in polls popular with the media and many Internet-based polls. The American Association for Public Opinion Research (AAPOR) warns that results of surveys based on respondents who self-select might not be reliable. The characteristics of people who choose to participate in this type of survey may be different than those who do not in ways that may bias the final results. These polls can sometimes be accurate, but it is very difficult to evaluate whether they are accurate simply because of good luck or because they were able to capture decent information about the population they were trying to represent. AAPOR warns that this type of sample is not based on the full target population.

• Samples from Internet Panels

Another variation of the self-selected sample is the random sample selected from the people who have signed up to be members of an Internet panel. While the sample itself is random, the sample is drawn from the population which is made up of people who have signed up to be members of the panel. [1]

Putting everything into context, people believe that the results of exit polls reflect the exact same picture of the election’s results. It can be explained by the assumption we made that judgments backed by the data delivered right after the election’s day are epistemologically different from the faulty pre-election forecast.

As per the exit polls, it was believed that Joe Biden will be replacing Donald Trump as the president of the USA with a clear and massive majority but the true picture turned out to be jaw-dropping. In actual results, Mr. Trump performed much better than anticipated and gave a tough fight to Mr. Biden. This clearly indicates some error that has occurred during exit polls and processes associated with the same.

The stark difference between the actual and anticipated results can be attributed to the error made in the sampling of people i.e selection of the people done in order to inquire whom they voted for was not done properly and was infused with confirmation and attention bias. Even many political scientists like Robert Griffin stated exit polls are plagues with sampling biases.

The problem with the sampling is that if not done properly, it does not take all the people into consideration and usually takes a particular section of the society/state/country while ignoring the rest. As a result, the views of only a particular section or people get recorded and analyzed while completely ignoring the others which can change the entire picture when taken into consideration.

Another problem that came into the picture that while doing the sampling, a certain set of people overrepresented which skewed the results. For example, the youth of America was given a lot of weightage in the exit polls and the majority of them were against Trump being re-elected as the president of the USA.

Proper sampling and collection of the right data is a prerequisite for the correct result of exit polls and clearly, the inefficiency in same has again come out like that of 2016’s exit polls which predicted that Hillary Clinton will win with a majority and actual results were in favor in Mr. Trump.

Thus faulty exit polls’ results can be attributed to the statistical error that happened while sampling. [3]

References:

[1] https://www.aapor.org/AAPOR_Main/media/MainSiteFiles/Sampling-Methods-for-Political-Polling.pdf

[2] *0049–0058 (ias.ac.in)

[3]https://www.google.com/amp/s/www.cnbc.com/amp/2020/11/07/election-pollsters-2020-reckoning.html

Errors, Polls, and US Elections

Written by Parigyan - The Data Science Society of GIM