Predicting PredictIt: An Analysis of a Political Betting Forum and its Accuracy

Noah Fine

Introduction

It is tough to predict an election. Both camps taunt the other, and claim their likelihood of victory to be high, but in actuality, giving a precise estimate of both the final margin of an election and the win probability of an election is very difficult.

In both the 2016 and 2020 U.S. Presidential elections, polling was at least somewhat inaccurate. In most cases, and certainly on a national level, the results were well within the margin of error. But this was not always the case, whether in the Rust Belt in 2016 or Florida in 2020, where polling results fell outside of the margin of error by a wide margin (even if, in some cases, this did not impact the actual state winner).

Models which aggregated polls, meanwhile, have had mixed results. The Huffington Post's model famously gave Hillary Clinton a 98% chance of winning the Presidential election in 2016, while the stats blog FiveThirtyEight gave her a less confident 71% chance of victory.

Prediction markets though, are a very interesting case, as in theory their implied predictions should to correlate to concrete factors which give data about the election, yet they seem to have outperformed the polls and even many models in both [2016] and 2020. But how does this work? As one article puts it: "The idea is straightforward: trade contingent claims in a market where the claims pay off as a function of something one is interested in forecasting. If structured correctly, the prices should reflect the expected payoffs to the claims, and therefore the expected outcome of the event of interest." In other words, if a betting market is structured well, its pricing will reflect the probability of some event occuring. Similarly to how wisdom of the crowd can effectively determine the number of jelly beans in a jar, the thinking goes, perhaps wisdom of the crowd can also determine the true probability of some event in politics.

In this tutorial, we will examine the accuracy and pricing of one specific political betting website, PredictIt, in their markets for the victors in the thirty-five 2020 U.S. Senate races. In these markets, we will examine PredictIt's overall accuracy, explore how PredictIt's pricing and accuracy relate to other factors (such as time and trade volume), and compare PredictIt's accuracy to a statistical model's accuracy (FiveThirtyEight). If you'd like to delve into this topic with prediction markets more generally, though, I highly recommend reading this article.

But first- a dive into how PredictIt works

In many sports-betting markets, prices are set by "Vegas" - centralized sportsbooks who offer odds at a certain price based on, ultimately, what they believe will enable them to make the highest profit.

PredictIt, though, works much more similarly to the stock market than to conventional sports-betting markets - users make buy/sell offers at a price they want to buy/sell a contract at, and then are matched by other users who want to perform the opposite action (sell/buy) at the same price (we will call this (1)); or, in the case of a buy, by users who want to purchase the complimentary contract at a complimentary price (we will call this (2)). We will explain (1) and (2) in more detail.

Contracts are for a given event E, are priced between 1 cent and 99 cents (inclusive), and are binary. If you purchase a "Yes" contract for event E, and E occurs, you receive one dollar (though this isn't entirely true, as I will explain shortly), but if E does not occur, your contract resolves to 0 cents, and you lose the money which you used to purchase the contract. And the inverse is true for the "No" contract (which is complimentary to the "Yes" contract) - if you purchase the "No" contract for event E, and E does not occur, you receive one dollar, but if E occurs, your contract resolves to 0 cents, and you lose the money which you used to purchase the contract. These are the basics, but the full rules for PredictIt can be found [here].(https://www.predictit.org/support/how-to-trade-on-predictit).

Let's give examples of each of the two types of transactions mentioned above.

(1) - this is when a buyer/seller is matched by another user who wants to perform the opposite action at the same price: I believe that Democrats have a 52% chance of winning North Carolina, and I want to make a small profit, so I put in a bid to buy a "Yes" contract in that market at 50 cents. You believe that Democrats only have a 48% chance of winning North Carolina, and you own "Yes" contracts in that market; when you see my offer to buy a "Yes" contract at 50 cents, you offer to sell at 50 cents. The transaction goes through - I get your contract, and you get 50 cents from me.

(2) - this is when users want to purchase the complimentary contract at a complimentary price: I believe that Democrats have a 52% chance of winning North Carolina, and I want to make a small profit, so I put in a bid to buy a "Yes" contract in that market at 50 cents. You believe that Democrats only have a 48% chance of winning North Carolina, and you want to make a small profit, so you put in a bid to buy a "No" contract in that market at 50 cents. The transaction goes through - we each get contracts for 50 cents, and our money goes to PredictIt (most of it will come back as payout at the end).

However, there is an important factor which has been omitted up until now - PredictIt takes 10% of profits made on each contract. This is significant - suppose I am considering buying a "Yes" contract for some event at 50 cents, because I believe the event's true probability of occuring is 52%. However, I can quickly deduce that, because PredictIt takes 10% of my profits, it is not worth it for me to buy the contract - suppose I am right that the event has a 52% chance of occuring. Then, the expected value of purchasing the contract is (.52 .95) + (.48 0.0) = 0.494, as the winnings would only be 95 cents, not a full dollar, because PredictIt would take 10% of my 50 cent profit. So, I would not purchase the contract, as the expected value of the contract is less than the price of the contract.

Further, there can be arbitrage in PredictIt, even within one market. For example, consider again the North Carolina Senate market:

Suppose Alice believes Republicans are at a disadvantage, and Bob believes they heavily advantaged; so, when Bob offers to purchase a Republican "Yes" contract at 60 cents, Alice purchases a Republican "No" contract at 40 cents; both type (2) transactions go through. Meanwhile, Candace believes Democrats are at a disadvantage, while Dennis does not; so, when Dennis offers to purchase a Democratic "Yes" contract at 60 cents, Candace purchases a Democratic "No" contract at 40 cents; both type (2) transactions go through.

Now note: the price for a Democratic "Yes" contract is at 60 cents, while the price for a Republican "Yes" contract is also at 60 cents, and a huge opportunity for arbitrage exists - one could buy a 40 cent "No" contract for both the Democrats and Republicans, and be nearly guarenteed a profit of 14 cents when one of the two parties wins.

As we will see, such high arbitrage is incredibly rare, and most often arbitrage opportunities fall inside the range where it is not profitable because PredictIt takes 10% of the profits. But here is the bottom line - even when one of several events is guarenteed to occur, the sum of the "Yes" contract prices of those events is not necessarily a dollar - it can be less, and it can be more. The issue is discussed in greater depth in this paper.pdf).

This then begs the question - does price alone tell us the implied probability that the market gives to an event's occurence? The answer is: certainly not! If the Democratic "Yes" price is 60 cents and the Republican "Yes" price is 60 cents, this does not reflect an implied 60% probability of the Democrats winning the seat in question. Though there are other methods in other contexts for calculating implied proability, what I believe will work best here is the following definition: assuming one of the two major parties is going to win a given Senate seat (which was the case in these 35 races), to calculate the implied probability of a given party winning that Senate seat, we simply take the price on that party's "Yes" contract and divide it by the sum of the prices of both parties' "Yes" contracts. So, in the 60 cents/60 cents example above, we would say that each party has an implied probability of 50% of winning the Senate seat.

Note: all prices given in the upcoming analysis of PredictIt's Senate races data are for the "Yes" shares of a given contract. When I say "the contract for the Democrats winning North Carolina was priced at 50 cents on 11/1/20", I am effectively using shorthand to say "the "Yes" contract for the Democrats winning North Carolina was priced at 50 cents on 11/1/20."

Obtaining and Unpacking Data

To obtain this data, I reached out to PredictIt. After some discussion, they sent me a folder containing data on all thirty-five 2020 U.S. Senate races markets. I am incredibly appreciative of their providing me their data for this project.

There are four files containing our data in the folder which PredictIt shared with me. As they are in the "PredictIt-data" folder, these files are of the form: "PredictIt-data/Price History By Market -NoahFinei.xlsx", where the "i" in "NoahFinei" is either 1, 2, 3, or 4. We can more effectively examine what each of these files contains by looking at the first few lines of one of them after bringing in the data. We can do this, in turn, using Pandas to both read in and store the data (in a dataframe):

Each row represents information about a market on a given date. Let's go through what each column tells us:

With this explained, we will continue using Pandas to add the rest of the data to our "data" dataframe.

Parsing and Improving the Data

Now that we have our data, we need to figure out if there are any major issues with it, and if any cleaning or parsing needs to be done. And indeed, even from the output above, we can see that we have a few potential issues with our data.

First, if we want to measure correlations between how far out from the election a given data point is and accuracy of pricing in a meaningful way, we should make note of that somewhere in the dataframe; this is currently not noted. Second, though we know that all of these markets are for Senate races in specific U.S. states, we currently have no easy way of classifying which state a given market is for; the market name needs to be parsed to obtain this data. Third, the rows do not note who the eventual winner was in their markets; this will make checking price accuracy (which we will define later on) impossible. Finally, the rows do not note the implied probability of victory of the party in the contract concerned; as discussed above, this often differs from the actual price (and will be necessary in hypothesis testing).

So, we will fix these issues. We will fix the first issue by adding a "Days Out" column which measures how far out the day in a given row is from the election, in days. Note that election day in 2020 occured on November 3rd, 2020, in all states concerned (though the two Georgia races did head to a runoff, we will ignore this for now, and treat their election day as November 3rd, 2020, the date of Georgia's first round of its general election; seeing as no runoff was guarenteed, this doesn't seem too unreasonable).

Now, we want to parse out the state's name from each market name, to resolve issue two. The best way to do this is with regular expressions, and by noticing a few things:

I found all of these properties by looking through the files that PredictIt sent to me, but I will check after the fact that this worked as expected as well. For now, let's use these properties to parse out the state names from the market names, and add them to our data frame:

We can also quickly check that these properties produced 35 unique state names (for the 35 unique Senate races), and that none of them are nonsensical:

As we see above, there are 35 unique state names, as expected, and they are all reasonable/real state names. So, we have resolved issue two.

Now, we need to deal with the issue of marking the winner in each market. To do this, I made an excel file which had state name in one column and the winning party (Republican or Democratic) in another. After reading in the data, it just took a dictionary and a list to add the appropriate column into the dataframe. I also decided to add a column called "538 District", which indicates how 538 refers to the state in their data; as we shall see, this will be relevant later. To do this, I used the same method that I used above to mark the winners in each market, adding a column in my excel sheet called "538 District".

This issue is resolved.

Finally, we need to add the implied probability of victory in each row. We will add a few caveats, though. First, the implied probability will be strictly between the Democratic and Republican contracts (as noted in a previous section), to keep things simple; meaning, the value in the implied probability column will reflect the answer to the question: "Assuming either the Democratic or Republican party wins this race, what is the implied probability of the party in this market winning this race?" and not: "what is the implied probability of the party in this market winning this race?" Accordingly, markets for non-Democratic and non-Republican contracts will have a value of 0 in this column. Additionally, if only one of the two major parties has a contract present in a given market on a given date (which happens very infrequently), the implied probability will be set to the contract's actual price. Further, the value in this column will reflect the implied probability of the closing price, as the opening price can pose problems when it reflects the market's opening to the public (often, the initial prices, which are set by PredictIt, are somewhat arbitrary and can skew our data), and the average trade price is by default 0 on a day with no trades. Let's add this column to the dataframe:

Exploratory Data Analysis of the PredictIt data

To start the analysis, let's examine how well price correlated to observed outcome without considering any other factors. In general for these sort of analyses, we will be using the closing share price (for the reason described above), and only considering markets at least a day out from the election (as afterwards, they tend to either hover around 0.99 cents for the winner until the contract is paid off, or they irrationally don't, and this behavior deserves a separate analysis in a different project). Further, unless specified otherwise, "price" on a given day will refer to the closing share price on that day.

We can create an interesting bar chart using the numpy and matplotlib libraries, which are commonly used in data analysis. In this chart, we will chart two things: the frequency at which a contract priced in a given range resolved to "Yes", and the frequency at which a contract priced in that range would have been expected to resolve to yes if the price were reflective of true probability (which we will say is the median value of that range, but could reasonably be labelled as any value in that range).

Though the observed bars mostly follow the expected, there are some areas of significant deviation, such as from 30-39 cents and from 60-69 cents. We can visualize this in a more granular way via a scatterplot:

Indeed, even on a more granualar level, we see PredictIt pricing is not so representative of the true probability, at least in this sample - very few values fall even close to where they would fall if they represented true probability. That said, we will see if any meaning can be extracted from the scatterplot above in our hypothesis testing, specifically in the context of correlation.

Meanwhile, we can visualize the difference between the proportion resolved to "Yes" and the contract price more effectively in the plot below:

The spikes on this graph seem to mirror each other, and this makes sense - for each Democratic contract priced at x, there is usually a corresponding Republican contract priced around 1-x; further, for whichever contract resolves to yes, the other resolves to no (as only one of the two parties will win a given seat).

We can also attempt to see if time out from the election plays a role in pricing accuracy. In the bar chart below, we can see the frequency at which a contract priced in a given range resolved to "Yes" a specific amount of time out from the election (1 day out, 7 days out, 30 days out, and 90 days out), and the frequency at which a contract priced in that range would have been expected to resolve to yes if the price were reflective of true probability (which we will again say is the median value of that range, but could reasonably be labeled as any value in that range).

Looking at each of these ten buckets, it seems like there may be some correlation between time and pricing accuracy (namely, that pricing accuracy gets better as an election gets closer, but we will test this later).

We will now create a similar plot to the one above, but instead of making a different column for each time, we will make columns based on trade volume. In the bar chart below, we can see the frequency at which a contract priced in a given range resolved to "Yes" given a specific volume (0 trades, 1-9 trades, 10-99 trades, 100-999 trades, and 1000+ trades), and the frequency at which a contract priced in that range would have been expected to resolve to yes if the price were reflective of true probability (which we will say is the median value of that range, but could reasonably be labeled as any value in that range).

It is much harder to see from here if trading volume correlates in any way with pricing accuracy, but we can test this more thoroughly later.

Finally, one more thing I was interested to see was the sum of all contracts in a market over time; as mentioned earlier, contracts in a market may not always sum to a dollar, potentially presenting arbitrage opportunities, and I wanted to visualize this. As there are 35 markets, it would be too crowded to view the sums of all of the markets' contracts over time. So, the chart below shows the sum of all contracts in a market for a sample of the markets in our data across time, and the average sum of all contracts in a market across time, up to 90 days out.

As we can see (which will be important for a decision shortly), the prices of all contracts in a market usually sum to a value greater than a dollar.

Hypothesis Testing

First, we'll start with a simple hypothesis - that, at least in the U.S. Senate 2020 markets, the contract price and the observed probability (meaning, the proportions of the contracts that resolved to yes) are related. We can test this hypothesis by using the stats library to get both the p-value and the r-value of the relationship between contract price and observed probability in our dataset. The p-value tells us the likelihood of observing the sort of variance that we see in the data if the two variables (in our case, contract price and observed probability) were completely unrelated; a very low p-value (0.05 is a conventional cut off value for "very low") tells us that the observed data is unlikely to have occured if no relationship between the variables existed, and we can conclude that a relationship does exist between the two variables, and that what we are seeing is not just random noise. Meanwhile, the r-value tells us the strength of the relationship: the correlation between the two variables.

Though observed probability will be on the y-axis in the graph below, it is not really a dependent variable (as a variable on the y-axis usually is); in reality, both contract price and observed probability are dependent on other variables, such as the demographics of the state, the candidate's performance on the campaign trail, and early voting laws in the state, which all impact both gamblers' evalution of the odds and the true odds of victory (which translate into the observed odds) of a given party. However, since one variable needs to go on the x-axis and one on the y-axis, I am placing the contract price on the x-axis because it will be easier to code up that way.

As we can see from the p-value, the probability of seeing points like this if there was no relationship between contract price and observed (true) probability is well under 0.05, and thus we conclude that such a relationship does in fact exist. Further, the correlation between the two variables is very high, at around 91%! So, there is a strong relationship between a contract's price and the observed probability of that contract resolving to "Yes".

Additionally, we can use this output to predict the probability of a contract resolving to "Yes", given its price, using the regression line: simply plug in the contract price as the x-value, and the y-value returned is the predicted probability of that contract resolving to "Yes". Of course, this isn't entirely accurate, as probabilities greater than 1 can be returned, but it is worth noting that, along the interval from 0.01 to 0.99, y never deviates from x by more than 0.07; meaning, for any given price, we would predict the probability of that contract resolving to yes to be within 0.07 of that price. We can also see that the difference between x and y on the line gets greater around the edges, and disappears near 0.50. Meaning, if PredictIt's price implies something has a very high chance of happening, it is generally even likelier than implied by the price, and if PredictIt's price implies something has a very low chance of happening, it is generally even less likely than implied by the price, at least according to this linear model. But as we will see, there are other factors we need to consider.

Sidenote - establishing a metric

Before testing any of our hypotheses relating to pricing accuracy, whether over time or over trading volume, it is important to establish a metric with which we can measure pricing accuracy. Fortunately, there are plenty of strong options that we could use. Unfortunately, the number of options is somewhat overwhelming, and which one to choose is unclear (though they would likely all work for our purposes). So, to help with the decision, I turned to my favorite stats gurus at FiveThirtyEight, as I knew that they had done an autopsy of their 2020 election models; I figured that whatever metric they used to evaluate the accuracy of their model's predictions there, I could use to evaluate the accuracy of PredictIt's "predictions" here. But the level of analysis in FiveThirtyEight's publicly-available analysis of their predictions' performance is closer to the level of the EDA above than to the level of rigorous hypothesis testing (I assume that they have done or will do this internally, though even if that's the case, we still don't have access to their metric). This is not helpful for, say, comparing performances across time, so I needed to look elsewhere for suggestions.

Luckily, an excellent stats blogs came to the rescue; in fact, the same one: FiveThirtyEight. More specifically, their sports section. FiveThirtyEight allows readers to compete against their NFL forecasting model by assigning their own probabalistic predictions to games and comparing accuracy. Of course, in their case as much as in ours, to properly compare accuracy requires a metric, and there FiveThirtyEight settled on a system based on Brier scores as their metric. We will use this metric as well, and refer to a score given by this method as an FTE-score.

Here is how an FTE-score is calculated for a given market on a given day:

We use implied probabality, and not price, because, as noted above, this method of scoring punishes overconfidence; since the PredictIt prices more often sum to over 1.00 than they do not (see EDA), this will lead to PredictIt getting lower FTE-scores due to seeming overconfidence. Also, we only use the Democratic contracts in a given market because, since implied probability adds up to 1.00 for the two parties, the scores on both sides will simply mirror each other (this is easy to prove mathematically, but we won't do so here), so including both of Democratic and Republican contracts will unnecessarily double our score.

Back to hypothesis testing

So our hypothesis structure now changes slightly from the previous one: we hypothesize that PredictIt's implied probability (not price) becomes more accurate as days out from the election decreases. We want to see in the plot below that, as election day gets closer, PredictIt's markets' average FTE-score, which measures the accuracy of the implied probabilities of the markets, rises.

As we can see above, our hypothesis has more than ample evidence; assuming no relationship between days out and Predict's markets' average FTE-score as the null hypothesis, we can see that our p-value is orders of magnitude below 0.05, thus enabling us to reject the null hypothesis (though I recognize the null hypothesis we would really have to reject is the one that the relationship is positive, this can also be done easily given the absurdly low p-value here; this is the case in the upcoming analyses as well). So, we can reasonably conclude that PredictIt's implied odds become more accurate as election day draws closer.

However, a reader may wonder what, exactly, is happening with those spikes >100 days out? Why was PredictIt so much more volatile further out from the election? We can assume that this was due to the lower number of Senate markets present on PredictIt, which increased the day-by-day variance of the score of the average market (smaller sample leads to higher variance of the average). As we can see, after all of the markets are added (around 100 days out), the average FTE-score stabilizes, but previously the smaller number of markets was enabling huge swings (with a smaller number of markets, a huge swing in one market's accuracy would have a greater influence on the average accuracy). So, perhaps it is worth analyzing if the relationship between days out and average FTE-score still exists among only points where all markets were already added. And we can do this below.

Sure enough, among only points which average out PredictIt's FTE-scores for all thirty-five 2020 U.S. Senate races, we still observe an increase in average FTE-score as the election draws nearer, meaning, at least by how we are measuring accuracy, the odds implied by PredictIt's prices become more accurate as election day draws nearer.

A friend of mine once mentioned that he had heard Nate Silver on the FiveThirtyEight politics podcast state that the polls aren't especially indicative of the election results until 20 days before the election. Whether or not this is true, his statement inspired the question: does PredictIt's accuracy actually improve before 20 days prior to the election, or is the regression line over the 100+ day period above simply being dragged upward by the points within 20 days of the election? We can see if accuracy really does improve closer to the election in the plots below, which separate out the plot above into two periods: one for points within 20 days of the election, and one for points more than 20 days out from the election.

This first plot shows an expected result: in the 20 days leading up to the election, at least in our sample, PredictIt becomes more accurate, with a rising FTE-score in that timeframe. We see this in the very low p-value (less than 1 in 10 billion).

But, somewhat suprisingly, accuracy grows over time even further than 20 days out from the election. The second plot shows that, in the timeframe starting when all thirty-five markets were up and running until 21 days out from the election, the FTE-score still grows on average, albiet slower than in the timeframe from the first plot (by about 0.0033 points per day here versus by about 0.1151 points per day there). And with a p-value near 0.01, we can be fairly confident in this result.

So, we have shown another cool result: for the 2020 Senate races, PredictIt became more accurate, on average, as election day approached, and this growth in accuracy sped up as the election got especially close. And this intuitively makes sense, as uncertainty declines closer to the election - debates have already happened, and we know who won them; scandalous stories are less likely to break with so little time left; new national crises and issues are less likely to emerge in a smaller timeframe - leading the markets to become more confident in their assessments, thus raising the FTE-score relative to what it was weeks before when their predictions end up being correct. Occasionally, of course, the market will become more confident in the wrong outcome, but, as we see above, this is more than offset by the occasions where the market will become more confident in the correct outcome - at least, for the 2020 U.S. Senate races.

Before moving on to the next section, let's test one more hypothesis: namely, that a relationship exists between trade volume and accuracy. We can plot trade volume versus the average FTE-score of a market at that trade volume below.

And, as we can see from the high p-value, it is very likely that we could see results like this if no relationship existed between FTE-score (accuracy) and trade volume; so, we cannot conclude that such a relationship exists.

Comparison Introduction - Obtaining and Cleaning the 538 Data

I found analyzing the above data to be a both fun and informative excersize. But, as the reader may have noticed, this analysis was only step one of a larger plan. Step two: comparing PredictIt's performance to FiveThirtyEight's. When I began working on this project, I was not planning on this comparison, but once I realized how simple it was to obtain FiveThirtyEight's data, I became very excited to compare the performance of the two. So, let's do it.

First, we need to load in FiveThirtyEight's data.

As we can see, there are a lot of unnecessary columns, but by looking through the csv file in excel and planning ahead, we can decide now which columns are relevant, and remove the unneeded ones. We only want to keep district, forecastdate (from what I can tell, the excel file only contains one recording per day, so the timestamp, which also tells us the time of day, is irrelevant; since we will be comparing FiveThirtyEight's predictions to PredictIt closing prices, and we know the forecast update occured before the end of the day, we can take whatever odds are listed on the forecast date as being for that date's end), winner_Dparty (the probability the model gives to Democratic victory), and winner_Rparty (the probability the model gives to Republican victory).

Then, as we did with the PredictIt data, let's label each entry with how many days out it is from the election. We will simply take the date portion of the timestamp, and subtract it from the actual election day to get this value.

Unlike what we did with the PredictIt data, we will not add a column here for implied two-party win probability, as the structure of the data allows this to be calculated easily as we iterate through: implied two-party Democratic win probability = winner_Dparty/(winner_Dparty + winner_Rparty).

Exploratory Data Analysis - PredictIt versus FiveThirtyEight

Before doing anything, we must note that FiveThirtyEight and PredictIt can only be compared on common predictions. Meaning, if PredictIt has a market which opened before FiveThirtyEight publicized their model, or if FiveThirtyEight publicized their model before a given prediction market opened on PredictIt, there is nothing to be compared. Fortunately, we can easily see that all FiveThirtyEight forecasts occur after their corresponding markets opened on PredictIt: recall from the plots above that all 2020 Senate prediction markets were open by 100 days out from the election. Then, we only need to see that the highest value of days out in the FiveThirtyEight dataset is less than 100:

So, any prediction made by FiveThirtyEight has a corresponding prediction on PredictIt. Additionally, the corresponding prediction of a FiveThirtyEight prediction is made easy to find by the structure of the pred dictionary, created earlier.

That said, let's do our first bit of exploration: seeing how the implied Democratic odds of FiveThirtyEight compare to the implied Democratic odds of PredictIt in common markets, via both a scatterplot and a hard number.

Points above the black line are points where PredictIt gave a higher Democratic win probability than FiveThirtyEight, while for points below the black line, the opposite is the case. If all points were above the line, it would indicate an overall PredictIt Democratic bias relative to FiveThirtyEight, whereas if all points were below the line, it would indicate an overall PredictIt Republican bias relative to FiveThirtyEight. Though this is not clear visually, it appears from the text output that PredictIt has a slight Republican bias relative to FiveThirtyEight. But we can test the significance of this properly in our hypothesis testing.

To possibly get an idea of relative accuracy, we can also visualize how the races turned out using a multicolored plot, similar to the one above.

From the above plot, it appears both FiveThirtyEight and PredictIt generally underweighted Republicans in 2020 Senate races, as there are many more red points (indicating Republican victory) in cases where both FiveThirtyEight and PredictIt favored the Democrats than there are blue points in the inverse situation. This plot does not make relative accuracy clear, though it is worth noting that there are two red clusters near the middle of the plot which are significantly above the line, implying FiveThirtyEight had been much more accurate than PredictIt in those cases. Further, it is worth noting that no candidate which had been given above a 90% win probability at any point in the election lost. We will use this fact in accuracy comparison later.

As we saw earlier, though, at least with PredictIt, predictions become more accurate closer to the election; perhaps it is worth viewing a plot similar to the above with just points from the day before the election.

On the day before the election, a PredictIt Democratic bias relative to FiveThirtyEight appears more clearly from the high number of points below the y=x line. So, perhaps we should see if bias changes over time in the next section as well.

Hypothesis Testing - PredictIt versus FiveThirtyEight

There are really two hypotheses we want to test: is PredictIt biased relative to FiveThirtyEight, and which of the two is more accurate?

First, we will test bias. We will plot PredictIt's bias relative to FiveThirtyEight over days out from the election, and see whether either the overall bias (indicated by the intercept of the graph; the "const" row in the upcoming chart has the data for this) or the change in bias over time (indicated by the x-coefficient; the "x1" row in the upcoming chart has the data for this) is statistically significant.

Curiously, PredictIt seems to be flipping back and forth in its bias relative to 538 until about a month out, when it begins to head sharply towards a Republican bias. Twenty days out, the bias becomes Republican for the first time since about thirty-six days out, and by the day before the election, PredictIt's Republican bias in the 2020 U.S. Senate markets averages 3%. Further, as we can see in the table (P>|t| column), both p-values are significantly lower than 0.05; we can therefore conclude that the trendline above is significant.

But what if, like with accuracy before, we broke down bias into two graphs: one twenty days out and closer to the election, and one over twenty days out from the election. We would almost certainly still see a trend near the election, but would there be any meaningful bias farther out? Let's see.

While the near-election output turned out as expected, the output farther out showed something seemingly crazy - a trend in the opposite direction! And the p-values in all cases are low enough to consider the results significant.

So, farther out from the election, PredictIt tends to show a slight Democratic bias relative to 538, while when the election gets close, PredictIt tends to show a heavy Republican bias relative to 538. We will break this down more in the conclusion.

Now, the moment I've been most excited for: let's compare PredictIt's accuracy to FiveThirtyEight's! Again, we will be using FTE-scores as our metric to measure accuracy; after all, if FiveThirtyEight is deemed to be less accurate than PredictIt by their own metric, surely they would not dispute the result! We will do this comparison over time, as we found earlier (at least in the case of PredictIt) that accuracy changes over time significantly.

This is intense! They seem to be running neck-and-neck; let's plot out the differences in scores and see if we can pull out anything statistically significant.

Neither the constant nor the x-coefficient has a low enough p-value to deem either of them statistically significant; they are 0.101 and 0.061 respectively, and both are greater than 0.05. So, we cannot conclude whether FiveThirtyEight or PredictIt is more accurate overall in a statistically significant way over this dataset, though we can view accuracy winners on individual days using the plots above.

It may also be worth examining who was more accurate in only close races, which I'll define as a race in which both FiveThirtyEight and PredictIt never gave either party a 90% chance or higher of winning; as we saw earlier, in any race which we would not call "close", the winner was of the favored party. Let's examine this in the scatterplot below.

Here, it seems we have more consistent PredictIt outperformance of FiveThirtyEight; as above, we can test this by building a linear regression model on the differences.

Here, we do have statistically significant results, as seen in the low p-values! So, in close races (as defined), PredictIt not only outperforms FiveThirtyEight at the basline (94 days out, as we see above), but also has an increasing accuracy relative to FiveThirtyEight as election day comes closer! And, this is all measured using a metric commonly used by 538!

Machine Learning - PredictIt versus FiveThirtyEight

Though we can't tell whether PredictIt or FiveThirtyEight is more accurate in predictive power over all 2020 U.S. Senate races, we could still try to see whose predictions "mean more". As in, we can answer the question - what gives us more information as to who the winner of a race will actually be: knowing FiveThirtyEight's prediction for a race, or knowing PredictIt's? Fortunately, we have the perfect tool to answer this question: decision trees.

A decision tree allows us to classify inputs by following its branches using the attributes of our input. An example of how this may work: we may have a dataset of dogs, humans, and cats, with two attributes: #legs, and #whiskers. The first node on the tree may say: if #legs less than 3, follow the left branch; otherwise, follow the right branch. The left branch leads to a human classification, while right branch leads to another node, which could say: if #whiskers is 0, follow the left branch; otherwise, follow the right branch. The left branch here leads to a dog classification, while the right branch leads to a cat classification. An example of how this plays out: suppose we have data on a middle-aged man, with 2 legs and 4 whiskers (he hasn't been good about shaving). We look at the root node, which tells us to follow the left branch, which (correctly) classifies this input as human!

A decision tree, at least when using "entropy" as the criterion in sklearn, is built using the following algorithm: look over the whole training data (which I will explain shortly) to find the attribute which, when split on, reduces uncertainty the most. Split on this attribute. Repeat for resulting nodes until either certainty is reached, and we classify as the class of all of the objects represented in that node, or we hit the max_depth, and we classify as the class of the majority of objects represented in that node.

Now, we'll try to build a decision tree. However, we want an optimal max_depth parameter so that there is not overfitting to our data (I recognize other parameters can be tweaked, but I concluded in an earlier project that this is the most significant one). To find the optimal max depth of the tree, we will use holdout validation, where a random sample is taken from our dataset to train the model (about 70% of the data), and the remainder of the data is used to test the model's classification accuracy. We then choose a max_depth which maximizes the proportion of the test data correctly classified.

Note that the attribute in the root node is what the algorithm considered to be the most important attribute to split on. So, let's create a few decision trees using this method, and see if the algorithm consistently chooses one attribute to split on at the root. We will use three attributes: FiveThirtyEight implied Democratic odds, PredictIt implied Democratic odds, and days out from the election (as this was found to significantly influence accuracy). If the trees are consistently splitting on FiveThirtyEight implied Democratic odds, then perhaps the model believes that to be better information to know with regards to predicting outcome, and if the trees are consistently splitting on PredictIt implied Democratic odds, then perhaps the model believes that to be better information to know with regards to predicting outcome.

In the trees above, we can see that the root attributes used are sometimes FiveThirtyEight's probability, and sometimes PredictIt's probability; it really just depends on which portion of the data is used as training data, and which is used as testing data. So, no meaningful conclusion can be drawn here, either. Though it is worth noting: in creating these decision trees, I realized that the choice of root attribute does not necessarily imply which trait is truly more important. In the above data, we see in most of the trees that the first attribute is used generally to classify the most Democratic markets, which makes sense: very rarely does a market in which PredictIt or FiveThirtyEight had given the Democratic over 75% odds go to the Republican, so the tree uses a number around there to classify points as certainly Democratic. However, classifying toss-ups, for which one could argue that knowing whether to use PredictIt or FiveThirtyEight predictions is more important, occurs in the middle of the tree, and requires a much more nuanced splitting than what occurs at the root, as we can see in the trees above. And in that part of the tree, both PredictIt and FiveThirtyEight predictions are heavily used (days out is sometimes used as well, so it isn't irrelevant in prediction).

Also of note: using this dataset to train a decision tree to classify PredictIt U.S. Senate election markets in general (let alone simply PredictIt U.S. election markets more broadly) may be flawed, as this entire dataset is likely skewed. But we will dive into this more in our closing section.

Conclusion

A few points from throughout our analyses and hypothesis testing stand out as noteworthy:

But now, we need to discuss skewed sample, which may have played a role in our observations. In the 2020 general election, Democratic margins on average underperformed polls by nearly 4%, meaning polling was skewed towards Democrats, and the actual results were more Republican than the polling predicted. So, there may be a cause-effect relationship between PredictIt's Republican bias relative to FiveThirtyEight and PredictIt's increased accuracy relative to FiveThirtyEight: PredictIt likely beat out FiveThirtyEight in accuracy simply because it was more bullish on Republicans in the 2020 U.S. Senate elections, and Republicans outperformed expectations more frequently than Democrats. This would render our accuracy conclusions a by-product of our bias conclusion: PredictIt's markets only outperform FiveThirtyEight's forecasts in this dataset because their markets are more baised towards Republicans.

Which then renders the question: is this (ultimately correct) bias over the 2020 U.S. Senate sample representative of PredictIt's predictive power, or did PredictIt simply get lucky in this dataset? The answer is not clear, and we would need to see more data - perhaps PredictIt has a general bias towards Republicans relative to FiveThirtyEight's forecast, regardless of the year, and while in years like 2016 and 2020 this would cause PredictIt to appear more accurate, in years like 2018, where Democrats slightly overperformed polls, this bias would hurt PredictIt's accuracy relative to FiveThirtyEight. Or, perhaps in years like 2016 and 2020, PredictIt shows (accurate) Republican bias relative to FiveThirtyEight's significantly polls-based forecasting, while in years like 2018, PredictIt shows (accurate) Democratic bias relative to FiveThirtyEight's significantly polls-based forecasting. Whether the former or the latter is the case, we cannot know, as we do not have the data to test these hypotheses.

In conclusion: over this dataset, PredictIt's prices and implied probabilities appear strongly correlated with actual, observed outcomes. And, as election day approaches, PredictIt's implied probabilities become more accurate; PredictIt even appears more accurate than FiveThirtyEight, a reputable forecaster! But when it comes to generalizing these conclusions to other markets in PredictIt, we simply don't have the data to do so.

Short Thank-Yous

I would like to again thank PredictIt for providing me with their 2020 U.S. Senate election data. This analysis was a pleasure to perform, and I hope I have the opporunity to perform analyses like this on some of their of data in the future.

Additionally, I would like to thank my professor, John Dickerson, for his guidance throughout this process. I feel confident in saying that this project wouldn't have come to fruition without his advice at a few crucial steps in the process.

Finally, I want to thank my parents, as they had to put up with me talking about FiveThirtyEight, PredictIt, and polling nonstop from ages 15-18.