Machine Learning Bracketology Methodology

Machine Learning Bracketology Methodology, say that five times fast.

Anyway, the AMSTS Machine Learning Bracket Projection is exactly what it's called, it leverages a number of Machine Learning models to try to replicate the NCAA Selection Committee's decision process. We'll hit on the process first, then talk a little bit about each model and what it likes/dislikes.

A full breakdown of the committee's selection process is done here, and the AMSTS Bracketology process attempts to recreate that as best possible, with minor adjustments as needed. A simple example of that is that until conference tournaments are completed, there is no guarantee of who the conference champion of a particular conference will be. Thus, conference champions are projected throughout the season by leveraging the remaining schedule and the team's AMSTS Computer Ranking.

Like the selection committee, we then select the 36 best at-large teams in the nation. Well, we try to simulate what the committee will pick, because they might not always pick the 36 "best" teams out there. So rather than simply taking the top 36 computer ranked at-large teams, it follows the NCAA Selection Committee's process, with each machine learning model being a "committee member". Using the same process as the selection committee, there is an initial round of voting to populate the pool of potential at-large teams, as well as a list of teams that should be in consideration. Like the NCAA Selection Committee, if all but 2 of the models choose a team to be "At-large", it will immediately be passed through to the tournament. The remaining teams that have received at least three consideration or at-large votes are placed in the consideration pool.

Once in the consideration pool, the models select the eight best teams of that pool. The top eight selections across all the models are placed into a voting round. Each model once again ranks all eight teams from 1-8 and the top four teams are given at-large berths. The committee then goes to the next four teams and again ranks them 1-8 with the top four getting an at-large berth. The remaining four teams are placed back in the pool and the process repeats itself until all 36 at-large bids have been allocated.

After the field is fully set, with 36 at-large teams and 32 conference champions, the teams are seeded in an S-curve in a similar fashion to the method used by the selection committee. Placement into the bracket follows the committee's rules, as such:

Each of the first four teams selected from a conference shall be placed in different regions if they are seeded on the first four lines.
Teams from the same conference shall not meet prior to the regional final if they played each other three or more times during the regular season and conference tournament.
Teams from the same conference shall not meet prior to the regional semifinals if they played each other twice during the regular season and conference tournament.
Teams from the same conference may play each other as early as the second round if they played no more than once during the regular season and conference tournament.
Any principle can be relaxed if two or more teams from the same conference are among the last four at-large seeded teams participating in the First Four.

The region preferences shown on the analysis page demonstrates each team's preference for each Pod and Region, where conflicts due to one of the above rules are given a score of 88888, and hard limitations (BYU cannot play in a region or pod that plays on Sunday, and hosts of a region or pod may not play in that region or pod) are given a score of 99999. Then the bracket is filled out from top to bottom. If a team is the lowest of its seed and has conflicts in the remaining region or pod, it will be slotted down a seed, while the team below it will be given the higher seed at the likely expense of travel. This is in line with the committee's policies on seed moves.

That's it! That is the process that gets us to a completed bracket. Now let's meet the selection committee members.

The "committee members" fall into three main groups. All are classification models from SciKit-Learn, that are operated through a Python script with SECRET ALGORITHMIC INPUTS. (They're probably just the ones you think they are, and they're the ones the committee has publicly said they use, such as the NET tier wins, out of conference strength of schedule, etc.). The model is trained on AMSTS ratings and tournament bids from the 2014-15 season onwards. As each year passes, more data is passed into the model with the hope that it will get better at prediction. Of the eight models used, two are k-nearest neighbors, three are Naive Bayes functions, two are Support Vector Machines, and the last one is a Random Forest Classifier. Let's see them individually:

Gaussian Naive Bayes [GNB] — This model is able to return probabilities that a particular team will be an at-large bid along what is normally referred to as a normal distribution.
K-Nearest Neighbor (3 data points) [KNN-3] — This model finds the three most similar teams (in the data history of the last five years) and their resulting probability of making the tournament.
K-Nearest Neighbor (6 data points) [KNN-6] — Like the previous, this uses the same process but has a higher number of teams from its history to triangulate within a wider array of teams. You'd think the results would be similar between the two but sometimes they are fairly different!
Support Vector Machines with RBF Kernel [SVC-RBF] — This model classifies by using an equation to segregate teams that had favorable tournament outcomes (i.e. made the tournament) vs those that did not, via a radial basis function. Because of the nature of the function used, this model will also tend to believe that the bottom teams in the rankings deserve a shot at the tournament (think those teams ranked between 330-353). This is removed from the selection process.
Support Vector Machines with Polynomial Kernel [SVC-Poly] — Like the above model, but using a polynomial function instead of an RBF. Like the previous model, the nature of polynomial functions means that sometimes this committee member thinks that the lowest ranked teams (330-353) should make the tournament. This is also removed from the selection process.
Random Forest Classifier [RFC] — This model uses decision trees to model out the history of the data and a probability of tournament appearance is given through comparison of similar "leaves".
Multinomial Naive Bayes [MNB] — Similar to the GNB, this function also returns probability of tournament appearance based upon a multinomial naive bayes distribution.
Complement Naive Bayes [CNB] — Also similar to the GNB, this function is designed to take imbalanced weightings to be more exact in its estimation of a percent likelihood of tournament appearance.

So that's the committee. Obviously, just like in a real selection committee, some are better at their job than others, but the committee as a whole tends to steer towards better results. If you have a better science background than I and would like to tell me why we are idiots for using a particular model, or what we can do to create a better output, please reach out to us via Twitter.