This assignment is to be done individually.
The following is offered with apologies to the vast majority of students who do their work honestly and take their university learning seriously:
Your instructor takes academic integrity seriously and has no tolerance for plagiarism or any other form of academic misconduct. Failure to respect these guidelines will result in you receiving a grade of zero on this assignment.
Acceptable collaboration between students, provided it is acknowledged explicitly in your report and code, might include:
Unacceptable collaboration and violations of academic integrity include, but are not limited to:
Modern classification techniques can be used to distinguish and categorize many things which some people may consider uniquely human. In this assignment you will use two basic classification algorithms to divide songs in ten music genres: classical, country, edm_dance, jazz, kids, latin, metal, pop, rnb, and rock.
This assignment is organized as a kaggle competition. To participate you will need to create a kaggle account using your McGill email. Once you have your account, go to the link posted on moodle to join the competition.
Before doing any classification, you will need to download the data found in the 'Data' menu of the kaggle competition. There you can download a zip file containing a CSV file and two subdirectories, containing training and testing data. The data are a set of files, each corresponding to a single song. The CSV file, labels.csv contains the labels for the songs in the training directory.
Every line in the song files consists of 12 decimal numbers, corresponding to the frequency analysis of a small segment of the song. There are between approximately 300 and 1300 segments per song, depending on the length, beat, and timbre of the music.
Each line can be thought of as a 12-dimentional feature vector with genre label of that song.
Your first task is to download these files, and load them into your program so that they may be manipulated. The files are in CSV format, so you may benefit from the use of a dataframe library (e.g., pandas in python). Additionally, you should split your data into separate training and testing sets.
You may wish to start by using a small subset of the training data for testing purposes, making sure that you don't also train on these examples. This will allow you to gain confidence in the performance of your classifier before you try it with a greater portion of testing data, and eventually, validating it on the competition test set.
The first type of classifier you will build is a Gaussian classifier.
For each genre, calculate the 12-dimensional mean vector and 12x12-dimensional covariance matrix from your training set. These 12+144 numbers fully describe the probabilistic model of the genre. You should have one mean vector and one covariance matrix for each of the six genres.
With the model trained you will now want to evaluate it on the testing set. This model allows you to make predictions both on the feature vectors and on songs as a whole. To make a prediction on a new feature vector simply plug it into the following equation, giving the unnormalized negative log likelihood (UNLL) for each genre:
where is the mean vector of that genre, and is its covariance matrix.
The UNLL is an indicator of the probability of a feature vector belonging to a song of a specific genre. The lower the UNLL, the higher the probability. Therefore, the genre with the lowest value of UNLL is the highest probability prediction for the corresponding feature vector.
Extending this idea, it is possible to predict a genre for an entire song. To accomplish this, repeat the previous step on all the feature vectors of that song and average the results for each genre. The genre with the lowest average UNLL is the prediction.
One of the main advantages of this type of classifier is that no training is required. All the classifying information is found in the unformatted data.
Given a new song, calculate the Euclidean distance between each feature vector with all the feature vectors in your training set. Then find the k vectors with the smallest distance from . Assign each vector the genre of the majority of these k vectors. Finally, to assign a genre to the song, find the majority class of its feature vectors.
Try to improve your score on Kaggle. You are free to use any method, classifier or third-party library you want. Discuss your method and how it compares to the previous two classifiers.
You are expected to write the algorithms for the Gaussian and k-NN classifiers. You are allowed to use a linear algebra package, e.g., numpy, as you see fit, as long as you write the core of the prediction algorithm.
Your report should address the following questions:
Your assignment must be submitted through moodle to allow for peer- and self-assessment. The submission must contain:
(Subject to minor revision)
Question/Criterion | Unsatisfactory | Bare minimum | Satisfactory | Above and beyond |
---|---|---|---|---|
1. What assumptions about the data do we make when we model the data using a Gaussian distribution? | 0, No discussion | 1. One assumption is stated. | 3. Several key assumptions are stated. | |
2. When do you expect that a Gaussian will work well and when do you think it will not work well? | 0. No discussion | 1. Only one situation is given, where one classifier is better than the other. | 3. Several situations are given, without elaborating on what makes one classifier better than the other. | 5. All previous criteria. With elaborate discussion of the factors that make one classifier better than the other. |
3. What values of k work best for the kNN classifier? | 0. No discussion | 1. A value of k is stated without supporting graph. | 3. A value of kis stated, with a graph showing the results of different experiments. | |
4. Based on your results from this assignment, which classifier (Gaussian, kNN, or other) works best for the task of music genre classification | 0. No discussion | 1. A classifier is stated as best, without supporting evidence. | 3. A classifier is stated as best, with supporting evidence, but without discussing why. | 5. A classifier is stated as best, with elaboration on why, and with supporting evidence. |
5. Code and documentation. | 0: Code does not run. | 1: Code runs without documentation or feedback. | 3: Code runs, is well documented, and gives meaningful feedback. | |
6. Kaggle performance. | 0: No submissions were made to kaggle. | 1: Bottom third of submissions. | 2: Top two-thirds of submissions. | 3: Top submissions. |
Original specifications by Douglas Turnbull
Last updated on 15 October 2017