The problem is that sentiment analysis of the product reviews is needed for business use. The “customer” wants us to construct & evaluate the possible quality of such algorithm on a small test sample (only 100 records). No training data provided, so we need to find it somewhere in a good quality and size. Also since testing set has little number of samples, careful feature selection and model ensemble are necessary for high accuracy even after the training sample formation. General description and data are available on Kaggle.
My solution to this could be found here:Github. I’ll use web parsing to form a sufficient training set of 10,000 records, then vectorise it and compare different classification techniques: separate models as well as their ensemble via VotingClassifier.
Predict Grant Applications is a knowledge competition on Kaggle. This task requires participants to predict the outcome of grant applications for the University of Melbourne.
Objective: for 38 indicators related to the grant application (the area of research of scientists, information on their academic background, the size of the grant, the area in which it is issued, etc.), to predict whether the application will be accepted.
My solution to this could be found here:Github
Full range of data exploration and preparation:
Based on Russian Federal State Statistics Service data (http://gks.ru) I’ve tried to predict future monthly wages. I’ve worked with Time Series data, using STL-decomposition, Box-Cox transformation, Dickey-Fuller test, correlograms, etc. in order to properly clear & transform the data and predict the values.
Solution to this could be found here:Github
Clustering is an approach to unsupervised machine learning. In this notebook I’ve just demonstrate its basic functionality using “digits” dataset (it is available here).
Notebook could be found here:Github