GLASS CLASSIFICATION USING VARIOUS MACHINE LEARNING TECHNIQUES
In this article i had used some classification techniques such as logistic regression,decision tree,SVM(SUPPORT VECTOR MACHINE linear),KNN(K NEAREST NEIGHBOR),random forests and ANN(ARTIFICIAL NEURAL NETWORK) to predict the class for the given features.
FLOW OF THE ARTICLE
- Overview
- Important libraries and reading the file
- Data visualization and pre processing
- Splitting the data into training set and testing set
- Classification models
- Analyzing various classification metrics like Accuracy,precision,recall and f1 score etc..
- Comparing the models and concluding the best model
OVERVIEW
You can visit the above link and download the data set for this article.
it has 214 rows and 10 columns
The columns of the data set are:-
Columns:
- RI: refractive index
- NA: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4–10)
- Mg: Magnesium
- Al: Aluminum
- K: Potassium
- Ca: Calcium
- Ba: Barium
- Fe: Iron
- Type of glass: 1 building_windows_float_processed — 2 building_windows_non_float_processed — 3 vehicle_windows_float_processed — 4 vehicle_windows_non_float_processed (none in this database) — 5 containers — 6 tableware — 7 headlamps
IMPORTANT LIBRARIES AND READING THE CSV FILE :
import the following libraries for this model:-
Now lets read our csv file(glass.csv)
so we are having all the values in numbers and no missing values so our data set is good so we don’t need any pre processing for this data set.
CORRELATION MATRIX FOR THE DATA SET(DATA VISUALIZATION) :
So to know which feature is important or contributes more to predict the class we draw correlation matrix and then the highest correlation value indicates that it is contributing more to predict the class.
From the above matrix we can say that AL(aluminium) with value 0.60 with respect to the Type feature is contributing more among other features, and negative correlation value indicates that it is inversely proportional to the type class in this data set..
SPLITTING THE DATA INTO TRAINING SET AND TESTING SET
Here we had split the training and testing data into 80% and 20%.
The input training set i.e x_train has 171 rows and 9 columns and the input testing set i.e x_test has 43 rows and 9 columns and the output training set i.e y_train has 171 rows and 1 column and the output testing set has 43 rows and 1 column.
CLASSIFICATION MODELS
LOGISTIC REGRESSION:-
So we got the accuracy of training and testing for the logistic regression model now lets see the confusion matrix between the y_pred_lg_test and y_test and metrics like precision,recall,f1 score.
From the above confusion matrix for example we can know that there are 9 values of y_test as 1 and y_pred_lg_test as 1 and like that for other values and so on…… and
The metrics such as precision means the percentage of your results which are relevant. On the other hand, recall refers to the percentage of total relevant results correctly classified by your algorithm . here support means Notice the support column: it lists the number of samples for each class and the traditional F-measure or balanced F-score (f1 score) is the harmonic mean of precision and recall.
KNN (WITH K=4):-
now the confusion matrix and the classification metrics for KNN:-
SVM (SUPPORT VECTOR MACHINE LINEAR):-
Now the confusion matrix and classification metrics for SVM:-
DECISION TREE CLASIFIER (CART):-
Now the decision tree will look like this:
Confusion matrix and classification metrics for CART:-
RANDOM FOREST CLASSIFICATION :
Confusion matrix and classification metrics for Random forest classification:-
ANN(ARTIFICIAL NEURAL NETWORKS):-
Now we plot the graphs of loss and accuracy of train and test data.
COMPARING THE MODELS
Now we completed all the classification models and now we start comparing them based on the accuracy score. this is for 80% of training data and 20% of testing data.
Now we visualize the models based on accuracy for the 80% of training data and 20% of testing data.
We use only testing set accuracy to validate a model, so more the test accuracy better the model in the above the test accuracy of Random forest i.e 0.790698 is the best.
NOW WE SPLIT OUT TRAINING SET TO 60 % AND TESTING SET TO 40% AND COMPARE THE MODELS
After splitting we will run all the classification models and the accuracy scores of the models will look like this :-
Here the model with high testing accuracy is also Random forest with testing accuracy of 0.755814. but we got testing accuracy 0.790698 of the random forest model where the training set and testing set had split into 80% and 20% which is better than the 60% and 40%.
Now again we will split our training data and testing data into 70% and 30% and compare which model is best.
Now we see here again random forest has the best testing accuracy compared to others with testing accuracy of 0.769231 which is better than the (60–40) split but not better than (80–20) split.
So finally we conclude that
Training and testing set split,classifier,testing accuracy
1.80%-20% , random forest, 0.790698
2.70%-30%, random forest, 0.769231
3.60%-40%, random forest, 0.755814
So we prefer the 1st one with random forest model of testing accuracy 0.790698 to predict the class for given features….