The Digits data set of the Scikit-learn library provides numerous data-sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times. Perform data Analysis to accept or reject this Hypothesis.
In this project, we are using the Handwritten Digits dataset which is already ready in the sklearn library. we can import the dataset
from sklearn import datasets
digits = datasets.load_digits()
Info about Dataset:
print(digits.DESCR)
OUTPUT:
main_data = digits['data'] targets = digits['target']
len(main_data)
%matplotlib inline
plt.subplot(321) plt.imshow(digits.images[1791], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(322) plt.imshow(digits.images[1792], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(323) plt.imshow(digits.images[1793], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(324) plt.imshow(digits.images[1794], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(325) plt.imshow(digits.images[1795], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(326) plt.imshow(digits.images[1796], cmap=plt.cm.gray_r, interpolation='nearest')
OUTPUT:
Support Vector Classifier:
from sklearn import svm
svc = svm.SVC(gamma=0.001 , C = 100.)
svc.fit(main_data[:1790] , targets[:1790])
predictions = svc.predict(main_data[1791:])
predictions , targets[1791:]
OUTPUT:
From SVC we get 100% accuracy
Training Data : 1790
Test Data : 6
Decision Tree Classifier:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion = 'gini')
dt.fit(main_data[:1600] , targets[:1600])
predictions2 = dt.predict(main_data[1601:])
from sklearn.metrics import accuracy_score
confusion_matrix(targets[1601:] , predictions2
OUTPUT:
array([[17, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 17, 0, 0, 1, 0, 0, 0, 2, 0], [ 0, 0, 13, 1, 0, 1, 0, 1, 1, 0], [ 0, 2, 2, 9, 0, 3, 2, 4, 0, 0], [ 0, 0, 0, 0, 18, 0, 1, 2, 0, 1], [ 0, 0, 0, 1, 2, 15, 0, 0, 1, 0], [ 0, 0, 0, 1, 2, 0, 19, 0, 0, 0], [ 0, 0, 0, 2, 1, 0, 0, 17, 0, 0], [ 0, 2, 1, 0, 0, 0, 0, 1, 13, 0], [ 0, 1, 0, 0, 0, 0, 0, 2, 1, 16]], dtype=int64)
accuracy_score(targets[1601:] , predictions2)
OUTPUT:
From Decision Tree Classifier we get 78 % Accuracy
Training Data : 1600
Test_data : 197
Random Forest Classifier:
from sklearn.ensemble import RandomForestClassifier
rc = RandomForestClassifier(n_estimators = 150)
rc.fit(main_data[:1500] , targets[:1500])predictions3 = rc.predict(main_data[1501:])accuracy_score(targets[1501:] , predictions3)OUTPUT:
0.9222972972972973From Random Forest Classifier we get high accuracy for n_estimators = 150
Training data : 1500
Test Data : 297
Conclusion:
Data maters the most we need a good amount of data for modal.if we have a less data then we can use some other machine learning classifier algorithms like random forest which is also give 92 % accuracy on 1500 trainset which is less data compare to Support vector classifier.
As per our hypothesis, we can say with hyperparameter tunning with different machine learning models or using more data we can achieve near 95% accuracy on the handwritten dataset. But make sure we also have a good amount of test data otherwise the model will get overfit.
Comments
Post a Comment