The Digits data set of the Scikit-learn library provides numerous data-sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times. Perform data Analysis to accept or reject this Hypothesis.
In this project, we are using the Handwritten Digits dataset which is already ready in the sklearn library. we can import the dataset
from sklearn import datasetsdigits = datasets.load_digits()Info about Dataset:
print(digits.DESCR)OUTPUT:main_data = digits['data'] targets = digits['target']len(main_data)%matplotlib inlineplt.subplot(321) plt.imshow(digits.images[1791], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(322) plt.imshow(digits.images[1792], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(323) plt.imshow(digits.images[1793], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(324) plt.imshow(digits.images[1794], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(325) plt.imshow(digits.images[1795], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(326) plt.imshow(digits.images[1796], cmap=plt.cm.gray_r, interpolation='nearest')OUTPUT:Support Vector Classifier:from sklearn import svmsvc = svm.SVC(gamma=0.001 , C = 100.)svc.fit(main_data[:1790] , targets[:1790])predictions = svc.predict(main_data[1791:])predictions , targets[1791:]OUTPUT:From SVC we get 100% accuracy
Training Data : 1790
Test Data : 6
Decision Tree Classifier:from sklearn.tree import DecisionTreeClassifierdt = DecisionTreeClassifier(criterion = 'gini')dt.fit(main_data[:1600] , targets[:1600])predictions2 = dt.predict(main_data[1601:])from sklearn.metrics import accuracy_scoreconfusion_matrix(targets[1601:] , predictions2OUTPUT:array([[17, 0, 0, 0, 0, 0, 0, 0, 0, 0],[ 0, 17, 0, 0, 1, 0, 0, 0, 2, 0], [ 0, 0, 13, 1, 0, 1, 0, 1, 1, 0], [ 0, 2, 2, 9, 0, 3, 2, 4, 0, 0], [ 0, 0, 0, 0, 18, 0, 1, 2, 0, 1], [ 0, 0, 0, 1, 2, 15, 0, 0, 1, 0], [ 0, 0, 0, 1, 2, 0, 19, 0, 0, 0], [ 0, 0, 0, 2, 1, 0, 0, 17, 0, 0], [ 0, 2, 1, 0, 0, 0, 0, 1, 13, 0], [ 0, 1, 0, 0, 0, 0, 0, 2, 1, 16]], dtype=int64)accuracy_score(targets[1601:] , predictions2)OUTPUT:From Decision Tree Classifier we get 78 % Accuracy
Training Data : 1600
Test_data : 197Random Forest Classifier:
from sklearn.ensemble import RandomForestClassifierrc = RandomForestClassifier(n_estimators = 150)rc.fit(main_data[:1500] , targets[:1500])predictions3 = rc.predict(main_data[1501:])accuracy_score(targets[1501:] , predictions3)OUTPUT:0.9222972972972973From Random Forest Classifier we get high accuracy for n_estimators = 150
Training data : 1500
Test Data : 297
Conclusion:
Data maters the most we need a good amount of data for modal.if we have a less data then we can use some other machine learning classifier algorithms like random forest which is also give 92 % accuracy on 1500 trainset which is less data compare to Support vector classifier.
As per our hypothesis, we can say with hyperparameter tunning with different machine learning models or using more data we can achieve near 95% accuracy on the handwritten dataset. But make sure we also have a good amount of test data otherwise the model will get overfit.

Comments
Post a Comment