In present days, tons of data and information exist for each and everyone,
Data can now be kept in many various kinds of databases as well as information
repositories, besides being available online or in hard copy. With such big
amount of data, a need for powerful techniques for better interpretation of
these data that exceeds the human’s ability for comprehension and making
decision in a better way get into the picture. In order to get the best
classification technique as well as tools required for handling with the
classification task that helps in decision making, this survey has detailed a
comparative study between a number of some data mining techniques and also
tools required for its implementation. Results have shown that the performance
of the tools for the classification task is overripe by the kind of dataset
used and by the way the implementation of classification algorithms was done
within the toolkits.
Decision tree, WEKA (Waikato Environment for Knowledge Analysis)
Today’s databases and
data repositories contain tons of data and information and so it becomes very
tough, even impossible for a human being to evaluate them blue-collar for
better decision making. So, they need some assistance or technique that can
make work faster and efficient; therefore humans need techniques for data
mining as well as its applications. 1. Data mining is defined as the process
of finding desired information from lots of amounts of data kept in databases
and data warehouses as well as other information repositories.
a combination of techniques from multiple perspectives such as database and
data warehousing technology, statistics, machine learning, high-performance
computing, and pattern matching is also involved in data mining 2.
business, science and engineering, economics, games and bioinformatics are also
considered as the different fields of data mining. As tons of information exist
and from that, a particular part of it needs to be retrieved, some efficient
methods should be used for its better operation.
Decision tree Algorithm
A decision tree is defined as a
decision support system using graph decisions of tree-like and their possible
repercussions, including probability results, resource costs, and utility.
A Decision Tree, also known as a
classification tree, is used to discover a classification function that
performs the operation of deducing the value of dependent attributes from the
values of the independent attributes. A decision tree is also defined as a
flowchart-like structure in which a test on each node is represented by each
node, the outcome is represented by each branch and a class label is
represented by each leaf.
The classification rule can be
classified from the paths from root to leaf. Some of the empirical application
areas of decision trees are commonly operations research, specifically in
decision analysis, to help recognize a strategy most suitable for reaching
towards the goal.
1.1.1 Advantages and Disadvantages
Decision trees are taken as the most
suitable approaches in information discovery as well as data mining. The technologies
of research big and complicated group of data with a view to find useful
patterns are included in it 4. Given approach is very essential because it
enables modeling and information retrieval from the group of information there
All theoreticians and specialists are
continually searching for methods to perform it in a more efficient way,
economical and precise. In many fields apart from data mining like knowledge
retrieval, machine learning, and pattern matching have application of decision
There are some benefits of decision
tree algorithm as follow:
? Simple for
understanding and relate properly to a set of production rules.
? Decision trees can be
efficiently approached for real problems.
? No prior predictions
about the behavior of the data are to be made.
? Efficient enough to create models
with data containing numerical and also categorical values.
But it has some limitations compared
to other algorithms that are as follow:
? Output attributes must
be categorical, and more than one output attributes are not permitted.
? Not stable in that minor
fluctuations in the training data can turn into various attribute selections at
every choice node with in the tree. The effect can be worth-noticing as
attribute choices adapts all descendent sub trees.
? Trees from numeric datasets can be
more complex as attribute divided for numeric data are typically in a binary
J48 is the java implementation of
improved version of decision tree. 8 The improvements which are made are as
follow: • Managing continuous as well as discrete attributes • Managing
training data with not specified attribute values • Modified of trees after its
origination With the advanced algorithm, quick and more efficient outcomes
without the adaptation of the final decision can be achieved and the proposed algorithm
makes the decision tree more specific and easy to understand. Also,
improvisation in efficiency and categorization is achieved.15
1.2 K-Means Algorithm
K-means is a basic, simple partition
clustering technique which operates to search a user-specified k number of
clusters. Their centroids notify these clusters that is typically the mean of
the points in the cluster.
Two separate phases are involved in
this algorithm: in the first phase, selection process of k centers at random is
performed, where the value of k is constant from the start. During the next
phase, Assignment of each data object to the nearest center is done. Euclidean
distance is taken into consideration for determination of the distance from
each data object to the cluster centers.
After the inclusion of all the data objects in
some clusters, recalculation operation is performed on the average of the
clusters. This iterative process performs recursion until the criterion
function reaches its minimum value. 12
1.2.1 Algorithm steps
The steps involved in k-means
algorithm are as follow:
? Select k data object
from dataset S as initial cluster centers at random
? Repeat step 3 to step 5
till no new cluster centers are found.
? Measure the distance
from each data object di (110.1145
/846170.846172 2 Han, J., Kamber, M., Jian P., Data Mining Concepts and
Techniques. San Francisco, CA: Morgan Kaufmann Publishers, 2011.
3 Rokach, Lior; Maimon, O. (2008). Data
mining with decision trees: theory and applications. World Scientific Pub Co
Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Pearson
5 Experimental study of Data clustering
using k- Means and modified algorithms Dr. M.P.S. Bhatia and Deepika
KhuranaIJDKP Vol.3 No.3, May 2013.
6 Rong Cao,Lizhen Xu,Improved C4.5 Decision
tree algorithm for the analysis of sales. Southeast University Nanjing211189,
7 Surbhi Hardikar, Ankur Shrivastava and
Vijay Choudhary Comparison betweenID3 and C4.5 in Contrast to IDS VSRD-IJCSIT,
Vol. 2 (7), 2012
Rokach, L.; Maimon, O. (2005). “Top-down induction of decision trees
classifiers-a survey”. IEEE Transactions on Systems, Man, and Cybernetics,
Part C 35 (4): 476–487
Hall P, Park BU, Samworth RJ (2008). “Choice of neighbor order in
10 Toussaint GT (April 2005).
“Geometric proximity graphs for improving nearest neighbor methods in
instance-based learning and data mining”. International Journal of
Computational Geometry and Applications
11 Classification algorithm in Data mining:
An Overview IJPPT-2013 Vol.4 issue 8
12 Survey on Various Enhanced K-Means
Algorithms Twinkle Garg, Arun Malik IJARCCE Vol.3, Issue 11, Nov.2014
Performance Evaluation of K-Means and Heirarichal Clustering in Terms of
Accuracy and Running Time Nidhi Songh, Divakar Singh IJCSIT, Vol. 3(3) 2012.
14 A review on SVM for data classification
Himani Bhavsar, Mahesh Panchal IJARCET Vol. 1, Issue 10, 2012.
Patel, J. A. & Sharma, P,”Big data for better health planning”,
International Conference on Advances in Engineering and Technology Research