Discretization of Unlabeled Data using RST & Clustering

Girish Kumar Singh, Shrabanti Mandal


An algorithm can be applied on numerical or continuous attributes as well as on nominal or discrete value. If input to an algorithm required only attributes of nominal or discrete type then continuous attributes of the dataset need to be discretize before applying such algorithm. Discretization method can be of two types namely supervised and unsupervised. Supervised methods of dicretization utilize class labels of the dataset while in unsupervised method class labels are totally disregarded. In many literatures it has been shown that supervised methods gives good discretization result. Supervised algorithms cannot apply if dataset is unlabeled. In real life, many dataset do not have class (label) attribute and only unsupervised discretization methods are applicable in such cases. This paper presents discretization schemes for unlabeled data based on RST (Rough Set Theory) and clustering. The experiments have been performed to compare the proposed technique with other discretization methods for labeled data on two benchmark datasets. Two parameters Class-Attribute Interdependence Redundancy and the total number of intervals have been used to compare the proposed techniques with other existing techniques. The results display a satisfactory tradeoff between the information loss and number of intervals for the proposed method.


Discretization; Data mining; Rough set theory

Full Text:



J.Y. Ching, A.K.C. Wong and K.C.C. Chang, Class-dependent discretization for inductive learning from continuous and mixed mode data, IEEE Trans. Pattern Analysis and Machine Intelligence 17 7 (1995), 641 – 651.

L.A. Kurgan and K.J. Cios, CAIM discretization algorithm, IEEE Trans. Knowledge and Data Engineering 16(2) (2004), 145 – 153.

A.K.C. Wong and D.K.Y. Chiu, Synthesizing statistical knowledge from incomplete mixed modedata, IEEE Trans. Pattern Analysis and Machine Intelligence 9 (7) (1987), 796 – 805.

J.R. Quinlan, C4.5 Programs for Machine Learning, Morgan-Kaufmann (1993).

U.M. Fayyad and K.B. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, Proc. 13th Int’l Joint Conf. Artificial Intelligence (1993), 1022 – 1027.

J. Dougherty, R. Kohavi and M. Sahami, Supervised and unsupervised discretization of continuous features, Proc. 12th Int’l Conf. Machine Learning (1995), 194 – 202.

X. Wu, A bayesian discretizer for real-valued attributes, The Computer Journal 39(1) (1996), 688 – 691, DOI: 10.1093/comjnl/39.8.688.

R. Kerber, ChiMerge: discretization of numeric attributes, Proc. Ninth Int’l Conf. Artificial Intelligence (AAAI-91) (1992), 123 – 128.

H. Liu and R. Setiono, Feature selection via discretization, IEEE Trans. Knowledge and Data Eng. 9(4) (1997), 642 – 645.

A.K.C. Wong and T.S. Liu, Typicality, diversity and feature pattern of an ensemble, IEEE Trans. Computers 24 (1975), 158 – 181.

Z. Pawlak, Rough sets, International Journal of Computer and Information Sciences 11 (1982), 341 – 356.

M. Ester, H.P. Kriegel, J. Sander and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96) Portland, Oregon, 226 – 231 (1996).

J. Catlett, On changing continuous attributes into ordered discrete attributes, in Proceedings of the European Working Session on Learning, Berlin, Germany, 164 – 178 (1991).

R.C. Holte, Very simple classification rules perform well on most commonly used datasets, Machine Learning 11 (1993), 63 – 90.

X. Wu and D. Urpani, Induction by attribute elimination, IEEE Transactions on Knowledge and Data Engineering 11(5) (1999), 808 – 812.

L. Kaufman and P.J. Rousueeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley & Sons (1990).

G.K. Singh and S. Minz, Discretization using clustering and rough set theory, International Conference on Computing: Theory and Applications (ICCTA’07), 330 – 336 (2007).

S. Mehta, S. Parthasarathy and H. Yang, Toward unsupervised correlation preserving discretization, IEEE Trans. Knowledge and Data Eng. 7(9) (2005), 1174 – 1185.

S. Ferrandiz and M. Boullé, Multivariate discretization by recursive supervised bipartition of graph, Proc. Fourth Conf. Machine Learning and Data Mining (MLDM) (2005), 253 – 264.

P. Yang, J.-S. Li and Y.-X. Huang, HDD: A hypercube divisionbased algorithm for discretisation, Int. J. Systems Science 42(4) (2011), 557 – 566.

S. Garcıa, J. Luengo, J. A. Sáez, V. López and F. Herrera, A survey of discretization techniques: taxonomy and empirical analysis in supervised learning, IEEE Transactions on Knowledge and Data Engineering 25(4) (2013), 734 – 750, DOI: 10.1109/TKDE.2012.35.

J. Bai, K. Xia, Y. Chi and L. Liu, Continuous attribute discretization based on inflection point, Journal of Information & Computational Science 11(4) (2014), 1327 – 1333, DOI: 10.12733/jics20103079.

S. Ramirez Gallego, B. Krawczyk, S. Garcia, M. Wozniak and F. Herrera, A survey on data preprocessing for data stream mining: Current status and future directions, Neurocomputing 239 (2017), 39 – 57, DOI: 10.1016/j.neucom.2017.01.078.

DOI: http://dx.doi.org/10.26713%2Fjims.v11i1.890

eISSN 0975-5748; pISSN 0974-875X