skip to main content
article

A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems

Published: 01 July 2001 Publication History

Abstract

Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.This paper presents a simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression. The proposed method is based on a well-established statistical method (empirical Bayes) that is straightforward to implement as an in-database procedure. Furthermore, for categorical attributes with an inherent hierarchical structure, like ZIP codes, the preprocessing scheme can directly leverage the hierarchy by blending statistics at the various levels of aggregation.While the statistical methods discussed in this paper were first introduced in the mid 1950's, the use of these methods as a preprocessing step for complex models, like neural networks, has not been previously discussed in any literature.

References

[1]
Becher J. D., Berkhin P. and Freeman E., Automating Exploratory Data Analysis for Efficient Data Mining, KDD-2000, p. 424-429
[2]
Carlin, B. P. and Louis T. A. Bayes and Empirical Bayes Methods for Data Analysis, New York, Chapman & Hall, 1996
[3]
Cestnik B. & Bratko, On Estimating Probabilities in Tree Pruning, Proc. of European Workshop in Symbolic Learning (EWSL'91), 138-150, 1991
[4]
Cestnik B., Estimating Probabilities: A Crucial Task in Machine Learning, Proc. of the 9th European Conf. on Artificial Intelligence, ECAI'90, 147-149, 1990
[5]
Gnanadesikan, R., Methods for Statistical Data Analysis of Multivariate Observations, Wiley, New York, 1977
[6]
Good, L. J., Probability and the weighting of evidence, London, Charles Griffing & Company Limited, 1950
[7]
http://www.unica-usa.com
[8]
Johnson, S. C. Hierarchical Clustering Schemes, Psychometrika, 2:241-254, 1967
[9]
McCallum A., Rosenfeld R., Mitchell T. and Ng A., Improving Text Classification by Shrinkage in a Hierarchy of Classes, Proceedings of the 15th International Conference on Machine Learning, 1998
[10]
Nishisato, S. Analysis of Categorical Data: Dual Scaling and Its Applications, Toronto: Toronto University Press, 1980
[11]
Quinlan, J. R. C4.5: Programs for Machine Learning, San Mateo, Calif., Morgan Kaufmann, 1992
[12]
Quinlan, J. R. Induction of decision trees. Machine Learning, 1:81-106, 1986
[13]
Robbins, H. An empirical Bayes approach to statistics, In Proc. 3rd Berkeley Symposium on Math Statistics and Probability, 1, Berkeley, CA: University of California Press, 157-164, 1955

Cited By

View all
  • (2025)Intrusion detection in metaverse environment internet of things systems by metaheuristics tuned two level frameworkScientific Reports10.1038/s41598-025-88135-915:1Online publication date: 28-Jan-2025
  • (2025)Tunnel lining defects identification using TPE-CatBoost algorithm with GPR data: A model test studyTunnelling and Underground Space Technology10.1016/j.tust.2024.106275157(106275)Online publication date: Mar-2025
  • (2025)Explainable robo-advisor: An online learning framework for new investors without trading recordsNeurocomputing10.1016/j.neucom.2025.129463(129463)Online publication date: Jan-2025
  • Show More Cited By
  1. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGKDD Explorations Newsletter
      ACM SIGKDD Explorations Newsletter  Volume 3, Issue 1
      July 2001
      50 pages
      ISSN:1931-0145
      EISSN:1931-0153
      DOI:10.1145/507533
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 July 2001
      Published in SIGKDD Volume 3, Issue 1

      Check for updates

      Author Tags

      1. categorical attributes
      2. empirical bayes
      3. hierarchical attributes
      4. neural networks
      5. predictive models

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)440
      • Downloads (Last 6 weeks)51
      Reflects downloads up to 13 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Intrusion detection in metaverse environment internet of things systems by metaheuristics tuned two level frameworkScientific Reports10.1038/s41598-025-88135-915:1Online publication date: 28-Jan-2025
      • (2025)Tunnel lining defects identification using TPE-CatBoost algorithm with GPR data: A model test studyTunnelling and Underground Space Technology10.1016/j.tust.2024.106275157(106275)Online publication date: Mar-2025
      • (2025)Explainable robo-advisor: An online learning framework for new investors without trading recordsNeurocomputing10.1016/j.neucom.2025.129463(129463)Online publication date: Jan-2025
      • (2025)JAQ of all trades: Job mismatch, firm productivity and managerial qualityJournal of Financial Economics10.1016/j.jfineco.2024.103992164(103992)Online publication date: Feb-2025
      • (2025)Towards Precision Economics: Unveiling GDP Patterns Using Integrated Deep Learning TechniquesComputational Economics10.1007/s10614-025-10863-xOnline publication date: 23-Jan-2025
      • (2025)A comprehensive and systematic literature review on intrusion detection systems in the internet of medical things: current status, challenges, and opportunitiesArtificial Intelligence Review10.1007/s10462-024-11101-w58:4Online publication date: 30-Jan-2025
      • (2024)CARTEProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693026(23843-23866)Online publication date: 21-Jul-2024
      • (2024)Development of a Deep Learning-Based Predictive Model for Improvement after Holmium Laser Enucleation of the Prostate According to Detrusor ContractilityInternational Neurourology Journal10.5213/inj.2448362.18128:Suppl 2(S82-89)Online publication date: 30-Nov-2024
      • (2024)Development of an AI-Based Suicide Ideation Prediction Model for People with DisabilitiesLife10.3390/life1411137214:11(1372)Online publication date: 25-Oct-2024
      • (2024)Enabling mixed effects neural networks for diverse, clustered data using Monte Carlo MethodsProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/555(5018-5024)Online publication date: 3-Aug-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media