We introduce how we configured benchmark datasets to properly evaluate the performance of our proposed method, STCC: Semi-Supervised Learning for Tabular Datasets with Continuous and Categorical Variables. Unlike popular domains like computer vision and natural language processing, there is no standard set of benchmarks for tabular model studies. Therefore, most studies employ different set of datasets. However, there is a major concern in such datasets that they are too much concentrated on continuous variables. For example, OpenML-CC18 consists of 72 datasets, but 48 (66.7%) of them have only continuous variables. Similarly, AMLB (42/71), Grinsztajin et al.(15/22), Somepali et al.(10/16), and Gorishniy et al.(7/11) consist of datasets with more than 50 percent of them have only continuous variables. Therefore, numerical tests within such datasets would not provide strong evidence of any model performing well in real-world applications. Therefore, we selected 24 datasets after carefully reviewing more than 4,000 datasets including OpenML (3,953 datasets), AMLB (71 datasets), and Grinstajin et al. (22 datasets). Detailed criteria are as follows: (1) Preprocessing Datasets: with more than 30% missing values were excluded. For the remaining datasets, columns with more than 30% missing values were removed. Also, redundant categorical variables, which have only one category, were removed. (2) Variable types: In order to evaluate tabular models in a more real-world like environment, we selected datasets with both continuous and categorical variables. Surprisingly, around 60 percent of the entire datasets did not satisfy this condition. (3) Data distribution: In this study, assume that data samples are i.i.d. Hence, datasets with certain distributional structure (sequential or temporal) were excluded. Also, we eliminated datasets with too simple distributions, which can be easily predicted with high accuracy by naive models. Artificially generated datasets were excluded as well. Lastly, as this study focuses on classification tasks, datasets for regression tasks were not considered. (4) Dataset size: Most previous studies did not evaluate their models with datasets of different sizes. For more comprehensive evaluation, we selected datasets with different sizes: small-sized (~10,000 samples), medium-sized (10,000 ~ 100,000), and large-sized (100,000~).