Training Data Files for Classification – IPNNL – The University of Texas at Arlington

MNIST Dataset ( 784 Inputs, 10 Output clases, 60000 training patters, 10000 Testing Patterns,

The MNIST ((“Modified National Institute of Standards and Technology “) database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. This dataset is a classic within the Machine learning community and has been extensively studied.

For more information on the data file, see

[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 86(11):2278-2324, November 1998

[2] http://yann.lecun.com/exdb/mnist/

MNIST_train.dat (Zipped 18.5 MB)(rename files to unzip)
MNIST_test.dat.zip (Zipped 3.1 MB) (rename files to unzip)

Character Features (49 inputs, 36 output classes, 7044 training patterns, 5713 testing patterns, 1.9MB)

These features have been computed from characters extracted from license plate images.

For more information on the data file, see

trg.tra
val.tst

GRNG.TRN: (16 Inputs, Class Id, 800 Training Patterns, 196KB)

The geometric shape recognition data file consists of four geometric shapes, ellipse, triangle, quadrilateral, and pentagon. Each shape consists of a matrix of size 64*64. For each shape, 200 training patterns were generated using different degrees of deformation. The deformations included rotation, scaling, translation, and oblique distortions. The feature set is ring-wedge energy (RNG), and has 16 features.

For more information on the data file, see

H. C. Yau, M. T. Manry, “Iterative Improvement of a Nearest Neighbor Classifier”, Neural Networks, Vol. 4, pp. 517-524, 1991

Grng.dat

GONGTRN.TRA: ( 16 Inputs, Class Id, 3000 Training Patterns, 780KB)

The raw data consists of images from hand printed numerals collected from 3,000 people by the Internal Revenue Service. We randomly chose 300 characters from each class to generate 3,000 character training data. Images are 32 by 24 binary matrices. An image scaling algorithm is used to remove size variation in characters. The feature set contains 16 elements. The 10 classes correspond to 10 Arabic numerals. For more details concerning the features, see

W. Gong, H. C. Yau, and M. T. Manry, “Non-Gaussian Feature Analyses Using a Neural Network,” Progress in Neural Networks, vol. 2, 1994, pp. 253-269.

A testing version GONGTST is also available (780K) for download.

COMF18.TRA : ( 18 Inputs, Class Id, 12,392 Training Patterns, 3.8MB)

The training data file is generated segmented images. Each segmented region is separately histogram equalized to 20 levels. Then the joint probability density of pairs of pixels separated by a given distance and a given direction is estimated. We use 0, 90, 180, 270 degrees for the directions and 1, 3, and 5 pixels for the separations. The density estimates are computed for each classification window. For each separation, the co-occurrences for for the four directions are folded together to form a triangular matrix. From each of the resulting three matrices, six features are computed: angular second moment, contrast, entropy, correlation, and the sums of the main diagonal and the first off diagonal. This results in 18 features for each classification window.

For more details concerning the features, see

R.R. Bailey, E. J. Pettit, R. T. Borochoff, M. T. Manry, and X. Jiang, “Automatic Recognition of USGS Land Use/Cover Categories Using Statistical and Neural Network Classifiers,” Proceedings of SPIE OE/Aerospace and Remote Sensing, April 12-16, 1993, Orlando Florida.

Four regions of land use/cover types were identified in the images per Level I of the US Geological Survey Land Use/Land Cover Classification System : urban areas, fields or open grassy land, trees (forested land), and water ( lakes or rivers).

Comf18.tra

SPEECH_CLASS.TRA: (39 Inputs, 34 Classes, 2184 Training Patterns, 853 KB)

The speech samples are first preemphasized and it is converted into frequency domain by taking DFT. Then it is passed through Mel filter banks and the inverse DFT is applied on the output to get Mel-Frequency Cepstrum Coefficients (MFCC). Each of MFCC(n), MFCC(n)-MFCC(n-1) and MFCC(n)-MFCC(n-2) would have 13 features, which results in a total of 39 features. Each class corresponds to a phoneme.

speech_class.tra

F17C.DAT: (17 inputs, 39 Classes, 4745 Training Patterns, 1.33 MB)

This data file consists of parameters that are available in the basic health usage monitoring system (HUMS), plus some others. The data was obtained from the M430 flight load level survey conducted in Mirabel Canada in early 1995. The input features include: (1) CG F/A load factor, (2) CG lateral load factor, (3) CG normal load factor, (4) pitch attitude, (5) pitch rate, (6) roll attitude, (7) roll rate, (8) yaw rate, (9) corrected airspeed, (10) rate of climb, (11) longitudinal cyclic stick position, (12) pedal position, (13) collective stick position, (14) lateral cyclic stick position, (15) main rotor mast torque, (16) main rotor mast pm, (17) density ratio. The 39 classes represents different maneuvers of the flight like taking off, landing, turning right or left etc. This is an application for prognostics or flight condition recognition

F17C.DAT

Object Recognition Dataset ( 576 Inputs, 2 Classes, 17977 Training Patterns, 0.32 MB)

The features are extracted from a trained Convolutional Neural Network (CNN) after throwing away the fully-connected layer at the top. So the features are the output of the last convolutional layer. The CNN was trained on 128×128 grayscale images.

alscrap_train.zip (rename files to unzip)

Two Spirals Benchmark Problem

The inputs of the spirals problem are points on two entangled spirals. Gaussian noise with sd=0.05 is added to each data point.
spiral_1.txt has one cycle in the spirals. N=2, num classes = 2, Nv=5000 points.
spiral_2.txt has two cycles in the spirals. N=2, num classes = 2, Nv=10000 points.
spiral_3.txt has three cycles in the spirals. N=2, num classes = 2, Nv=20000 points.

For more details concerning the features, see

Friedrich Leisch & Evgenia Dimitriadou (2010). mlbench: Machine Learning Benchmark Problems. R package version 2.1-1.