Data Model

Our data model is a standard database table or spreadsheet in comma-separated csv format-- a single table that has no preset limits on the number of rows or columns.

Data Upload Format

  1. The uploaded file is a zip file, and its name has the extension .zip
  2. The content of the zip file is the data file in csv format, and its name has the extension .csv

Representational View

  1. The first line is the header,describing the fields in each column
  2. A row is a single unit that has measurements
  3. Each column is a specific measurement

Field/Column Attributes

  1. The header is text that appears in quotes for each field
  2. A column and its values are one of these types:
  3. The last column is the item to be predicted

Optional column headers

  1. "ID" in column 1 - optional case-insensitive identifier field used for cross-reference. Not used for prediction. default is internal sequential ordering
  2. "CLASS" in last column - indicates that the numbers are labels for class 1, class 2 ...
  3. X2 as a suffix in column 1 - the default is to select random cases for testing. The X2 suffix indicates that the test cases should be selected from the end of the data file. It is case-insensitive and can be "x2" as well

Avoiding Data Errors

  1. Excluding the header, columns must be pure numbers or quoted text. Do not mix types in the same column.
  2. Unquoted alphabetic fields are errors. For text or categories, quote the whole column.
  3. real, ordered numbers are not quoted.
  4. Missing values are expressed as empty fields.

Examples:

In these examples each row is a patient, and the same measurements are taken on each patient

Example 1

"ID","systolic blood pressure","bad cholesterol","family history of heart disease","comments","risk"
Joe,170,150,1,"very overweight","high"
Emily,120,110,0,,"low"
Brenda,200,200,0,"healthy","high"
Robert,100,90,0,,"low"

Example 2

"systolic blood pressure","bad cholesterol","family history of heart disease","comments","class"
170,150,1,"taking statins",2
120,110,0,,1

Example 3

"ID X2","systolic blood pressure","bad cholesterol","family history of heart disease","comments","life expectancy years"
18979,170,150,1,"very overweight",70
94321,120,110,0,,85

Transaction Data and Rare Events

Prediction models typically expect records that summarize many previous transactions like total yearly purchases by a customer. For rare-event prediction, best results are often achieved by assembling a sample of all rare events and an equal-size sample of alternative events.

A Word about Results

You will be emailed a extensive set of results to be displayed in a browser. Many different predictive models are invoked and are presented in a manner intended to be transparent and understandable to the non-expert. You control the output and quality of results by incremental revisions to the input data. Do not underestimate the importance of this task. You are the expert in that endeavor, and you can improve results by reacting to the results of previous experiments.

Big Data

Currently, our limit on zip file transfer is 58Mb which is in the vicinity of half a terabyte of uncompressed data. This is quite large for structured data and prediction models. Small files having less than 5k records will be processed using a 50% train/test split.