V i s i R e x T i p s & T r i c k s
Tips & Tricks
Choosing a Project
VisiRex vs NeuNet
Choosing a VisiRex Project
Is the Data Predictable?
- For a project to be successful, there must be some relationship between the
input fields and the field you are attempting to predict.
- If your data is not predictable using the selected input fields,
VisiRex will simply grow a huge tree that tries to memorize some unique combination of inputs for every row.
- Data is called "noisy" if it contains contradictions to the underlying rules.
Rule extraction can be successful on noisy data as long as the quantity of noise is small
compared to the "well behaved" rows.
- Noise can often be filtered out of the data by using Tree Pruning and Min Items.
- The presence of noise will always result in reduced confidence.
How Much Data Do I Need?
- There is no firm answer to this question.
It is best to experiment with project configuration and various sizes of training set.
In some cases, rules may be extracted using only a few dozen training rows.
In other cases, thousands of training rows may be required.
- It depends on how closely the target prediction is related to your input values,
and the complexity of this relationship.
- It depends on how crisp (clean) your data is, or whether it contain contradictions and anomalies (dirty).
Clean data will require a smaller training set and produce rules of higher confidence
than can be achieved using dirty data.
- If VisiRex is used for data mining and anomaly detection,
it is possible to train VisiRex on a small sample of cleaned data,
then use these "clean" rules to detect the "dirt" in the larger data.
- It depends on how many input fields you are using.
Usually it is best to include any input field you think might be useful,
and the let VisiRex decide exactly how useful.
However, if you include a large number of redundant or unrelated inputs,
there is an increased chance that VisiRex will discover some spurious rules.
This spurious correlation can be averaged-out by increasing the number of training rows.
- VisiRex is fast enough, that you can easily experiment with the size of the training set.
If you data is well shuffled, try reducing the size of the selected rows for the training set until the extracted rules begin to look silly.
Try selecting your test set completely outside of the training set,
then monitor the confusion matrix as a report on the performance of the test set,
while decreasing the size of the training set.
Which Input Fields Should I Use?
- Begin by selecting any inputs that could possibly be related to the target prediction.
- Try to avoid using the row counter index field as input,
because this numeric field takes much time to calculate its unimportance.
- As you visually inspect the extracted tree,
you will gain a feeling for which fields are most important.
You can go back and de-select those fields that are unnecessary.
- In some projects, you might find that one field is too good a predictor.
Usually this fact will be obvious before you start.
However you are interested in any rules that underlie this one obvious rule,
simply deselect the obvious field to extract the more subtle rules.
A Complete System for Inductive Rule Extraction
CorMac Technologies Inc.
34 North Cumberland Street ~ Thunder Bay ON P7A 4L3 ~ Canada
E m a i l