I've got a dilemma which I hope someone has a solution to.
Let's say we're building a data mining model to predict aircraft reliability. In the training table we've got a column (among many others) with a unique aircraft ID, and then a column for the type (737,747) and then a column for the series (100,200,300). I.E. A 737-800 series would be "737" and "800".
There is in essence a parent-child relationship between these 2 columns. 737's should share a common set of reliability factors, and then those factors might be further defined by the series number (for instance, the 737 might have very reliable radar except for the 500 series). The series is analogous to what model year a car is. What I want to make sure doesn't happen is for the system to correlate a 747-400 and a 737-400 because they are the same series. They are totally independent if the model number is different.
My only idea was to merge the columns and have a single value "737-100". But it would seem then that the model won't have any idea that a "737-100" and "737-200" should have a lot more in common than a "737-100" because the values will be completely different.
I was hoping to find some sort of parent-child hint in the column properties but found none.
What solutions have other people tried? It sure seems that there should be an elegant solution for something like, but I'm missing it.
Geof
You can still use two columns. The first is the type, the second is type+series:
TypeTypeSeries
737 737-100
737 737-200
…
747 747-100
This solution is basically an extension of your proposed one. Please let me know if this works for you.
Thanks,
|||
Of course! Thanks, I thought I was close.
It worked just fine.
Geof
No comments:
Post a Comment