24: Recoding and Transforming
- Page ID
- 83010
It’s often the case that although a DataFrame contains the raw information you need, it’s not exactly in the form you need for your analysis. Perhaps the data is in different units than you need – meters instead of feet; dollars instead of yen. Or perhaps you need some combination of available quantities – miles per gallon instead of just miles and gallons separately. Or perhaps you need to reframe a variable by binning it into meaningful subdivisions – categorizing a raw column of salaries into “high,” “medium,” and “low” wage earners, for instance.
In data science, these activities are known as recoding and/or transforming. There’s not a sharp division between the two; usually I think of recoding as converting a single variable to one with different units (as in the dollars-to-yen and high/medium/low earners examples) and transforming as creating a new variable entirely out of a combination of columns (like miles per gallon). In both cases, though, we’ll be creating and adding new columns to a DataFrame. These columns are sometimes called derived columns since they’re based on (derived from) existing columns rather than containing independent information.