The Changing Role of the DBA: Week 3, Data Preparation
Welcome to the fourth post and the third step in our “Oracle Data Professional to Oracle Data Scientist” series. Last time in Week 2, we covered data understanding, including the importance of not just accepting the data you’ve been handed, assembling the “right data,” and understanding and verifying the right data to make sure it looks valid and “makes sense.” If you haven’t read that post or the summary blog post in this series, I highly recommend doing so.
You can now also watch the Oracle Machine Learning Overview: from Oracle Data Professional to Oracle Data Scientist in 6 Weeks! on YouTube.
In this blog post, we are going to move into the third step to becoming an Oracle data scientist: Data Preparation.
Part of what makes you a great candidate for being an Oracle Data Scientist is that you’ve learned quite a bit just by working at your company. Having experiential knowledge of that company, the industry, your competitors, the factors that drive your business, and the data that your company collects and stores, puts you in a solid position for this next step—Data Preparation. You learned from Week 1 on Business Understanding that you need to be very precise in your problem statement definition. In Week 2, you learned to be thoughtful about the data you’ve assembled so that machine learning can address the business problem.
Now, you need to prepare the data before you can run the machine learning algorithms and achieve good results. First, you should know that there are two main types of data preparation. We’ll go through them both.
Data preparation required by the algorithms: Data transformations such as normalization, scaling, binning, missing value treatment, and feature reduction are often required by the machine learning algorithms to yield useful insights.
Engineered features: New attributes or variables derived to capture domain knowledge and generate more accurate and useful models.
Data preparation required by the algorithms
To start, let’s go one-by-one through the manual data preparation steps that are required by most machine learning algorithms. Most classically-trained data scientists will start here looking at the data they have and begin addressing missing values, data normalization, binning, and feature reduction.
Full disclosure: Oracle Machine Learning’s Automatic Data Preparation will automatically do much of this work for you. However, before you faithfully adopt everything Automatic Data Preparation does, you should understand why it’s doing what it’s doing and when you should accept its suggestions (almost always). Then, you can concentrate your energies on more important data preparation aspects such as deriving new “engineered features” that help tease out information hidden in your data to make it easier for the machine learning algorithms.
Or you can use Oracle Machine Learning’s Data Transformation functions and go back to having full control over your data preparation and manually perform any or all of the following data transforms:
Missing value treatment: In general, missing values are important because many machine learning algorithms won’t work with missing values or “null” values. Missing values are also interesting because in some cases, missing data indicates that something’s wrong with the data collection, like a bad sensor, or in a more deceitful case, an attempt to hide or withhold information.
When any record is found to have one or more missing values, the most common approach to satisfy the algorithm’s requirements is to delete that record, which can unfortunately lead to loss of data.
Another popular approach is to replace nulls or blanks with the mean for numeric attributes or mode for categorical attributes (ADP does this automatically) or replace the value with an assigned value like “999.”
But you should also keep an eye out for whether a missing response is simply an oversight or a deliberate deception and how that should be reflected in the data. For instance, is there a reason somebody skipped the question: “Have you ever declared bankruptcy before?”
Normalization: To mitigate differences in scale or the size of numerical attribute values, most algorithms require transforming numerical attributes to standardized 0-1 scales so that the model fitting and model coefficients generated all share the same standardized scale. For example, in a dataset with both AGE and SALARY, AGE, which may have a natural range between 0-120 years, would be normalized or rescaled to values between 0 and 1 or -1 to +1. SALARY which would have natural ranges between $0 and millions of dollars would be rescaled to the same 0 and 1 or -1 to +1 ranges. That way, when the model is generated by the algorithm, all the model’s coefficients would be standardized. A model coefficient value multiplied times SALARY would not overpower the same model coefficient value multiplied times AGE.
Care needs to be made both in the machine learning model build and model apply that both AGE and SALARY are used and displayed in their scaled or unscaled form. Oracle Machine Learning’s Automatic Data Preparation automatically deals with data normalization, but now you know why it’s done.
Binning: Binning is performed for different purposes in different situations. Sometimes known as “recoding,” binning can be used to “bucketize” categorical or numerical data to both add contextual meaning and to simplify the data. For example, AGE could be binned into CHILD, TEENAGER, ADULT and SENIOR. Similarly, PROFESSION could be recoded into “WHITE_COLAR” and “BLUE_COLAR” workers to add more domain context and help with model interpretability.
Data scientists often want to “bin” their data to group continuous numbers into ranges, also known as quantization.
Oracle Machine Learning’s Automatic Data Preparation automatically deals with all required binning—and required preparatory steps for the data. Additionally, most Oracle Machine Learning algorithms—for example, Decision Tree algorithms and the data split points they create—manage their own data preparation internally.
Random and Stratified Sampling: Sometimes data scientists will process samples of a larger data population to reduce the amount of data and speed up model building. However, with Oracle, this data reduction strategy is less critical.
Stratified sampling is a common practice used in supervised learning when the target field is unbalanced. For example, if the business problem is to identify people who will be interested in a new digital banking product, but only 5% of existing customers currently have it, you could build a model that always guesses the most popular data value (No) and be accurate 95% of the time. However, the models would be wrong 100% of the time for the smaller population who represent the types of customers you seek to target.
To overcome these unbalanced data distribution situations, data scientists will perform stratified sampling so that the rare cases (people who have the digital banking product) and the more common cases (people who don’t have the product) represent a more balanced, 50/50 dataset. For more information, see Sampling in the Oracle Data Miner documentation.
There are also other techniques and strategies for dealing with unbalanced data, including weighting the target classes and Receiver Operating Curves (ROC), but more on that in Week 5 – Model Evaluation.
Again, Oracle’s strategy to “move the algorithms, not the data” takes advantage of Oracle Database’s powerful, scalable, parallelized architecture, so it can easily process billions of records. These capabilities make data sampling less necessary. Stratified sampling may still be practical, though, if you find yourself with a 10-billion-record dataset.
Transactional data: For transactional data (e.g. purchases, clicks, sensor readings, etc.), data scientists will often require multiple tables to be exported and “flattened out” into large tables with a column for every attribute that’s being used in the model. This is because that is how traditional data analysis and machine learning tools are designed to operate but it is not how an enterprise database is designed to function.
In the real world of relational databases, data is managed in multiple tables joined together by unique customer or transaction identifiers in star schemas (e.g. customer data, sales transaction data, supplemental demographics data, customer service department transactional data, etc.).
Instead of doing this work, you can rely on Oracle Machine Learning to aggregate and “mine” these disparate data sources in their natural states inside the database. For more, see Preparing the Case Table.
Feature Selection and Reduction: Machine learning toolkits and platforms often struggle to handle large numbers of input attributes (whatever “large” is for them). So, they force the user to either select their “best” input attributes or run feature selection and reduction techniques to winnow down a larger list of candidate input attributes. If there are tens, hundreds or thousands of input attributes, machine learning tools, languages, and environments may require dedicated compute platforms.
Oracle Machine Learning doesn’t have this limitation and has been run with many thousands of input attributes. This last statement may sound strange as many will recall the database’s 1,000-column limit. However, Oracle Machine Learning can process transactional data (e.g. sales transactions, web clicks, medications taken by a patient) and process it as “nested data.” See "Nested Data” in the documentation.
Unstructured data: Although we can’t go into detail here, it’s worth mentioning that Oracle Machine Learning can also deal with unstructured, or “text” data. This powerful capability means that potentially valuable input data, such as customer comments, tweets, documents, descriptions, and notes can be added along with traditional structured or semi-structured transactional data. Most Oracle Machine Learning algorithms can ingest unstructured data to build the model and apply it. But dealing with unstructured data could be the topic of another blog series. So, for now, I’d like to point the interested reader to the Oracle Machine Learning documentation for Mining Unstructured Text.
Oracle Machine Learning’s automated processing of unstructured data can yield better models and insights as shown in this Support Vector Machine model that uses information about wines, including their descriptions, to predict good wines (greater than 90 points).
For a further discussion of Oracle Machine Learning’s support for data preparation and transformations see DBMS_Machine Learning in Mining_TRANSFORM. Oracle Data Miner (ODMr), an extension to Oracle SQL Developer, provides automation, defaults, and system determined defaults based on the data to simplify and automate machine learning as much as possible while still allowing expert users to manually override.
While you may need to manually address some or all of the above data preparation issues, the hope is that you can take advantage of Oracle Machine Learning’s Automatic Data Preparation processing capabilities and concentrate more of your data preparation energies on developing clever new “engineered features.”