Tuesday, August 19, 2014

Data Cleaning is a critical part of the Data Science process

Data Cleaning is a critical part of the Data Science process
by David Smith

A New York Times article yesterday discovers the 80-20 rule: that 80% of a typical data science project is sourcing cleaning and preparing the data, while the remaining 20% is actual data analysis. The article gives short shrift to this important task by calling it "janitorial work", but whether you call it data munging, data wrangling or anything else, it's a critical part of the data science. I'm in agreement with Jeffrey Heer, professor of computer science at the University of Washington and a co-founder of Trifacta, who is quoted in the article saying,

     “It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.”

As an illustration of this point, check out the essay by Julia Evans, Machine learning isn't Kaggle competitions (hat tip: Drew Conway). A Kaggle competion typically presents a nice, clean, regularized data set to the competitors, but this isn't representative of the real-world process of making predictions from data. As Julia points out:

     Cleaning up data to the point where you can work with it is a huge amount of work. If you’re trying to reconcile a lot of sources of data that you don’t control like in this flight search example, it can take 80% of your time.

While there are projects underway to help automate the data cleaning process and reduce the time it takes, the task of automation is made difficult by the fact that the process is as much art as science, and no two data preparation tasks are the same. That's why flexible, high-level langauages like R are a key part of the process. As Mitchell Sanders notes in a Tech Republic article,

     Data science requires a difficult blend of domain knowledge, math and statistics expertise, and code hacking skills. In particular, he suggests that expert knowledge of tools like R and SAS are critical. "If you can't use the tools, you can't analyze the data."

This is a critical step to gaining any kind of insight from data, which is why data scientists still command premium salaries today, according to data from Indeed.com.

Tuesday, August 05, 2014

How to generate a color-coded/formatted Word file of R programs in RStudio

RMarkdown: Generate a color-coded/formatted Word file of R! using RStudio

Here is my approaches/cheat-sheet/template:

If have codes like these:
 
     library("xxx")    
     summary (usmort)
 
Method I (using RStudio)
  • Start a new R markdown file (File -> New File-> R Markdown).
  • Delete all the codes that RStudio generated in that new file.
  • Copy and paste the R codes above into the new file
  • Add global option (between --- and ---), chunk head (```) with option (eval=FALSE), and chunk tail (```) :
    ---
    output:
    word_document:
    highlight: pygments
    ---
    ```{r,eval=FALSE}
                 ... your codes here ...
    ``` 
  • Click the ‘knit word’ button, you get a Word file with codes show like below: 
     library("xxx")
     summary
(usmort)
 
Method II (with or without using RStudio)
  • add chunk head (```{r,eval=FALSE}) and tail (``` ) and wrap the whole codes, and save the file ("xxx_r_code.R").
  • run the R commands in console window:
       >library(rmarkdown)
    >render("xxx_r_codes.R",word_document(highlight="pygments"))