Showing posts with label Data. Show all posts
Showing posts with label Data. Show all posts

Wednesday, October 12, 2016

Battle of the data science Venn Diagrams

Battle of the Data Science Venn Diagrams
by David Taylor    
Data science is a rather fuzzily defined field; some of the definitions I've heard are:
  • "Work that takes more programming skills than most statisticians have, and more statistics skills than a programmer has."
  • "Applied statistics, but in San Francisco."
  • "The field of people who decide to print 'Data Scientist' on their business cards and get a salary bump."
Personally, I've recently decided to avoid the controversy by calling myself a data spelunker. (Data miners are out of vogue anyway.)
As a field in search of a definition, it's unsurprising that you can find a lot of different attempts to define it.
As a field full of data nerds with a penchant for visualization, it's also unsurprising that a lot of them use Venn diagrams. (Fun fact: John Venn, who invented the eponymous diagrams, and his son filed a patent in 1909 for an lawn bowling machine.)... Full Text

Tuesday, August 19, 2014

Data Cleaning is a critical part of the Data Science process

Data Cleaning is a critical part of the Data Science process
by David Smith

A New York Times article yesterday discovers the 80-20 rule: that 80% of a typical data science project is sourcing cleaning and preparing the data, while the remaining 20% is actual data analysis. The article gives short shrift to this important task by calling it "janitorial work", but whether you call it data munging, data wrangling or anything else, it's a critical part of the data science. I'm in agreement with Jeffrey Heer, professor of computer science at the University of Washington and a co-founder of Trifacta, who is quoted in the article saying,

     “It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.”

As an illustration of this point, check out the essay by Julia Evans, Machine learning isn't Kaggle competitions (hat tip: Drew Conway). A Kaggle competion typically presents a nice, clean, regularized data set to the competitors, but this isn't representative of the real-world process of making predictions from data. As Julia points out:

     Cleaning up data to the point where you can work with it is a huge amount of work. If you’re trying to reconcile a lot of sources of data that you don’t control like in this flight search example, it can take 80% of your time.

While there are projects underway to help automate the data cleaning process and reduce the time it takes, the task of automation is made difficult by the fact that the process is as much art as science, and no two data preparation tasks are the same. That's why flexible, high-level langauages like R are a key part of the process. As Mitchell Sanders notes in a Tech Republic article,

     Data science requires a difficult blend of domain knowledge, math and statistics expertise, and code hacking skills. In particular, he suggests that expert knowledge of tools like R and SAS are critical. "If you can't use the tools, you can't analyze the data."

This is a critical step to gaining any kind of insight from data, which is why data scientists still command premium salaries today, according to data from Indeed.com.

Friday, May 10, 2013

Integrated Health Interview Series

Integrated Health Interview Series (IHIS)

If you're working on trend projects using US National Health Interview Survey. IHIS, a database of NHIS data that facilitates trend analyses, can be a big helper. This well-organized online documents included NHIS annual data from the 1960s to the present. You can easily linked to additional variables in NHIS public use data and comprehensive on-line documentation.

Thursday, May 09, 2013

Medicare Provider Charge Data

Medicare Provider Charge Data
 
"As part of the Obama administration's work to make our health care system more affordable and accountable, data are being released that show significant variation across the country and within communities in what hospitals charge for common inpatient services.
 
The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs represent almost 7 million discharges or 60 percent of total Medicare IPPS discharges.
 
Hospitals determine what they will charge for items and services provided to patients and these charges are the amount the hospital bills for an item or service. The Total Payment amount includes the MS-DRG amount, bill total per diem, beneficiary primary payer claim payment amount, beneficiary Part A coinsurance amount, beneficiary deductible amount, beneficiary blood deducible amount and DRG outlier amount.
 
For these DRGs, average charges and average Medicare payments are calculated at the individual hospital level. Users will be able to make comparisons between the amount charged by individual hospitals within local markets, and nationwide, for services that might be furnished in connection with a particular inpatient stay.
 
Data are being made available in Microsoft Excel (.xlsx) format and comma separated values (.csv) format."
 
 

Hospital Prices No Longer Secret As New Data Reveals Bewildering System, Staggering Cost Differences - 05/08/2013 The Huffington Post

Tuesday, January 31, 2012

Series of NCHS Data Evaluation and Methods Research

Series 2. Data Evaluation and Methods Research
Source: NCHS - Vital and Health Statistics Series
Studies of new statistical methodology including experimental tests of new survey methods, studies of vital statistics collection methods, new analytical techniques, objective evaluations of reliability of collected data, and contributions to statistical theory. Studies also include comparison of U.S. methodology with those of other countries.
No. 154 (2012). NCHS Urban–Rural Classification Scheme for Counties. 72 pp. (PHS) 2012-1354.

Thursday, November 03, 2011

Final Data Collection Standards for Race, Ethnicity, Primary Language, Sex, and Disability Status Required by Section 4302 of the Affordable Care Act


HHS on Oct. 31, 2011, published final standards for data collection on race, ethnicity, sex, primary language and disability status, as required by Section 4302 of the Affordable Care Act [PDF | 1.6 MB].
The law requires that data collection standards for these measures be used, to the extent practicable, in all national population health surveys. They will apply to self-reported information only. The law also requires any data standards published by HHS comply with standards created by the Office of Management and Budget (OMB).

Proposed standards were published on June 29, 2011, and public comments were accepted until August 1, 2011.

The standards, effective upon publication today, apply to population health surveys sponsored by HHS, where respondents either self-report information or a knowledgeable person responds for all members of a household. HHS will begin implementation of these new data standards in all new surveys and at the time of major revisions to current surveys.

Thursday, June 30, 2011

Measured Smoking-Related Chemicals in NHANES

Cotinine in Serum Cotinine file

Cotinine: Tobacco use is the most important preventable cause of premature morbidity and mortality in the United States. The consequences of smoking and of using smokeless tobacco products are well known and include an increased risk for several types of cancer, emphysema, acute respiratory illness, cardiovascular disease, stroke, and various other disorders (U.S. DHHS, 2006). Persons exposed to secondhand tobacco smoke (environmental tobacco smoke [ETS]) may have adverse health effects that include lung cancer and coronary heart disease; maternal exposure during pregnancy can result in lower birth weight. Children exposed to ETS are at increased risk for sudden infant death syndrome, acute respiratory infections, ear problems, and exacerbated asthma (U.S. DHHS, 2004). The smoke produced by burning tobacco contains at least 250 chemicals that are toxic or carcinogenic, and more than 50 compounds present in ETS are known or reasonably anticipated to be human carcinogens (NTP, 2004). Source: http://www.cdc.gov/exposurereport/chemical_information.html

2,5-Dimethylfuran:  2,5-Dimethylfuran is a volatile chemical found in tobacco smoke (Baggett et al., 1974) and in roasted coffee aroma (Wang et al., 1983). Exposure among the general population may occur through inhaling cigarette smoke and coffee aroma. 2,5-Dimethylfuran in blood and exhaled air has been used to determine smoking status (Ashley et al., 1996; Gordon et al., 2002; Perbellini et al., 2003). In addition, levels of 2,5-dimethylfuran found in blood provide a rough estimate of the number of cigarettes smoked per day (Ashley et al., 1995, 1996). After a person smokes cigarettes, 2,5-dimethylfuran is absorbed from the respiratory tract and then rapidly eliminated from the blood (Egle and Gochberg, 1979; Gordon et al., 2002). 2,5-Dimethylfuran is also a human urinary metabolite of n-hexane. Workers exposed to n-hexane will eliminate 2,5-dimethylfuran, along with other metabolites, in their urine (ATSDR, 2007; Iwata et al., 1983; Mutti et al., 1984; Perbellini et al., 1981). Source: http://www.cdc.gov/exposurereport/chemical_information.html