Thursday, March 14, 2013

10 tips for making your R graphics look their best

10 Tips for making your R graphics look their best
Source: Revolutions - David Smith

[Y Cheng: FYI, I ususally use Daniel's XL Toolbox, a free, open-source add-in for Excel, to export figures of Excel into high publishable figures. Hope you like it as well.]

So you've spent hours slaving over the code for a beautiful statistical graphic in R, and now you're ready to show it to the world.  You might be printing it, embedding it in a document, or displaying it on the web.  Don't do your graph a disservice by causing it to look anything less than perfect in its final venue. Here are 10 tips to help make sure your graphic will always look best.

1. Call the right device driver from a script

It's tempting to just create graphics to the on-screen device (such as X11 on Linux or Quartz on MacOS) and then just use "Save As..." from the menu.  However, this doesn't allow you to explicitly set the options for the device, and on some platforms, you don't even get to choose the file format.  Also, if you resize the graphics window after you create the graph, you can get some unexpected results (such as circles that look like ovals). Avoid using dev.copy for the same reason, despite its convenience.

The best practice is to create a script file that begins with a call to the device driver (usually pdfor png), runs the graphics commands, and then finishes with a call to dev.off().  For example:
png(file="mygraphic.png",width=400,height=350)
plot(x=rnorm(10),y=rnorm(10),main="example")
dev.off()
Not only will you often get better-looking results, but you'll have the means to recreate the graphic file six months down the line, when you've long forgotten how you did it manually.

2. If you're printing, use PDF

If you plan to print your graphic, you want to use a vector-based format.  This means that the graphic is represented in a scale-independent format, and it can be recreated in any size small or large without resulting in jagged lines or pixellated text.  When you print it on a printer, lines will appear smooth and text will be clear, even if the graphic has been enlarged or reduced and regardless of the DPI (dots-per-inch) rating of the printer.

PDF (via the pdf() driver) is the best choice: because PDF viewers are ubiquitous these days, your graphic can easily be viewed on Windows, MacOS and Linux machines.  It's also easy to create a high-quality printout of a PDF file on almost any printer.

PDF is also the best choice whenever you want to send the graph as a file via email, and the recipient needs the best quality possible.

3. For Web display, use PNG

PDF files aren't conveniently embedded in Web pages, so you'll need to use a pixel-based format instead.  GIF was the most popular format for many years, but it has several limitations (not least, graphs using many colors -- like image plots -- might not look correct in GIF format).  These days, the best choice is the PNG format, generated by the png() driver. Most browsers these days can display PNG graphics without trouble.

The main choice you need to make when using png() is the dimensions of the graphic in pixels. (This is specified with the width= and height= arguments to png). The choice of the X dimension is the most important: ideally, you want the whole graph to fit on the screen at once, and you definitely don't want the viewer to have to scroll horizontally to see your whole graph in all its beauty.

Almost every display is more than 800 pixels wide these days, so width=800 is a good choice for a full-screen graphic.  If your graph needs to fit into a column (for a blog-entry, say), you might want to cut that down to 400 pixels.  Choose the Y dimension based on your desired aspect ratio (see #6, below) -- for most purposes I find choosing a slightly smaller Y-dimension (about 85-90% of X) works well.

If you're not sure in advance how large the graphic will be on the Web page, a simple trick is to create it at high resolution (more than 1200 pixels in either direction), and use the height= ORwidth= options (but not both, to preserve aspect ratio) for the img tag in HTML to shrink it down to size. This can make your page slower to load than necessary, but most browsers these days are good at preserving image quality when resizing images. (See #5 for some caveats when generating high-resolution PNG files.)

Remember, though: the lower the display resolution, the fewer fine details will be visible on the final graph.  Some graphics just have to be displayed big for the full effect.

4. For documents or for detail, go hi-resolution

If you're inserting a graphic into a document like Word or Powerpoint, a vector format like PDF would in theory be the best, since it's independent of scale.  In practice though, Microsoft products don't handle embedded vector graphics reliably: with some effort you can make it look OK when printed, but it can be a pain to edit or review a document that includes a vector graphic. (Open-source LaTeX handles this much better, where embedded PostScript is the best choice.)

In this situation the best compromise is to stick with PNG as with the Web example, but at MUCH higher resolution.  In Word you can resize the graphic to an appropriate size, but the high resolution gives you the flexibility to choose a size while not compromising on the quality.  I'd recommend at least 1200 pixels on the longest side for standard printers. If your graphic is being professionally printed (in a book or on a poster, for example), check with your print shop for their recommendations (they'll probably want a PostScript file or a very high-resolution TIFF file).

5. Choose your dimensions carefully

R always has a concept of the real-world dimensions of your graphic measured in inches, independent of the number of pixels used to render a PNG or the actual size a PDF may be enlarged or reduced to when printing. The choice of physical dimensions is important whenever you use text on graph -- which is almost always, since tick labels, axis labels and titles are all examples of text.  

R uses the number of graph inches on the X and Y axis to determine the actual width and height of letters drawn on the page. As a general rule, as the graph size in inches gets large, the size of the text relative to the graphic gets smaller; conversely, for smaller graphics the text gets large relative to the graph elements.  You can correct for this using the cex option to the text-plotting commands, but this gets real fiddly real fast.

For PDF graphs this is easiest to deal with, where you specify width and height in inches anyway.  Even if you plan to display your graph on a huge poster, it's best to stick with human-scale dimensions of 7-10 inches per side.  This is a size that would fit comfortably when printed on Letter (US) or A4 (metric) paper. Since PDF is scalable, you can zoom up the graphic for whatever side you need, and the text will stay at a comfortable size relative to the data.
 
For PNG graphs, it's a bit tricker. By default, R assumes 72 pixels to the inch, so when you increase the pixel dimensions you're also increasing the implicit size of the graph area.

Here's an example of a 400x350 graphic with the default settings:
 
png(file="animals72.png",width=400,height=350,res=72)
plot(Animalslog="xy"type="n"main="Animal brain/body size")
text(Animalslab=row.names(Animals))
dev.off()

Animals72 
R is assuming the graph area is 5.55 inches across, so the default text size is large relative to the graph itself.  You can correct this with the res= argument to png, which specifies the number of pixels per inch.  The smaller this number, the larger the plot area in inches, and the smaller the text relative to the graph itself.  Let's see what happens when you drop this down to 45/inch:

png(file="animals45.png",width=400,height=350,res=45)
plot(Animalslog="xy"type="n"main="Animal brain/body size")
text(Animalslab=row.names(Animals))
dev.off()

Animals45
Note the title is smaller, and the text labels are smaller too, making for a less-crowded plot.  I like to choose a resolution that gives me an X dimension in the 8-10 inches range (here 400/45 = 8.33 inches).
6. Think about aspect ratio
R's PDF graphics driver by default gives a 7x7inch square surface, and it's tempting to choose equal X and Y pixel dimensions for PNGs.  But some graphs lend themselves to displays much wider than they are tall (like time series), and others look better as tall, thin graphs (lattice graphs, for example). 

Consider the aspect ratio when choosing the dimensions of your PDF or PNG graphic, and choose an X-height to Y-height ratio that serves the data best.  Whatever you do, don't stick with a square default and resize the graphic to a new aspect ratio for display.  This will result in stretched text elements and other unpleasant artifacts.

Also, remember that the graph dimensions you set up in the pdf or png call include all the outer margins around the graph itself, and by default they're not the same size on all sides. You'll either want to adjust the graph size accordingly, or reset the margins as shown in the next tip.

7. Remove the outer margins, if you're not using them

R reserves space at the top of the graph for the title, and space on the bottom and left side for the axis labels.  If your graph doesn't include any such labels, it's a good idea to tell R to use this space for the graphic, instead.  This makes it easier to embed your graph into a Web page or document without having to futz with clipping or spacing.  It also makes things a bit easier if you later have to reproduce your graph in smaller dimensions, where the space reserved for the labels can take up a significant portion of the plot area.

To remove the space reserved for labels, use par(mar=...).  For example
png(file="notitle.png",width=400height=350)
par(mar=c(5,3,2,2)+0.1)
hist(rnorm(100),ylab=NULL,main=NULL)
dev.off()
The four numbers in the call to par are the number of lines of text reserved on the bottom, left, top and right, respectively.  The default is to leave 4.1 lines at the top; in the example above I've reduced it to 2.1, which is a sensible minimum to leave a small buffer of whitespace around the graph. On the left side I reduced it to 3.1 (from a default of 4.1), leaving enough room for the y-axis tick labels. This is the result (with a surrounding box to show the dimensions of the graph area):

Notitle   

Compare to the result using default margins by eliminating the call to par in the script above:

Notitle-default

8. Make sure anti-aliasing is enabled

When a diagonal line is displayed on a computer screen, the points on the line don't line up exactly with the rectangular grid on the screen. This causes the line to look jagged, as if it's taking a series of steps up (with each row of pixels on the screen) instead of smoothly ascending.  This visual effect is alleviated with anti-aliasing, which uses automatically uses grey pixels where the line doesn't quite fill an on-screen pixel, lessening the "jaggie" effect and generally making lines, text, and other elements look smoother on the screen. 

You don't need to worry about this with PDF graphics (the PDF viewer handles this for you), but it can be an issue with PNG graphics.  On most systems R will use anti-aliasing automatically, but on some (those without X11), it's not available.  Here's a graphic (from the R graph gallery) created without anti-aliasing:
Graph_72 
Antialiasing is enabled on my system, so here's what it looks like when I recreate it (typo and all):
Coal-antialias-on 
In this version, the text is much easier to read and the lines appear smoother.

If you don't have anti-aliasing on your system (and can't recompile R to enable it), you can use the poor-man's anti-aliasing trick: generate the graph in double the resolution, and display it at half the size. The browser will handle the anti-aliasing, at the expense of additional bandwidth for your graphic.
 
9. Don't use JPEG, ever

You might be tempted to use the JPEG (aka .JPG) graphics format for the final product on the Web, but this is almost certainly a bad idea.  JPEG works fine for photograph-like images, but introduces blurry artifacts around lines and letters for the typical R graph.  You might save a few kilobytes in the file size by using the jpeg device driver or converting your PNG into .JPG, but only at a significant expense in quality.
 
10. Be creative

Of course, the most important tip for making your graph look good is: make a good-looking graph! Graphical display of quantitative data is in some ways more art than science, but as a general rule it takes time and effort to make a truly effective display that lets your data tell the story it needs to tell. Fortunately, R provides you with all the tools you need to pull out all the details, make the right comparisons, and make the results pleasing to the eye.  Don't be satisfied with the "stock" graphs from the top-level functions like plot or hist. Make liberal use of the annotation functions like text and line, and experiment with choices of color, layout, and size.

There are many good resources for learning about making good graphical displays, but my favorite is Tufte's classic: The Visual Display of Quantitative Information. Not only is it chock-full with wonderful examples and sensible guidelines for displaying data, it makes a beautiful coffee-table book to show your non-statistician friends that Statistics is about more than just numbers.

If you want to download the scripts that generated the graphs in this article, you can get them here:

Wednesday, March 06, 2013

The logic operation with %IF

The logic operation with %IF of SAS Macro

I have used the SAS for many years; I use the simplified statement for logic comparison in the data step without any problem; the codes, for example, can be IF 18 <= age <= 80 instead of IF age >=18 AND age <=80. However, today I've used hours to figure out a logic problem when I use %IF in the same for SAS macro codes. To get the correct answer of logic operation using %IF, we must use the more standard way (age is a macro variable in the following example): %IF &age>=18 AND &age<=89 %THEN %DO; ......; %END.

Relative articles:

Tuesday, February 12, 2013

Thin Asians at Risk for Diabetes Due to Hidden Body Fat

Thin Asians at risk for diabetes due to hidden body fat
Source: MedScape.come by Lisa Nainggolan

Type 2 diabetes, usually associated with obesity, can occur in many seemingly thin people from ethnic minorities, physicians told attendees here at the Excellence in Diabetes 2013 meeting last week.

Researchers showed that Japanese American women are twice as likely to be diagnosed with diabetes as whites, despite having lower body-mass indexes (BMIs). Epidemiologist Gertraud Maskarinec, MD, from the University of Hawaii Cancer Center, Honolulu, presented the findings, which cover a number of studies from her group, in a poster.

She told Medscape Medical News: "Diabetes risk is higher in all ethnic groups than in whites, and of course some of this is just due to body weight, but evidence is now building that people of many races may be at increased risk of diabetes and cancer before they are even considered conventionally overweight."

In communities where there are a lot of Asians, "I think it's on everybody's radar already," said Dr. Maskarinec. "If an Asian walks in, you don’t have to wait until they weigh hundreds of pounds to do a diabetes test." The World Health Organization (WHO) has worked on the idea to lower the "at-risk" BMI to 23 kg/m2 for certain ethnic groups, she adds, but "not everybody has adopted it."

Meanwhile, Chittaranjan Yajnick, MD, from King Edward Memorial Diabetes Unit, Pune, India, also gave a talk on what makes Indians so susceptible to diabetes. "We have seen that Indians are often diagnosed with diabetes 10 years earlier and 5- to 10-units BMI thinner than whites," he noted.

Both believe the explanation lies in "hidden" visceral fat found inside the body, between organs, in Asians and probably other ethnic groups too, but not in whites. This in turn affects the levels of adipokines secreted, such as leptin and adiponectin, which can have adverse metabolic effects. ...full text ...

Friday, January 18, 2013

Janelia Automatic Animal Behavior Annotator


The Janelia Automatic Animal Behavior Annotator (JAABA) an open-source program, is a machine learning-based system that enables researchers to automatically compute interpretable, quantitative statistics describing video of behaving animals. It may be a good tool for low-cost science projects of middle/high school students.

More information:

Friday, December 07, 2012

How to extract an image from a PDF file

How to extract an image from a Pdf file?
  • Manually
    • Open the Pdf file with image using Adobe Acrobat Reader.
    • Go to the page with image you want to extract.
    • If the 'Select Tool' with an arrow image is not on your toolbar, right-click toolbar, click 'Select & Zoom', then check 'Select Tool'.
    • Click 'Select Tool', now the mouse pointer is an arrow.
    • Right-click the image, then click 'Copy Image'. (or press 'Ctrl-c')
    • Click 'Start', 'All Program', 'Accessories', then run 'Paint'. (you can use your other favorit image edit software instead)
    • Press 'Ctrl-v'.
    • Finally, save the file as a jpg or other graphic format file.
  • Online
    • I like ExtractPdf.com, which can extract Images, Text or EVEN Fonts from a PDF File.
  • Sofeware

Tuesday, November 27, 2012

Survival analysis using Stata

Survival analysis using Stata

Monday, November 19, 2012

Stata Programming

Stata programming
      • di "`2+2'" // ==> N/A
      • local x 2
        • di "`=`x'-2'" // ==> 0
      • local pth "c:\project"
        • di "`pth'\data\" //==> "c:\project\data\"
      • global pt "c:\project"
        • di "$pt\data\" //==> "c:\project\data\"
        • di "${pt}data\" //==> "c:\projectdata\"
      • local a 2+3
      • local b 7
        • display `a'+`b' //==> 10
        • display "`a'+`b'" //==> 2+3+7
        • display "`a'"+"`b'" //==> 2+3+"7" invalid name
        • display "`a'""+""`b'" //==> 2+3+7
      • regress mpg weight
        • local rsqf e(r2)
        • local rsqv = e(r2)
        • di "R-squared_1f=`rsqf'"  //==> R-squared_1f=e(r2)
        • di "R-squared_1v=`rsqv'"  //==> R-squared_1v=.6515312529087511
        • di "R-squared_2f=" `rsqf' //==> R-squared_2f=.65153125
        • di "R-squared_2v=" `rsqv' //==> R-squared_2v=.65153125
        • di "R-squared_3f=" "``rsqf''" //==> R-squared_3f=.6515312529087511
        • di "R-squared_3v=" "``rsqv''" //==> N/A
  • Stata Blog: Programming an estimation command in Stata
      • Example 1: Storing and extracting the result of an extended macro function
        • local count : word count a b c
        • display "count contains `count'" ==> count contains 3
      • Example 2: Using gettoken to store first token only
        • local mylist y x1 x2
        • display "mylist contains `mylist'" ==> mylist contains y x1 x2
        • gettoken first : mylist
        • display "first contains `first'" ==> first contains y
      • Example 3: Using gettoken to store first and remaining tokens
        • gettoken first left: mylist
        • display "first contains `first'" ==> first contains y
        • display "left  contains `left'" ==> left  contains  x1 x2
      • Example 4: Local macro update
        • local p = 1
        • local p = `p' + 3
        • display "p is now `p'" ==> p is now 4
      • Example 5: Local macro update
        • local p = 1
        • local ++p
        • display "p is now `p'" ==> p is now 2

Friday, November 16, 2012

The rule of thumb: AIC preference/signficance levels

The rule of thumb: AIC and BIC preference/signficance levels

The Akaike information criterion (AIC) is a measure of the relative goodness of fit of a statistical model, which was developed by Hirotsugu Akaike in 1974. It was not much appreciated until 21 century. Now it’s one of the most used fit statistic output for statistical modeling. The preferred model is the one with the minimum AIC value; however, there are no statistic test for model choice based on this criterion. I found a table in Joseph M Hilbe’s book (Negative Binomial Regression, 2nd Ed, 2011) which may help us to choose a better model. Below is a table I modified based on his confusing table:

AIC = -2(log-likelihood) + 2(number of predictors including the intercept)
========================================
AIC of (Model A – Model B) ,  if AIC(A) > AIC(B)
----------------------------------------------------------------
< 2.5            No difference in models
2.5 – 5.9      Prefer B if sample size n > 256
6.0 – 9.9      Prefer B if sample size n > 64
≥ 10.0          Prefer B
========================================

Another criterion measure is Bayesian information criterion (BIC). Raftery AE (1995) gave the scale for relative preference of two models (I modified the table in Hilbe's book):

BIC = -2(log-likelihood) + (number of predictors including the intercept)*(ln(sample size))
========================================
BIC of (Model A – Model B) ,  if BIC(A) > BIC(B)
----------------------------------------------------------------
< 2.0            Weak
2.0 – 5.9      Positive and prefer B
6.0 – 9.9      Strong and prefer B
≥ 10.0          Very strong and prefer B
========================================

Sunday, November 11, 2012

Gardening and Cooking


Gardening


Cooking
Restaurants
Others



Thursday, November 01, 2012

Resource and Tips of Stata

Resource and Tips of Stata

Wednesday, October 31, 2012

How to get predicted incidence rate using -poisson- of Stata

How to get predicted incidence rate using -poisson- of Stata

The Stata -poisson- can be used to model  count variables/incidence rate. However, the default predicted margin is a predicted number of events. We need use the -predict(ir)- option to get the predicted incidence rate. 

-n-, the default, calculates the predicted number of events. -ir- calculates the incidence rate exp(Xjbeta), which is the predicted number of events when exposure is 1. Specifying -ir- is equivalent to specifying -n- when neither -offset()- nor -exposure()- was specified when the model was fit. For example:

webuse nhanes2, clear
svy: poisson hlthstat age i.region
margins region, predict(ir) at(age=50) vce(svy)

Factor variable operators

  • i. unary operator to specifiy indicators
  • c. unary operator to treat as continuous
  • o. unary operator to omit a variable or indicator
  • # binary operator to specify interactions
  • ## binary operator to specify full-factorial interactions
  • use ib#., ib(first)., ib(last)., ib(freq). to set base level
  • ibn. means no baselevel
  • ".fvset base 3 group" sets the base for group to be 3
  • ".list if 3.group" to list all when group is equal to 3.
  • ".gen over_age=cond(3.group, age-21, 0)"


Tuesday, October 23, 2012


"YOUTH" - Samuel Ullman
Source: Samuel Ullman Museum

Youth is not a time of life; it is a state of mind; it is not a matter of rosy cheeks, red lips and supple knees; it is a matter of the will, a quality of the imagination, a vigor of the emotions; it is the freshness of the deep springs of life.

Youth means a temperamental predominance of courage over timidity of the appetite, for adventure over the love of ease. This often exists in a man of sixty more than a boy of twenty. Nobody grows old merely by a number of years. We grow old by deserting our ideals.

Years may wrinkle the skin, but to give up enthusiasm wrinkles the soul. Worry, fear, self-distrust bows the heart and turns the spirit back to dust.

Whether sixty or sixteen, there is in every human being's heart the lure of wonder, the unfailing child-like appetite of what's next, and the joy of the game of living. In the center of your heart and my heart there is a wireless station; so long as it receives messages of beauty, hope, cheer, courage and power from men and from the infinite, so long are you young.

When the aerials are down, and your spirit is covered with snows of cynicism and the ice of pessimism, then you are grown old, even at twenty, but as long as your aerials are up, to catch the waves of optimism, there is hope you may die young at eighty.


青  春 - 塞缪尔.乌尔曼
人生匆匆,青春不是易逝的一段。青春应是一种永恒的心态。满脸红光,嘴唇红润,腿脚灵活,这些都不是青春的全部。真正的青春啊,它是一种坚强的意志,是一种想象力的高品位,是感情充沛饱满,是生命之泉的清澈常新。

青春意味着勇敢战胜怯懦,青春意味着进取战胜安逸,年月的轮回就一定导致衰老吗?要知道呵,老态龙钟是因为放弃了对真理的追求。

无情岁月的流逝,留下了深深的皱纹,而热忱的丧失,会在深处打下烙印。焦虑、恐惧、自卑,终会使心情沮丧,意志消亡。

60也罢,16也罢,每个人的心田都应保持着不泯的意志,去探索新鲜的事物,去追求人生乐趣。我们的心中都应有座无线电台,只要不断地接受来自人类和上帝的美感、希望、勇气和力量,我们就会永葆青春。

倘若你收起天线,使自己的心灵蒙上玩世不恭的霜雪和悲观厌世的冰凌,即使你年方20,你已垂垂老矣;倘若你已经80高龄,临于辞世,若竖起天线去收听乐观进取的电波,你仍会青春焕发。 

Friday, October 19, 2012

Look AHEAD halted: Lifestyle management fails to reduce hard CV outcomes in diabetics

Look AHEAD halted: Lifestyle management fails to reduce hard CV outcomes in diabetics
Source: theHeart.org

Los Angeles, CA - The Action for Health Diabetes (Look AHEAD) study, a trial comparing an intensive lifestyle-intervention program aimed at achieving and maintaining weight loss and fitness in patients with type 2 diabetes, has been stopped for futility.

A large cardiovascular-outcomes study funded by the National Institutes of Health that included 5145 adults with diabetes and a body mass index >25 kg/m2, Look AHEAD failed to show a difference in the rate of nonfatal MI, nonfatal stroke, death, or hospitalization for angina among patients randomized to an intensive lifestyle intervention and those randomized to a control arm consisting of education alone.

Despite significant reductions in weight and improvements in physical-fitness levels among patients with diabetes, investigators concluded that the intervention arm, which included individual sessions with a nutritionist and/or personal trainer, as well as group sessions and refresher courses, failed to provide any benefit in terms of cardiovascular outcomes.

Dr Anne Peters (University of Southern California, Los Angeles), one of the study investigators, said in an interview that the trial was successful on one level-namely, that patients lost weight and improved their fitness. Data published at four years showed that the intensive intervention led to weight loss of up to 10% in the first year and that patients maintained a 6.5% reduction in body weight in the following three years. Over an 11-year follow-up period, the patients reported a 5% reduction in body weight from baseline, said Peters.

In addition, early data showed that treadmill fitness levels, hemoglobin A1c levels, systolic and diastolic blood pressure, HDL-cholesterol levels, and triglyceride levels were all significantly improved among patients in the lifestyle-intervention arm when compared with the control group. The only cardiovascular risk factor that remained unchanged with treatment was LDL-cholesterol levels.

Despite the lack of cardiovascular benefit observed in Look AHEAD, Peters stressed that diabetic patients should not stop exercising or begin eating anything they wish.
"We do know that weight loss and exercise can prevent diabetes," said Peters. "I am a big advocate of prevention, both early prevention of obesity altogether, as well as prevention of diabetes in individuals who have become overweight. Lifestyle changes can help prevent diabetes. Once you have diabetes, I think weight loss and exercise can have benefits, but they are not going to reduce the risk for the primary outcome that we set for Look AHEAD, which was a risk for macrovascular events or death."