Thursday, 22 November 2012

p's instead of dots in Inkscape with R

I've come across the problem of circles on an R plot pdf output turning into lower case Qs, and have solved - or at least half solved the problem.


Recently, while trying to make an R plot ready for a presentation, I exported my ggplot as a pdf, importing it into Inkscape. To my dismay, all of my points turned to q’s – and I ended up having to either make do with overlapping text (the reason I was importing my PDF into Inkscape in the first place), or suffer q’s for points. For those that don't know - Inkscape is a piece of free vector graphics editing software, similar to Adobe Illustrator.

 I ultimately remembered that R can plot different shapes, and that all shapes may not respond in the same way in Inkscape – luckily I was right. There are 25 different shapes that R produces, and some of these don't change when imported into Inkscape

R Code:

plot(x=1:25,y=rep(1,25), pch = 1:25, cex=3)
savePlot("plot.pdf",type="pdf")


Luckily it is only circles that are affected, and any shape without a circle is completely fine.
A list of the R shapes that work with Inkscape are below


2,3,4,5,6,7,8,9,11,12,14,15,17,18,22,23,24,25

And a list of R shapes that don’t work with Inkscape:

1,10,13,16,19,20,21


Monday, 26 September 2011

Memory usage in R

I've been having some trouble with running some R code in Eclipse using Statet, and thought that I'd look into memory usage for the different applications using one of the scripts that I have written.

When firing up each application, Eclipse is seen as a huge memory stealer.



And also seems to use up alot more memory by running the program


In conclusion, use whichever editor you feel the most comfortable with - but if you're running something memory intensive and still require a GUI - use the basic RGUI.

See below for the code


library(ggplot2)
setwd("C:\\Users\\DealeyJ\\Blog")
###Insert data
### nb/ added space before R in Type to switch order around

Program<-c("R","Eclipse","Eclipse","Rstudio","Rstudio")
Process<-c("rgui.exe","eclipse.exe","javaw.exe","rsession.exe","rstudio.exe")
Type<-c(" R","GUI"," R"," R","GUI")
BaseMem<-c(15,107,46,16,50)
AfterProgMem<-c(136,130,206,141,64)
###Bind as data frame - not needed
#Rmem<-as.data.frame(cbind(Program,Process,BaseMem,AfterProgMem,Type))
### Plot Base Memory and Save

ggplot()+
  geom_bar(aes(Program,BaseMem,fill=factor(Type)))+
  scale_y_continuous("Memory (MB)")+opts(title="Before Running script")
savePlot("R programs - Base Memory",type="png")
### Plot After Program Memory and save
BaseMemory<-1.5
ggplot()+  geom_bar(aes(Program,AfterProgMem,fill=factor(Type)))+
  scale_y_continuous("Memory (MB)")+opts(title="After running script")+
  geom_linerange(aes(Program,ymax=BaseMem,ymin=0,size=BaseMemory))
savePlot("R programs - after R Script",type="png")

Wednesday, 7 September 2011

Machine Learning is alot harder than I expected

I signed up to Kaggle a couple of months back, having a look into the Heritage Health Prize but never doing anything about it. The amount of data available in the HHP looked overwhelming from the beginning, and I thought it better to steer clear for now.

I recently saw however, through another blog, Sali Mali's post about another Kaggle competition released by Dunnhumby (The inventors of Tesco Clubcard). The data was alot simpler - just a customer ID, visit date, and spend amount. This looked like a great start to getting onto the bandwagon and here is my progress so far.

You can find Sali's post at http://anotherdataminingblog.blogspot.com/2011/08/gone-shopping.html

My first step (the only step so far) is to recreate the import and upload a simple guess.

I have used http://tohtml.com/ as a basis for colouring my R script. However, it doesn't have a definitive style for R so I have taken PHP and done a quick edit to make some (not all) of the colours look better

### Set work directorys and Librarys
setwd("C:\\Documents and Settings\\MyName\\Desktop\\R\\Eclipse Files\\DunnHumby")
library(Hmisc)
library(plyr)
### read in data and create temporary columns for use later
test<-read.csv("test.csv")
test$a<-1
test$row<-c(1:nrow(test))
### create a variable to find the last visit date
test$visit_number<-ave(test$a,test$customer_id, FUN=cumsum)
test$last_visit_date<-as.POSIXlt(Lag(test$visit_date,shift=1))
test$a<-Lag(test$customer_id,shift=1)
y<-which(test$visit_number==1)
test$last_visit_date[y]<-rep(as.POSIXlt(NA), length(y))
test$visit_date<-as.POSIXlt(test$visit_date)
### find the time since last visit
test$time_since_last_visit<-as.numeric((test$visit_date-test$last_visit_date)/24)
### create next visit date
test2<-ddply(test,.(customer_id),summarise,medianspend=median(visit_spend,na.rm=T),
         maxvisitdate=max(as.POSIXct(last_visit_date),na.rm=T),
         maxtime=max(time_since_last_visit,na.rm=T),
         maxrow=max(row))
### define next visit as last visit date + the time since last visit
test2$next_visit<-test2$maxvisitdate+(test$time_since_last_visit[test2$maxrow]*24*60*60)
### define the next spend as last spend. Any spend less than 10 should get a spend of 10 as the problem allows for you to be out by £10
test2$next_spend<-test$visit_spend[test2$maxrow]
test2$next_spend<-ifelse(test2$next_spend<10,10,test2$next_spend)
### create the output ready to read directly into the kaggle database
test2$visit_date<-sapply(strsplit(as.character(test2$next_visit)," "),function(x) x[1])
write.csv(subset(test2,select=c("customer_id","visit_date","next_spend")),"output.csv",row.names=FALSE)

I am going to create my variables in SAS from the training set after successfully uploading my first output to Kaggle due to the memory allocation on my PC (and R not liking this small amount of memory). Also, because I have struggled with some of the code to create fairly simple manipulations I see SAS as being a faster alternative for me.

Thursday, 25 August 2011

Hello World!

I'm going to have a go at starting my own blog, so I'd just like to say Hello World!

Jason