Sunday, 24 April 2016

Analysis of Census data set using R Studio


Basic Data Analysis through R/R Studio 
Hey Readers,
TASH here.!!

In this blog, I 'll design a basic data analysis program in R using R Studio by utilizing the features of R Studio to create some visual representation of that data. Following steps will be performed to achieve our goal.
  1. Installing R.
  2. Installing R Studio.
  3. Downloading/importing data in R .
  4. Transforming Data / Running queries on data.
  5. Basic data analysis using statistical averages.
  6. Plotting data distribution.


Let's go over the tutorial by performing one step at a time.
1. Installing R :
The following website will help to download the R package for Windows XP/Vista/7/8/8.1/10 with 64/32bits (https://cran.r-project.org/bin/windows/base/). 

 


 2. Installing R Studio :
The following website will help to download the R Studio on Windows XP/Vista/7/8/8.1/10 with 64/32bits (https://www.rstudio.com/products/rstudio/download/)



* link to download the R Studio for Windows Platform (https://download1.rstudio.org/RStudio-0.99.896.exe) .

3. Downloading/importing data in R .


For this tutorial we will use the sample census data set ACS (http://stat511.cwick.co.nz/homeworks/acs_or.csv) . There are two ways to import this data in R. One way is to import the data programmatically by executing the following command in the console window of R Studio.

acs <- read.csv(url("http://stat511.cwick.co.nz/homeworks/acs_or.csv"))

or else we can import dataset by just clicking IMPORT DATASET and providing the URL Address.

Once this command is executed by pressing Enter, the dataset will be downloaded from the internet, read as a csv file and assigned to the variable name acs.

 

Imported Dataset..
Setting up the preferences of separator ( , ), name (table name) and other parameters, click on the Import button. The dataset will be imported in R Studio and assigned to the variable name as set before.


 Any dataset can be viewed by executing the following line:
 
>View(acs_or) 


 



4. Transforming Data :
you can use various transformation features of R to manipulate the data. Let's learn few of the basic data access techniques.

To access a particular column, Ex. age_husband in our case.
  > acs_or$age_husband

To access data as a vector
  > acs_or[1,3]
 
you can use the subset function of R. if we want those rows from the dataset in which the age_husband is greater than age_wife.
For this we 'll run the following command in console

>  a <- subset(acs_or,age_husband > age_wife)


 

The above statement will return the set the rows in which the age_husband is greater than age_wife and assign those rows to a .


5. Basic data analysis using statistical averages:

Following functions can be used to calculate the averages of the dataset :

a.       For mean (average of the numbers) of column:
> mean(acs_or$age_husband)
 
b.      For median (add up all the numbers & then divide by the number of number.) of  column:
> median(acs_or$age_husband)
 
c.       For Quantile (dividing the observations in a sample in the same way.) of  column:
> quantile(acs_or$age_husband)
 
d.      For Variance (measurement of the spread between numbers in a data set) of  column:
> var(acs_or$age_husband)
 
e.       For Standard Deviation (measure of how spread out numbers are.) of  column:
> sd(acs_or$age_husband)
 
You can also get the statistical summary of the dataset by just running on either a column or the complete dataset:
> summary(acs_or)
 

To use a small range of rows from huge data set to perform analysis:

> s <- acs_or[1:100,]


It defines to create a new dataset with 1 – 100 rows and all columns and store it in s.

 

6. Plotting data distribution:
A very liked feature of R studio is its built in data visualizer for R. Any data set imported in R can visualized using the plot and several other functions of R. For Example
a.       To create a scatter plot of a data set, you can run the following command in console:
> plot(x = s$age_husband, y = s$age_wife, type = 'p')
 
Where s is the subset of the original dataset and type 'p' set the plot type as point. 
 
              
 
           > plot(x = s$age_husband, y = s$age_wife, type = 'l') 
           
        You can also choose line and other change type variable to 'l' etc.
  
           

 
b.      To draw a Histogram of a dataset, you can run the command
> hist(acs_or$number_children)
 
 
               
for Bar Plots, run the following set of commands:
> counts <- table(acs_or$bedrooms)
> 
> barplot(counts, main ="BedRooms Distribution", xlab = "Number of BedRooms" )
 
 
 
 
I hope this will give you a basic idea on how to do simple statistics in R.

No comments:

Post a Comment