# Using R to Find Correlation Between Salaries and Companies' Ratings Author: Ilan Dan-Gur
Date of publication: January 4th, 2021

I used a dataset from kaggle  containing 2253 job advertisements for Data Analysts. According to the author of the data file, the data was taken from the GlassDoor website (www.glassdoor.com). For each job, the dataset includes 16 variables such as job description, company name, company rating, industry, and also a range of GlassDoor estimated salaries for each type of job.

My goal was to find what type of correlation (if any) exists between the ratings of companies and the estimated salaries for jobs offered by those companies. The analysis was done using R (a language for statistical computing and graphics).

## 1. Importing Data

I began by importing the CSV data from a file (DataAnalyst.csv ) and saving it to a data frame:

``` data_analyst_raw <- read.csv("DataAnalyst.csv", stringsAsFactors = FALSE) ```

First, let's check the structure of the dataset:

``` str(data_analyst_raw) ```

As discussed above, we have 2253 "observations" (i.e. job advertisements) and 16 variables (company name, company rating, salary estimates, etc).

## 2. Data Exploration and Cleaning

We are going to check that some variables have correct (i.e. meaningful) values, and remove data that is not useful for our analysis.

### 2.1 Duplicates

Let's check that there are no duplicate job descriptions in the dataset:

``` sum(duplicated(data_analyst_raw\$Job.Description)) ```

The function returns "0", meaning there are no duplications, so we can move to the next stage.

### 2.2 Ratings

#### 2.2.1 Rating range and unavailable data

Let's check the range of ratings/reviews:

``` summary(data_analyst_raw\$Rating) ```

The result is:

```
Min. -1.000
1st Qu 3.100
Median 3.600
Mean 3.161
3rd Qu. 4.000
Max. 5.000

```

Since we can ignore ratings of "-1" for the purpose of this analysis, we are going to delete these lines from the dataset. Let's check how many lines (out of 2253) have a "-1" rating:

``` sum(data_analyst_raw\$Rating==-1) ```

There are 272 lines with no rating, so let's remove them from the dataset:

``` data_analyst_clean <- data_analyst_raw[!data_analyst_raw\$Rating==-1,] ```

Now let's check again the range of ratings/reviews:

``` summary(data_analyst_clean\$Rating) ```

The result is:

```
Min. 1.000
1st Qu 3.300
Median 3.700
Mean 3.732
3rd Qu. 4.100
Max. 5.000

```

So it appears that companies have a rating between 1.0 and 5.0.

#### 2.2.2. Rating variability

Let's check if companies' ratings are consistent across the dataset. We expect that if a certain company has 20 job listings in the dataset, the ratings of the company would be identical across the dataset.

Let's check how many companies are in the dataset:

``` length(unique(data_analyst_clean\$Company.Name)) ```

There are 1295 companies!

Let's look at companies that have the most job listings in the dataset:

``` head(sort(table(data_analyst_clean\$Company.Name),decreasing = TRUE)) ```

The first four results are:

```
58 Staffigo Technical Services, LLC\n5.0
22 Diverse Lynx\n3.9
19 Kforce\n4.1
19 Lorven Technologies Inc\n4.0

```

The format of the results is somewhat unexpected, because each company name appears to be followed by a "new line" character (i.e. "\n"), and then a number which might be a rating.

To get to the bottom of it, let's investigate the first company name ("Staffigo Technical Services, LLC\n5.0") by creating a subset dataframe for all company names in the dataset that contain the word "Staffigo", and then checking how many job listings there are, and the ratings:

``````
Staffigo <- data_analyst_clean[grepl("Staffigo",data_analyst_clean\$Company.Name,ignore.case = TRUE),]
str(Staffigo)
summary(Staffigo\$Rating)
```
```

We find that there are 58 job listings from company names that contain the word "Staffigo" in the dataset, which is exactly what we would have expected based on the list above, and they all have a company rating of "5".

We have similar findings when creating a subset dataframe for all company names in the dataset that contain the word "Diverse" (the second company name in the list above).

To conclude this subsection, it does appear, based on the two companies with the most job listings in the dataset, that companies have consistent (i.e. single) rating in the dataset.

#### 2.2.3 Rating plots

Let's plot the ratings to see if anything requires our attention.

First, we will plot the ratings in the same order they appear in the dataset:

``` plot(x = 1:1981,y = data_analyst_clean\$Rating,main = "Company Ratings in Job Ads",xlab = "Job Ads",ylab = "Rating") ``` Plot 1: Companies' ratings in job ads

Second, we will plot the histogram of the ratings:

``` hist(data_analyst_clean\$Rating, main = "Histogram of Company Ratings in Job Ads", xlab = "Rating") ``` Plot 2: Histogram of companies' ratings

There doesn't appear to be anything glaringly out of the ordinary in the two plots above, so we move to the next stage.

### 2.3 Salary Estimates

#### 2.3.1 Salary range and unavailable data

Let's check the structure of the salary estimate variable:

``` str(data_analyst_clean\$Salary.Estimate) ```

We learn that it's a character variable, with example like: "\$37K-\$66K (Glassdoor est.)".

To find all the unique salary estimates, we use:

``` unique(data_analyst_clean\$Salary.Estimate) ```

and we find that there are 90 unique salary ranges (e.g. "\$57K-\$100K (Glassdoor est.)", "\$78K-\$104K (Glassdoor est.)", including "-1".

Since we can ignore salary estimates of "-1" for the purpose of this analysis, we are going to use the same procedure as section 2.2.1 (Rating range and unavailable data, above) to delete these lines from the dataset. Let's check how many lines have a "-1" salary:

``` sum(data_analyst_clean\$Salary.Estimate=="-1") ```

There is only one line with a salary estimate of "-1", so let's remove it from the dataset:

``` data_analyst_clean <- data_analyst_clean[!data_analyst_clean\$Salary.Estimate=="-1",] ```

Next, we want to find, for each job, the average estimated salary. For example, in the expression "\$78K-\$104K (Glassdoor est.)", the average estimated salary is 91,000. To do that we need to extract the two numbers (78 and 104 in the example above) and calculate their average.

Here is the R code for all that:

``````
install.packages("stringr")
library(stringr)
x <- str_extract_all(data_analyst_clean[,3],"\\(?[0-9]+\\)?")
x[] <- lapply(x, function(y) as.numeric(y))
y <- lapply(x, mean)
data_analyst_clean\$Salary <- as.numeric(unlist(y))
```
```

Let's check the salaries range:

``` summary(data_analyst_clean\$Salary) ```

The result is:

```
Min. 33.50
1st Qu 58.00
Median 68.50
Mean 72.02
3rd Qu. 80.50
Max. 150.00

```

#### 2.3.2 Correlation between salaries and job titles

It may be worthwhile to investigate the correlation between salaries and job titles. For example, if we find that salaries for "Senior Data Analyst" are generally lower than those for "Junior Data Analyst" it may mean that the salary information in the dataset cannot be trusted, and therefore our analysis would be meaningless.

Let's find the ten most common job titles:

``` sort(table(data_analyst_clean\$Job.Title),decreasing = TRUE)[1:10] ```

Here are the results:

```
349 Data Analyst
79 Senior Data Analyst
30 Junior Data Analyst
18 Sr. Data Analyst
17 Data Analyst Junior
16 Data Analyst II
16 Data Quality Analyst
13 Data Governance Analyst

```

Let's create subset dataframes for Senior Data Analyst, Data Analyst, and Junior Data Analyst.

``````
senior_analyst <- subset(data_analyst_clean, data_analyst_clean\$Job.Title=="Senior Data Analyst" | data_analyst_clean\$Job.Title=="Sr. Data Analyst")

analyst <- subset(data_analyst_clean, data_analyst_clean\$Job.Title=="Data Analyst")

junior_analyst <- subset(data_analyst_clean, data_analyst_clean\$Job.Title=="Junior Data Analyst" | data_analyst_clean\$Job.Title=="Data Analyst Junior")
```
```

Now let's check the range of salaries for a senior data analyst:

``` round(summary(senior_analyst\$Salary)) ```

The result (in \$1000s) is:

```
Min. 36
1st Qu 59
Median 72
Mean 72
3rd Qu. 80
Max. 150

```

We repeat calculating the summary() for "data analyst" and for "junior data analyst", and the results are summarized in the table below:

Senior Analyst Analyst Junior Analyst
Min 36 34 36
Median 72 66 59
Mean 72 73 63
Max 150 150 126
Table 1: Salary Range (in \$1000s) vs. Job Titles

Since salaries can vary widely between companies and industries, and since job titles in and of themselves (i.e. without the job description) can be misleading, and since the salaries are only GlassDoor estimates, the data in Table 1 above seems plausible.

#### 2.3.3 Salary plots

Let's plot the estimated salaries to see if anything requires our attention.

First, we will plot the estimated salaries in the same order they appear in the dataset:

``` plot(x = 1:1980,y = data_analyst_clean\$Salary,main = "Estimated Salaries in Job Ads",xlab = "Job Ads",ylab = "Salary [\$1000s]") ``` Plot 3: Estimated salaries in job ads

Second, we will plot the histogram of the estimated salaries:

``` hist(data_analyst_clean\$Salary, main = "Histogram of Estimated Salaries in Job Ads", xlab = "Estimated Salary [\$1000s]") ``` Plot 4: Histogram of estimated salaries

Looking at plot 3, it seems that the salaries appear in the dataset in "clumps" or "clusters" of similar (or identical) salaries. Let's plot only the first 500 salaries in the dataset:

``` plot(x = 1:500,y = data_analyst_clean\$Salary[1:500],main = "Estimated Salaries in Job Ads",xlab = "Job Ads",ylab = "Salary [\$1000s]") ``` Plot 5: Estimated salaries for the first 500 job ads in the dataset

Other than the fact that salaries appear in the dataset in "clusters" of similar (or identical) values, there doesn't appear to be anything glaringly out of the ordinary in the plots above, so we move to the next stage.

### 2.4 Job Titles

As the final stage of data cleaning, let's check that all the job titles make sense.

We can look at all the unique job titles with:

``` unique(data_analyst_clean\$Job.Title) ```

We find that there are 1148 unique job titles, such as:

• - "Application Data Analyst"
• - "Senior Data Analyst (Corporate Audit)"
• - "Senior Quality Data Analyst"
• - "Game Data Analyst"
• - "HR Data Analyst II"

We can easily check if all the titles contain the words "Data" and "Analyst":

``````
sum(grepl("data",data_analyst_clean\$Job.Title,ignore.case = TRUE))

sum(grepl("analyst",data_analyst_clean\$Job.Title,ignore.case = TRUE))
```
```

In both cases the result is 1980, confirming that all the job titles (1980) indeed contain these words.

## 3. Analysis and Results

To get an idea of the type of correlation that might exist between the estimated salaries and company ratings we can look at the scatterplot from the dataframes for Senior Data Analyst, Data Analyst, and Junior Data Analyst (we have already created these dataframes in section 2.3.2 above).

### 3.1 Senior Data Analyst

``` plot(x = senior_analyst\$Salary,y = senior_analyst\$Rating,main = "Senior Data Analyst, Company Rating vs. Salary",xlab = "Salary [\$1000s]",ylab = "Rating") ``` Plot 6: Senior data analyst, companies' ratings vs. salaries

Looking at the plot, there seems to be little hope of correlation between the two variables, given the dataset we have used for the analysis.

Just to put a number on the graph, we can calculate the Pearson (i.e. linear) correlation coefficient:

``` cor(senior_analyst\$Salary,senior_analyst\$Rating) ```

The result is -0.064, indicating indeed poor linear correlation between estimated salaries and company ratings.

It would be interesting to know how many job ads (i.e. data point) we have, as well as how many companies are part of this graph:

``````
str(senior_analyst)
length(unique(senior_analyst\$Company.Name))
```
```

There are 97 data points (i.e. job ads) and 93 unique companies in our dataframe for a Senior Data Analyst.

We are now going to repeat a similar analysis for a Data Analyst, and a Junior Data Analyst, but present only the R code and the results:

### 3.2 Data Analyst

``` plot(x = analyst\$Salary,y = analyst\$Rating,main = "Data Analyst, Company Rating vs. Salary",xlab = "Salary [\$1000s]",ylab = "Rating") ``` Plot 7: Data analyst, companies' ratings vs. salaries

``` cor(analyst\$Salary,analyst\$Rating) ```

The result is 0.10, indicating weak linear correlation between estimated salaries and company ratings.

``````
str(analyst)
length(unique(analyst\$Company.Name))
```
```

There are 349 data points (i.e. job ads) and 289 unique companies in our dataframe for a Data Analyst.

### 3.3 Junior Data Analyst

``` plot(x = junior_analyst\$Salary,y = junior_analyst\$Rating,main = "Junior Data Analyst, Company Rating vs. Salary",xlab = "Salary [\$1000s]",ylab = "Rating") ``` Plot 8: Junior data analyst, companies' ratings vs. salaries

``` cor(junior_analyst\$Salary,junior_analyst\$Rating) ```

The result is 0.018, indicating poor linear correlation between estimated salaries and company ratings.

``````
str(junior_analyst)
length(unique(junior_analyst\$Company.Name))
```
```

There are 47 data points (i.e. job ads) and 13 unique companies in our dataframe for a Data Analyst.

## 4. Conclusion

Investigating 493 job advertisements with job titles of Senior Data Analyst, Data Analyst, and Junior Data Analyst, we found weak linear correlation (r ≤ 0.10) between estimated salaries and company ratings.