Wednesday, 12 April 2023

How to extract data from a PDF file in R






PDF files are widely used to share documents that preserve the original formatting and layout. However, extracting data from a pdf file can be challenging, especially if the file contains tables, images, or complex layouts. This blog post will show you how to use R to extract data from a pdf file and convert it into a data frame or a tibble.

 

There are several R packages that can help us with this task, such as pdftools, tabulizer, and magick. We will use pdftools as an example, but you can also explore other options. pdftools is a package that provides various tools to manipulate and analyze pdf files in R. You can install it from CRAN using the following command:

 

install.packages("pdftools")


To extract data from a pdf file, we need to first read the file into R using the pdf_text() function. This function returns a character vector of length equal to the number of pages in the pdf file. Each element of the vector contains the text of one page. For example, let's say we have a pdf file called "example.pdf" that contains three pages of text. We can read it into R using:


library(pdftools)
text <- pdf_text("example.pdf")
text


This will output something like:


[1] "This is the first page of the pdf file.\nIt contains some text and a table.\n\nName\tAge\tGender\nAlice\t25\tF\nBob\t30\tM\nCharlie\t35\tM\n"
[2] "This is the second page of the pdf file.\nIt contains some more text and an image.\n\nHere is an image of a cat:\n\n[image]\n"
[3] "This is the third and final page of the pdf file.\nIt contains some text and a bullet list.\n\nSome advantages of using R are:\n- It is free and open source\n- It has a large and active community\n- It supports various types of data analysis and visualization\n"


As you can see, the text is extracted as plain text, without any formatting or structure. This means that we need to do some additional processing to extract the data we want. For example, if we want to extract the table from the first page, we need to split the text by newline characters (`\n`) and tab characters (`\t`) to create a matrix. Then we can convert the matrix into a data frame or a tibble using as.data.frame() or as_tibble() functions. For example:


# Split the text of the first page by newline characters
lines <- strsplit(text[1], "\n")[[1]]
# Remove empty lines
lines <- lines[lines != ""]
# Split each line by tab characters
cells <- strsplit(lines, "\t")
# Create a matrix from the list of cells
mat <- do.call(rbind, cells)
# Convert the matrix into a data frame
df <- as.data.frame(mat, stringsAsFactors = FALSE)
# Set the column names from the first row
colnames(df) <- df[1, ]
# Remove the first row
df <- df[-1, ]

#Convert character-type data into DataFrame

char2DF <- function(Input) {

Data <- read.table(textConnection(Input),header=TRUE)

return(Data)

}

df <- char2DF(df)
# Convert the columns to appropriate types

df[,2]<- as.factor(df[,2]) #Age
df[,3]<- as.factor(df[,3]) #Gender

# View the data frame
df

This will output something like:

Name Age Gender
1 Alice 25 F
2 Bob 30 M
3 Charlie 35 M

Alternatively, we can use the read_table() function from the readr package to read the table directly from the text. This function automatically detects the column names and types from the text. For example:

library(readr)
# Read the table from the text of the first page
df <- read_table(text[1])
# View the data frame
df

This will output something like:


# A tibble: 3 x 3
Name Age Gender
<chr> <dbl> <chr>
1 Alice 25 F
2 Bob 30 M
3 Charlie 35 M


To extract data from other pages or other types of elements in the pdf file, we need to use different methods or packages. For example, to extract an image from the pdf file, we can use the pdf_convert() function from pdftools to convert the pdf file into an image file.

pdf_convert("example.pdf", format = "png", pages = 2, filenames = "output.png")

 

this is the output 😊


Now let’s retrieve the bullet point list from page 3 of the document.

# extract bullet point list
bullet_points <- unlist(strsplit(text[3], "\n\\s*\\u2022\\s*"))
 
# remove empty strings
bullet_points <- bullet_points[bullet_points != ""]
 
# print bullet points

cat(bullet_points, sep = "\n")

Thus we get this ouptut:

This is the third and final page of the pdf file

                    Some advantages of using R are:
                       - it is free and open source
                 -it has a large and active community
     -it supports various types of data analysis and visualization

Have fun with R!

No comments:

Post a Comment

Understanding Anaerobic Threshold (VT2) and VO2 Max in Endurance Training

  Introduction: The Science Behind Ventilatory Thresholds Every endurance athlete, whether a long-distance runner, cyclist, or swimmer, st...