PDF files are widely
used to share documents that preserve the original formatting and layout.
However, extracting data from a pdf file can be challenging, especially if the
file contains tables, images, or complex layouts. This blog post will show you
how to use R to extract data from a pdf file and convert it into a data frame
or a tibble.
There are several R
packages that can help us with this task, such as pdftools, tabulizer,
and magick. We will use pdftools as an example, but you can also
explore other options. pdftools is a package that provides various tools
to manipulate and analyze pdf files in R. You can install it from CRAN using
the following command:
install.packages("pdftools")
To extract
data from a pdf file, we need to first read the file into R using the pdf_text()
function. This function returns a character vector of length equal to the
number of pages in the pdf file. Each element of the vector contains the text
of one page. For example, let's say we have a pdf file called
"example.pdf" that contains three pages of text. We can read it into
R using:
library(pdftools)
text <-
pdf_text("example.pdf")
text
This will
output something like:
[1]
"This is the first page of the pdf file.\nIt contains some text and a
table.\n\nName\tAge\tGender\nAlice\t25\tF\nBob\t30\tM\nCharlie\t35\tM\n"
[2]
"This is the second page of the pdf file.\nIt contains some more text and
an image.\n\nHere is an image of a cat:\n\n[image]\n"
[3]
"This is the third and final page of the pdf file.\nIt contains some text
and a bullet list.\n\nSome advantages of using R are:\n- It is free and open
source\n- It has a large and active community\n- It supports various types of
data analysis and visualization\n"
As you can
see, the text is extracted as plain text, without any formatting or structure.
This means that we need to do some additional processing to extract the data we
want. For example, if we want to extract the table from the first page, we need
to split the text by newline characters (`\n`) and tab characters (`\t`) to
create a matrix. Then we can convert the matrix into a data frame or a tibble
using as.data.frame() or as_tibble() functions. For example:
# Split the text of the first page by newline
characters
lines <-
strsplit(text[1], "\n")[[1]]
# Remove empty lines
lines <-
lines[lines != ""]
# Split each line by tab characters
cells <-
strsplit(lines, "\t")
# Create a matrix from the list of cells
mat <-
do.call(rbind, cells)
# Convert the matrix into a data frame
df <-
as.data.frame(mat, stringsAsFactors = FALSE)
# Set the column names from the first row
colnames(df)
<- df[1, ]
# Remove the first row
df <-
df[-1, ]
#Convert
character-type data into DataFrame
char2DF <- function(Input) {
Data <- read.table(textConnection(Input),header=TRUE)
return(Data)
}
df <- char2DF(df)
# Convert the columns to appropriate types
df[,2]<-
as.factor(df[,2]) #Age
df[,3]<- as.factor(df[,3]) #Gender
# View the data frame
df
This will
output something like:
Name Age
Gender
1 Alice 25
F
2 Bob 30 M
3 Charlie
35 M
Alternatively,
we can use the read_table() function from the readr package to read the
table directly from the text. This function automatically detects the column
names and types from the text. For example:
library(readr)
# Read the table from the text of the first
page
df <- read_table(text[1])
# View the data frame
df
This will
output something like:
# A tibble:
3 x 3
Name Age
Gender
<chr>
<dbl> <chr>
1 Alice 25
F
2 Bob 30 M
3 Charlie
35 M
To extract
data from other pages or other types of elements in the pdf file, we need to
use different methods or packages. For example, to extract an image from the
pdf file, we can use the pdf_convert() function from pdftools to convert
the pdf file into an image file.
pdf_convert("example.pdf",
format = "png", pages = 2, filenames = "output.png")
this is the
output 😊
Now let’s
retrieve the bullet point list from page 3 of the document.
# extract bullet point list
bullet_points
<- unlist
(strsplit
(text[3]
,
"\n\\s*\\u2022\\s*"))
# remove empty strings
bullet_points
<- bullet_points
[bullet_points
!=
""]
# print bullet points
cat
(bullet_points
, sep
=
"\n")
Thus we get this ouptut:
This is the third and final page of the pdf file
Some advantages of using R are:
- it is free and open source
-it has a large and active community
-it supports various types of data analysis and visualization
Have fun with R!
No comments:
Post a Comment