Elixir Httpoison Floki scrapper

Written by Mahabub I.

Github Link of the repo: https://github.com/prio101/gurdian_scrapper I started to work with elixir language short time ago. And as people say, there are no other way better than creating small one or two side projects to sharpen your skill sets.

So, I started to think of creating a small scrapper script with elixir language specifically for this resource page on TheGurdian . They tried to crowd source and collect quotes and book data accordingly with time.

As example,

00:00:00h - “He is certain he heard footsteps: they come nearer, and then die away. The ray of light beneath his door is extinguished. It is midnight; someone has turned out the gas; the last servant has gone to bed, and he must lie all night in agony with no one to bring him any help.”

Swann’s Way Marcel Proust - @westrowc

So, at the very first I created a mix project for elixir language. With command,

> mix new gurdian_scrapper

This created a ready to work with mix project on the path. After that I put Httpoison and Floki over mix.exs and ran

> mix deps.get

And this command fetched the packages over local directory for the usage.

From here mostly what I have done is, Using httpoison to call and get the response from the url of TheGurdian page and later with Floki parse the data and saved it over a new text file.

In specific three steps it would be like, Fetching the response,

def start_and_fetch_response do
IO.puts "starting"
HTTPoison.start
url = 'https://www.theguardian.com/books/table/2011/apr/21/literary-clock'
case HTTPoison.get(url) do
  {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
    file = File.open!("data_save_scrapped.txt", [:write, :utf8])
    parse_by_id(body, 0, file)
  {:ok, %HTTPoison.Response{status_code: 404}} ->
    IO.puts "Not found :("
  {:error, %HTTPoison.Error{reason: reason}} ->
    IO.inspect reason
end
end

Formatting the response with Floki,

def parse_by_id(body, n, file) when n <= 934 do

    by_td_data(body, n, file)

    parse_by_id(body, n + 1, file)

end

def parse_by_id(_body, _n, file) do

    File.close file

end

def by_td_data(body, i, file) do

Enum.each(0..4, fn j ->

body

|> Floki.find("#table-cell-7619-#{i}-#{j}")

|> Floki.text

|> (fn x -> save_into_txt_file(x, file) end).()

end)

end

And after that I just saved it over the file,

def save_into_txt_file(data, file) do

    IO.binwrite(file, data <>", \n" )

end

After this I just ran,

> iex -S mix
->> GurdianScrapper.start_and_fetch_response

And this saves the scrapped data onto the file on the same project directory .

So, that’s all for today, its fun to work with elixir language. Maybe next time I will be able to share something more with you people.

Written by : Mahabub Islam

Done At: Jul 22,2018