top of page
Rosablanche

Web data import to Tableau

Updated: Mar 9, 2021

So I needed to reuse data from a table in a web page to support a visualisation in Tableau and I thought surely there's a way of achieving already available in standard.

Well, no...

The web data connector available in the list of standard connection doesn't do that: a simple data extraction from a webpage.

I did a bit of research and found on article on import.io that seemed to do just that. However, it didn't work with Tableau Public or Tableau 2020.1. I tried to create a new account with import.io but was informed that:

As of August 19th, 2020 we have suspended the ability to create new Import.io accounts.

So, back to square one.


Now, Tableau can connect to Google Sheets or upload a file in Excel or csv format; it should also be possible to extract the data from a web table to the formats above and then process it in Tableau.

So let's try to do just that.

 

I found this blog post:

It explains how to extract data from a web page with the importxml function in Google sheets.


I'm trying to extract the list of most popular websites and some metadata from Wikipedia (link here).

I'm interested in columns: site, domain, type, country.


Most useful info from the blog post above:

a tag with [1] means "only give me the first instance of <tag> inside <parent tag>." So, td/span/a[1] gives you the first link inside the <span> inside each <td>. In the same way, td/b[1] gives you the very first bold text inside each <td>

So I try to replicate this logic with my table, but I need to understand the xml structure first.

In Chrome, right-mouse click in a table cell and select "Inspect" to view the code behind.

I click on the first cell of the table that contains "Google Search and inspect it.


The syntax of the function is: IMPORTXML(url; xpath query). Both arguments are written between quotation marks " ".

The xpath query is written: "//<parent tag>/<child tag>[column index]".

Tags are written without the <>.

In this example, you see that the site name is the the first link <a> of the <td> within a <tr>. It's also the first <td> within a <tr>. So you could write the the xpath query two ways: "//tr/td/a[1]" or "//tr/td/a[1]".


To extract site, I type the formula below in cell A1:

=importxml("https://en.wikipedia.org/wiki/List_of_most_popular_websites";"//tr/td/a[1]")

The domain is the <td> number 2 within a <tr>. To extract domain, I type the formula below in cell A2:

=importxml("https://en.wikipedia.org/wiki/List_of_most_popular_websites";"//tr/td[2]")

The type is the <td> number 4 within a <tr>. To extract type, I type the formula below in cell A3:

=importxml("https://en.wikipedia.org/wiki/List_of_most_popular_websites";"//tr/td[4]")

and here is the result:


 

It is time now to import the data into Tableau.

Open up any version of Tableau and go to the Connect panel.



Click on Google sheets.


This will open a new tab in your browser where you'll be asked to authenticate (you need to have a google account btw).


Once authenticated, a popup appears with the list of your g-sheets.


Select the one you're interested in and click "Connect".


Tableau displays the list of worksheet available in your g-sheet.

Double-click on the one your interested in and start exploring your data !







 

I'm not really satisfied with this process, it's way too complicated but that's all I could find after a quick search. I will do more digging in the near future.


For the record, things are so much easier with PowerBI. 3 clicks and you're done!


Here's how:


In the "Get data" panel, select "Web"







Enter the website URL





PowerBI finds the table and identifies the columns









So much easier, no ?

11 views0 comments

Recent Posts

See All

How I aced the Tableau certification

From zero to hero in less than a month, my journey to the Tableau Desktop Specialist certification. Tableau is a very popular...

Comentarios


DSC02405-Edit.jpg

Rosablanche

A BLOG ABOUT ANALYTICS & DATA
bottom of page