So I needed to reuse data from a table in a web page to support a visualisation in Tableau and I thought surely there's a way of achieving already available in standard.
Well, no...
The web data connector available in the list of standard connection doesn't do that: a simple data extraction from a webpage.
I did a bit of research and found on article on import.io that seemed to do just that. However, it didn't work with Tableau Public or Tableau 2020.1. I tried to create a new account with import.io but was informed that:
As of August 19th, 2020 we have suspended the ability to create new Import.io accounts.
So, back to square one.
Now, Tableau can connect to Google Sheets or upload a file in Excel or csv format; it should also be possible to extract the data from a web table to the formats above and then process it in Tableau.
So let's try to do just that.
I found this blog post:
It explains how to extract data from a web page with the importxml function in Google sheets.
I'm trying to extract the list of most popular websites and some metadata from Wikipedia (link here).
I'm interested in columns: site, domain, type, country.
Most useful info from the blog post above:
a tag with [1] means "only give me the first instance of <tag> inside <parent tag>." So, td/span/a[1] gives you the first link inside the <span> inside each <td>. In the same way, td/b[1] gives you the very first bold text inside each <td>
So I try to replicate this logic with my table, but I need to understand the xml structure first.
In Chrome, right-mouse click in a table cell and select "Inspect" to view the code behind.
I click on the first cell of the table that contains "Google Search and inspect it.
The syntax of the function is: IMPORTXML(url; xpath query). Both arguments are written between quotation marks " ".
The xpath query is written: "//<parent tag>/<child tag>[column index]".
Tags are written without the <>.
In this example, you see that the site name is the the first link <a> of the <td> within a <tr>. It's also the first <td> within a <tr>. So you could write the the xpath query two ways: "//tr/td/a[1]" or "//tr/td/a[1]".
To extract site, I type the formula below in cell A1:
=importxml("https://en.wikipedia.org/wiki/List_of_most_popular_websites";"//tr/td/a[1]")
The domain is the <td> number 2 within a <tr>. To extract domain, I type the formula below in cell A2:
=importxml("https://en.wikipedia.org/wiki/List_of_most_popular_websites";"//tr/td[2]")
The type is the <td> number 4 within a <tr>. To extract type, I type the formula below in cell A3:
=importxml("https://en.wikipedia.org/wiki/List_of_most_popular_websites";"//tr/td[4]")
and here is the result:
It is time now to import the data into Tableau.
Open up any version of Tableau and go to the Connect panel.
Click on Google sheets.
This will open a new tab in your browser where you'll be asked to authenticate (you need to have a google account btw).
Once authenticated, a popup appears with the list of your g-sheets.
Select the one you're interested in and click "Connect".
Tableau displays the list of worksheet available in your g-sheet.
Double-click on the one your interested in and start exploring your data !
I'm not really satisfied with this process, it's way too complicated but that's all I could find after a quick search. I will do more digging in the near future.
For the record, things are so much easier with PowerBI. 3 clicks and you're done!
Here's how:
In the "Get data" panel, select "Web"
Enter the website URL
PowerBI finds the table and identifies the columns
So much easier, no ?
Comentarios