Extracting HTML
Feeds can be used to get content from ordinary HTML web pages. The fetched page can be accessed through the liquid doc object, as follows:
On the test tab this looks like:
This returns the entire HTML document. To extract content from the HTML document there are 3 helpers that can be used:
Tag Stripping:
Liquid’s standard ‘strip_html’ filter can be useful when working with HTML documents: https://shopify.github.io/liquid/filters/strip_html
HTML
In this example we will get the Biography Text from the Taxi for Email Twitter.
Feed set up
Set the feed url to https://twitter.com/taxiforemail
Set the method to ‘GET’
Set the data type to ‘HTML’
Data Extraction
First open the twitter page in a browser, then using the ‘inspect’ tool in the browser find the element we’re looking for:
We can see that the text is in a <p> tag with the class ‘ProfileHeaderCard-bio’. We can use this to make the following CSS selector:
p.ProfileHeaderCard-bio
We can get the content of this P through the doc object, using the find_first_by_css filter:
This gives the following result:
If we want just the text from this, without html tags, we can add the strip_html
filter:
{{doc | find_first_by_css: 'p.ProfileHeaderCard-bio' | strip_html }}
Which gives just the text: