Extracting HTML

Feeds can be used to get content from ordinary HTML web pages. The fetched page can be accessed through the liquid doc object, as follows:

On the test tab this looks like:

This returns the entire HTML document. To extract content from the HTML document there are 3 helpers that can be used:

Tag Stripping:

Liquid’s standard ‘strip_html’ filter can be useful when working with HTML documents: https://shopify.github.io/liquid/filters/strip_html

HTML

In this example we will get the Biography Text from the Taxi for Email Twitter.

Feed set up

Data Extraction

First open the twitter page in a browser, then using the ‘inspect’ tool in the browser find the element we’re looking for:

We can see that the text is in a <p> tag with the class ‘ProfileHeaderCard-bio’. We can use this to make the following CSS selector:

p.ProfileHeaderCard-bio

We can get the content of this P through the doc object, using the find_first_by_css filter:

This gives the following result:

If we want just the text from this, without html tags, we can add the strip_html filter:

{{doc | find_first_by_css: 'p.ProfileHeaderCard-bio' | strip_html }}

Which gives just the text:

Did this answer your question?