Menu
Menu Sheet Overlay
Search
Search Sheet

Parsing HTML

    About this guide

    As described in the Integrating Any Ecommerce Platform guide, sometimes you’ll find that APIs are not always available for every feature you are building. In situations like this, you must retrieve the data by fetching and parsing HTML.

    This guide is meant to describe the “dos” and “don’ts” for parsing HTML.

    What you’ll learn and how you can apply it

    By the end of this section, you will have an understanding of:

    And you will have:

    This guide is for you because…

    A note about jQuery

    In some of the code examples of this guide we use jQuery to demonstrate how parsers can be used. jQuery is not strictly required in Progressive Web Apps, but this guide demonstrates at least one kind of scenario where it can be helpful.

    What is a parser?

    A parser is a function that takes in an HTML string (or jQuery object) and outputs an object or array of objects. The object (or array of objects) is a formatted set of primitives (strings, integers, objects, arrays). It could be, for example, a list of products, a promotion advert, sale price, or the page heading text. A parser should not output actual DOM nodes or element objects, such as HTMLElements.

    A typical parser will look something like this:

    // Note that `$` refers to an instance of jQuery, and
    // `$html` is a jQuery object representation of the HTML
    // being parsed.
    const productParser = ($, $html) => {
        const $products = $html.find('.products .product')
    
        // jQuery's get function returns the elements in an array.
        // This allows us to use Array.prototype.map.
        // We don't want to use jQuery's map
        // as it has a different signature than Array.prototype.map
        // and using both in a single codebase can be confusing.
        return $products.get().map((product) => {
            const $product = $(product)
    
            return {
                href: $product.find('a').attr('href'),
                text: $product.text(),
                price: $product.find('.price').text()
            }
        })
    }
    
    export default productParser
    

    Here’s a common scenario where we’d need a parser. A user follows a link and we need to get some content from the existing HTML page associated with that link. To do that, we’d fetch the HTML, parse its content, and dispatch an action to populate the Redux store with that content. A reducer will need to be listening for that action to actually perform the update on the store.

    Any project that relies on parsing should use the stubConnector as its starting point. This connector contains a utility called fetchPageData to help fetch and parse data from an existing HTML page.

    Here’s what the code for that parser would look like:

    // By using this action, the product data will be placed into the expected location
    // in the Redux store
    import {
        receiveProductDetailsProductData
    } from 'mobify-integration-manager/dist/integration-manager/api/products/results'
    
    // Assuming that you are implementing a command within the stubConnector
    import {fetchPageData} from '../app'
    
    import productParser from './parsers/product'
    
    export const fetchProduct = (urlToTheProduct) => (dispatch) => {
        // Use util to send request and fetch the HTML
        return fetchPageData(urlToTheProduct)
            .then((res) => {
                const [$, $html] = res
                // fetchPageData will return jQuery
                // and a jQuery wrapped version of the HTML
                const parsedData = productParser($, $html)
    
                // Update the Redux store with the parsed data
                return dispatch(receiveProductDetailsProductData(parsedData))
            })
    }
    

    To learn more about the stubConnector, read the guide on Integrating Any Ecommerce Platform.

    When to use a parser

    Before deciding to parse an HTML page, the following conditions should met:

    Pages that typically meet the above criteria include product descriptions and category lists.

    The page has structure that can be traversed programmatically

    The page markup should be structured so that it can be traversed using plain and simple Javascript. For example, a typical category list page should be structured in a way that makes it easy to find each individual category.

    The page content follows a consistent format

    Ecommerce site product pages typically follow a consistent format from one product to another. It’s reasonable to expect every product to contain a title, price, description, and other metadata.

    It’s possible that there will be some inconsistencies. Some products may contain options that don’t exist for others. For example, clothes might have color options while hardware tools might not. These inconsistencies can be manageable so long as you can easily identify those differences.

    When not to use a parser

    Now that we know what conditions we need for using a parser, let’s look at when those conditions are not in place.

    If it’s impossible to programmatically traverse a page’s content, the page should not be parsed. Consider a page where the page title is only wrapped in a <span> tag with no ID or unique class—and there are many other <span> tags being used throughout the page.

    <span>Sub-title<span>
    <div>
        <!--
          It might be difficult to tell this heading apart from the sidebar below
          -->
        <span>Main Heading</span>
    </div>
    <p>Main content goes here</p>
    <div>
        <span>Sidebar</span>
        <p>Information</p>
    </div>
    

    If, from one page to another, it’s difficult or impossible to know what’s the same and what’s different, the page should not be parsed. Articles or blog-like pages are a common offender. Their structures can vary dramatically, often because the content is created using WYSIWYG editors. For example, a recipe blog post might list its ingredients with a plain list element, while another article might just list them in a paragraph with <br /> tags mixed in.

    <!-- Article Blog Post Example #1 -->
    <h1>Egg Pancakes</h1>
    <p>Ingredients</p>
    <ul>
        <li>Eggs</li>
        <li>Flour</li>
        <li>Syrup</li>
    </ul>
    
    <!-- Article Blog Post Example #2 -->
    <!-- Notice this example formatted completely differently from #1 -->
    <h1>Banana Brownies</h1>
    <p>Ingredients<br />
        - Bananas<br />
        - Cocoa Powder<br />
        - Eggs<br />
    </p>
    

    Displaying unparsed content

    A useful technique for displaying unpredictable content, as described above, is to simply output the markup unmodified in a DangerousHTML component.

    Here’s an example:

    const unpredictableBlogContent = getBlockContent() // => a raw HTML string, i.e. `<div><p>...<p><div><span>...</span></div></div>`
    const content = (
        <DangerousHTML html={unpredictableBlogContent}>
            {(htmlObj) => <div dangerouslySetInnerHTML={htmlObj} />}
        </DangerousHTML>
    )
    

    Be cautious about how you use DangerousHTML. This component relies on React’s dangerouslySetInnerHTML prop, which comes with its own security implications. Please familiarize yourself with the dangers of using this component as explained in its documentation.

    Striking a balance

    You will find that most web pages contain a mixture of well structured and poorly structured data.

    For example, a product page could have easy to find titles, prices, and product options. But you might find that the descriptions and promotions tend to be inconsistently formatted.

    For each piece of content, you will need to choose whether to parse or not to parse.