In this article

Running frequent and targeted crawls of your website is a key part of improving its technical health and improving rankings in organic search. In this guide, you'll learn how to crawl a website efficiently and effectively with Similarweb.

Step 1 - Basic Information

Before starting a crawl, it’s a good idea to make sure you have an understanding of your site’s domain structure. Enter your domain name into the ‘Domain’ field and click ‘Check’. You’ll then see a thumbs-up for the relevant domain.

If you’d like the crawl to include any sub-domains found in the crawl, check the ‘Crawl sub-domains’ option. Underneath that, you can also choose whether to crawl both HTTP and HTTPS. 

The project name will automatically populate with the domain name, but you can change this to anything you like to help you identify the project.

Previewing Project Setup

Once you've configured your settings, you can preview the project setup, verify your settings, and check the crawl configuration before starting the crawl. For example, you can check for 401 errors upfront using the preview feature. This means you can spot issues before the crawl begins and make the necessary amendments rather than discovering issues when the crawl completes.

You can access the preview in step 1 of the crawl setup process, where you’ll see an option to ‘Save & Preview’.

When the preview opens, you can choose to view a screenshot, the rendered HTML, static HTML, and response headers. Once you’ve reviewed the preview, you can go back to the settings using the button in the bottom left corner, or the x in the top right.

You can also access the ‘Save and Preview’ option in step 4 of the crawl setup process. Click “Advanced Settings” and you’ll see the option just underneath.

Step 2 - Sources

There are seven different types of URL sources you can include in your Similarweb Site Audit projects.

Consider running a crawl with as many URL sources as possible in order to supplement your linked URLs with XML Sitemap and Google Analytics data, as well as other data types. Check the box next to the relevant source to include the different elements in your crawl.

  • Website: Crawl only the site by following its links to deeper levels. The crawl will start from your primary domain by default, but if you need it to start from a different point or multiple points, you can also specify those by expanding the website option.
  • Sitemaps: Crawl a set of sitemaps, and the URLs in those sitemaps. Links on these pages will not be followed or crawled. When you expand the options here, you can also manually add sitemaps, select or deselect different sitemaps, and choose whether to discover and crawl new sitemaps in robots.txt or upload XML or TXT sitemaps. 
  • Backlinks: Upload backlink source data, and crawl the URLs to discover additional URLs with backlinks on your site. This can also be automatically brought in via integration with Majestic. 
  • Google Search Console: Use our Google Search Console integration to enrich your reports with data such as impressions, positions on a page, devices used, etc. You can also discover additional pages on your site which may not be linked. To use the integration, you will need to connect your Google Account to Similarweb. See our Integrating Google Search Console with Site Audit guide for more details.
  • Analytics: Similarly, you can use our Google Analytics or Adobe Analytics integration, or upload analytics source data to discover additional landing pages on your site which may not be linked. Again, to use this you will need to connect your Google Account. See our Integrating Google Analytics with Site Audit guide for more details. 
  • Log Summary: Upload log file summary data from log analyzer tools such as Logz.io or Splunk to get a view of how bots interact with your site. You can also upload log file data manually.
  • URL lists: Crawl a fixed list of URLs, by uploading a list in a text file or CSV. Links on these pages will now be followed or crawled. This can be particularly useful for crawling a specific set of pages for accessibility issues, such as those that feature key templates used across the site. 

Screenshot of Step 2 of the Lumar crawl setup process, showing the available sources with website, sitemaps, backlinks, Google Search Console and Analytics checked, but Log Summary and URL lists unchecked.

Step 3 - Limits

In step 3 you can set the relevant limits for your crawl. We recommend starting with a small “website” crawl to look for any signs that your site may be uncrawlable. The default options are to crawl 100 levels deep from the starting page, or a maximum of 100,000 URLs, whichever is reached first. For the first crawl, we recommend changing the second option to “Or a maximum of 100 URLs; whichever is reached first”. You can then choose whether to be notified or finish the crawl if the limit is not enough. For the initial, small crawl, you can set this to finish anyway.

The other option you have in this step is to set the crawl speed. The Similarweb crawler can crawl as fast as your infrastructure allows (up to 350 URLs per second for JavaScript-rendered crawls in testing). However, crawling at too fast a rate means your server may not be able to keep up, leading to performance issues on your site. To avoid this, Similarweb sets a low maximum crawl speed for your account. We can increase this, but it is essential to consult with your dev ops team to identify the crawl rate your infrastructure can handle. 

You can also add restrictions to lower or increase the speed of crawls at particular times. For example, you may decide that you want a slower crawl rate during peak times for your site, but a faster crawl speed early in the morning when traffic is lower.

Step 4 - Settings

In step 4 you can start your crawl. Once you have completed your smaller initial crawl and ensured your settings are correct, you’ll be able to set a schedule to run crawls at regular intervals. 

To set a schedule, choose your required frequency from the drop down and choose your starting date and time. You can also choose “One Time” to schedule a single, non-recurring crawl at a future date or time.

Underneath the Schedule options, you’ll also see a button for Advanced Settings. Clicking this will open a range of additional options you can set as required. You’ll see a check on any elements that have settings applied (which are likely to have been added during the steps above) and can open up each section to add or amend new settings. 

The advanced options are:

  • Scope:
    • Domain scope: Detailing the primary domain, whether sub-domains and both HTTP and HTTPS will be crawled, and any secondary domains that will be crawled. These may have been set in steps 1 and 2 above. 
    • URL scope: Here, you can choose to include only specific URL paths or exclude specific URL paths.

  • For page grouping, you can create a new group, add a name for page grouping, and add a regular expression in the 'Page URL Match' column. Select the percentage of URLs that you would like to crawl. URLs matching the designated path are counted. When the limits have been reached, all further matching URLs go into the 'Page Group Restrictions' report and are not crawled.

  • Resource restrictions: To define which types of URLs you want Similarweb to crawl (e.g. non-HTML, CSS resources, images, etc.). You can also set the crawler to ignore an invalid SSL certificate.
  • Link restrictions: To define which links you want Similarweb to crawl (e.g. follow anchor links, pagination links, etc.). 
  • Redirect settings: To choose whether to follow internal or external redirects.
  • Link validation: Where you can choose which links are crawled to see if they are responding correctly.
  • Spider settings:
    • Start URLs: This was set in Step 2 above but can be accessed and changed here. 
    • JavaScript rendering: Here you can enable or disable JavaScript rendering. You can also add any custom rejections, any additional custom JavaScript, and any external JavaScript resources. 
    • Crawler IP settings: Where you can select regional IPs if required. If your crawl is blocked, or you need to crawl behind a firewall (e.g. a staging environment), you will need to ask your web team to whitelist 52.5.118.182 and 52.86.188.211. 
    • User agent: By default, the crawler will use the Googlebot Smartphone user agent, but you can change this here if needed, and amend the viewport size if required. 
    • Mobile site: If your website has a separate mobile site, you can enter settings here to help Similarweb’s crawler use a mobile user agent when crawling the mobile URLs. 
    • Stealth mode crawl: Allowing you to run a crawl as if it was performed by a set of real users. 
    • Custom request header: Where you can add any custom request headers that will be sent with every request.
    • Cookies: This setting is mostly used for accessibility crawls, to ensure any cookie popup is cleared so the crawl can progress. This is not generally required for tech SEO crawls, but you can see how to configure cookie details here if you need to use it.
  • Extraction:
    • Custom extraction: Where you can use regular expressions to extract custom information from pages when they are crawled.
  • Test settings:
    • Robots overwrite: This allows you to identify additional URLs that can be excluded using a custom robots.txt file - allowing you to test the impact of pushing a new file to a live environment. Upload the alternative version of your robots file, and select 'Use Robots Overwrite' when starting the crawl:

  • Test site domain: Here you can enter your test environment domain to allow comparisons with your live site. 
  • Custom DNS: This allows custom DNS entries to be configured if your website does not have public DNS records (e.g. a staging environment). 
  • Authentication: To include authentication credentials in all requests using basic authentication. 
  • Remove URL parameters: If you have excluded any parameters from search engine crawls with URL parameter tools like Google Search Console, enter these in the 'Remove URL Parameters' field under 'Advanced Settings.'

  • URL rewriting: Add a regular expression to match a URL and add an output expression. 
  • Report setup:
    • API callback: Where you can specify a URL to be called once your crawl has been completed, to trigger an external application. 
    • Crawl email alerts: To set whether to receive email notifications on the progress of your crawl, and specify the email addresses that will receive notifications. 
    • Report settings: Here you can specify additional settings for your reports.

Step 5 - The Final Step

Once your first, smaller crawl has been completed, check the results to ensure that everything looks OK.

First, check the number of URLs crawled in the project summary. If you selected to crawl a maximum of 100 URLs in step 3, the URL count should be close to that number. If the URL count is 0, your site has probably blocked the crawler, and the IP addresses mentioned above need to be whitelisted.

Secondly, check the domains of the URLs that are returned in the reports. Select the project from the project list and click on Accessibility Overview. You can then click on the top SEO Health Score Error on the right-hand side to get into a report.

 

Once the report opens, check the “Example URL” that appears in the URL details column and check that the domain or subdomain is correct. 

If everything looks OK, you can then return to step 3 of the crawl setup to increase the limits and run a full crawl.

Handy Tips

Setting for Specific Requirements

If you have a test/sandbox site you can run a “Comparison Crawl” by adding your test site domain and authentication details in “Advanced Settings”. For more about the Test vs Live feature, check out our guide to Comparing a Test Website to a Live Website. To crawl an AJAX-style website, with an escaped fragment solution, use the “URL Rewrite” function to modify all linked URLs to the escaped fragment format. Read more about our testing features - Testing Development Changes Before Putting Them Live.

Changing Crawl Rate

Watch for performance issues caused by the crawler while running a crawl. If you see connection errors or multiple 502/503 type errors, you may need to reduce the crawl rate in the 'Limits' tab.

If you have a robust hosting solution, you may be able to crawl the site at a faster rate. The crawl rate can be increased at times when the site load is reduced - such as 4 a.m. for example. Head to the 'Limits' tab and 'Show Restrictions'

Analyze Outbound Links

Sites with a large quantity of external links, may want to ensure that users are not directed to dead links. To check this, select 'Crawl External Links' under 'Project Settings', adding an HTTP status code next to external links within your report. Read more on outbound link audits to learn about analyzing and cleaning up external links.

Change User Agent

See your site through a variety of crawlers' eyes (Facebook/Bingbot etc.) by changing the user agent in 'Advanced Settings'. Add a custom user agent to determine how your website responds.

Next Steps

Reset your 'Project Settings' after the crawl, so you can continue to crawl with 'real-world' settings applied. Remember, the more you experiment and crawl, the closer you are to becoming an expert crawler.

Was this article helpful?
1 out of 1 found this helpful