SYMPTOM / CRAWL RESULTS POTENTIAL REASON SOLUTION
0 or 1 disallowed URL is returned Robots.txt disallowed everything You can use a custom robots.txt file by utilizing our "Robots Overwrite" feature found in Advanced Settings > Test Settings > Robots Overwrite.
1 indexable URL only returning 200 status code Whole site or at least part of the website uses JavaScript You will need to enable our JavaScript rendering and run a rendered crawl for accurate results.
"Include only" rules/paths isn’t linked from the primary domain. Add in a start URL that contains links to the pages that satisfy the include only rule. You can do this by going to Advanced Settings > Scope > Start URLs.
If there is a login portal, the site might require cookies to be crawled. Contact Similarweb Support and provide the sessions cookie and they will add it to your project.
1 URL crawled with status code of 401 or 403 Site is blocking Similarweb’s bot via IP You can make sure the default IP (52.5.118.182 or 52.86.188.211) is selected in Advanced Settings > Spider Settings > Crawler IP Settings and then have the webmaster of the site whitelist these IP addresses.
1 URL returned with 3xx status code Primary domain redirects to a URL that isn't in the scope of the crawl Ensure the redirected to URL is within the scope of the crawl by either selecting Crawl both HTTP/HTTPS, Crawl all subdomains or add in specific secondary domain. Adding the redirected to URL as a start URL can also fix this.
1 URL crawled with curl_GotNothing or with no links or metrics at all Website has security features that block fake crawlers that are pretending to be real crawlers If the page failed, changing the user agent in Advanced Settings from Googlebot to similarweb will resolve the issue.
1 Failed URL with reason: error_Curl_Err_SSLCertificateError or error_Curl_Err_CurlError The SSL certificate for the website might be invalid. This is a common issue when crawling staging sites. Can check validity in address bar or on external validator for more information. Select "Ignore invalid SSL Certificate" in Advanced Settings > Scope > Crawl Restrictions.
All or most URLs return 403 status code (the page title might be "Attention Required! | Cloudflare"). The website has security features that block fake crawlers that are pretending to be real crawlers If the page was blocked, changing the user agent in Advanced Settings from Googlebot to similarweb will resolve the issue. You can also test this by changing your browser user agent to Googlebot.
Was this article helpful?
0 out of 0 found this helpful