XML Sitemap URL Extractor

This program lists all the URLs inside a sitemap, and all the nested URLs as well.

example of regular sitemap: https://pinoymix.com/sitemap.xml
example of nested sitemaps: https://moz.com/sitemap.xml
example of gzipped sitemap:
https://www.yahoo.com/entertainment/sitemaps/entertainment-sitemap_articles_2024-02-23_US_en-US.xml.gz

What is a Google XML Sitemap?

A Google XML Sitemap or simply sitemap is a blueprint of your website contained in a file called sitemap.xml or something similar. It helps search engines find, crawl and index all of your website's content. Sitemaps also tell search engines which pages on your site are most important.

Why is it called XML Sitemap?

It is sometimes called XML Sitemap because the format in which it is stored is XML. This is a sample XML tag in a Sitemap.

<url>
  <loc>https://aruljohn.com/</loc>
  <lastmod>2024-02-26T19:33:40.000Z</lastmod>
</url>

Where can I find the sitemap of a website?

Type just the domain name of the website and then add a /robots.txt to the end. You will see the sitemap listed there. For example, if you want to find the sitemap of slashdot.org, just go to https://slashdot.org/robots.txt. You will find more than one sitemaps listed there.

Can you share the Linux commands to get this information?

Yes, you can read our article on how to write Linux commands to extract URLs from sitemaps. This uses curl, grep, gunzip and other commands.