How to Download and Extract URLs from Sitemaps using the Command Line

Published on February 28, 2024

If you work on SEO-related tasks and are looking for sitemaps or URLs in a sitemap, this article contains a list of Linux / Unix commands to make your job easy.

Commands to make your SEO-related tasks easier Commands to make your SEO-related tasks easier

What is an XML Sitemap?

A sitemap is like a map. It helps you how to follow a process or how to follow a direction. Search engines like Google use sitemaps to help navigate through a website in a more structured way. A sitemap is an XML file that contains a list of all or most important URLs in that website.

This is an example of the sitemap residing on our other website pinoymix.com:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url><loc>https://pinoymix.com/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/schools/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/history/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/about/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/contact/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/recipes/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/lyrics/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
</urlset>

The Googlebot constantly crawls the Internet and looks for updates in webpages. When it finds a sitemap, it categorizes and stores [indexes] each page in its database.

If you organize the webpages of your website neatly in a sitemap, the Googlebot can understand your website better, crawl more effiently and index the pages faster. The interval or frequency with which Googlebot crawls and indexes your webpages varies.

Until now, I've been saying Googlebot, but Googlebot is only one of the gazillion spiders on the Internet. There are several others, including spiders and bots from Bing, Yahoo, Yandex, Baidu and others, that use sitemaps to index your information.

There are two kinds of sitemaps - XML sitemaps and HTML sitemaps. HTML sitemaps are simple webpages that point to other webpages in that website. XML sitemaps are text files that contain a list of URLs on your website.

There are other sitemaps such as RSS or Atom feeds, which we have on this website, as well as text sitemaps such as urllist.txt, where you have one URL per line. This is the simplest type of sitemap.

What do I need to get a sitemap from the command line?

You will use curl and wget to get sitemaps from other websites. You need to use a Terminal for this. MacOS and Linux computers have them pre-installed. On Windows, you have to manually install curl and wget.

Most of these commands are meant for MacOS and Linux/Unix, so if your program does not work on Windows, you might want to consider installing a Linux emulator or virtual machine with Linux. You will thank me later.

Download an XML sitemap and print the contents

For our example, let us download the sitemap from moz.com. It is located at https://moz.com/sitemap.xml.

Using curl:

curl -sL https://moz.com/sitemap.xml

OUTPUT:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="sitemap.xsl"?><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><sitemap><loc>https://moz.com/sitemaps-1-section-apiProductPages-1-sitemap.xml</loc><lastmod>2020-06-25T14:45:56-07:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-section-about-1-sitemap.xml</loc><lastmod>2024-01-04T09:04:18-08:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-section-blog-1-sitemap.xml</loc><lastmod>2024-02-27T06:46:35-08:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-categorygroup-blogCategories-1-sitemap.xml</loc><lastmod>2023-12-06T02:23:47-08:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-section-blogHome-1-sitemap.xml</loc><lastmod>2024-01-11T15:27:10-08:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-section-caseStudies-1-sitemap.xml</loc><lastmod>2023-09-11T15:22:31-07:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-section-community-1-sitemap.xml</loc><lastmod>2023-12-04T10:56:30-08:00</lastmod></sitemap>
.....

In curl, the -s option is to keep it silent (without the progress information), and the -L option is to follow any redirects.

Using wget:

wget -q -O-   https://moz.com/sitemap.xml

In wget, the -q option is to keep it quiet, and -O- option is to get the file and print the result on STDOUT. Instead of -O-, if you use -O followed by a filename, it will save the output as a file with that filename.

Extract the URLs from sitemap.xml

To extract the URLs only, we first need to see where the URLs are located in the sitemap.

This is a sample:

<sitemap><loc>https://moz.com/sitemaps-1-section-caseStudies-1-sitemap.xml</loc><lastmod>2023-09-11T15:22:31-07:00</lastmod></sitemap>

Using curl with grep and sed, we can get the URLs with this command:

curl -sL https://moz.com/sitemap.xml |  grep -o "<loc>[^<]*" | sed -e 's/<[^>]*>//g'

OUTPUT (first few results):

https://moz.com/sitemaps-1-section-apiProductPages-1-sitemap.xml
https://moz.com/sitemaps-1-section-about-1-sitemap.xml
https://moz.com/sitemaps-1-section-blog-1-sitemap.xml
https://moz.com/sitemaps-1-categorygroup-blogCategories-1-sitemap.xml
https://moz.com/sitemaps-1-section-blogHome-1-sitemap.xml
https://moz.com/sitemaps-1-section-caseStudies-1-sitemap.xml

Extract the URLs from sitemap.xml.gz

What if the sitemap is saved as a gzipped file? In that case, we pipe it with gunzip.

As an example, we will read URLs from the gzipped sitemap at https://www.yahoo.com/lifestyle/sitemaps/lifestyles-sitemap_index_US_en-US.xml.gz

curl -s https://www.yahoo.com/lifestyle/sitemaps/lifestyles-sitemap_index_US_en-US.xml.gz | gunzip | grep -o "<loc>[^<]*" | sed -e 's/<[^>]*>//g'

OUTPUT (first few results):

https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-27_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-26_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-25_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-24_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-23_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-22_US_en-US.xml.gz

Extract the URLs from a sitemap and save into a text file

If you want to save the URLs from https://www.nps.gov/sitemap.xml and store them in a file called nps.txt:

curl -s https://www.nps.gov/sitemap.xml | grep -o "<loc>[^<]*" | sed -e 's/<[^>]*>//g' > nps.txt

It gets saved in nps.txt. To verify:

$ cat nps.txt 
https://www.nps.gov/sitemap/sitemap1.xml
https://www.nps.gov/sitemap/sitemap2.xml
https://www.nps.gov/sitemap/sitemap3.xml
https://www.nps.gov/sitemap/sitemap4.xml
https://www.nps.gov/sitemap/sitemap5.xml

Use our online XML Sitemap Extractor tool

You can just use our online XML Sitemap Extractor if you do not want to do all this.

We have done all the hard work for you. In this online tool, you have the option to save your result as a text file or copy/paste into the clipboard.

Conclusion

There is a lot more you can do with curl, wget, grep, awk, sed and other Linux utilities.

You may bookmark this page if you work on SEO. We will keep updating this page depending on your feedback. Please leave a comment or contact me via email if you have any questions or comments. Thank you for reading this article.

Related Posts

If you have any questions, please contact me at arulbOsutkNiqlzziyties@gNqmaizl.bkcom. You can also post questions in our Facebook group. Thank you.

Disclaimer: Our website is supported by our users. We sometimes earn affiliate links when you click through the affiliate links on our website.

Published on February 28, 2024