If you work on SEO-related tasks and are looking for sitemaps or URLs in a sitemap, this article contains a list of Linux / Unix commands to make your job easy.
Commands to make your SEO-related tasks easier
Table of Contents
- What is an XML Sitemap?
- What do I need to get a sitemap from the command line?
- Download an XML sitemap and print the contents
- Extract the URLs from sitemap.xml
- Extract the URLs from sitemap.xml.gz
- Extract the URLs from a sitemap and save into a text file
- Use our online XML Sitemap Extractor tool
- Conclusion
What is an XML Sitemap?
A sitemap is like a map. It helps you how to follow a process or how to follow a direction. Search engines like Google use sitemaps to help navigate through a website in a more structured way. A sitemap is an XML file that contains a list of all or most important URLs in that website.
This is an example of the sitemap residing on our other website pinoymix.com:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url><loc>https://pinoymix.com/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/schools/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/history/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/about/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/contact/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/recipes/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
<url><loc>https://pinoymix.com/lyrics/</loc><lastmod>2024-02-26</lastmod><changefreq>weekly</changefreq><priority>0.7</priority></url>
</urlset>
The Googlebot constantly crawls the Internet and looks for updates in webpages. When it finds a sitemap, it categorizes and stores [indexes] each page in its database.
If you organize the webpages of your website neatly in a sitemap, the Googlebot can understand your website better, crawl more effiently and index the pages faster. The interval or frequency with which Googlebot crawls and indexes your webpages varies.
Until now, I've been saying Googlebot, but Googlebot is only one of the gazillion spiders on the Internet. There are several others, including spiders and bots from Bing, Yahoo, Yandex, Baidu and others, that use sitemaps to index your information.
There are two kinds of sitemaps - XML sitemaps and HTML sitemaps. HTML sitemaps are simple webpages that point to other webpages in that website. XML sitemaps are text files that contain a list of URLs on your website.
There are other sitemaps such as RSS or Atom feeds, which we have on this website, as well as text sitemaps such as urllist.txt
, where you have one URL per line. This is the simplest type of sitemap.
What do I need to get a sitemap from the command line?
You will use curl
and wget
to get sitemaps from other websites. You need to use a Terminal for this. MacOS and Linux computers have them pre-installed. On Windows, you have to manually install curl
and wget
.
Most of these commands are meant for MacOS and Linux/Unix, so if your program does not work on Windows, you might want to consider installing a Linux emulator or virtual machine with Linux. You will thank me later.
Download an XML sitemap and print the contents
For our example, let us download the sitemap from moz.com
. It is located at https://moz.com/sitemap.xml
.
Using curl:
curl -sL https://moz.com/sitemap.xml
OUTPUT:
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="sitemap.xsl"?><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><sitemap><loc>https://moz.com/sitemaps-1-section-apiProductPages-1-sitemap.xml</loc><lastmod>2020-06-25T14:45:56-07:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-section-about-1-sitemap.xml</loc><lastmod>2024-01-04T09:04:18-08:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-section-blog-1-sitemap.xml</loc><lastmod>2024-02-27T06:46:35-08:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-categorygroup-blogCategories-1-sitemap.xml</loc><lastmod>2023-12-06T02:23:47-08:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-section-blogHome-1-sitemap.xml</loc><lastmod>2024-01-11T15:27:10-08:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-section-caseStudies-1-sitemap.xml</loc><lastmod>2023-09-11T15:22:31-07:00</lastmod></sitemap><sitemap><loc>https://moz.com/sitemaps-1-section-community-1-sitemap.xml</loc><lastmod>2023-12-04T10:56:30-08:00</lastmod></sitemap>
.....
In curl, the -s
option is to keep it silent (without the progress information), and the -L
option is to follow any redirects.
Using wget:
wget -q -O- https://moz.com/sitemap.xml
In wget, the -q
option is to keep it quiet, and -O-
option is to get the file and print the result on STDOUT. Instead of -O-
, if you use -O
followed by a filename, it will save the output as a file with that filename.
Extract the URLs from sitemap.xml
To extract the URLs only, we first need to see where the URLs are located in the sitemap.
This is a sample:
<sitemap><loc>https://moz.com/sitemaps-1-section-caseStudies-1-sitemap.xml</loc><lastmod>2023-09-11T15:22:31-07:00</lastmod></sitemap>
Using curl
with grep
and sed
, we can get the URLs with this command:
curl -sL https://moz.com/sitemap.xml | grep -o "<loc>[^<]*" | sed -e 's/<[^>]*>//g'
OUTPUT (first few results):
https://moz.com/sitemaps-1-section-apiProductPages-1-sitemap.xml
https://moz.com/sitemaps-1-section-about-1-sitemap.xml
https://moz.com/sitemaps-1-section-blog-1-sitemap.xml
https://moz.com/sitemaps-1-categorygroup-blogCategories-1-sitemap.xml
https://moz.com/sitemaps-1-section-blogHome-1-sitemap.xml
https://moz.com/sitemaps-1-section-caseStudies-1-sitemap.xml
Extract the URLs from sitemap.xml.gz
What if the sitemap is saved as a gzipped file? In that case, we pipe it with gunzip
.
As an example, we will read URLs from the gzipped sitemap at https://www.yahoo.com/lifestyle/sitemaps/lifestyles-sitemap_index_US_en-US.xml.gz
curl -s https://www.yahoo.com/lifestyle/sitemaps/lifestyles-sitemap_index_US_en-US.xml.gz | gunzip | grep -o "<loc>[^<]*" | sed -e 's/<[^>]*>//g'
OUTPUT (first few results):
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-27_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-26_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-25_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-24_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-23_US_en-US.xml.gz
https://www.yahoo.com/lifestyle/sitemaps/lifestyle-sitemap_articles_2024-02-22_US_en-US.xml.gz
Extract the URLs from a sitemap and save into a text file
If you want to save the URLs from https://www.nps.gov/sitemap.xml
and store them in a file called nps.txt:
curl -s https://www.nps.gov/sitemap.xml | grep -o "<loc>[^<]*" | sed -e 's/<[^>]*>//g' > nps.txt
It gets saved in nps.txt. To verify:
$ cat nps.txt
https://www.nps.gov/sitemap/sitemap1.xml
https://www.nps.gov/sitemap/sitemap2.xml
https://www.nps.gov/sitemap/sitemap3.xml
https://www.nps.gov/sitemap/sitemap4.xml
https://www.nps.gov/sitemap/sitemap5.xml
Use our online XML Sitemap Extractor tool
You can just use our online XML Sitemap Extractor if you do not want to do all this.
We have done all the hard work for you. In this online tool, you have the option to save your result as a text file or copy/paste into the clipboard.
Conclusion
There is a lot more you can do with curl
, wget
, grep
, awk
, sed
and other Linux utilities.
You may bookmark this page if you work on SEO. We will keep updating this page depending on your feedback. Please leave a comment or contact me via email if you have any questions or comments. Thank you for reading this article.
Related Posts
If you have any questions, please contact me at arulbOsutkNiqlzziyties@gNqmaizl.bkcom. You can also post questions in our Facebook group. Thank you.