Infrastructure
Check out all of our GitHub Actions: https://actions.cicirello.org/
| GitHub Actions | |
|---|---|
| Build Status | |
| Source Info | |
| Support |
The generate-sitemap GitHub action generates a sitemap for a website hosted on GitHub Pages, and has the following features:
<lastmod> tag in the sitemap entry. If the file
was created during that workflow run, but not yet committed, then it instead uses
the current date (however, we recommend if possible committing newly created files first).<meta name="robots" content="noindex">
directives, excluding any that do from the sitemap.Disallow: rules for User-agent: *.index.html that the preferred URL for the page
ends with the enclosing directory, leaving out the index.html. For example,
instead of https://WEBSITE/PATH/index.html, the sitemap will contain
https://WEBSITE/PATH/ in such a case..html extension from URLs listed in sitemap.The generate-sitemap GitHub action is designed to be used in combination with other GitHub Actions. For example, it does not commit and push the generated sitemap. See the Examples for examples of combining with other actions in your workflow.
The generate-sitemap action is for GitHub Pages sites, such that the repository contains the html, etc of the site itself, regardless of whether or not the html was generated by a static site generator or written by hand. For example, I use it for multiple Java project documentation sites, where most of the site is generated by javadoc. I also use it with my personal website, which is generated with a custom static site generator. As long as the repository for the GitHub Pages site contains the site as served (e.g., html files, pdf files, etc), the generate-sitemap action is applicable.
The generate-sitemap action is not for GitHub Pages Jekyll sites (unless you generate the site locally and push the html output instead of the markdown, but why would you do that?). In the case of a GitHub Pages Jekyll site, the repository contains markdown, and not the html that is generated from the markdown. The generate-sitemap action does not support that use-case. If you are looking to generate a sitemap for a Jekyll website, there is a Jekyll plugin for that.
The remainder of the documentation is organized into the following sections:
This action relies on actions/checkout@v2 with fetch-depth: 0.
Setting the fetch-depth to 0 for the checkout action ensures
that the generate-sitemap action will have access to the commit
history, which is used for generating the <lastmod> tags in the
sitemap.xml file. If you instead use the default when applying the
checkout action, the <lastmod> tags will be incorrect. So be
sure to include the following as a step in your workflow:
steps:
- name: Checkout the repo
uses: actions/checkout@v4
with:
fetch-depth: 0
path-to-rootThe path to the root of the website relative to the
root of the repository. Default . is appropriate in most cases,
such as whenever the root of your Pages site is the root of the
repository itself. If you are using this for a GitHub Pages site
in the docs directory, such as for a documentation website, then
just pass docs for this input.
base-url-pathThis is the url to your website. You must specify this
for your sitemap to be meaningful. It defaults
to https://web.address.of.your.nifty.website/ for demonstration
purposes.
include-htmlThis flag determines whether html files are included in
your sitemap (files with an extension of either .html
or .htm). Default: true.
include-pdfThis flag determines whether pdf files are included in
your sitemap. Default: true.
additional-extensionsIf you want to include URLs to other document types, you can use
the additional-extensions input to specify a list (separated by
spaces) of file extensions. For example, Google (and other search
engines) index a variety of other file types, including docx, doc,
source code for various common programming languages, etc. Here
is an example:
- name: Generate the sitemap
uses: cicirello/generate-sitemap@v1
with:
additional-extensions: doc docx ppt pptx
exclude-pathsThe action will automatically exclude any files or directories
based on a robots.txt file, if present. But if you have additional
directories or individual files that you wish to exclude from the
sitemap that are not otherwise blocked, you can use the exclude-paths
input to specify a list of them, separated by any whitespace characters.
For example, if you wish to exclude the directory /exclude-these as
well as the individual file /nositemap.html, you can use the following:
- name: Generate the sitemap
uses: cicirello/generate-sitemap@v1
with:
exclude-paths: /exclude-these /nositemap.html
If you have many such cases to exclude, your workflow may be easier to read if you use a YAML multi-line string, with the following:
- name: Generate the sitemap
uses: cicirello/generate-sitemap@v1
with:
exclude-paths: >
/exclude-these
/nositemap.html
sitemap-formatUse this to specify the sitemap format. Default: xml.
The sitemap.xml generated by the default will contain lastmod dates
that are generated using the last commit dates of each file. Setting
this input to anything other than xml will generate a plain text
sitemap.txt simply listing the urls.
drop-html-extensionThe drop-html-extension input provides the option to exclude .html extension
from URLs listed in the sitemap. The default is drop-html-extension: false. If
you want to use this option, just pass drop-html-extension: true to the action in
your workflow. GitHub Pages automatically serves the
corresponding html file if URL has no file extension. For example, if a user
of your site browses to the URL, https://WEBSITE/PATH/filename (with no extension),
GitHub Pages automatically serves https://WEBSITE/PATH/filename.html if it exists.
The default behavior of the generate-sitemap action includes the .html extension
for pages where the filename has the .html extension. If you prefer to exclude the
.html extension from the URLs in your sitemap, then
pass drop-html-extension: true to the action in your workflow.
Note that you should also ensure that any canonical links that you list within
the html files corresponds to your choice here.
date-onlyThe date-only input controls whether XML sitemaps include the full date and time in lastmod,
or only the date. The default is date-only: false, which includes the full date and time
in the lastmod fields. If you only want the date in the lastmod, then use date-only: true.
sitemap-pathThe generated sitemap is placed in the root of the website. This
output is the path to the generated sitemap file relative to the
root of the repository. If you didn't use the path-to-root input, then
this output should simply be the name of the sitemap file (sitemap.xml
or sitemap.txt).
url-countThis output provides the number of URLs in the sitemap.
excluded-countThis output provides the number of URLs excluded from the sitemap due
to either <meta name="robots" content="noindex"> within html files,
or due to exclusion from directives in a robots.txt file.
You can run the action with a step in your workflow like this:
- name: Generate the sitemap
uses: cicirello/generate-sitemap@v1
with:
base-url-path: https://THE.URL.TO.YOUR.PAGE/
In the above example, the major release version was used, which ensures that you'll be using the latest patch level release, including any bug fixes, etc. If you prefer, you can also use a specific version such as with:
- name: Generate the sitemap
uses: cicirello/generate-sitemap@v1.10.1
with:
base-url-path: https://THE.URL.TO.YOUR.PAGE/
In this example workflow, we use all of the default inputs except for
the base-url-path input. The result will be a sitemap.xml
file in the root of the repository. After completion, it then
simply echos the outputs.
name: Generate xml sitemap
on:
push:
branches: [ main ]
jobs:
sitemap_job:
runs-on: ubuntu-latest
name: Generate a sitemap
steps:
- name: Checkout the repo
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Generate the sitemap
id: sitemap
uses: cicirello/generate-sitemap@v1
with:
base-url-path: https://THE.URL.TO.YOUR.PAGE/
- name: Output stats
run: |
echo "sitemap-path = ${{ steps.sitemap.outputs.sitemap-path }}"
echo "url-count = ${{ steps.sitemap.outputs.url-count }}"
echo "excluded-count = ${{ steps.sitemap.outputs.excluded-count }}"
This example workflow illustrates how you might use this to generate
a sitemap for a Pages site in the docs directory of the
repository. It also demonstrates excluding pdf files, and
configuring a plain text sitemap.
name: Generate API sitemap
on:
push:
branches: [ main ]
jobs:
sitemap_job:
runs-on: ubuntu-latest
name: Generate a sitemap
steps:
- name: Checkout the repo
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Generate the sitemap
id: sitemap
uses: cicirello/generate-sitemap@v1
with:
base-url-path: https://THE.URL.TO.YOUR.PAGE/
path-to-root: docs
include-pdf: false
sitemap-format: txt
- name: Output stats
run: |
echo "sitemap-path = ${{ steps.sitemap.outputs.sitemap-path }}"
echo "url-count = ${{ steps.sitemap.outputs.url-count }}"
echo "excluded-count = ${{ steps.sitemap.outputs.excluded-count }}"
In this example workflow, we add various additional types to the
sitemap using the additional-extensions input. Note that this
also include html files and pdf files since the workflow is using the
default values for include-html and include-pdf, which both default to
true.
name: Generate xml sitemap
on:
push:
branches: [ main ]
jobs:
sitemap_job:
runs-on: ubuntu-latest
name: Generate a sitemap
steps:
- name: Checkout the repo
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Generate the sitemap
id: sitemap
uses: cicirello/generate-sitemap@v1
with:
base-url-path: https://THE.URL.TO.YOUR.PAGE/
additional-extensions: doc docx ppt pptx xls xlsx
- name: Output stats
run: |
echo "sitemap-path = ${{ steps.sitemap.outputs.sitemap-path }}"
echo "url-count = ${{ steps.sitemap.outputs.url-count }}"
echo "excluded-count = ${{ steps.sitemap.outputs.excluded-count }}"
Presumably you want to do something with your sitemap once it is
generated. In this example workflow, we combine it with the action
peter-evans/create-pull-request.
First, the cicirello/generate-sitemap action generates the sitemap. And
then the peter-evans/create-pull-request monitors for changes, and
if the sitemap changed will create a pull request.
name: Generate xml sitemap
on:
push:
branches: [ main ]
jobs:
sitemap_job:
runs-on: ubuntu-latest
name: Generate a sitemap
steps:
- name: Checkout the repo
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Generate the sitemap
id: sitemap
uses: cicirello/generate-sitemap@v1
with:
base-url-path: https://THE.URL.TO.YOUR.PAGE/
- name: Create Pull Request
uses: peter-evans/create-pull-request@v3
with:
title: "Automated sitemap update"
body: >
Sitemap updated by the [generate-sitemap](https://github.com/cicirello/generate-sitemap)
GitHub action. Automated pull-request generated by the
[create-pull-request](https://github.com/peter-evans/create-pull-request) GitHub action.
This first real example is from the personal website
of the developer. One of the workflows,
sitemap-generation.yml,
is strictly for generating the sitemap. It runs on pushes of either *.html or *.pdf
files to the staging branch of this repository. After generating the sitemap, it uses
peter-evans/create-pull-request
to generate a pull request. You can also replace that step with a commit and push instead.
You can find the resulting sitemap here: sitemap.xml.
This next example is for the documentation website of
the Chips-n-Salsa library. The
docs.yml
workflow runs on push and pull-requests of either *.java files. It uses Maven
to run javadoc (e.g., with mvn javadoc:javadoc). It then copies the generated javadoc
documentation to the docs directory, from which the API website is served. This is followed
by another GitHub Action,
cicirello/javadoc-cleanup,
which makes a few edits to the javadoc generated website to improve mobile browsing.
Next, it commits any changes (without pushing yet) produced by javadoc and/or javadoc-cleanup. After performing those commits, it now runs the generate-sitemap action to generate the sitemap. It does this after committing the site changes so that the lastmod dates will be accurate. Finally, it uses peter-evans/create-pull-request to generate a pull request. You can also replace that step with a commit and push instead.
You can find the resulting sitemap here: sitemap.xml.
The generate-sitemap action uses the following:
Here is a selection of blog posts about generate-sitemap on DEV.to:
You can support the project in a number of ways:
generate-sitemap action
useful, consider starring the repository.The scripts and documentation for this GitHub action is released under the MIT License.