Free Software Directory:Participate/Editing multiple pages

From Free Software Directory
Jump to: navigation, search

Currently, the best known way to edit multiple pages without remote access is through asking Semantic MediaWiki which pages match a given set of properties, exporting those as XML, making the edits and importing the changes.

However, there are some issues which may need consideration:

  • The use of the procedures in this page doesn't guarantee that the Free Software Directory wiki's job queue will be processed faster, so the server/service may become unstable.
  • For anonymous access and accounts which are not marked as bots, the results might be limited.
  • According to MediaWiki, the download/export is limited to 35 pages per request.
  • If you plan to upload/import by using the upload_xmls function, then the accounts are required to have importupload user right/permission (only given through backend access to MediaWiki's LocalSettings.php) and, in the default clientlogin login method, must be made without the Free Software Foundation Central Login.

Requirements

  • Knowledge of:
  • A computer with any of:
  • If you plan to upload/import by using the upload_xmls function, then fulfil one of the following:
    • If using the default clientlogin method: a bot account created without the Free Software Foundation Central Login, and which must have importupload user right/permission (only given through backend access to MediaWiki's LocalSettings.php).
    • If using the login method: an user account with importupload user right/permission (only given through backend access to MediaWiki's LocalSettings.php), as well as a bot account for the same user, from Special:BotPasswords, which must also be given the importupload user right/permission, as well as having the following rights from Special:BotPasswords:
      • High-volume editing.
      • Edit existing pages.
      • Edit protected pages.
      • Create, edit, and move pages.
      • Upload new files.
      • Upload, replace, and move files.
  • Independently of the account type, username and password are indirectly required by the write_updated_xmls function.
  • Python 3 and the modules mentioned at the lines starting with import in the #XML transfer script.

QEMU setup

For the procedures it is assumed that you have QEMU or a Technoethical T400s, which has a mix of low and high specifications, that is suitable for most office uses. A similar setup can be achived with QEMU using the following commands:

    qemu-img create -f "qcow2" "Technoethical T400s.qcow2" "120G"

    qemu-system-x86_64 \
        -enable-kvm \
        -cpu "Penryn-v1" \
        -m "4G" \
        -hda "Technoethical T400s.qcow2" \
        -cdrom ~/"Downloads/trisquel_10.0.1_amd64.iso" \
        -netdev user,id=vmnic,hostname="Technoethical-T400s" \
        -device virtio-net,netdev=vmnic

Afterwards, use Trisquel's package manager to install the dependencies listed in the #Requirements section, as well as the Python modules from lines starting with import in the script inside the #XML transfer section.

XML transfer

The following Python 3 script can be used to ask, download, update and upload Semantic MediaWiki pages as XML of the Free Software Directory.

#!/usr/bin/python3

# xml_transfer: Ask, download, update and upload Semantic MediaWiki pages as XML of the Free Software Directory (<https://directory.fsf.org>).
# Copyright (C) 2023  Adonay "adfeno" Felipe Nogueira <https://libreplanet.org/wiki/User:Adfeno>

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU Affero General Public License for more details.

# You should have received a copy of the GNU Affero General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

"""
Ask, download, update and upload Semantic MediaWiki pages as XML of the Free Software Directory (<https://directory.fsf.org>).

Requirements:

* If you plan to upload/import by using the "upload_xmls" function, then fulfil one of the following:

** If using the default "clientlogin" method: a bot account created without the Free Software Foundation Central Login, and which must have importupload user right/permission (only given through backend access to MediaWiki's LocalSettings.php).

** If using the "login" method: an user account with importupload user right/permission (only given through backend access to MediaWiki's LocalSettings.php), as well as a bot account for the same user, from [[Special:BotPasswords]], which must also be given the importupload user right/permission, as well as having the following rights from [[Special:BotPasswords]]:

*** High-volume editing.
*** Edit existing pages.
*** Edit protected pages.
*** Create, edit, and move pages.
*** Upload new files.
*** Upload, replace, and move files.

* Independently of the account type, "username" and "password" are indirectly required by the "write_updated_xmls" function.

Before using the functions from this module, please read the comment on the requirements and, after fulfilling those, set the following variables:

* work_dir (optional, default: "."): work directory, if you change it, call "os.chdir(work_dir)" afterwards.

* username and password: depending on the account type, may allow more page names to be fetched or the usage of the "upload_xmls" function. Independently of the account type, the information is indirectly required by the "write_updated_xmls" function.

* edit_summary: edit summary/comment, which is only used by the "write_updated_xmls" function.

After changing the variables above, here follows a basic workflow:

1. Depending on the account type, if you want to allow more page names to be fetched or the usage of the "upload_xmls" function, call the "login" function.

1.a. It's recommended to repeat this step if you take too long (or the connection is lost) between steps 2, 3.a, 5 or 8.

2. (Optional) Test page selection with "ask" function.

3. Get the full page list by either:

3.a. Call "download_full_page_list".

3.b. Read full page list from existing file by calling "read_full_page_list_file".

4. Split the full list into a list of lists by either:

4.a. Calling "write_page_list_files" to write to a file, returning sublists with a maximum of 35 page names.

4.b. Read multiple list files using "read_page_list_files", regardless of how many page names are in each sublist.

4.c. Split an existing full page list with "split_full_page_list" into a list of lists, each with a maximum of 35 page names, without touching any files.

5. Get the XMLs by calling "download_xmls".

6. Edit the main text of the entries using any tool, for which you can also use the "xml.etree.ElementTree" Python module, as well as any XPath filter or XML editor.

7. Update the XMLs' metadata using "write_updated_xmls", which takes into account "edit_summary", the current session's user name and identity, as well as the date and time.

8. Finally, send the XMLs by calling "upload_xmls".
"""

import glob
import os
import re
import requests
import sys
import time
import xml.etree.ElementTree as ET

# Basic configuration
# If you change "work_dir", run "os.chdir(work_dir)" afterwards.
work_dir = "."
# Name and password which, depending on the account type, may allow more page names to be fetched or the usage of the "upload_xmls" function.
username = ""
password = ""
# Edit summary/comment, which is only used by the "write_updated_xmls" function.
edit_summary = ""

# Automated setup after basic configuration
url = "https://directory.fsf.org/"
session = requests.Session()
os.chdir(work_dir)
ns = {"" : "http://www.mediawiki.org/xml/export-0.10/"}
ET.register_namespace("", ns[""])

# Function definitions

def login(action = "clientlogin"):
    """
    Do login, which should be done first or if the server returns an authentication error when using other functions. Returns True on success, and nothing otherwise.
    Arguments:
    * action: "clientlogin" or "login".
    """
    if ((action != "clientlogin")
        and (action != "login")):
        raise TypeError('"action" must be "clientlogin" or "login".')
    global logintoken
    global session_username
    global session_userid
    global csrftoken
    # Step 1: Get a login token.
    response = session.get(
        url + "w/api.php",
        params = {
            "action": "query",
            "meta": "tokens",
            "type": "login",
            "format": "json"
        }
    )
    logintoken = response.json()["query"]["tokens"]["logintoken"]
    # Step 2
    if action == "clientlogin":
        # Method: 2.a: Log in using the clientlogin action.
        response = session.post(
            url + "w/api.php",
            data = {
                "action": action,
                "username": username,
                "password": password,
                "logintoken": logintoken,
                "loginreturnurl": url,
                "format": "json"
            }
        )
    elif action == "login":
        # Method: 2.b: Log in using the login action.
        response = session.post(
            url + "w/api.php",
            data = {
                "action": action,
                "lgname": username,
                "lgpassword": password,
                "lgtoken": logintoken,
                "format": "json"
            }
        )

    # Step 3: While logged in, retrieve a CSRF/edit token.
    response = session.get(
        url + "w/api.php",
        params = {
            "action": "query",
            "meta": "tokens",
            "format": "json"
        }
    )
    csrftoken = response.json()["query"]["tokens"]["csrftoken"]
    return csrftoken

    # Step 4: Get user identity and name.
    response = session.get(
        url + "w/api.php",
        data = {
            "action": "query",
            "meta": "userinfo",
            "format": "json"
        }
    )
    session_username = response.json()["query"]["userinfo"]["name"]
    session_username = response.json()["query"]["userinfo"]["id"]
    return True

def ask(query = "[[Category:Entry]]"):
    """
    Do a single Semantic MediaWiki query, useful for testing. Returns a dictionary whose "data" is the paginated result, and "offset" is the next offset to use or None if no continuation is needed.
    Arguments:
    * query: Semantic MediaWiki query.
    """
    response = session.post(
        url + "w/api.php",
        data = {
            "action": "ask",
            "query": query,
            "format": "json"
        }
    )
    try:
        offset = response.json()["query-continue-offset"]
    except KeyError:
        offset = None
    return {
        "data": response.json()["query"]["results"],
        "offset": offset
    }

def download_full_page_list(query = "[[Category:Entry]]", full_page_list_file = ".list"):
    """
    Download the full page list matching the given single Semantic MediaWiki query and writes that to a file. Returns a list with all the pages that were written into the file.
    Arguments:
    * query: Semantic MediaWiki query.
    * full_page_list_file: file path to write.
    """
    full_page_list = []
    query = [
        re.sub(
            re.escape("|") + "\s*limit\s*=\s*\d+",
            "",
            query
        ),
        0
    ]
    while True:
        results = ask(query[0])
        if results["offset"] is not None:
            query = list(
                re.subn(
                    re.escape("|") + "\s*limit\s*=\s*\d+",
                    "|offset=" + str(results["offset"]),
                    query[0]
                )
            )
            if query[1] == 0:
                query[0] = query[0] + "|offset=" + str(results["offset"])
        full_page_list.extend(results["data"].keys())
        if results["offset"] is None:
            break
    with open(full_page_list_file, "w") as f:
        f.write("\n".join(full_page_list))
    return full_page_list

def read_full_page_list_file(full_page_list_file = ".list"):
    """
    Read the full page list from a file, without downloading the list. Returns a list with all the pages that were read from the file. This is useful for resuming a paused work.
    Arguments:
    * full_page_list_file: file path to read.
    """
    with open(full_page_list_file, "r") as f:
        full_page_list = (f.read(full_page_list_file)).splitlines()
    return full_page_list

def split_full_page_list(full_page_list):
    """
    Split the full page list, returning a list containing a series of lists, each of those with 35 page names.
    Arguments:
    * full_page_list: full page list. Required.
    """
    return [full_page_list[i:i + 35] for i in range(0, len(full_page_list), 35)]

def write_page_list_files(full_page_list):
    """
    Write the split page lists into text files, each named with the first page name of each set of 35 pages, suffixed with ".txt" extension. Returns a list containing a series of lists, each of those with 35 page names.
    Arguments:
    * full_page_list: full page list. Required.
    """
    page_lists = split_full_page_list(full_page_list)
    for each_list in page_lists:
        with open(each_list[0] + ".txt", "wb") as f:
                f.write("\n".join(each_list))    
    return page_lists

def read_page_list_files(page_list_files = []):
    """
    Write the split page lists into text files, each named with the first page name of each set of pages, suffixed with ".txt" extension. Returns a list containing a series of lists, each of those with page names.
    Arguments:
    * full_page_list_files: a list of file paths whose contents are page lists. Default: a search is done for every file in "work_dir" that ends with ".txt".
    """
    if ((len(page_list_files) == 0)
        or (page_list_files is None)):
        page_list_files = glob.glob("*.txt")
    if not isinstance(page_list_files, list):
        raise TypeError('"page_list_files" must be the instance of a list.')
    page_lists = []
    for each_list in page_list_files:
        with open(each_list, "r") as f:
            page_lists.append(f.read(each_list).splitlines())
    return page_lists

def download_xmls(page_lists = []):
    """
    Download the XMLs of each page list inside a major list, each XML will be named with the first page name of each set of pages, suffixed with ".xml" extension. Returns True on success, and nothing otherwise.
    Arguments:
    * page_lists: a list of lists, each containing the page names to work on. Required.
    """
    for each_list in page_lists:
        response = session.post(
            url + "wiki/Special:Export",
            data = {
                "curonly": "true",
                "pages": "\n".join(each_list)
            }
        )
        with open(each_list[0] + ".xml", "wb") as f:
                f.write(response.content)
    return True

def write_updated_xmls(xml_files = []):
    """
    Update and rewrite the specified XMLs. The update takes "edit_summary", the current session's user name and identity to change the same metadata in the revision, as well as the timestamp of the current date and time. Returns True on success, and nothing otherwise.
    Arguments:
    * xml_files: a list of file paths whose contents are the XML exports of the FSD. Default: a search is done for every file in "work_dir" that ends with ".xml".
    """
    if ((len(xml_files) == 0)
        or (xml_files is None)):
        xml_files = glob.glob("*.xml")
    if not isinstance(xml_files, list):
        raise TypeError('"xml_files" must be the instance of a list.')
    for each_xml in xml_files:
        xml = ET.parse(each_xml)
        for each_page in xml.findall(".//page", ns):
            each_page.find("revision/contributor/username", ns).text = session_username
            each_page.find("revision/contributor/id", ns).text = session_userid
            each_page.find("revision/comment", ns).text = edit_summary
            each_page.find("revision/timestamp", ns).text = time.strftime("%FT%TZ", time.gmtime())
        xml.write(each_xml, encoding = "utf-8")
    return True

def upload_xmls(xml_files = []):
    """
    Upload the specified XMLs. Returns True on success, and nothing otherwise.
    Arguments:
    * xml_files: a list of file paths whose contents are the XML exports of the FSD. Default: a search is done for every file in "work_dir" that ends with ".xml".
    """
    if ((len(xml_files) == 0)
        or (xml_files is None)):
        xml_files = glob.glob("*.xml")
    if not isinstance(xml_files, list):
        raise TypeError('"xml_files" must be the instance of a list.')
    for each_xml in xml_files:
        response = session.post(
            url + "w/api.php",
            data = {
                "action": "import",
                "token": csrftoken,
                "summary": edit_summary,
                "format": "json"
            },
            files = {
                "xml": open(each_xml)
            }
        )
    return True


Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the page “GNU Free Documentation License”.

The copyright and license notices on this page only apply to the text on this page. Any software or copyright-licenses or other similar notices described in this text has its own copyright notice and license, which can usually be found in the distribution or license text itself.