Free Software Directory:Participate/Editing multiple pages
Currently, the best known way to edit multiple pages without remote access is through asking Semantic MediaWiki which pages match a given set of properties, exporting those as XML, making the edits and importing the changes.
However, there are some issues which may need consideration:
- The use of the procedures in this page doesn't guarantee that the Free Software Directory wiki's job queue will be processed faster, so the server/service may become unstable.
- For anonymous access and accounts which are not marked as bots, the results might be limited.
- According to MediaWiki, the download/export is limited to 35 pages per request.
- If you plan to upload/import by using the
upload_xmls
function, then the accounts are required to haveimportupload
user right/permission (only given through backend access to MediaWiki'sLocalSettings.php
) and, in the defaultclientlogin
login method, must be made without the Free Software Foundation Central Login.
Requirements
- Knowledge of:
- Python 3.
- Python regular expressions.
- Semantic MediaWiki ask queries.
- XPath 1.0.
- A computer with any of:
- QEMU.
- Specifications similar to those of Technoethical T400s.
- If you plan to upload/import by using the
upload_xmls
function, then fulfil one of the following:- If using the default
clientlogin
method: a bot account created without the Free Software Foundation Central Login, and which must haveimportupload
user right/permission (only given through backend access to MediaWiki'sLocalSettings.php
). - If using the
login
method: an user account withimportupload
user right/permission (only given through backend access to MediaWiki'sLocalSettings.php
), as well as a bot account for the same user, from Special:BotPasswords, which must also be given the importupload user right/permission, as well as having the following rights from Special:BotPasswords:- High-volume editing.
- Edit existing pages.
- Edit protected pages.
- Create, edit, and move pages.
- Upload new files.
- Upload, replace, and move files.
- If using the default
- Independently of the account type,
username
andpassword
are indirectly required by thewrite_updated_xmls
function. - Python 3 and the modules mentioned at the lines starting with
import
in the #XML transfer script.
QEMU setup
For the procedures it is assumed that you have QEMU or a Technoethical T400s, which has a mix of low and high specifications, that is suitable for most office uses. A similar setup can be achived with QEMU using the following commands:
qemu-img create -f "qcow2" "Technoethical T400s.qcow2" "120G" qemu-system-x86_64 \ -enable-kvm \ -cpu "Penryn-v1" \ -m "4G" \ -hda "Technoethical T400s.qcow2" \ -cdrom ~/"Downloads/trisquel_10.0.1_amd64.iso" \ -netdev user,id=vmnic,hostname="Technoethical-T400s" \ -device virtio-net,netdev=vmnic
Afterwards, use Trisquel's package manager to install the dependencies listed in the #Requirements section, as well as the Python modules from lines starting with import
in the script inside the #XML transfer section.
XML transfer
The following Python 3 script can be used to ask, download, update and upload Semantic MediaWiki pages as XML of the Free Software Directory.
#!/usr/bin/python3 # xml_transfer: Ask, download, update and upload Semantic MediaWiki pages as XML of the Free Software Directory (<https://directory.fsf.org>). # Copyright (C) 2023 Adonay "adfeno" Felipe Nogueira <https://libreplanet.org/wiki/User:Adfeno> # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as # published by the Free Software Foundation, either version 3 of the # License, or (at your option) any later version. # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. """ Ask, download, update and upload Semantic MediaWiki pages as XML of the Free Software Directory (<https://directory.fsf.org>). Requirements: * If you plan to upload/import by using the "upload_xmls" function, then fulfil one of the following: ** If using the default "clientlogin" method: a bot account created without the Free Software Foundation Central Login, and which must have importupload user right/permission (only given through backend access to MediaWiki's LocalSettings.php). ** If using the "login" method: an user account with importupload user right/permission (only given through backend access to MediaWiki's LocalSettings.php), as well as a bot account for the same user, from [[Special:BotPasswords]], which must also be given the importupload user right/permission, as well as having the following rights from [[Special:BotPasswords]]: *** High-volume editing. *** Edit existing pages. *** Edit protected pages. *** Create, edit, and move pages. *** Upload new files. *** Upload, replace, and move files. * Independently of the account type, "username" and "password" are indirectly required by the "write_updated_xmls" function. Before using the functions from this module, please read the comment on the requirements and, after fulfilling those, set the following variables: * work_dir (optional, default: "."): work directory, if you change it, call "os.chdir(work_dir)" afterwards. * username and password: depending on the account type, may allow more page names to be fetched or the usage of the "upload_xmls" function. Independently of the account type, the information is indirectly required by the "write_updated_xmls" function. * edit_summary: edit summary/comment, which is only used by the "write_updated_xmls" function. After changing the variables above, here follows a basic workflow: 1. Depending on the account type, if you want to allow more page names to be fetched or the usage of the "upload_xmls" function, call the "login" function. 1.a. It's recommended to repeat this step if you take too long (or the connection is lost) between steps 2, 3.a, 5 or 8. 2. (Optional) Test page selection with "ask" function. 3. Get the full page list by either: 3.a. Call "download_full_page_list". 3.b. Read full page list from existing file by calling "read_full_page_list_file". 4. Split the full list into a list of lists by either: 4.a. Calling "write_page_list_files" to write to a file, returning sublists with a maximum of 35 page names. 4.b. Read multiple list files using "read_page_list_files", regardless of how many page names are in each sublist. 4.c. Split an existing full page list with "split_full_page_list" into a list of lists, each with a maximum of 35 page names, without touching any files. 5. Get the XMLs by calling "download_xmls". 6. Edit the main text of the entries using any tool, for which you can also use the "xml.etree.ElementTree" Python module, as well as any XPath filter or XML editor. 7. Update the XMLs' metadata using "write_updated_xmls", which takes into account "edit_summary", the current session's user name and identity, as well as the date and time. 8. Finally, send the XMLs by calling "upload_xmls". """ import glob import os import re import requests import sys import time import xml.etree.ElementTree as ET # Basic configuration # If you change "work_dir", run "os.chdir(work_dir)" afterwards. work_dir = "." # Name and password which, depending on the account type, may allow more page names to be fetched or the usage of the "upload_xmls" function. username = "" password = "" # Edit summary/comment, which is only used by the "write_updated_xmls" function. edit_summary = "" # Automated setup after basic configuration url = "https://directory.fsf.org/" session = requests.Session() os.chdir(work_dir) ns = {"" : "http://www.mediawiki.org/xml/export-0.10/"} ET.register_namespace("", ns[""]) # Function definitions def login(action = "clientlogin"): """ Do login, which should be done first or if the server returns an authentication error when using other functions. Returns True on success, and nothing otherwise. Arguments: * action: "clientlogin" or "login". """ if ((action != "clientlogin") and (action != "login")): raise TypeError('"action" must be "clientlogin" or "login".') global logintoken global session_username global session_userid global csrftoken # Step 1: Get a login token. response = session.get( url + "w/api.php", params = { "action": "query", "meta": "tokens", "type": "login", "format": "json" } ) logintoken = response.json()["query"]["tokens"]["logintoken"] # Step 2 if action == "clientlogin": # Method: 2.a: Log in using the clientlogin action. response = session.post( url + "w/api.php", data = { "action": action, "username": username, "password": password, "logintoken": logintoken, "loginreturnurl": url, "format": "json" } ) elif action == "login": # Method: 2.b: Log in using the login action. response = session.post( url + "w/api.php", data = { "action": action, "lgname": username, "lgpassword": password, "lgtoken": logintoken, "format": "json" } ) # Step 3: While logged in, retrieve a CSRF/edit token. response = session.get( url + "w/api.php", params = { "action": "query", "meta": "tokens", "format": "json" } ) csrftoken = response.json()["query"]["tokens"]["csrftoken"] return csrftoken # Step 4: Get user identity and name. response = session.get( url + "w/api.php", data = { "action": "query", "meta": "userinfo", "format": "json" } ) session_username = response.json()["query"]["userinfo"]["name"] session_username = response.json()["query"]["userinfo"]["id"] return True def ask(query = "[[Category:Entry]]"): """ Do a single Semantic MediaWiki query, useful for testing. Returns a dictionary whose "data" is the paginated result, and "offset" is the next offset to use or None if no continuation is needed. Arguments: * query: Semantic MediaWiki query. """ response = session.post( url + "w/api.php", data = { "action": "ask", "query": query, "format": "json" } ) try: offset = response.json()["query-continue-offset"] except KeyError: offset = None return { "data": response.json()["query"]["results"], "offset": offset } def download_full_page_list(query = "[[Category:Entry]]", full_page_list_file = ".list"): """ Download the full page list matching the given single Semantic MediaWiki query and writes that to a file. Returns a list with all the pages that were written into the file. Arguments: * query: Semantic MediaWiki query. * full_page_list_file: file path to write. """ full_page_list = [] query = [ re.sub( re.escape("|") + "\s*limit\s*=\s*\d+", "", query ), 0 ] while True: results = ask(query[0]) if results["offset"] is not None: query = list( re.subn( re.escape("|") + "\s*limit\s*=\s*\d+", "|offset=" + str(results["offset"]), query[0] ) ) if query[1] == 0: query[0] = query[0] + "|offset=" + str(results["offset"]) full_page_list.extend(results["data"].keys()) if results["offset"] is None: break with open(full_page_list_file, "w") as f: f.write("\n".join(full_page_list)) return full_page_list def read_full_page_list_file(full_page_list_file = ".list"): """ Read the full page list from a file, without downloading the list. Returns a list with all the pages that were read from the file. This is useful for resuming a paused work. Arguments: * full_page_list_file: file path to read. """ with open(full_page_list_file, "r") as f: full_page_list = (f.read(full_page_list_file)).splitlines() return full_page_list def split_full_page_list(full_page_list): """ Split the full page list, returning a list containing a series of lists, each of those with 35 page names. Arguments: * full_page_list: full page list. Required. """ return [full_page_list[i:i + 35] for i in range(0, len(full_page_list), 35)] def write_page_list_files(full_page_list): """ Write the split page lists into text files, each named with the first page name of each set of 35 pages, suffixed with ".txt" extension. Returns a list containing a series of lists, each of those with 35 page names. Arguments: * full_page_list: full page list. Required. """ page_lists = split_full_page_list(full_page_list) for each_list in page_lists: with open(each_list[0] + ".txt", "wb") as f: f.write("\n".join(each_list)) return page_lists def read_page_list_files(page_list_files = []): """ Write the split page lists into text files, each named with the first page name of each set of pages, suffixed with ".txt" extension. Returns a list containing a series of lists, each of those with page names. Arguments: * full_page_list_files: a list of file paths whose contents are page lists. Default: a search is done for every file in "work_dir" that ends with ".txt". """ if ((len(page_list_files) == 0) or (page_list_files is None)): page_list_files = glob.glob("*.txt") if not isinstance(page_list_files, list): raise TypeError('"page_list_files" must be the instance of a list.') page_lists = [] for each_list in page_list_files: with open(each_list, "r") as f: page_lists.append(f.read(each_list).splitlines()) return page_lists def download_xmls(page_lists = []): """ Download the XMLs of each page list inside a major list, each XML will be named with the first page name of each set of pages, suffixed with ".xml" extension. Returns True on success, and nothing otherwise. Arguments: * page_lists: a list of lists, each containing the page names to work on. Required. """ for each_list in page_lists: response = session.post( url + "wiki/Special:Export", data = { "curonly": "true", "pages": "\n".join(each_list) } ) with open(each_list[0] + ".xml", "wb") as f: f.write(response.content) return True def write_updated_xmls(xml_files = []): """ Update and rewrite the specified XMLs. The update takes "edit_summary", the current session's user name and identity to change the same metadata in the revision, as well as the timestamp of the current date and time. Returns True on success, and nothing otherwise. Arguments: * xml_files: a list of file paths whose contents are the XML exports of the FSD. Default: a search is done for every file in "work_dir" that ends with ".xml". """ if ((len(xml_files) == 0) or (xml_files is None)): xml_files = glob.glob("*.xml") if not isinstance(xml_files, list): raise TypeError('"xml_files" must be the instance of a list.') for each_xml in xml_files: xml = ET.parse(each_xml) for each_page in xml.findall(".//page", ns): each_page.find("revision/contributor/username", ns).text = session_username each_page.find("revision/contributor/id", ns).text = session_userid each_page.find("revision/comment", ns).text = edit_summary each_page.find("revision/timestamp", ns).text = time.strftime("%FT%TZ", time.gmtime()) xml.write(each_xml, encoding = "utf-8") return True def upload_xmls(xml_files = []): """ Upload the specified XMLs. Returns True on success, and nothing otherwise. Arguments: * xml_files: a list of file paths whose contents are the XML exports of the FSD. Default: a search is done for every file in "work_dir" that ends with ".xml". """ if ((len(xml_files) == 0) or (xml_files is None)): xml_files = glob.glob("*.xml") if not isinstance(xml_files, list): raise TypeError('"xml_files" must be the instance of a list.') for each_xml in xml_files: response = session.post( url + "w/api.php", data = { "action": "import", "token": csrftoken, "summary": edit_summary, "format": "json" }, files = { "xml": open(each_xml) } ) return True
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the page “GNU Free Documentation License”.
The copyright and license notices on this page only apply to the text on this page. Any software or copyright-licenses or other similar notices described in this text has its own copyright notice and license, which can usually be found in the distribution or license text itself.