| advertise add site services publishers database health videos | ![]() | about toolbar stats live show health store more stuff JOIN/LOGIN |
Tubal Reversal Information on Wikipedia tubal-reversal.net | Keeping Wikipedia Spick and Span fightaging.org | Society for Neuroscience - Neuroscience Wikipedia Initiative sfn.org | Doctor Eugene Lipov, M.D. - Biography and Wikipedia Entry eugenelipovmd.org |
Robots or bots are automatic processes which interact with Wikipedia as though they were human editors. This page attempts to explain how to carry out the development of a bot for use on Wikipedia. The explanation is geared mainly towards those who have some prior programming experience, but are unsure of how to apply this knowledge to creating a Wikipedia bot.
[edit] Why would I need to create a bot?Bots can automate tasks and perform them much faster than humans. If you have a simple task which you need to perform lots of times (an example might be to add a template to all pages in a category with 1000 pages) then this is a task better suited to a bot than a human. [edit] Considerations before creating a botThere are already a number of bots running on Wikipedia. Many of these bots publish their source code, which can sometimes be reused with little additional development time. In addition, there are a number of semi-bots available to anyone. Most of these take the form of enhanced web browsers with Wikipedia-specific functionality. The most popular of these is AWB; see Wikipedia:Tools/Editing tools for a complete list. If you have no previous programming experience, it may be simpler to ask an existing bot to do the job, or ask others to develop a bot for you. These requests can be made at Wikipedia:Bot requests. If you wish to write a new bot anyway, be aware that learning a programming language is a non-trivial task. However, it is not black magic – anyone can learn how to program with sufficient time and effort. Good luck! If you decide to create a bot, planning is crucial to obtain an error-free, efficient, and effective program. The following initial considerations are important:
[edit] How does a Wikipedia bot work?[edit] Overview of operationJust like a human editor, a Wikipedia bot reads Wikipedia pages, and makes changes where it thinks changes need to be made. The difference is that although bots are faster and less prone to fatigue than humans, they are nowhere near as bright as we are. Bots are good at repetitive tasks that have easily defined patterns, where few decisions have to be made. In the most typical case, a bot logs in to its own account and requests pages from Wikipedia just as a browser does – although it does not display the page on screen, but works on it in memory – and then programmatically examines the page code to see if any changes need to be made. It then makes and submits whatever edits it was designed to do, again using the same codes a browser would use. This method, often called screen scraping, uses the standard HTTP GET protocol: whenever you see /w/index.php?...=...&...=... in the browser address bar, everything after the question mark is variables and data sent by the GET method. There are also a handful of Application Programming Interfaces (described below) for getting pages and sending edits to and from Wikipedia. Because bots access pages the same way people do, bots can experience the same kind of difficulties that human users do. They can get caught in edit conflicts, have page timeouts, or run across other unexpected complications while requesting pages or making edits. Because the volume of work done by a bot is larger than that done by a live person, the bot is more likely to encounter these issues. Thus, it is important to consider these situations when writing a bot. [edit] APIs for botsIn order to make changes to Wikipedia pages, a bot necessarily has to retrieve pages from Wikipedia and send edits back. There are several Application Programming Interfaces (APIs) available for that purpose.
Some Wikipedia web servers are configured to grant requests for compressed (gzip) content. This can be done by including a line "Accept-Encoding: gzip" in the HTTP request header; if the HTTP reply header contains "Content-Encoding: gzip", the document is in gzip form, otherwise, it is in the regular uncompressed form. Note that this is specific to the web server and not to the MediaWiki software. Other sites employing MediaWiki may not have this feature. [edit] Logging inApproved bots need to be logged in to make edits. Although a bot can make read requests without logging in, bots that have completed testing should log in for all activities. Bots logged in from an account with the bot flag can obtain more results per query from the Mediawiki API (api.php). For security, login data must be passed using the HTTP POST method. because parameters of HTTP GET requests are easily visible in URL, logins via GET are disabled. To log a bot in using MediaWiki API, use this URL and POST data:
This will return a result (success or error) in XML form, as documented at mw:API:Login. Other output formats are available. A successful login attempt will result in the Wikimedia server setting several HTTP cookies. The bot must save these cookies and send them back every time it makes a request (this is particularly crucial for editing). On the English Wikipedia, the following cookies should be used: enwikiUserID, enwikiToken, and enwikiUserName. The enwiki_session cookie is required to actually send an edit or commit some change, otherwise the MediaWiki:Session fail preview error message will be returned. [edit] Editing; edit tokensWikipedia uses a system of edit tokens for making edits to wikipedia pages, as well as some other operations such as rollback. The token looks like a long hexadecimal number followed by '+\', for example:
The role of edit tokens is to prevent "edit hijacking", where users are tricked into making an edit by clicking a single link. The editing process involves two HTTP requests. First, a request for an edit token must be made. Then, a second HTTP request must be made that sends the new content of the page along with the edit token just obtained. It is not possible to make an edit in a single HTTP request. To obtain an edit token, follow these steps:
If the edit token the bot receives does not have the hexidecimal string (i.e., the edit token is just '+\') then the bot most likely is not logged in. This might be due to a number of factors: failure in authentication with the server, a dropped connection, a timeout of some sort, or an error in storing or returning the correct cookies. If it is not because of a programming error, just log in again to refresh the login cookies. [edit] Edit conflictsEdit conflicts occur when multiple, overlapping edit attempts are made on the same page. Almost every bot will eventually get caught in an edit conflict of one sort or another, and should include some mechanism to test for and accommodate these issues. Bots that use the Mediawiki API (api.php) should retrieve the edit token, along with the Generally speaking, if an edit fails to complete the bot should check the page again before trying to make a new edit, to make sure the edit is still appropriate. Further, if a bot rechecks a page to resubmit a change, it should be careful to avoid any behavior that could lead to an infinite loop and any behavior that could even resemble edit warring. [edit] Overview of the process of developing a botActually coding or writing a bot is only one part of developing a bot. You should generally follow the development cycle below to ensure that your bot follows Wikipedia's bot policy. Failure to comply with the policy may lead to your bot failing to be approved or being blocked from editing Wikipedia. [edit] Idea
[edit] Specification
[edit] Software architecture
[edit] ImplementationImplementation (or coding) involves turning design and planning into code. It may be the most obvious part of the software engineering job but it is not necessarily the largest portion. In the implementation stage you should:
[edit] TestingA good way of testing your bot as you are developing is to have it show the changes (if any) it would have made to a page, rather than actually editing the live wiki. Some bot frameworks (such as pywikipedia) have pre-coded methods for showing diffs. During the approvals process, the bot will most likely be given a trial period (usually with a restriction on the number of edits or days it is to run for) during which it may actually edit to enable fine-tuning and iron out any bugs. At the end of the trial period, if everything went according to plan, the bot should get approval for full-scale operation. [edit] DocumentationAn important (and often overlooked) task is documenting the internal design of your bot for the purpose of future maintenance and enhancement. This is especially important if you are going to allow clones of your bot. Ideally, you should post up the source code of your bot on its userpage if you want others to be able to run clones of it. This code should be well documented (usually using comments) for ease of use. [edit] Queries/ComplaintsYou should be ready to respond to queries about or objections to your bot on your user talk page, especially if it is operating in a potentially sensitive area, such as fair-use image cleanup. [edit] MaintenanceMaintaining and enhancing your bot to cope with newly discovered bugs or new requirements can take far more time than the initial development of the software. Not only may it be necessary to add code that does not fit the original design but just determining how software works at some point after it is completed may require significant effort (this is another reason to document your code as you go along).
[edit] General guidelines for running a botIn addition to the official bot policy, which covers the main points to consider when developing your bot, there are a number of more general advisory points to consider when developing your bot. [edit] Bot best practices
[edit] Common bot features you should consider implementing[edit] Manual assistanceIf your bot is doing anything that requires judgement or evaluation of context (e.g., correcting spelling) then you should consider making your bot manually-assisted. That is, not making edits without human confirmation. [edit] Disabling the botIt is good bot policy to have a feature to disable the bot's operation if it is requested. Remember that if your bot goes bad, it is your responsibility to clean up after it! You could have the bot refuse to run if a message has been left on its talk page, on the assumption that the message may be a complaint against its activities; this can be checked using the API [edit] SignatureJust like a human, if your bot makes edits to a talk page on wikipedia, it should sign its post with four tildes (~~~~). It should not sign any edits to text in the main namespace. [edit] Open source botsMany bot operators choose to make their code open source, and occasionally it may be required before approval for particularly complex bots. Making your code open source has several advantages:
Open source code, while rarely required, is typically encouraged in keeping with the open and transparent nature of Wikipedia, though there are some cases when code should not be made public. For example, the open proxy-finding code of ProcseeBot could be used for malicious purposes on other sites. Making code open source can add some extra work to coding. One has to make sure that sensitive information such as passwords is separated into a file that isn't made public. There are several options available for users wishing to make their code open. Some users choose to put the code in a subpage of the bot's userspace, although this can be a hassle to maintain if not automated and results in the code being multi-licensed under Wikipedia's licensing terms in addition to any other terms you may specify. Another solution is to use a revision control system such as SVN, Git, or Mercurial. Wikipedia has articles comparing the different software options and websites for code hosting, many of which have no cost. The Wikimedia Toolserver also offers SVN hosting for its users. [edit] Programming languages and librariesSee also: mw:API:Client Code Bots can be written in almost any programming language. The choice of a language often depends on the experience of the bot writer (which languages are familiar) or on the availability of pre-developed libraries to perform the desired task. The following list includes some languages that have libraries to assist with bot tasks. [edit] PerlPerl has a run-time compiler. This means that it is not necessary to compile builds of your code yourself as it is with other programming languages. Instead, you simply create your program using a text editor such as gvim. You then run the code by passing it to an interpreter. This can be located either on your own computer or on a remote computer (webserver). If located on a webserver, you can start your program running and interface with your program while it is running via the Common Gateway Interface from your browser. Perl is available for most operating systems, including Microsoft Windows (which most human editors use) and UNIX/Linux (which many webservers use). If your internet service provider provides you with webspace, the chances are good that you have access to a perl build on the webserver from which you can run your Perl programs. Guides to getting started with Perl programming:
Libraries:
[edit] PHPPHP can also be used for programming bots. PHP is an especially good choice if you wish to provide a webform-based interface to your bot. For example, suppose you wanted to create a bot for renaming categories. You could create an HTML form into which you will type the current and desired names of a category. When the form is submitted, your bot could read these inputs, then edit all the articles in the current category and move them to the desired category. (Obviously, any bot with a form interface would need to be secured somehow from random web surfers.)
[edit] PythonPython is a popular interpreted language with object-oriented features. Getting started with Python: Libraries:
[edit] Microsoft .NETMicrosoft .NET is a set of languages including C#, C++/CLI, Visual Basic .NET, J#, JScript .NET, IronPython, and Windows PowerShell. Free Microsoft Visual Studio .NET development environment is often used. Using Mono Project, .NET programs can also run on Linux, Unix, BSD, Solaris and Mac OS X as well as under Windows. Getting started:
Libraries:
[edit] JavaJava programs are generally developed with an IDE, such as Eclipse; development using a command line console (with the javac and java programs) is also an option. Getting started: Libraries: [edit] RubyLibraries:
[edit] Chicken SchemeIron Chicken is an extension or "egg" for Chicken Scheme that makes the Mediawiki API programmable using s-expressions, and presents API and HTML output as SXML which can be queried easily. A simple example that gets members of a category and writes them to a page in the client user's userspace is: Libraries: |
| ↑ top of page ↑ | about thumbshots |