Assembly line examples

From Apibot
Jump to: navigation, search

Here you can see some examples of typical bot tasks, implemented through the Apibot Assembly line interface. (Look here for examples using the Bridge interface.)

This interface works like an assembly line, made of PHP objects linked one to another. You assemble the line, kick it running, and here your Frankenstein goes. :-)

The line objects belong to different types - feeders, filters, fetchers, workers, writers... Each type has a role in the line. Feeders load on the line objects to be processed, eg. pages. Filters discard some of the objects, so they won't be processed further. Fetchers get additional info about the processed objects if they don't have it, eg. load page texts by their titles. Workers modify the objects according to the line goal. Finally, writers put the changes into effect - sends the changed objects back to the wiki, or write them into a file or database, etc.

Most objects are standardized to do specific tasks. Further customization is achieved by setting some params to them. For this reason, the assembly line interface is easier to use than the Bridge one. All PHP knowledge you need is the names of the classes you need, how to create an object, how to string it to another and how to assign a value to its property - all these can be learned in an hour. You can concentrate on the processing logic instead of on writing PHP code.

The examples assume that:

  • Apibot is in a subdirectory of the current directory, named "apibot"
  • the $logins array variable (probably defined in the file logins.php) contains a bot login with the key My_Bot@en.wikipedia.org.
  • the $bot_settings array variable (probably defined in the file settings.php) contains the bot settings.

Simple examples

Fix some popular typos in all articles

The pages in the Main namespace are typically called articles, and are the reason for the wiki existence. Spelling typos are not welcome there. Let's fix some.

# Include the needed modules.
require_once ( dirname ( __FILE__ ) . '/apibot/settings.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/logins.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/core/core.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/interfaces/line/__all_modules.php' );
# line/__all_modules.php is a convenience that just includes all line module files.
# For faster start and less RAM usage, include only the needed module files.

# Create the Core object.
$core = new Core ( $logins['My_Bot@en.wikipedia.org'], $bot_settings );

# Create the assembly line objects and string them.
$feeder = new Feeder_Query_List_Allpages ( $core );
$feeder->filterredir = "non-redirects";  // we won't proofread redirects

$replaces = array (    // here we list the 'replaces' to perform
  'teh' => array (     // first popular typo desc (the key does not matter)
    'string' => "Teh", // replace this
    'with' => "The",   // ... with this
    'report' => "$1 'Teh's replaced",  // and log this message ($1 will be replaced with the count)
  ),
  'apiobt' => array (     // second popular typo desc... you already know the drill
    'string' => "Apiobt",
    'with' => "Apibot",
    'report' => "$1 'Apiobt's replaced",
  )
);
$worker = new Worker_EditPage_ReplaceStrings ( $core, $replaces );
// you might prefer Worker_EditPage_ReplaceRegexes, with 'regex' instead of 'string' in $replaces
$worker->link_to ( $feeder );

$writer = new Writer_Edit ( $core );  // will submit the changes (if any) to the wiki
// with a list of the $replaces reports as a summary (since it is not given here)
$writer->link_to ( $worker );

$feeder->feed();  // power up! :-)

Change a specific text everywhere

If you make a lot of possible changes in a text, like fixing 1000s of popular typos, it might be worth parsing all articles. However, if you have to change a specific text, it might be better to check only the pages that contain it.

You see an impossibly old wiki that still refers to Myanmar as Burma. Its admin asks you to help replacing the old name everywhere:

# Include the needed modules.
require_once ( dirname ( __FILE__ ) . '/apibot/settings.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/logins.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/core/core.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/interfaces/line/__all_modules.php' );

# Create the Core object.
$core = new Core ( $logins['My_Bot@en.wikipedia.org'], $bot_settings );

# Create the assembly line objects and string them.
$feeder = new Feeder_Query_List_Search ( $core );
$feeder->search = "Burma";
$feeder->what = "text";
$feeder->namespace = 0;

$replaces = array (    // here we list the 'replaces' to perform
  'Burma' => array (     // first replacement desc
    'string' => "Burma", // replace this
    'with' => "Myanmar",   // ... with this
    'report' => "$1 replacements done",  // and log this message ($1 will be replaced with the count)
  ),
);
$worker = new Worker_EditPage_ReplaceStrings ( $core, $replaces );
$worker->link_to ( $feeder );

$writer = new Writer_Edit ( $core );
$writer->link_to ( $worker );

$feeder->feed();

Fix all wikilinks pointing to a moved page

A page was moved, but there are a lot of links to it in the wiki. What about fixing them to avoid the need for a redirect?

Assuming that old page title was "Old Title", and the new is "New Title":

# Include the needed modules.
require_once ( dirname ( __FILE__ ) . '/apibot/settings.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/logins.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/core/core.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/interfaces/line/__all_modules.php' );

# Create the Core object.
$core = new Core ( $logins['My_Bot@en.wikipedia.org'], $bot_settings );

# Create the assembly line objects and string them.
$feeder = new Feeder_Query_List_Backlinks ( $core );
$feeder->title = "Old Title";

$replaces = array (    // here we list the 'replaces' to perform
  'Old Page' => array (     // first replacement desc
    'string' => "Old Page", // replace this
    'with' => "New Page",   // ... with this
    'report' => "$1 replacements done",  // and log this message ($1 will be replaced with the count)
  ),
);
$worker = new Worker_EditPage_ReplaceWikilinksTargets ( $core, $replaces );
$worker->link_to ( $feeder );

$writer = new Writer_Edit ( $core );
$writer->link_to ( $worker );

$feeder->feed();

Move pages from a category to another one

What if you need to move a lot of pages from one category to another? (In this example - all of them.)

Assuming that you have to move pages from "Old Category" to "New Category"

# Include the needed modules.
require_once ( dirname ( __FILE__ ) . '/apibot/settings.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/logins.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/core/core.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/interfaces/line/__all_modules.php' );

# Create the Core object.
$core = new Core ( $logins['My_Bot@en.wikipedia.org'], $bot_settings );

# Create the assembly line objects and string them.
$feeder = new Feeder_Query_List_Categorymembers ( $core );
$feeder->title = "Category:Old Category";  // note the "Category:" prefix!

$replaces = array (    // here we list the 'replaces' to perform
  'Old Category' => array (     // first replacement desc
    'old_name' => "Old Category", // replace this
    'new_name' => "New Category", // ... with this
  ),
);
$worker = new Worker_EditPage_ReplaceCategories ( $core, $replaces );
$worker->link_to ( $feeder );

$writer = new Writer_Edit ( $core );
$writer->link_to ( $worker );

$feeder->feed();

Replace a template in all pages

Often someone makes a new wiki template that is better and more powerful version of an old one.

Let's replace the old template ("Template:Old Template") with the new one ("Template:New Template") everywhere.

Also, to make the things more interesting, the templates are not completely compatible. The parameter "chairman" in Old Template is replaced by "chairperson" in New Template. We should take into account this, too.

# Include the needed modules.
require_once ( dirname ( __FILE__ ) . '/apibot/settings.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/logins.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/core/core.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/interfaces/line/__all_modules.php' );

# Create the Core object.
$core = new Core ( $logins['My_Bot@en.wikipedia.org'], $bot_settings );

# Create the assembly line objects and string them.
$feeder = new Feeder_Query_List_Embeddedin ( $core );
$feeder->title = "Template:Old Template";  // note the "Template:" prefix!

$replaces = array (    // here we list the 'replaces' to perform
  'Old Template' => array (     // first replacement desc
    'old_name' => "Old Template", // replace this
    'new_name' => "New Template", // ... with this
  ),
);
$worker = new Worker_EditPage_ReplaceTemplatesNames ( $core, $replaces );
$worker->link_to ( $feeder );

# We are not limited to stringing only one worker in a line :-)
$replaces2 = array (
  'chairman' => array (
    'name' => "New Template",  // the name of the template to change paramname in
    'old_paramname' = "chairman",
    'new_paramname' = "chairperson",
  )
);
$worker2 = new Worker_EditPage_ReplaceTemplatesParamnames ( $core, $replaces2 );
$worker2->link_to ( $worker );

$writer = new Writer_Edit ( $core );
$writer->link_to ( $worker2 );

$feeder->feed();

Revert all edits done by a prolific vandal

A vandal (going by the account "Nasty Vandal") messed up overnight a thousand pages in your favorite wiki. What about reverting all of his "contributions"?

# Include the needed modules.
require_once ( dirname ( __FILE__ ) . '/apibot/settings.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/logins.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/core/core.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/interfaces/line/__all_modules.php' );

# Create the Core object.
$core = new Core ( $logins['My_Bot@en.wikipedia.org'], $bot_settings );

# Create the assembly line objects and string them.
$feeder = new Feeder_Query_List_Usercontribs ( $core );
$feeder->user = "Nasty Vandal";
$feeder->show = "top";

# No workers needed - will string the writer directly to the feeder:
$writer = new Writer_Rollback ( $core );
$writer->user = "Nasty Vandal";
$writer->link_to ( $feeder );

$feeder->feed();

Complex examples

The simple examples above barely scratch the surface of the Assembly line power. Think of its objects as Lego blocks and make from them as complex processing as you like.

A convoluted task

This example exists only to give you an idea how complex the Assembly line patterns can be, and how powerful it is.

Suppose you want to process pages (only once each) that have been edited during the last three days and are not redirects - but only those who match one of the following conditions:

  • They have a title starting with "Car "
  • They are up to 10 000 bytes long
  • They contain the text "When all else fails"

In addition, the pages that match the first and the second condition (but not the third) must also have a pageid smaller than 5100. And after this, if they belong to the category "Newly created pages", they must be removed from it.

After all this, the selected pages (and possibly de-categorized) pages must be edited:

  • If they belong to the category "Satirists", they must be moved to the category "Humorists"
  • The first three occurrences of the text "Driving a DeLorean" must be replaced with "Driving the DeLorean"
  • The file "DeLorean.jpg" must be replaced with "DeLorean2.jpg"

Then, the pages must be submitted back to the wiki (if changed).

In addition, they must be written in a .CSV file locally. After this, the text "The Time Machine" in them must be replaced with "The DeLorean Time Machine", and they must be written again in a different local .CSV file.

Complex enough? :-) Well, let's construct the assembly line:

                  +----------------------+
                  |       feeder         |
                  | (edit recentchanges) |
                  +----------------------+
                             |
                  +----------------------+
                  |       filter         |
                  | (unique pages only)  |
                  +----------------------+
                             |
       +---------------------+-------------------+
       |                     |                   |
 +----------------+  +----------------+  +----------------+
 | filter by title|  | filter by size |  | filter by text |
 |    ("Car ")    |  |  (< 10 000)    |  | ("When all...")|
 +----------------+  +----------------+  +----------------+
       |                |                        |
       +----------------+                        |
                |                                |
       +----------------+                        |
       |filter by pageid|                        |
       |    (< 5100)    |                        |
       +----------------+                        |
                |                                |
     +--------------------+                      |
     |  worker - del cat  |                      |
     |("Newly created...")|                      |
     +--------------------+                      |
                |                                |
                +--------------------------------+
                                 |
                 +------------------------------+
                 |   worker - replace category  |
                 | ("Satirists" -> "Humorists") |
                 +------------------------------+
                                 |
                 +------------------------------+
                 |     worker - replace text    |
                 | ("Driving a DeLorean" -> ...)|
                 +------------------------------+
                                 |
                 +------------------------------+
                 |     worker - replace file    |
                 |    ("DeLorean.jpg" -> ...)   |
                 +------------------------------+
                                 |
                 +------------------------------+
                 |                              |
  +----------------------------+  +---------------------------+
  |   writer - send to wiki    |  | writer - write to a .CSV  |
  +----------------------------+  +---------------------------+
                                                |
                                  +---------------------------+
                                  |    worker - replace text  |
                                  |("The Time Machine" -> ...)|
                                  +---------------------------+
                                                |
                                  +---------------------------+
                                  | writer - write to a .CSV  |
                                  +---------------------------+

Of course, the line might be much more complex than this. :-) Let's see how we implement it.

# Include the needed modules.
require_once ( dirname ( __FILE__ ) . '/apibot/settings.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/logins.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/core/core.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/interfaces/line/__all_modules.php' );

# Create the Core object.
$core = new Core ( $logins['My_Bot@en.wikipedia.org'], $bot_settings );

# Create the feeder.
$feeder = new Feeder_Query_List_Recentchanges ( $core );
$feeder->start = date ( 'Y-m-d H:i:s', ( time() - ( 3 * 86400 ) ) );
$feeder->end   = date ( 'Y-m-d H:i:s', time() );
$feeder->dir   = "newer";
$feeder->type  = "edit";

$filter_unique = new Filter_Unique_Title ( $core );
$filter_unique->link_to ( $feeder );

# Create and link the first three filters.
$filter_by_title_car = new Filter_Page_Title_RegexesExist_Any ( $core, array ( "/^Car /u" ) );
$filter_by_title_car->link_to ( $filter_unique );

$filter_by_size_sub10000 = new Filter_Page_Size_Diap ( $core, 0, 10000 );
$filter_by_size_sub10000->link_to ( $filter_unique );  // we can link objects in parallel, too :-)

$filter_by_when_etc = new Filter_Page_Text_StringsExist_Any ( $core, array ( "When all else fails" ) );
$filter_by_when_etc->link_to ( $filter_unique );

# Filter what gets through the first and the second filter by the pageid < 5100 condition
$filter_by_pageid_sub5100 = new Filter_Page_Pageid_Diap ( $core, 0, 5100 );
$filter_by_pageid_sub5100->link_to ( $filter_by_title_car );
$filter_by_pageid_sub5100->link_to ( $filter_by_size_sub10000 );  // we can also link to several sources :-)

# ... and remove that from [[Category:Newly created pages]]
$worker_del_cat_newlycreated = new Worker_EditPage_DeleteCategories ( $core, array ( "Newly created pages" ) );
$worker_del_cat_newlycreated->link_to ( $filter_by_pageid_sub5100 );

# Replace [[Category:Satirists]] with [[Category:Humorists]] in all data
$replcat1 = array (
  'sathum' => array (
    'old_name' => "Satirists",
    'new_name' => "Humorists",
  ),
);
$worker_repl_cat_sathum = new Worker_EditPage_ReplaceCategories ( $core, $replcat1 );
$worker_repl_cat_sathum->link_to ( $worker_del_cat_newlycreated );
$worker_repl_cat_sathum->link_to ( $filter_by_when_etc );  // forgot this already? :-)

# Replace text "Driving a DeLorean" with "Driving the DeLorean"
$repltext1 = array (
  'athe' => array (
    'search' => "Driving a DeLorean",
    'with' => "Driving the DeLorean",
  ),
);
$worker_repl_text_athe = new Worker_EditPage_ReplaceStrings ( $core, $repltext1 );
$worker_repl_text_athe->link_to ( $worker_repl_cat_sathum );

# Replace file "DeLorean.jpg" with "DeLorean2.jpg"
$replfile1 = array (
  'to2' => array (
    'old_name' => "DeLorean.jpg",
    'new_name' => "DeLorean2.jpg",
  ),
);
$worker_repl_file_DL2 = new Worker_EditPage_ReplaceFilelinksNames ( $core, $replfile1 );
$worker_repl_file_DL2->link_to ( $worker_repl_text_athe );

# Submit the changes to the wiki
$writer_edit = new Writer_Edit ( $core );
$writer_edit->link_to ( $worker_repl_file_DL2 );

# Also, write some of the page data to a .CSV file (eg. "./datafiles/pages1.csv")
$writer_csv1 = new Writer_File_CSV ( $core );
$writer_csv1->filename = "./datafiles/pages1.csv";
$writer_csv1->fields = array ( "title", "pageid", "revid", "timestamp", "user", "comment", "size", "content" );
$writer_csv1->link_to ( $worker_repl_cat_sathum );

# Then, replace "The Time Machine" with "The DeLorean Time Machine"
$repltext2 = array (
  'deloreantm' => array (
    'search' => "The Time Machine",
    'with' => "The DeLorean Time Machine",
  ),
);
$worker_repl_text_deloreantm = new Worker_EditPage_ReplaceStrings ( $core, $repltext2 );
$worker_repl_text_deloreantm->link_to ( $writer_csv1 );

# Finally, write the changed page data to another .CSV file (eg. "./datafiles/pages2.csv")
$writer_csv2 = new Writer_File_CSV ( $core );
$writer_csv2->filename = "./datafiles/pages2.csv";
$writer_csv2->fields = array ( "title", "pageid", "revid", "timestamp", "user", "comment", "size", "content" );
$writer_csv2->link_to ( $worker_repl_text_deloreantm );

$feeder->feed();  // wake up, monster!

Stringing feeders to feeders

What is one feeder is not enough for the task? For example, what if we want to walk through all namespaces, parse all pages for each namespace, and obtain all revisions for each page? The logic says there must be something like three consecutive feeders. And this is precisely what we must do.

There is, however, a little trick to know. The feeders must pass info one to another. Some feeders are able to inspect the signals for suitable info and use it, but not all. Sometimes you might need to tweak the info the signals carry a bit, in order to tell each next feeder what it needs to know. This is usually done by some Workers that process signal parameters. We will use them to set data from the former feeders as parameters to the latter ones.

When started by a signal, a feeder will look within the signal for a group of signal parameters called "feeder", and for a group named for the line object slot name. If found, it will try to use the parameters in this group as its parameters.

Unless told otherwise, a "slave" feeder will just pass through the "feed start" and "feed end" packages (which carry no data). The "feed data" packets will be stopped, and the feeder will send an entire feed for every one of them. We can use this to set data fields from the data signal sent by the master feeder as slave feeder parameters through a worker.

# Include the needed modules.
require_once ( dirname ( __FILE__ ) . '/apibot/settings.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/logins.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/core/core.php' );
require_once ( dirname ( __FILE__ ) . '/apibot/interfaces/line/__all_modules.php' );

# Create the Core object.
$core = new Core ( $account, $settings );

# Create the namespaces feeder.
$feeder_nsid = new Feeder_Info_Namespaces_Ids ( $core );

# Create a worker that sets as a parameter the namespace id passed as data.
$worker_sp_nsid = new Worker_SetParam_FromDataValue ( $core );
$worker_sp_nsid->group = "feeder";     // set it as a parameter in the group "feeder"
$worker_sp_nsid->name  = "namespace";  // ... with the parameter name "namespace"
$worker_sp_nsid->link_to ( $feeder_nsid );

# Create an allpages feeder. On every data signal (a new namespace) it will restart, setting feeder parameters
# from the signal parameters group "feeder" - that means, the data namespace id as a "namespace" parameter.
$feeder_ap = new Feeder_Query_List_Allpages ( $core );
$feeder_ap->link_to ( $worker_sp_nsid );

# Create a new worker that sets as a parameter the data page title parameter.
$worker_sp_title = new Worker_SetParam_FromTitle ( $core );
$worker_sp_title->group = "feeder";  // set it as a parameter in the group "feeder", as we already know
$worker_sp_title->name = "titles";   // ... with the parameter name "titles"
$worker_sp_title->link_to ( $feeder_ap );

# Create a page revisions feeder. On every data signal (a new page) it will restart, setting feeder parameters
# from the signal group "feeder" - that means, the page title as a "titles" parameter.
$feeder_rv = new Feeder_Query_Property_Revisions ( $core );
# The page properties feeders (the Revisions one included) can add page data to the property data, if told.
$feeder_rv->pagedata = array ( "pageid", "title" );  // add the page pageid and title to every revision
$feeder_rv->parent_page_key  = "";                   // ... not as a sub-array, but directly in the data array
$feeder_rv->start = date ( 'Y-m-d H:i:s', 1 );       // from the beginning of the time (0 will not work!)
$feeder_rv->end   = date ( 'Y-m-d H:i:s', time() );  // ... to now
$feeder_rv->dir   = "newer";                         // ... traversing the revisions in this direction
$feeder_rv->prop = "parentid|user|timestamp|comment|size|content";  // obtain these revision properties
$feeder_rv->link_to ( $worker_sp_title );

# Create a CSV file writer - we want these revisions stored :-)
$writer_csv = new Writer_File_CSV ( $core );
$writer_csv->filename = "revisions.csv";  // into this file name
# Store these data fields in the file, in this order:
$writer_csv->fields = array ( "title", "ns", "pageid", "revid", "parentid", "user", "timestamp", "comment", "size", "content", "*" );
$writer_csv->link_to ( $feeder_rv );

$feeder_nsid->feed();

(Actually, the Allpages and the Revisions feeders can extract the namespace and the title respectively from the data, without any need for extra workers in between. These are added here just to show how the things work.)