Programming and Webmasters forum
HomeSearchRecent PostsLoginRegisterContact Us

Username  
Password

Pages: [1]   Go Down
 
  Email this topic  |  Print
0 Members and 1 Guest are viewing this topic.

Processing Lists of Words for Wikipedia

 
webmaster forum
polas  Offline
*
 
Code Guru
Gender: Male
Posts: 1296
Topics: 79
WWW
March 24, 2010, 12:23:26 PM

This request is from our member Calaka:
------------------------------------------------

So there is this website (hxxp: www. globalnames. org/) which lists well over 18 million names; many of which are individual genera and species.  The problem is that in addition to the legitimate names, there is also a lot of false positives (as the list is gathered with a bot from a number of website resources.  So I am a volunteer over at Wikipedia doing minor things, and I happened to start up a little "project" where I gather the list of names for other users (and me) to create articles based on the names provided from that massive list (and utilizing the references they give).  That little project page/list can be found here: hxxp: en. wikipedia. org/wiki/Wikipedia:WikiProject_Missing_encyclopedic_articles/Global_Names_Index .  Now I managed to get help from another Wikipedian volunteer that used a program with Python to extract the names and for me to then place on wikipedia lists manually (see: hxxp: www. eecis. udel. edu/~mpellegr/wikipedia/data/ .  Now he told me that I am all on my own on putting those lists on Wikipedia (as he is not aware/capable of doing the below) for others to create.  I did the first 10 pages of A and the first page of B (as an example shows here: hxxp: en. wikipedia. org/wiki/Wikipedia:WikiProject_Missing_encyclopedic_articles/Global_Names_Index/A ).

Now there are two problems to this.  First of all, there seems to be a limit of about ~2000 names per list, as after 2,000 the {{search}} function next to the name on the list is no longer functioning, which means there would need to be over 9,000 pages of lists created to cater for all the 18 million names! It would take forever to do by hand (quite time consuming)

Second (which a solution to this would help solve the first problem), there needs to be a way of "getting rid of the crud from the good stuff" so to speak, i. e.  removing all the names on that list that are not needed and only keeping those that can be made into an article.  Then you might be wondering what is useful and what is not?

Well:

1.  Anything that is just one word can remain on the list.  (This is a genus)
2.  Anything that is two words with the second word being a lower case can remain on the list.  (This is a species)
3.  Anything that is three words with the second and third word being a lower case can remain on the list.  (A lot rarer: this is a subspecies)

Things that can be removed/are not needed to be created as actual articles:

1.  Two words in a name with the second letter having a capital letter (which is usually the person who discovered/names the species/genus) - genus if it is two words, species if it is three words (the third word being a capital letter in this case)
2.  Any names that have a number next to it (usually coincides with point 1 but not always)
3.  Anything with a comma or parenthesis or square bracket
 

So as you can see. . .  If it was possible to design a bot/program to run through that first list made by the guy, compile its own list by following the above 6 points, the list can be reduced to include only those names which are relevant and greatly reduce the amount of names (I estimate from 18 million down to 3-4 million, if not even less!).  With that new list present, it would then be possible to paste those in Wikipedia (if you are unsure of how to do that automatically, then surely there would be someone who would know how to within the community - it was just difficult finding someone capable of creating the above program with the above set of rules).  The list can start alphabetically with everything below: "Abacopteris penangiana" (as that is the last legitimate species on page 1) and pages 2-10 of the letter A can be redone to correspond with the new rules since no one has gone through them yet, whereas I have been working on page one for a bit now.

Now I know this might seem a bit full on and even more so for someone that does not know with the "behind the scenes" dealings with Wikipedia but if you can advise/give any comments to this or perhaps even recommend a place to post the above, it would be much appreciated.

The code used by the first individual to create that list in case you are curious:

hxxp: www. eecis. udel. edu/~mpellegr/wikipedia/gen

Oh and PPS: The names themeselves can be created into articles by following the references/lists provided by the global names index.  An example of such a page can be found here: hxxp: en. wikipedia. org/wiki/Asceles and here: hxxp: en. wikipedia. org/wiki/Akera_bullata (I made a few more, but I think you get the idea of what is possible with the references provided)

Mesham Type Oriented Parallel Programming Language
 
webmaster forum
polas  Offline
*
 
Code Guru
Gender: Male
Posts: 1296
Topics: 79
WWW
March 24, 2010, 12:29:02 PM

Ok, so just so I have this correct:

The grammer of the files @ eecis.edu is

* [[TOKEN]] -- {{Search|TOKEN}}

Where TOKEN is a set of words

The rules are:

1) Only one word - ok
2) Two words, second lower case - ok
3) Three words, second and third words are lower case - ok
4) Anything else - fail and remove from the list

And then the words coming from this list (in the same format) you want in a text file?

The reason I ask it this simply is just so we all have a good idea what EXACTLY you want in really simple terms.

I think it will be prety easy to knock together such a program, I am actually really busy with work @ mo and away this weekend but if you can confirm the exact rules then hopefully  I can have a brief look at it next week.

Mesham Type Oriented Parallel Programming Language
 
webmaster forum
Calaka  Offline
 
New Poster
Posts: 6
Topics: 1
March 24, 2010, 07:40:32 PM

Hey there!

Yes those rules are correct and would include only the names we would be interested in creating articles for.

Yeah, they can be in a textfile or however it suits anyone (i. e.  on a website somewhere).  The names would then be copy/pasted ~2,000 at a time to a page and once the names are actually compiled, someone can design a bot to copy/paste/submit the names on the wikipedia pages (but anyway, that can be arranged after, firstly it is the issue of removing the bad from the good). 

HMMM, but as I write this I see that the person who made the original files seem to have tweaked them already but I am not sure what rules he followed! The number of names have been reduce significantly! I will try and get in contact with him and will let you know how it turns out.

Sorry for the trouble.  :S

 
webmaster forum
polas  Offline
*
 
Code Guru
Gender: Male
Posts: 1296
Topics: 79
WWW
March 25, 2010, 11:46:35 AM

No probs, if you confirm whether or not you still want this then if you do will see what I can do Smiley

P.s. Would probably use PERL to write a script to do it

Mesham Type Oriented Parallel Programming Language
 
  Email this topic  |  Print
Pages: [1]   Go Up
 
Jump to:  



Powered by SMF 1.1.11 | SMF © 2006-2009, Simple Machines LLC


Google visited last this page September 02, 2010, 06:19:36 PM