What are the acronyms for ObsceneClean? There are only three. OW stands for Offensive Word. OE stands for Offensive Expression and OL stands for Offensive Langauage. OL is a mroe general term that is inclusive of OWs and OEs. Why can a very well disguised and long OW, like 'hfsjhfagkjh', not get detected? Becuase it may be so well disguised that it is not recognizable or barely recognizable and if an OW is not recognizable, who cares? What is the basis for assigning severities to OWs? The assignement of severities to OWs does not involve a subjective judgement of any kind. Generally 2 criteria are used: 1) The offensiveness of the OW as per studies (public surveys) in this area. 2) Prevalence. Less prevalent OWs are generally less offensive. Sound alikes will have a lesser severity than the OW they sound like. Will ObsceneClean recognize OWs consisting of 2 words, like 'butt hole'? If an OW is 2 words than it is not -strictly speaking an- OW. It is, rather, an obscene expression. There are a few exceptions for very well known OWs that are, by consensus, considered to a single OW, like a--hole. How can I ensure that ObsceneClean will not consider a certain OW or expression? Enter the expression or word in the LocalSafeList.txt file on one line and load it. Avoid complex syntax! Long expressions, disguised or complicated expressions will cause problems. Words and expressions in this file will not be considered at all by ObsceneClean. Entries in the LocalSafeList.txt are not case-insensitive. I want ObsceneClean to detect a custom OW or expression, possible? Yes. Enter the custom OW or expression followed by a comma, a number from 1 to 10 indicating the severity (or offensiveness) of the entry in the LocalOEs.txt file on one line. ObsceneClean will detect your entries in this file exactly as you entered them. ObsceneClean detected an OW and reported it with a high probability but the word was used in a non-offensive way. For example, a website about dog shows used the word bitch to refer to female dogs. How can I deal with this problem? Software analysis of english text requires a great deal of disambiguation. ObsceneClean will attempt to disambiguate OWs with more than one meaning but results may not be perfect. No AI at this time can speak and understand english like a human. Please report the example to ObsceneClean and we will try to add logic to deal with your particular situation. In the mean time you could experiment with putting the word in the LocalSafeList.txt file. I do not care about OWs with a low severity. Can I ignore them? Yes. The lowestsev parameter specifies the lowest severity for OWs that you want to detect. If you set this parameter to a 5, only OWs with a severity of a 5 and up will be reported. Someone attacked my website with offensive language and ObsceneClean did not stop them. How can I prevent these kinds of attacks? Users of offensive language will simply repeat attempts to enter offensive language, with increasing sophistication, on a website until they succeed and no software algorithm is smarter than a human. There are other methods, besides ObsceneClean, for dealing with offensive users. The best way of dealing with offensive language is to use all methods. Track users based on their IP address, a cookie or their registered userid and then ban the user. You might also consider warning the user the after their first violation and then using a more aggressive OL detection. This can be accomplished by chaging the parameters in the settings file. I do not care about low level OWs like 'damn' and 'hell'. Should I enter them in the LocalSafeList.txt file or should I use the lowestsev parameter? It is much better to use the lowestsev parameter. This parameter indicates the lowest severity of OW you care about. OWs below this setting will not be reported. If there is, however, a particular OW you don't want detected and it has a severity higher than your lowestsev value, then put this OW in the LocalSafeList.txt file on a line by itself. If I enter an offensive word in the LocalOEs.txt file and also enter it in the LocalSafeList.txt file, which one wins? Don't do that. But if you do, remember that the LocalSafeList.txt is always safe and is never considered in anyway. What are the Shakespeare and King James Bible settings all about? Setting these options to true (their default is false), will cause ObsceneClean to ignore any OWs in the King James Bible and in the complete works of Shakespere. Use of these otpions will make ObsceneClean slightly slower and it is recommended to use them only if a website is specifically about one of these subjects. In what order does ObsceneClean search for OWs? The order is not precise but generally unambiguous and high severity OWs come first. Well known combination OWs, like a--hole, must come before their derivation OW. So the OW a--hole must come before ass. So, once ObsceneClean is installed and configured I can just sit back and let it auto-magically filter out all profanity? Uh, no. ObsceneClean is a tool that will assist website admins in filtering out profanity. Checking posts and banning users will still be necessary but ObsceneClean should cut down on the time spent doing this kind of thing. No software is smarter than a human --yet. Developer FAQs Your code is messy. Why is it like that? Its free. Leap off a bridge. (This code is a proof of concept algorithm. Care to re-write the whole thing in java or perfectly platform independent ANSI C++ with fuill unicode compliance?) What is the basis for setting up the ObsceneClean data files? A completley dispassionate, analysis of offensive language. Offensive language's primary purpose is to offend. Allowing subjective reactions, even at an unconscious level, to influence this analysis will produce completely useless results. GIGO. There must be a way to simplfy the whole algorithm. It seems the logic is overly complex and there are too many special rules. Can it be simplified? No doubt some simplification is possible but by and large the algorithm is necessarily complex. There are three reasons why the algorithm is necessarily complex: 1) England was invaded many times and each invasion (especially the Danish invasion) altered the english langauage until it become one of the most complex languages in the world. It is not well known that english is a pidginized language. Adding to the complexity english is the fact that it has the largest vocabulary of any langauge in the world. Any software that proposes to analyze or to some extent determine the meaning of english text will necessarily be complex. 2) Users will deliberately disguise OWs in a myriad of ways. 3) Users will employ slang in the extreme. Slang does not follow any rules of syntax and grammar leaving little basis for an algorithm based on the rules of english. What is the difference between parameters $LowestSevConsidered and $lowestsev? OWs under the severity specified for $LowestSevConsidered will not even be considered at all in detection. The parm $lowestsev is the lowest severity for an OW that will be reported, although words under this severity may be considered in the overall evaluation of OWs. If an ambiguous OW is detected, the presence of other OWs improves the likelihood that the ambiguous OW is used in offensive way. For instance, if the word b-tch is detected and the word ass and wh-re are also detected in the same text, there is a much higher likelihood that the word b-tch was used in an offensive way as opposed to its non-offensive context. However, if $LowestSevConsidered is raised from its default of 1 up to 3, then the word ass with a severity of 1 will be completely ignored in this scenario. In most situations, it is best to leave $LowestSevConsidered at its default. The main REGEX used to detect OWs does not consider word boundaries, why? Word boundaries are of limited value in detecting OWs. aaafuckaaa YOU! In what order are OWs searched for? The order is prevalence of the OW, severity. An OW that is within another OW must go first, i.e. asshole must be before ass! OWs are sorted in MasterOWs.txt in this order.