Ecco l'approccio che ho seguito dopo aver giocato un po '. È una combinazione di regole di base, "correzioni" e sinonimi: in primo luogo, applica un filtro_car per applicare una serie di regole di ortografia di base. E 'corretto non al 100%, ma fa il lavoro abbastanza bene:
"char_filter": {
"en_char_filter": { "type": "mapping", "mappings": [
# fixes
"aerie=>axerie", "aeroplane=>airplane", "aloe=>aloxe", "canoe=>canoxe", "coerce=>coxerce", "poem=>poxem", "prise=>prixse",
# whole words
"armour=>armor", "behaviour=>behavior", "centre=>center" "colour=>color", "clamour=>clamor", "draught=>draft", "endeavour=>endeavor", "favour=>favor", "flavour=>flavor", "harbour=>harbor", "honour=>honor",
"humour=>humor", "labour=>labor", "litre=>liter", "metre=>meter", "mould=>mold", "neighbour=>neighbor", "plough=>plow", "saviour=>savior", "savour=>savor",
# generic transformations
"ae=>e", "ction=>xion", "disc=>disk", "gramme=>gram", "isable=>izable", "isation=>ization", "ise=>ize", "ising=>izing", "ll=>l", "oe=>e", "ogue=>og", "sation=>zation", "yse=>yze", "ysing=>yzing"
] }
}
La voce "correzioni" è lì per evitare che una scorretta applicazione di altre norme. Per esempio. "prise=>prixse"
impedisce che il "premio" venga modificato in "premio", che ha un significato diverso. Potrebbe essere necessario adattarlo in base alle proprie esigenze.
successivo, includono un filtro sinonimo per la cattura delle eccezioni più frequentemente utilizzati:
"en_synonym_filter": { "type": "synonym", "synonyms": EN_SYNONYMS }
Ecco nostra elenco dei sinonimi che include le parole chiave più importanti per nostro caso d'uso. Si potrebbe desiderare di adattare questa lista alle proprie esigenze:
EN_SYNONYMS = (
"accolade, prize => award",
"accoutrement => accouterment",
"aching, pain => hurt",
"acw, anticlockwise, counterclockwise, counter-clockwise => ccw",
"adaptor => adapter",
"advocate, attorney, barrister, procurator, solicitor => lawyer",
"ageing => aging",
"agendas, agendum => agenda",
"almanack => almanac",
"aluminium => aluminum",
"america, united states, usa",
"amphitheatre => amphitheater",
"anti-aliased, anti-aliasing => antialiased",
"arbour => arbor",
"ardour => ardor",
"arse => ass",
"artefact => artifact",
"aubergine => eggplant",
"automobile, motorcar => car",
"axe => ax",
"bannister => banister",
"barbecue => bbq",
"battleaxe => battleax",
"baulk => balk",
"beetroot => beet",
"biassed => biased",
"biassing => biasing",
"biscuit => cookie",
"black american, african american, afro-american, negro",
"bobsleigh => bobsled",
"bonnet => hood",
"bulb, electric bulb, light bulb, lightbulb",
"burned => burnt",
"bussines, bussiness => business",
"business man, business people, businessman",
"business woman, business people, businesswoman",
"bussing => busing",
"cactus, cactuses => cacti",
"calibre => caliber",
"candour => candor",
"candy floss, candyfloss, cotton candy",
"car park, parking area, parking ground, parking lot, parking-lot, parking place, parking",
"carburettor => carburetor",
"castor => caster",
"cataloguing => cataloging",
"catboat, sailboat, sailing boat",
"champion, gainer, victor, win, winner => victory",
"chat => talk",
"chequebook => checkbook",
"chequer => checker",
"chequerboard => checkerboard",
"chequered => checkered",
"christmas tree ball, christmas tree ball ornament, christmas ball ornament, christmas bauble",
"christmas, x-mas => xmas",
"cinema => movies",
"clangour => clangor",
"clarinettist => clarinetist",
"conditioning => conditioner",
"conference => meeting",
"coriander => cilantro",
"corporate => company",
"cosmos, universe => outer space",
"cosy, cosiness => cozy",
"criminal => crime",
"curriculums => curricula",
"cypher => cipher",
"daddy, father, pa, papa => dad",
"defence => defense",
"defenceless => defenseless",
"demeanour => demeanor",
"departure platform, station platform, train platform, train station",
"dishrag => dish cloth",
"dishtowel, dishcloth => dish towel",
"doughnut => donut",
"downspout => drainpipe",
"drugstore => pharmacy",
"e-mail => email",
"enamoured => enamored",
"england => britain",
"english => british",
"epaulette => epaulet",
"exercise, excercise, training, workout => fitness",
"expressway, motorway, highway => freeway",
"facebook => facebook, social media",
"fanny => buttocks",
"fanny pack => bum bag",
"farmyard => barnyard",
"faucet => tap",
"fervour => fervor",
"fibre => fiber",
"fibreglass => fiberglass",
"flashlight => torch",
"flautist => flutist",
"flier => flyer",
"flower fly, hoverfly, syrphid fly, syrphus fly",
"foot-walk, sidewalk, sideway => pavement",
"football, soccer",
"forums => fora",
"fourth => 4",
"freshman => fresher",
"chips, fries, french fries",
"gaol => jail",
"gaolbird => jailbird",
"gaolbreak => jailbreak",
"gaoler => jailer",
"garbage, rubbish => trash",
"gasoline => petrol",
"gases, gasses",
"gauge => gage",
"gauged => gaged",
"gauging => gaging",
"gipsy, gipsies, gypsies => gypsy",
"glamour => glamor",
"glueing => gluing",
"gravesite, sepulchre, sepulture => sepulcher",
"grey => gray",
"greyish => grayish",
"greyness => grayness",
"groyne => groin",
"gryphon, griffon => griffin",
"hand shake, shake hands, shaking hands, handshake",
"haulier => hauler",
"hobo, homeless, tramp => bum",
"new year, new year's eve, hogmanay, silvester, sylvester",
"holiday => vacation",
"holidaymaker, holiday-maker, vacationer, vacationist => tourist",
"homosexual, fag => gay",
"inbox, letterbox, outbox, postbox => mailbox",
"independence day, 4th of july, fourth of july, july 4th, july 4, 4th july, july fourth, forth of july, 4 july, fourth july, 4th july",
"infant, suckling, toddler => baby",
"infeasible => unfeasible",
"inquire, inquiry => enquire",
"insure => ensure",
"internet, website => www",
"jelly => jam",
"jewelery, jewellery => jewelry",
"jogging => running",
"journey => travel",
"judgement => judgment",
"kerb => curb",
"kiwifruit => kiwi",
"laborer => worker",
"lacklustre => lackluster",
"ladybeetle, ladybird, ladybug => ladybird beetle",
"larrikin, scalawag, rascal, scallywag => naughty boy",
"leaf => leaves",
"licence, licenced, licencing => license",
"liquorice => licorice",
"lorry => truck",
"loupe, magnifier, magnifying, magnifying glass, magnifying lens, zoom",
"louvred => louvered",
"louvres => louver",
"lustre => luster",
"mail => post",
"mailman => postman",
"marriage, married, marry, marrying, wedding => wed",
"mayonaise => mayo",
"meagre => meager",
"misdemeanour => misdemeanor",
"mitre => miter",
"mom, momma, mummy, mother => mum",
"moonlight => moon light",
"moult => molt",
"moustache, moustached => mustache",
"nappy => diaper",
"nightlife => night life",
"normalcy => normality",
"octopus => kraken",
"odour => odor",
"odourless => odorless",
"offence => offense",
"omelette => omelet",
"# fix torres del paine",
"paine => painee",
"pajamas => pyjamas",
"pantyhose => tights",
"parenthesis, parentheses => bracket",
"parliament => congress",
"parlour => parlor",
"persnickety => pernickety",
"philtre => filter",
"phoney => phony",
"popsicle => iced-lolly",
"porch => veranda",
"pretence => pretense",
"pullover, jumper => sweater",
"pyjama => pajama",
"railway => railroad",
"rancour => rancor",
"rappel => abseil",
"row house, serial house, terrace house, terraced house, terraced housing, town house",
"rigour => rigor",
"rumour => rumor",
"sabre => saber",
"saltpetre => saltpeter",
"sanitarium => sanatorium",
"santa, santa claus, st nicholas, st nicholas day",
"sceptic, sceptical, scepticism, sceptics => skeptic",
"sceptre => scepter",
"shaikh, sheikh => sheik",
"shivaree => charivari",
"silverware, flatware => cutlery",
"simultaneous => simultanous",
"sleigh => sled",
"smoulder, smouldering => smolder",
"sombre => somber",
"speciality => specialty",
"spectre => specter",
"splendour => splendor",
"spoilt => spoiled",
"street => road",
"streetcar, tramway, tram => trolley-car",
"succour => succor",
"sulphate, sulphide, sulphur, sulphurous, sulfurous => sulfur",
"super hero, superhero => hero",
"surname => last name",
"sweets => candy",
"syphon => siphon",
"syphoning => siphoning",
"tack, thumb-tack, thumbtack => drawing pin",
"tailpipe => exhaust pipe",
"taleban => taliban",
"teenager => teen",
"television => tv",
"thank you, thanks",
"theatre => theater",
"tickbox => checkbox",
"ticked => checked",
"timetable => schedule",
"tinned => canned",
"titbit => tidbit",
"toffee => taffy",
"tonne => ton",
"transportation => transport",
"trapezium => trapezoid",
"trousers => pants",
"tumour => tumor",
"twitter => twitter, social media",
"tyre => tire",
"tyres => tires",
"undershirt => singlet",
"university => college",
"upmarket => upscale",
"valour => valor",
"vapour => vapor",
"vigour => vigor",
"waggon => wagon",
"windscreen, windshield => front shield",
"world championship, world cup, worldcup",
"worshipper, worshipping => worshiping",
"yoghourt, yoghurt => yogurt",
"zip, zip code, postal code, postcode",
"zucchini => courgette"
)
Anche se è non è la mia domanda. Ti ringrazio per questo! Un ottimo lavoro. –