2015-10-19 5 views
5

devo file di dati, che assomiglia a questo,Conversione di un elenco di lista in un dizionario

["Arts & Entertainment", "Arts & Entertainment/Animation & Comics", "Arts & Entertainment/Books & Literature", "Arts & Entertainment/Celebrity/Gossip", "Arts & Entertainment/Fine Art", "Arts & Entertainment/Humor", "Arts & Entertainment/Movies", "Arts & Entertainment/Movies/Action", "Arts & Entertainment/Movies/Comedy", "Arts & Entertainment/Movies/Documentary", "Arts & Entertainment/Movies/Drama", "Arts & Entertainment/Movies/Horror", "Arts & Entertainment/Music", "Arts & Entertainment/Music/Alternative Music", "Arts & Entertainment/Music/Blues", "Arts & Entertainment/Music/Christian Music", "Arts & Entertainment/Music/Classic Rock", "Arts & Entertainment/Music/Classical Music", "Arts & Entertainment/Music/Country Music", "Arts & Entertainment/Music/Electronic Dance Music", "Arts & Entertainment/Music/Heavy Metal", "Arts & Entertainment/Music/Pop Music", "Arts & Entertainment/Music/Rap", "Arts & Entertainment/Radio Stations", "Arts & Entertainment/Television", "Arts & Entertainment/Television/Game Show", "Arts & Entertainment/Television/Kids", "Arts & Entertainment/Television/News", "Arts & Entertainment/Television/Reality", "Arts & Entertainment/Television/Science", "Arts & Entertainment/Television/Sitcom", "Arts & Entertainment/Television/Soap Opera", "Arts & Entertainment/Television/Talk Show", "Autos", "Autos/4-Wheel Drive/SUVs", "Autos/Buying/Selling Cars", "Autos/Certified Pre-Owned", "Autos/Convertible", "Autos/Coupe", "Autos/Crossover", "Autos/Diesel", "Autos/Electric Vehicles", "Autos/Hatchback", "Autos/Hybrid", "Autos/Luxury", "Autos/Maintenance", "Autos/Maintenance/Parts", "Autos/Maintenance/Repair", "Autos/MiniVan", "Autos/Motorcycles", "Autos/Off-Road Vehicles", "Autos/Road-Side Assistance", "Autos/Sedan", "Autos/Trucks", "Autos/Trucks/Pickup", "Autos/Vintage Cars", "Autos/Wagon", "Business & Industry", "Business & Industry/Advertising", "Business & Industry/Agriculture", "Business & Industry/Biotech/Biomedical", "Business & Industry/Business Software", "Business & Industry/Construction", "Business & Industry/Construction/Composites & Plastics", "Business & Industry/Forestry", "Business & Industry/Government", "Business & Industry/Green Solutions", "Business & Industry/Human Resources", "Business & Industry/Logistics", "Business & Industry/Marketing", "Business & Industry/Metals", "Business & Industry/Non-Profit Organizations", "Business & Industry/Power Industry", "Business & Industry/Public Services", "Business & Industry/Public Services/Emergency Services", "Business & Industry/Public Services/Waste Management", "Business & Industry/Purchasing", "Business & Industry/Retail Industry", "Business & Industry/Small Business", "Business & Industry/Telecom", "Career", "Career/Career Planning", "Career/Job Search", "Career/Job Search/Resume Writing/Advice", "Career/Telecommuting", "Career/U.S. Military", "Education", "Education/Business School", "Education/College Education", "Education/College Education/Admissions", "Education/College Education/College Life", "Education/Continuing Education", "Education/Distance Learning", "Education/Financial Aid", "Education/Financial Aid/Scholarships", "Education/Graduate School", "Education/Homeschooling", "Education/Language Learning", "Education/Language Learning/English as a 2nd Language", "Education/Primary Education", "Education/Secondary Education", "Education/Special Education", "Finance & Money", "Finance & Money/Credit/Debt & Loans", "Finance & Money/Day Trading", "Finance & Money/Exchange Traded Funds", "Finance & Money/Financial News", "Finance & Money/Financial Planning", "Finance & Money/Financial Planning/Retirement Planning", "Finance & Money/Financial Planning/Tax Planning", "Finance & Money/Foreign Exchange Trading", "Finance & Money/Hedge Fund", "Finance & Money/Insurance", "Finance & Money/Investing", "Finance & Money/Mutual Funds", "Finance & Money/Options", "Finance & Money/Stocks", "Food & Drink", "Food & Drink/Barbecues & Grilling", "Food & Drink/Beverages", "Food & Drink/Beverages/Cocktails/Beer", "Food & Drink/Beverages/Coffee/Tea", "Food & Drink/Beverages/Wine", "Food & Drink/Cuisine-Specific", "Food & Drink/Cuisine-Specific/American Cusine", "Food & Drink/Cuisine-Specific/Cajun/Creole", "Food & Drink/Cuisine-Specific/Chinese Cuisine", "Food & Drink/Cuisine-Specific/French Cuisine", "Food & Drink/Cuisine-Specific/Italian Food", "Food & Drink/Cuisine-Specific/Japanese Food", "Food & Drink/Cuisine-Specific/Mexican Cuisine", "Food & Drink/Desserts & Baking", "Food & Drink/Health/LowFat Cooking", "Food & Drink/Organic Food", "Food & Drink/Vegetarian", "Health & Fitness", "Health & Fitness/A.D.D.", "Health & Fitness/AIDS/HIV", "Health & Fitness/Allergies", "Health & Fitness/Alternative Medicine", "Health & Fitness/Alzheimer\\'s Disease", "Health & Fitness/Arthritis", "Health & Fitness/Asthma", "Health & Fitness/Autism/PDD", "Health & Fitness/Bipolar Disorder", "Health & Fitness/Brain Tumor", "Health & Fitness/Cancer", "Health & Fitness/Cancer/Breast Cancer", "Health & Fitness/Cancer/Lung Cancer", "Health & Fitness/Cancer/Prostate Cancer", "Health & Fitness/Cholesterol", "Health & Fitness/Chronic Fatigue Syndrome", "Health & Fitness/Chronic Obstructive Pulmonary Disease", "Health & Fitness/Chronic Pain", "Health & Fitness/Cold & Flu", "Health & Fitness/Deafness", "Health & Fitness/Dental Care", "Health & Fitness/Depression", "Health & Fitness/Dermatology", "Health & Fitness/Diabetes", "Health & Fitness/Epilepsy", "Health & Fitness/Exercise", "Health & Fitness/GERD/Acid Reflux", "Health & Fitness/Headaches/Migraines", "Health & Fitness/Heart Disease", "Health & Fitness/Heart Disease/Women\\'s Heart Disease", "Health & Fitness/Hepatitis", "Health & Fitness/Herbs for Health", "Health & Fitness/Holistic Healing", "Health & Fitness/Hypertension", "Health & Fitness/IBS/Crohn\\'s Disease", "Health & Fitness/Incest/Abuse Support", "Health & Fitness/Incontinence", "Health & Fitness/Infertility", "Health & Fitness/Men\\'s Health", "Health & Fitness/Nursing", "Health & Fitness/Nutrition", "Health & Fitness/Orthopedics", "Health & Fitness/Orthopedics/Sports Medicine", "Health & Fitness/Panic/Anxiety Disorders", "Health & Fitness/Pediatrics", "Health & Fitness/Pharmaceutical", "Health & Fitness/Physical Therapy", "Health & Fitness/Psychology/Psychiatry", "Health & Fitness/Senior Health", "Health & Fitness/Sexuality", "Health & Fitness/Sleep Disorders", "Health & Fitness/Smoking Cessation", "Health & Fitness/Substance Abuse", "Health & Fitness/Substance Abuse/Alcoholism", "Health & Fitness/Thyroid Disease", "Health & Fitness/Weight Loss", "Health & Fitness/Women\\'s Health", "Hobbies & Games", "Hobbies & Games/Arts & Crafts", "Hobbies & Games/Arts & Crafts/Beadwork", "Hobbies & Games/Arts & Crafts/Drawing/Sketching", "Hobbies & Games/Arts & Crafts/Needlework", "Hobbies & Games/Arts & Crafts/Painting", "Hobbies & Games/Arts & Crafts/Photography", "Hobbies & Games/Arts & Crafts/Woodworking", "Hobbies & Games/Astrology", "Hobbies & Games/Birdwatching", "Hobbies & Games/BoardGames/Puzzles", "Hobbies & Games/Candle & Soap Making", "Hobbies & Games/Card Games", "Hobbies & Games/Chess", "Hobbies & Games/Cigars", "Hobbies & Games/Collecting", "Hobbies & Games/Collecting/Antiques", "Hobbies & Games/Collecting/Book Collecting", "Hobbies & Games/Collecting/Miniatures", "Hobbies & Games/Collecting/Stamps & Coins", "Hobbies & Games/Creative Writing", "Hobbies & Games/Getting Published", "Hobbies & Games/Home Recording", "Hobbies & Games/Inventors & Patents", "Hobbies & Games/Learning a Musical Instrument", "Hobbies & Games/Learning a Musical Instrument/Guitar", "Hobbies & Games/Magic & Illusion", "Hobbies & Games/Paranormal Phenomena", "Hobbies & Games/Sci-Fi & Fantasy", "Hobbies & Games/Video Games", "Hobbies & Games/Video Games/Nintendo", "Hobbies & Games/Video Games/PSP", "Hobbies & Games/Video Games/Playstation", "Hobbies & Games/Video Games/RPG", "Hobbies & Games/Video Games/Racing", "Hobbies & Games/Video Games/X-Box", "Home & Garden", "Home & Garden/Appliances", "Home & Garden/Environmental Safety", "Home & Garden/Gardening/Landscaping", "Home & Garden/Home Repair", "Home & Garden/Interior Decorating", "News & Current Affairs", "News & Current Affairs/Law & Politics", "News & Current Affairs/Law & Politics/Immigration", "News & Current Affairs/Law & Politics/Legal Issues", "News & Current Affairs/Law & Politics/U.S. Government Resources", "Parenting & Family", "Parenting & Family/Adoption", "Parenting & Family/Babies & Toddlers", "Parenting & Family/Daycare/Pre-School", "Parenting & Family/Parenting Children", "Parenting & Family/Parenting Teens", "Parenting & Family/Pregnancy", "Parenting & Family/Special Needs Kids", "Pets", "Pets/Aquariums", "Pets/Cats", "Pets/Dogs", "Pets/Veterinary Medicine", "Real Estate", "Real Estate/Apartments", "Real Estate/Architecture", "Real Estate/Buying/Selling Homes", "Religion", "Religion/Alternative Religions", "Religion/Atheism/Agnosticism", "Religion/Buddhism", "Religion/Catholicism", "Religion/Christianity", "Religion/Hinduism", "Religion/Islam", "Religion/Judaism", "Religion/Latter-Day Saints", "Religion/Pagan/Wiccan", "Science", "Science/Astronomy", "Science/Biology", "Science/Chemistry", "Science/Geology", "Science/Physics", "Sensitive Content", "Sensitive Content/Gambling", "Sensitive Content/Gambling/Sports Gambling", "Society", "Society/Dating", "Society/Divorce", "Society/Gay Life", "Society/Marriage", "Society/Senior Living", "Society/Weddings", "Sports & Recreation", "Sports & Recreation/Auto Racing", "Sports & Recreation/Auto Racing/NASCAR Racing", "Sports & Recreation/Baseball", "Sports & Recreation/Basketball", "Sports & Recreation/Bicycling", "Sports & Recreation/Bicycling/Mountain Biking", "Sports & Recreation/Bodybuilding", "Sports & Recreation/Boxing", "Sports & Recreation/Canoeing/Kayaking", "Sports & Recreation/Cheerleading", "Sports & Recreation/Climbing", "Sports & Recreation/College Sports", "Sports & Recreation/Cricket", "Sports & Recreation/Figure Skating", "Sports & Recreation/Fishing", "Sports & Recreation/Fishing/Fly Fishing", "Sports & Recreation/Fishing/Freshwater Fishing", "Sports & Recreation/Fishing/Game & Fish", "Sports & Recreation/Fishing/Saltwater Fishing", "Sports & Recreation/Football", "Sports & Recreation/Golf", "Sports & Recreation/Horses", "Sports & Recreation/Horses/Horse Racing", "Sports & Recreation/Hunting/Shooting", "Sports & Recreation/Ice Hockey", "Sports & Recreation/Inline Skating", "Sports & Recreation/Martial Arts", "Sports & Recreation/Olympics", "Sports & Recreation/Paintball", "Sports & Recreation/Rodeo", "Sports & Recreation/Rugby", "Sports & Recreation/Running/Walking", "Sports & Recreation/Sailing", "Sports & Recreation/Scuba Diving", "Sports & Recreation/Skateboarding", "Sports & Recreation/Skiing", "Sports & Recreation/Snowboarding", "Sports & Recreation/Soccer", "Sports & Recreation/Surfing/Bodyboarding", "Sports & Recreation/Swimming", "Sports & Recreation/Table Tennis/Ping-Pong", "Sports & Recreation/Tennis", "Sports & Recreation/Volleyball", "Sports & Recreation/Waterski/Wakeboard", "Sports & Recreation/Yachting", "Style & Fashion", "Style & Fashion/Body Art", "Style & Fashion/Cosmetics", "Style & Fashion/Fashion", "Style & Fashion/Jewelry", "Technology & Computing", "Technology & Computing/Cameras & Camcorders", "Technology & Computing/Cell Phones", "Technology & Computing/Computer Certification", "Technology & Computing/Computer Networking", "Technology & Computing/Computer Peripherals", "Technology & Computing/Computer Security", "Technology & Computing/Computer Security/Antivirus Software", "Technology & Computing/Computer Security/Network Security", "Technology & Computing/Databases", "Technology & Computing/Graphics", "Technology & Computing/Graphics/3-D Graphics", "Technology & Computing/Graphics/Animation", "Technology & Computing/Graphics/Desktop Publishing", "Technology & Computing/Graphics/Desktop Video", "Technology & Computing/Graphics/Web Design/HTML", "Technology & Computing/Home Theater Systems", "Technology & Computing/Operating Systems", "Technology & Computing/Operating Systems/Linux", "Technology & Computing/Operating Systems/Mac OS", "Technology & Computing/Operating Systems/Unix", "Technology & Computing/Operating Systems/Windows", "Technology & Computing/Portable Device", "Technology & Computing/Programming", "Technology & Computing/Programming/C/C++", "Technology & Computing/Programming/Java", "Technology & Computing/Programming/JavaScript", "Technology & Computing/Programming/Visual Basic", "Travel", "Travel/Adventure Travel", "Travel/Africa", "Travel/Air Travel", "Travel/Asia", "Travel/Asia/Japan", "Travel/Australia & New Zealand", "Travel/Bed & Breakfasts", "Travel/Budget Travel", "Travel/Business Travel", "Travel/Camping", "Travel/Canada", "Travel/Caribbean", "Travel/Cruises", "Travel/Europe", "Travel/Europe/Eastern Europe", "Travel/Europe/France", "Travel/Europe/Greece", "Travel/Europe/Italy", "Travel/Europe/United Kingdom", "Travel/Honeymoons/Getaways", "Travel/Hotels", "Travel/Mexico & Central America", "Travel/National Parks", "Travel/South America", "Travel/Spas", "Travel/Theme Parks", "Travel/United States", "Travel/United States/California", "Travel/United States/Florida", "Travel/United States/Hawaii", "Travel/United States/Las Vegas, Nevada", "Travel/United States/Manhattan, New York", "Travel/United States/New England", "Travel/United States/Texas", "Travel/Weather"] 

pulisco il file di dati e ho diviso, in modo che sembra qualcosa di simile,

['Arts & Entertainment'] 
['Arts & Entertainment', 'Animation & Comics'] 
['Arts & Entertainment', 'Books & Literature'] 
['Arts & Entertainment', 'Celebrity Gossip'] 
['Arts & Entertainment', 'Fine Art'] 
['Arts & Entertainment', 'Humor'] 
['Arts & Entertainment', 'Movies'] 
['Arts & Entertainment', 'Movies', 'Action'] 
['Arts & Entertainment', 'Movies', 'Comedy'] 
['Arts & Entertainment', 'Movies', 'Documentary'] 
['Arts & Entertainment', 'Movies', 'Drama'] 
['Arts & Entertainment', 'Movies', 'Horror'] 
['Arts & Entertainment', 'Music'] 
['Arts & Entertainment', 'Music', 'Alternative Music'] 
['Arts & Entertainment', 'Music', 'Blues'] 
['Arts & Entertainment', 'Music', 'Christian Music'] 
['Arts & Entertainment', 'Music', 'Classic Rock'] 
['Arts & Entertainment', 'Music', 'Classical Music'] 
['Arts & Entertainment', 'Music', 'Country Music'] 
['Arts & Entertainment', 'Music', 'Electronic Dance Music'] 
['Arts & Entertainment', 'Music', 'Heavy Metal'] 
['Arts & Entertainment', 'Music', 'Pop Music'] 
['Arts & Entertainment', 'Music', 'Rap'] 
['Arts & Entertainment', 'Radio Stations'] 
['Arts & Entertainment', 'Television'] 
['Arts & Entertainment', 'Television', 'Game Show'] 
['Arts & Entertainment', 'Television', 'Kids'] 
['Arts & Entertainment', 'Television', 'News'] 
['Arts & Entertainment', 'Television', 'Reality'] 
['Arts & Entertainment', 'Television', 'Science'] 
['Arts & Entertainment', 'Television', 'Sitcom'] 
['Arts & Entertainment', 'Television', 'Soap Opera'] 
['Arts & Entertainment', 'Television', 'Talk Show']... 

Ora, sto cercando di convertire gli oggetti della lista in un dizionario che assomiglia a questo,

{ 
    "Arts & Entertainment": { 
     "Animation & Comics": {}, 
     "Books & Literature": {}, 
     "Celebrity Gossip": {}, 
     "Fine Art": {}, 
     "Humor": {}, 
     "Movies": { 
      "Horror": {}, 
      "Action": {}, 
      "Comedy": {}, ... 
     }, ... 
} 

il problema è che non posso figur E come non sovrascrivere le mie sottocategorie, nell'esempio sopra, il sottotitolo Film ha tre categorie con esso, tuttavia quando eseguo il mio codice, che è al di sotto di esso ha solo la chiave di "Horror" in esso e questo perché Horror è l'ultimo elemento nell'ultimo elemento dell'ultimo elenco in quella categoria. Esempio di quello che sto ricevendo:

{ 
    "Arts & Entertainment": { 
     "Animation & Comics": {}, 
     "Books & Literature": {}, 
     "Celebrity Gossip": {}, 
     "Fine Art": {}, 
     "Humor": {}, 
     "Movies": { 
      "Horror": {} # notice there are no other categories in the movies section 
     }, ... 
} 

codice che ho provato:

def cleanup_contextweb(): 
    contextweb_file_path = directory_path + raw_file_names[1] 
    tree = {} 
    with open(contextweb_file_path, 'r') as contextweb_file: 
    cats = contextweb_file.read().replace('Manhattan, New York', 'Manhattan New York').replace('Las Vegas, Nevada', 'Las Vegas Nevada').replace('Celebrity/Gossip', 'Celebrity Gossip').replace('Atheism/Agnosticism', 'Atheism Agnosticism').replace('Pagan/Wiccan', 'Pagan Wiccan').split(',') 
    #cats = re.sub(r'"|\[|\]', '', cats) 
    cats = [map(str.strip, re.sub(r'"|\[|\]', '', cat).split('/')) for cat in cats] 
    cats = sorted(cats) 
    for cat in cats: 
     if len(cat) == 1: 
     tree[cat[0]] = {} 
     elif len(cat) == 2: 
     tree[cat[0]][cat[1]] = {} 
     elif len(cat) == 3: 
     tree[cat[0]][cat[1]] = {} 
     tree[cat[0]][cat[1]][cat[2]] = {} 
     elif len(cat) == 4: 
     tree[cat[0]][cat[1]] = {} 
     tree[cat[0]][cat[1]][cat[2]] = {} 
     tree[cat[0]][cat[1]][cat[2]][cat[3]] = {} 
    with open(directory_path + 'cleaned_' + raw_file_names[1], 'w') as contextweb_file_out: 
    json.dump(tree, contextweb_file_out, sort_keys=True, indent=4) 

    return json.dumps(tree, sort_keys=True, indent=4) 

Come vedrete che sto cercando di costruire il dizionario So quanto in profondità (quante chiavi Ho bisogno) Sono basato sulla lunghezza della lista passata. Altre cose, ho provato, ma cancellato, includo, ordinando l'elenco degli elenchi (cats) in base alla lunghezza dell'elenco secondario e invertendolo, in modo che tutto la lista con 4 elementi verrebbe ripetuta prima. Pensavo di poter costruire le chiavi in ​​quel modo perché esistesse la chiave per i livelli inferiori. Non è stato di grande aiuto.

+0

si sta sostituendo il valore di ogni chiave con un dict vuoto ogni volta, o nel caso di 'tree [cat [0]] [cat [1]] = {cat [2]: {}}', a dict con solo questa chiave. dovresti invece aggiungere una nuova chiave se la chiave esiste già e una nuova dict se non lo è. –

+0

Hai provato la ricorsione? Prendi tutti gli elementi con lo stesso primo elemento. Elimina il primo elemento, chiama se stesso con gli elenchi abbreviati. Questo restituisce un dizionario di quegli elementi. Quando hai un solo elemento, restituiscilo come chiave per un dizionario vuoto. È abbastanza chiaro da implementare? – Prune

+0

@RNar Oh sì, mi dispiace per quello è stato un errore di battitura, io a giocare con il codice, rimuovendolo produce lo stesso risultato però. – reticentroot

risposta

5

Ecco come si presenta con la ricorsione:

data = [ 
    ['Arts & Entertainment'], 
    ['Arts & Entertainment', 'Animation & Comics'], 
    ...,  # full data list elided for readability 
    ['Arts & Entertainment', 'Television', 'Talk Show'] 
] 

def classify(in_list): 
    sub_dict = {} 

    label_set = set([category[0] for category in in_list]) 
    for label in label_set: 
     # print label 
     sub_category = [sub[1:] for sub in in_list if sub[0] == label and len(sub) > 1] 
     # print sub_category 
     sub_dict[label] = classify(sub_category) 

    return sub_dict 


print classify(data) 

uscita (che non ho formato per migliorare la leggibilità):

{'Arts & Entertainment': {'Celebrity Gossip': {}, 'Humor': {}, 'Television': {'Game Show': {}, 'Kids': {}, 'Science': {}, 'Talk Show': {}, 'Sitcom': {}, 'Reality': {}, 'Soap Opera': {}, 'News': {}}, 'Animation & Comics': {}, 'Movies': {'Action': {}, 'Drama': {}, 'Horror': {}, 'Comedy': {}, 'Documentary': {}}, 'Radio Stations': {}, 'Music': {'Alternative Music': {}, 'Christian Music': {}, 'Electronic Dance Music': {}, 'Pop Music': {}, 'Country Music': {}, 'Classical Music': {}, 'Rap': {}, 'Heavy Metal': {}, 'Blues': {}, 'Classic Rock': {}}, 'Fine Art': {}, 'Books & Literature': {}}} 
+0

Spero che questo sia abbastanza semplice da comprendere. Ho lasciato nelle mie dichiarazioni di stampa di tracciamento, nel caso in cui quelli ti aiutano a seguire la logica. – Prune

+0

Sì, grazie mille, è facile da leggere e quindi facile da studiare, grazie! – reticentroot

6

In realtà, un ciclo for in grado di produrre una bella soluzione piuttosto troppo :

>>> data 
[['a', 'b', 'c', 'd'], ['a', 'b', 'c'], ['a', 's', 'd'], ['a', 'b', 'c', 'd', 'e']] 
>>> tree = {} 
>>> for cats in data: 
...  curtree = tree 
...  for c in cats: 
...   curtree = curtree.setdefault(c, {}) 
... 
>>> tree 
{'a': {'s': {'d': {}}, 'b': {'c': {'d': {'e': {}}}}}} 

Il metodo .setdefault() assicura che dizionario secondario è aggiunto se e solo se Il tasto f (categoria) non è mai esistito prima.

Il curtree inizia dal dizionario di base tree e attraversa/crea l'albero utilizzando le categorie.

+0

Mi piace davvero che tu sia la soluzione, io scelgo l'altra perché mi da pratica con la ricorsione, che è un mio punto debole. – reticentroot

+0

Grazie. Ho votato questo: quando la soluzione del ciclo è semplice come quella ricorsiva, è consigliabile utilizzare il ciclo. – Prune