Write three very simple python scripts to reduce .cvs files

  • ステータス: Closed
  • 賞金: $35
  • 受け取ったエントリー: 2
  • 優勝者: sarasixti

コンテスト概要

I need three simple scripts to reduce data (information about mutations in proteins):

All the scripts should work in mac command line and with a random input files, e.g.:
python script1.py Random_file_name_1.csv

Here are the scripts:
Script 1:
Reduce XXX.csv file to generate XXX_reduced_to_species.csv file by copying only the lines with unique "Organism" and "Total number of mutations" values (please find the files in the attachement) (XXX is just an example, a generic name of a protein):

For instance, if XXX.csv file looks like this:

Organism: Total number of mutations:
Helicobacter pylori 0
Helicobacter pylori 0
Helicobacter pylori 1
Helicobacter pylori 0
Escherichia coli 0
Escherichia coli 2
Escherichia coli 1
Escherichia coli 0
Escherichia coli 1

then the XXX_reduced file will look like this:

Organism: Total number of mutations:
Helicobacter pylori 0
Helicobacter pylori 1
Escherichia coli 0
Escherichia coli 2
Escherichia coli 1


Script 2:
Reduce XXX_reduced_to_species.csv to XXX_reduced_to_genus.csv by using the following rule:
for every organism name that has identical first name, leave the first line with the highest number of mutations:

For instance, the lines:

Organism: Total number of mutations:
Helicobacter acinonychis 1
Helicobacter bilis 0
Helicobacter cetorum 0
Helicobacter cinaedi 2
Helicobacter felis 2

will be reduced to:
Helicobacter cinaedi 2


Script 3:
Finally, I need a script step3.py to create Mutations_summary.csv file from multiple .cvs files:

python script_3.py XXX_reduced_to_genus.csv
YYY_reduced_to_genus.csv ZZZ_reduced_to_genus.csv

This script should generate Mutations_summary.csv file with the following columns:

" Merge all .csv files into one .csv file
" Remove all the columns but "Organism name Organism Groups Lifestyle Size (Mb) GC%"
" Remove all the duplicates, so that the .cvs contains only lines with unique combination of "Organism name Organism Groups Lifestyle Size (Mb) GC%" values.

For instance, the list:


Organism name Organism Groups Lifestyle Size (Mb) GC%
Bla-bla1 X A 1 20
Bla-bla1 X A 1 20
Bla-bla2 X A 2 30
Bla-bla2 X B 2 30


should be reduced to:

Organism name Organism Groups Lifestyle Size (Mb) GC%
Bla-bla1 X A 1 20
Bla-bla2 X A 2 30
Bla-bla2 X B 2 30



" Add columns "XXX", "YYY", "ZZZ" and fill these columns with values by using the following rule: for each "Organism name" in the Mutations_summary.csv file find identical "Organism name" value in the XXX_reduced_to_genus.csv file, then find the corresponding value in the "Total number of mutations" column of XXX_reduced_to_genus.csv file, and print this "Total number of mutations" value into the XXX column of the Mutations_summary.csv file. If the XXX_reduced_to_genus.cvs file does not have a matching "Organism name" entry, then print "-" in the "Total number of mutations" column of the Mutations_summary.csv file.
" Finally, add a column "Total number of mutated proteins" in the Mutations_summary.csv file and fill it with values by counting how many XXX, YYY and ZZZ columns are not equal to "0" or "-". For example:


Organism: XXX YYY ZZZ Total number of mutated proteins
Helicobacter cinaedi 2 1 5 3
Escherichia coli 0 0 0 0
Weird name 1 0 4 2
Gibberish word 0 - 4 1


" So, at the end, the Mutations_summary.csv file will have the following columns:
Organism name Organism Groups Lifestyle Size (Mb) GC% Organism XXX YYY ZZZ Total number of mutated proteins

Please find the examples of the .csv files in the attachment. Also, if you are sure you can do this within an hour, please post on the Clarification board so that other Freelancers know that this project is most likely already taken care of by somebody.

When you complete the task, please make a screenshot of a small portion of the Mutations_summary.csv file so that I can see that I explained the task clearly and can award the project.

Thank you.
Sergey

推奨スキル

このコンテストのトップエントリー

エントリーをもっと表示

公開説明ボード

  • sergeyvmelnikov
    コンテスト所有者
    • 5年間前

    Dear all, it seems my explanations were not clear enough. I have attached three sample files to these contest and I was expecting you to process these files. So, rather than using generic XXX, YYY and ZZZ from my explanations, I was expecting you to use AlaRS_new, IleRS_new and LeuRS_new.

    • 5年間前
  • roshansanthoshh
    roshansanthoshh
    • 5年間前

    Have completed the code. Can provide you the code right now itself.

    • 5年間前

コンテストの開始方法

  • あなたのコンテストを投稿

    あなたのコンテストを投稿 速くて簡単

  • たくさんのエントリーを集めましょう

    たくさんのエントリーを集めましょう 世界中から

  • ベストエントリーをアワード

    ベストエントリーをアワード ファイルをダウンロード - 簡単!

コンテストを今すぐ投稿 または本日参加!