The web site offer metadata , tag , and functionality for the register deep down and permit matter to party to download the useable malware try for foster analysis , aim at raise security system enhancement across the industry . The publicly available dataset is divinatory to help oneself speed up auto learnedness search for malware espial by carry a curated and label collection of try out and connect metadata . While automobile instruct theoretical account are rivet on knowledge , the security sector miss a pattern , tumid - plate dataset that can well be access by all phase of substance abuser ( from mugwump investigator to laboratory and pot ) , which has hence far slow up down ontogeny , Sophos fence . It is both costly and difficult to procure a immense routine of choose , mark sample distribution , and substitute datum set is too difficult due to cerebral belongings business organisation and the opening of cater alien one-third political party with malicious software . As a resultant , most put out malware detecting article manoeuvre on proprietary , intimate database , with finding that can not be correlative explicitly with each former the party enjoin . The SoReL-20 M dataset , a product - weighing machine dataset cut across 20 million try out , include 10 million unarm pick of malware , bearing to kettle of fish the trouble . The dataset control characteristic that have been distill for each try base on the EMBER 2.0 dataset , label , designation metadata , and full phase of the moon binary for the malware sample utilise . In gain , posture of PyTorch and LightGBM that have already been rail as service line on this data are ply , along with hand command to stretch and retell the information , every bit fountainhead as to adulterate , power train , and quiz the simulation . It will postulate cognition , acquirement , and clock time to restructure ” and die hard , Sophos enjoin , leave that the malware being unblock has been disarm . The business enterprise notice that restrict aggressor are probable to gain from these try out or economic consumption them to physique assault method acting , but wield that “ there live already many other rootage that could be leverage by assaulter to attain entree to malware information and try that are bare , libertine and to a greater extent price - good to apply . ” The organization also title that the sampling unarm are more than utile for protection research worker try on to throw out their freelance defensive structure . sample distribution of disenable malware , which have been in the groundless for a clock time , are alleged to call up back on the take down infrastructure . In accession , well-nigh anti - computer virus vender can also discover them . It is bear that identification would step-up with metadata issue alongside the sample distribution . As an diligence , we spot that malware is not bound to Windows or even practicable filing cabinet , which is why further contingent is tranquillise necessitate by investigator and protective covering team , ” aver ReversingLabs , which claim to cater a reputable database of to a greater extent than 12 billion register of goodware and malware . ”